Exercise on Joins and anti-joins: add information from other tables¶

In [1]:

import pandas as pd

Load data from clinical trial¶

Data comes in two different files. The file predimed_records.csv file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv contain the information of which patient was assigned to which diet group.

In [2]:

df = pd.read_csv('../../data/predimed_records.csv')
df.head()

Out[2]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event
0	436	4	Male	58	Former	33.53	122	0.753086	No	No	Yes	No	No	10	5.374401	Yes
1	1130	4	Male	77	Current	31.05	119	0.730061	Yes	Yes	No	No	No	10	6.097194	No
2	1131	4	Female	72	Former	30.86	106	0.654321	No	Yes	No	Yes	No	8	5.946612	No
3	1132	4	Male	71	Former	27.68	118	0.694118	Yes	No	Yes	No	No	8	2.907598	Yes
4	1111	2	Female	79	Never	35.94	129	0.806250	Yes	No	Yes	No	No	9	4.761123	No

In [3]:

info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()

Out[3]:

	location-id	patient-id	group
0	2	885	MedDiet + VOO
1	1	182	MedDiet + Nuts
2	1	971	MedDiet + Nuts
3	2	691	MedDiet + Nuts
4	2	632	Control

There were 5 different locations where the study was conducted, each one gave an identification number patient-id to each participant.

In [4]:

info['location-id'].unique()

Out[4]:

array([2, 1, 3, 4, 5])

1. Add diet information to the patients' records¶

For how many patients do we have clinical information? (i.e., rows in df)

In [6]:

## your code here
df.shape

Out[6]:

(6324, 16)

For how many patients do we have diet information? (i.e., rows in info)

In [7]:

## your code here
info.shape

Out[7]:

(6287, 3)

Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.

Which type of merge would you do?
For how many patients do we have full information (records and which diet they followed?

In [10]:

## your code here
data_diet = pd.merge(df, info, how='inner', left_on=['location-id', 'patient-id'], right_on=['location-id', 'patient-id'])

2. Add location information to the patients' records¶

There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.

Add a new column to the dataset that contains the city where each patient was recorded.

In [9]:

locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], 
                                    'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations

Out[9]:

	location-id	City
0	1	Madrid
1	2	Valencia
2	3	Barcelona
3	4	Bilbao
4	5	Malaga

In [11]:

## your code here:
data_diet_loc = pd.merge(data_diet, locations, how='inner', left_on='location-id', right_on='location-id')

In [23]:

data_diet_loc.head()

Out[23]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City
0	436	4	Male	58	Former	33.53	122	0.753086	No	No	Yes	No	No	10	5.374401	Yes	Control	Bilbao
1	1130	4	Male	77	Current	31.05	119	0.730061	Yes	Yes	No	No	No	10	6.097194	No	Control	Bilbao
2	1131	4	Female	72	Former	30.86	106	0.654321	No	Yes	No	Yes	No	8	5.946612	No	MedDiet + VOO	Bilbao
3	1132	4	Male	71	Former	27.68	118	0.694118	Yes	No	Yes	No	No	8	2.907598	Yes	MedDiet + Nuts	Bilbao
4	1111	2	Female	79	Never	35.94	129	0.806250	Yes	No	Yes	No	No	9	4.761123	No	MedDiet + VOO	Valencia

3. Remove drops from table¶

Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv.

Load the list of patients who droped, from dropped.csv
Use an anti-join to remove them from the table
How many patients (rows) are left in the data?

In [13]:

dropped = pd.read_csv('dropped.csv')

In [14]:

dropped.shape

Out[14]:

(42, 2)

In [15]:

dropped.head()

Out[15]:

	location-id	patient-id
0	1	217
1	1	1147
2	1	1170
3	1	627
4	4	541

In [21]:

# your code here
data_diet_loc_drop = pd.merge(
    data_diet_loc, 
    dropped, 
    how='left', 
    left_on=['location-id', 'patient-id'], 
    right_on=['location-id', 'patient-id'], 
    indicator=True
).query('_merge != "both"').drop(columns='_merge')

4. Save final result in `processed_data_predimed.csv`¶

Using the .to_csv method of Pandas DataFrames

In [22]:

fname = 'processed_data_predimed.csv'

#  your code here
data_diet_loc_drop.to_csv(fname)

In [ ]:

24 KiB Raw Blame History