Exercise on Joins and anti-joins: add information from other tables¶

In [1]:

import pandas as pd

# Set some Pandas options: maximum number of rows/columns it's going to display
#pd.set_option('display.max_rows', 1000)
#pd.set_option('display.max_columns', 100)

Load data from clinical trial¶

Data comes in two different files. The file predimed_records.csv file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv contain the information of which patient was assigned to which diet group.

In [2]:

df = pd.read_csv('../../data/predimed_records.csv')
df.head()

Out[2]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event
0	436	4	Male	58	Former	33.53	122	0.753086	No	No	Yes	No	No	10	5.374401	Yes
1	1130	4	Male	77	Current	31.05	119	0.730061	Yes	Yes	No	No	No	10	6.097194	No
2	1131	4	Female	72	Former	30.86	106	0.654321	No	Yes	No	Yes	No	8	5.946612	No
3	1132	4	Male	71	Former	27.68	118	0.694118	Yes	No	Yes	No	No	8	2.907598	Yes
4	1111	2	Female	79	Never	35.94	129	0.806250	Yes	No	Yes	No	No	9	4.761123	No

In [3]:

info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()

Out[3]:

	location-id	patient-id	group
0	2	885	MedDiet + VOO
1	1	182	MedDiet + Nuts
2	1	971	MedDiet + Nuts
3	2	691	MedDiet + Nuts
4	2	632	Control

There were 5 different locations where the study was conducted, each one gave an identification number patient-id to each participant.

In [4]:

info['location-id'].unique()

Out[4]:

array([2, 1, 3, 4, 5])

1. Add diet information to the patients' records¶

For how many patients do we have clinical information? (i.e., rows in df)
For how many patients do we have diet information? (i.e., rows in info)

In [5]:

len(df)

Out[5]:

In [6]:

len(info)

Out[6]:

Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.

Which type of merge would you do?
For how many patients do we have full information (records and which diet they followed?

In [7]:

df_with_info = df.merge(info, on=['patient-id', 'location-id'], how='right')

In [8]:

df_with_info.count()

Out[8]:

patient-id     6287
location-id    6287
sex            6287
age            6287
smoke          6287
bmi            6287
waist          6287
wth            6287
htn            6287
diab           6287
hyperchol      6287
famhist        6287
hormo          5629
p14            6287
toevent        6287
event          6287
group          6287
dtype: int64

In [9]:

df_with_info

Out[9]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group
0	885	2	Male	74	Former	29.94	107	0.681529	Yes	Yes	Yes	No	NaN	8	5.711157	No	MedDiet + VOO
1	182	1	Female	60	Former	30.76	85	0.555556	No	No	Yes	Yes	Yes	10	3.274470	No	MedDiet + Nuts
2	971	1	Female	65	Never	23.81	86	0.540881	Yes	Yes	Yes	No	No	6	3.088296	No	MedDiet + Nuts
3	691	2	Female	64	Never	32.70	102	0.637500	Yes	Yes	No	No	No	8	3.028063	No	MedDiet + Nuts
4	632	2	Female	73	Never	28.32	91	0.594771	Yes	Yes	Yes	Yes	No	9	5.919233	No	Control
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6282	855	1	Male	55	Former	29.77	106	0.612717	Yes	No	Yes	Yes	No	9	3.449692	No	MedDiet + VOO
6283	711	4	Female	78	Never	34.72	104	0.712329	Yes	No	Yes	No	No	9	1.921971	No	Control
6284	113	4	Female	60	Never	31.48	98	0.640523	No	Yes	Yes	No	No	10	6.403833	No	MedDiet + VOO
6285	50	5	Female	77	Never	28.92	92	0.625850	Yes	No	Yes	No	No	11	5.360712	No	MedDiet + Nuts
6286	85	4	Female	61	Never	37.50	106	0.662500	Yes	No	Yes	Yes	No	9	1.823409	No	MedDiet + Nuts

6287 rows × 17 columns

2. Add location information to the patients' records¶

There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.

Add a new column to the dataset that contains the city where each patient was recorded.

In [10]:

locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], 
                                    'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations

Out[10]:

	location-id	City
0	1	Madrid
1	2	Valencia
2	3	Barcelona
3	4	Bilbao
4	5	Malaga

In [11]:

df_with_info = df_with_info.merge(locations, on='location-id', how='right')
df_with_info

Out[11]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City
0	182	1	Female	60	Former	30.76	85	0.555556	No	No	Yes	Yes	Yes	10	3.274470	No	MedDiet + Nuts	Madrid
1	971	1	Female	65	Never	23.81	86	0.540881	Yes	Yes	Yes	No	No	6	3.088296	No	MedDiet + Nuts	Madrid
2	485	1	Male	71	Former	22.41	92	0.516854	Yes	Yes	Yes	No	No	9	0.438056	No	Control	Madrid
3	621	1	Male	71	Former	32.70	110	0.662651	No	Yes	Yes	No	No	11	2.661191	No	MedDiet + Nuts	Madrid
4	954	1	Male	70	Former	29.48	107	0.633136	No	Yes	Yes	No	No	8	4.186174	No	MedDiet + Nuts	Madrid
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6282	1120	5	Male	63	Never	28.83	92	0.534884	Yes	No	Yes	No	NaN	9	3.392197	No	MedDiet + VOO	Malaga
6283	1082	5	Female	63	Never	25.33	92	0.613333	Yes	No	Yes	No	No	5	5.886379	No	MedDiet + Nuts	Malaga
6284	302	5	Female	59	Never	31.71	80	0.519481	No	Yes	Yes	No	No	8	5.631759	No	MedDiet + VOO	Malaga
6285	566	5	Male	58	Former	27.81	98	0.583333	Yes	No	Yes	No	No	10	1.913758	No	MedDiet + VOO	Malaga
6286	50	5	Female	77	Never	28.92	92	0.625850	Yes	No	Yes	No	No	11	5.360712	No	MedDiet + Nuts	Malaga

6287 rows × 18 columns

3. Remove drops from table¶

Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv.

Load the list of patients who droped, from dropped.csv
Use an anti-join to remove them from the table
How many patients (rows) are left in the data?

In [12]:

dropped = pd.read_csv('dropped.csv')

In [13]:

dropped.shape

Out[13]:

(42, 2)

In [14]:

dropped.head()

Out[14]:

	location-id	patient-id
0	1	217
1	1	1147
2	1	1170
3	1	627
4	4	541

In [15]:

temp = df_with_info.merge(dropped, on=['location-id', 'patient-id'], how='outer', indicator=True)
temp

Out[15]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City	_merge
0	1	1	Female	77	Never	25.92	94	0.657343	Yes	No	Yes	Yes	No	9	5.538672	No	MedDiet + VOO	Madrid	left_only
1	2	1	Female	68	Never	34.85	150	0.949367	Yes	No	Yes	Yes	NaN	10	3.063655	No	MedDiet + Nuts	Madrid	left_only
2	3	1	Female	66	Never	37.50	120	0.750000	Yes	Yes	No	No	No	6	5.590691	No	MedDiet + Nuts	Madrid	left_only
3	4	1	Female	77	Never	29.26	93	0.628378	Yes	Yes	No	No	No	6	5.456537	No	MedDiet + VOO	Madrid	left_only
4	5	1	Female	60	Never	30.02	104	0.662420	Yes	No	Yes	No	No	9	2.746064	No	Control	Madrid	left_only
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6282	1253	5	Male	79	Never	25.28	105	0.640244	Yes	No	Yes	No	No	8	5.828884	No	MedDiet + VOO	Malaga	left_only
6283	1254	5	Male	62	Former	27.10	104	0.594286	Yes	No	Yes	Yes	No	9	5.067762	No	MedDiet + Nuts	Malaga	left_only
6284	1255	5	Female	65	Never	35.02	103	0.686667	Yes	No	Yes	No	No	10	1.993155	No	MedDiet + VOO	Malaga	left_only
6285	1256	5	Male	61	Never	28.42	94	0.576687	Yes	Yes	No	No	No	9	2.039699	No	MedDiet + Nuts	Malaga	left_only
6286	1257	5	Male	58	Former	24.43	93	0.547059	Yes	Yes	Yes	No	No	9	2.590007	No	MedDiet + Nuts	Malaga	left_only

6287 rows × 19 columns

In [16]:

df_without_dropped = temp[temp['_merge'] == 'left_only'].drop('_merge', axis=1)

In [17]:

df_without_dropped.shape

Out[17]:

(6245, 18)

In [18]:

df_without_dropped.head()

Out[18]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City
0	1	1	Female	77	Never	25.92	94	0.657343	Yes	No	Yes	Yes	No	9	5.538672	No	MedDiet + VOO	Madrid
1	2	1	Female	68	Never	34.85	150	0.949367	Yes	No	Yes	Yes	NaN	10	3.063655	No	MedDiet + Nuts	Madrid
2	3	1	Female	66	Never	37.50	120	0.750000	Yes	Yes	No	No	No	6	5.590691	No	MedDiet + Nuts	Madrid
3	4	1	Female	77	Never	29.26	93	0.628378	Yes	Yes	No	No	No	6	5.456537	No	MedDiet + VOO	Madrid
4	5	1	Female	60	Never	30.02	104	0.662420	Yes	No	Yes	No	No	9	2.746064	No	Control	Madrid

4. Save final result in `processed_data_predimed.csv`¶

Using the .to_csv method of Pandas DataFrames

In [19]:

df_without_dropped.to_csv('processed_data_predimed.csv', index=None)

In [ ]:

58 KiB Raw Blame History Unescape Escape