2025-plovdiv-data/exercises/tabular_join/tabular_join_solution.ipynb

58 KiB
Raw Permalink Blame History

Exercise on Joins and anti-joins: add information from other tables

In [1]:
import pandas as pd

# Set some Pandas options: maximum number of rows/columns it's going to display
#pd.set_option('display.max_rows', 1000)
#pd.set_option('display.max_columns', 100)

Load data from clinical trial

Data comes in two different files. The file predimed_records.csv file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv contain the information of which patient was assigned to which diet group.

In [2]:
df = pd.read_csv('../../data/predimed_records.csv')
df.head()
Out[2]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event
0 436 4 Male 58 Former 33.53 122 0.753086 No No Yes No No 10 5.374401 Yes
1 1130 4 Male 77 Current 31.05 119 0.730061 Yes Yes No No No 10 6.097194 No
2 1131 4 Female 72 Former 30.86 106 0.654321 No Yes No Yes No 8 5.946612 No
3 1132 4 Male 71 Former 27.68 118 0.694118 Yes No Yes No No 8 2.907598 Yes
4 1111 2 Female 79 Never 35.94 129 0.806250 Yes No Yes No No 9 4.761123 No
In [3]:
info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()
Out[3]:
location-id patient-id group
0 2 885 MedDiet + VOO
1 1 182 MedDiet + Nuts
2 1 971 MedDiet + Nuts
3 2 691 MedDiet + Nuts
4 2 632 Control

There were 5 different locations where the study was conducted, each one gave an identification number patient-id to each participant.

In [4]:
info['location-id'].unique()
Out[4]:
array([2, 1, 3, 4, 5])

1. Add diet information to the patients' records

  • For how many patients do we have clinical information? (i.e., rows in df)
  • For how many patients do we have diet information? (i.e., rows in info)
In [5]:
len(df)
Out[5]:
6324
In [6]:
len(info)
Out[6]:
6287

Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.

  • Which type of merge would you do?
  • For how many patients do we have full information (records and which diet they followed?
In [7]:
df_with_info = df.merge(info, on=['patient-id', 'location-id'], how='right')
In [8]:
df_with_info.count()
Out[8]:
patient-id     6287
location-id    6287
sex            6287
age            6287
smoke          6287
bmi            6287
waist          6287
wth            6287
htn            6287
diab           6287
hyperchol      6287
famhist        6287
hormo          5629
p14            6287
toevent        6287
event          6287
group          6287
dtype: int64
In [9]:
df_with_info
Out[9]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group
0 885 2 Male 74 Former 29.94 107 0.681529 Yes Yes Yes No NaN 8 5.711157 No MedDiet + VOO
1 182 1 Female 60 Former 30.76 85 0.555556 No No Yes Yes Yes 10 3.274470 No MedDiet + Nuts
2 971 1 Female 65 Never 23.81 86 0.540881 Yes Yes Yes No No 6 3.088296 No MedDiet + Nuts
3 691 2 Female 64 Never 32.70 102 0.637500 Yes Yes No No No 8 3.028063 No MedDiet + Nuts
4 632 2 Female 73 Never 28.32 91 0.594771 Yes Yes Yes Yes No 9 5.919233 No Control
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6282 855 1 Male 55 Former 29.77 106 0.612717 Yes No Yes Yes No 9 3.449692 No MedDiet + VOO
6283 711 4 Female 78 Never 34.72 104 0.712329 Yes No Yes No No 9 1.921971 No Control
6284 113 4 Female 60 Never 31.48 98 0.640523 No Yes Yes No No 10 6.403833 No MedDiet + VOO
6285 50 5 Female 77 Never 28.92 92 0.625850 Yes No Yes No No 11 5.360712 No MedDiet + Nuts
6286 85 4 Female 61 Never 37.50 106 0.662500 Yes No Yes Yes No 9 1.823409 No MedDiet + Nuts

6287 rows × 17 columns

2. Add location information to the patients' records

There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.

  • Add a new column to the dataset that contains the city where each patient was recorded.
In [10]:
locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], 
                                    'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations
Out[10]:
location-id City
0 1 Madrid
1 2 Valencia
2 3 Barcelona
3 4 Bilbao
4 5 Malaga
In [11]:
df_with_info = df_with_info.merge(locations, on='location-id', how='right')
df_with_info
Out[11]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City
0 182 1 Female 60 Former 30.76 85 0.555556 No No Yes Yes Yes 10 3.274470 No MedDiet + Nuts Madrid
1 971 1 Female 65 Never 23.81 86 0.540881 Yes Yes Yes No No 6 3.088296 No MedDiet + Nuts Madrid
2 485 1 Male 71 Former 22.41 92 0.516854 Yes Yes Yes No No 9 0.438056 No Control Madrid
3 621 1 Male 71 Former 32.70 110 0.662651 No Yes Yes No No 11 2.661191 No MedDiet + Nuts Madrid
4 954 1 Male 70 Former 29.48 107 0.633136 No Yes Yes No No 8 4.186174 No MedDiet + Nuts Madrid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6282 1120 5 Male 63 Never 28.83 92 0.534884 Yes No Yes No NaN 9 3.392197 No MedDiet + VOO Malaga
6283 1082 5 Female 63 Never 25.33 92 0.613333 Yes No Yes No No 5 5.886379 No MedDiet + Nuts Malaga
6284 302 5 Female 59 Never 31.71 80 0.519481 No Yes Yes No No 8 5.631759 No MedDiet + VOO Malaga
6285 566 5 Male 58 Former 27.81 98 0.583333 Yes No Yes No No 10 1.913758 No MedDiet + VOO Malaga
6286 50 5 Female 77 Never 28.92 92 0.625850 Yes No Yes No No 11 5.360712 No MedDiet + Nuts Malaga

6287 rows × 18 columns

3. Remove drops from table

Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv.

  1. Load the list of patients who droped, from dropped.csv
  2. Use an anti-join to remove them from the table
  3. How many patients (rows) are left in the data?
In [12]:
dropped = pd.read_csv('dropped.csv')
In [13]:
dropped.shape
Out[13]:
(42, 2)
In [14]:
dropped.head()
Out[14]:
location-id patient-id
0 1 217
1 1 1147
2 1 1170
3 1 627
4 4 541
In [15]:
temp = df_with_info.merge(dropped, on=['location-id', 'patient-id'], how='outer', indicator=True)
temp
Out[15]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City _merge
0 1 1 Female 77 Never 25.92 94 0.657343 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO Madrid left_only
1 2 1 Female 68 Never 34.85 150 0.949367 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts Madrid left_only
2 3 1 Female 66 Never 37.50 120 0.750000 Yes Yes No No No 6 5.590691 No MedDiet + Nuts Madrid left_only
3 4 1 Female 77 Never 29.26 93 0.628378 Yes Yes No No No 6 5.456537 No MedDiet + VOO Madrid left_only
4 5 1 Female 60 Never 30.02 104 0.662420 Yes No Yes No No 9 2.746064 No Control Madrid left_only
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6282 1253 5 Male 79 Never 25.28 105 0.640244 Yes No Yes No No 8 5.828884 No MedDiet + VOO Malaga left_only
6283 1254 5 Male 62 Former 27.10 104 0.594286 Yes No Yes Yes No 9 5.067762 No MedDiet + Nuts Malaga left_only
6284 1255 5 Female 65 Never 35.02 103 0.686667 Yes No Yes No No 10 1.993155 No MedDiet + VOO Malaga left_only
6285 1256 5 Male 61 Never 28.42 94 0.576687 Yes Yes No No No 9 2.039699 No MedDiet + Nuts Malaga left_only
6286 1257 5 Male 58 Former 24.43 93 0.547059 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts Malaga left_only

6287 rows × 19 columns

In [16]:
df_without_dropped = temp[temp['_merge'] == 'left_only'].drop('_merge', axis=1)
In [17]:
df_without_dropped.shape
Out[17]:
(6245, 18)
In [18]:
df_without_dropped.head()
Out[18]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City
0 1 1 Female 77 Never 25.92 94 0.657343 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO Madrid
1 2 1 Female 68 Never 34.85 150 0.949367 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts Madrid
2 3 1 Female 66 Never 37.50 120 0.750000 Yes Yes No No No 6 5.590691 No MedDiet + Nuts Madrid
3 4 1 Female 77 Never 29.26 93 0.628378 Yes Yes No No No 6 5.456537 No MedDiet + VOO Madrid
4 5 1 Female 60 Never 30.02 104 0.662420 Yes No Yes No No 9 2.746064 No Control Madrid

4. Save final result in processed_data_predimed.csv

  1. Using the .to_csv method of Pandas DataFrames
In [19]:
df_without_dropped.to_csv('processed_data_predimed.csv', index=None)
In [ ]: