2025-plovdiv-data/exercises/tabular_join/tabular_join.ipynb
2025-09-24 13:04:28 +03:00

24 KiB

Exercise on Joins and anti-joins: add information from other tables

In [1]:
import pandas as pd

Load data from clinical trial

Data comes in two different files. The file predimed_records.csv file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv contain the information of which patient was assigned to which diet group.

In [2]:
df = pd.read_csv('../../data/predimed_records.csv')
df.head()
Out[2]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event
0 436 4 Male 58 Former 33.53 122 0.753086 No No Yes No No 10 5.374401 Yes
1 1130 4 Male 77 Current 31.05 119 0.730061 Yes Yes No No No 10 6.097194 No
2 1131 4 Female 72 Former 30.86 106 0.654321 No Yes No Yes No 8 5.946612 No
3 1132 4 Male 71 Former 27.68 118 0.694118 Yes No Yes No No 8 2.907598 Yes
4 1111 2 Female 79 Never 35.94 129 0.806250 Yes No Yes No No 9 4.761123 No
In [3]:
info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()
Out[3]:
location-id patient-id group
0 2 885 MedDiet + VOO
1 1 182 MedDiet + Nuts
2 1 971 MedDiet + Nuts
3 2 691 MedDiet + Nuts
4 2 632 Control

There were 5 different locations where the study was conducted, each one gave an identification number patient-id to each participant.

In [4]:
info['location-id'].unique()
Out[4]:
array([2, 1, 3, 4, 5])

1. Add diet information to the patients' records

  • For how many patients do we have clinical information? (i.e., rows in df)
In [6]:
## your code here
df.shape
Out[6]:
(6324, 16)
  • For how many patients do we have diet information? (i.e., rows in info)
In [7]:
## your code here
info.shape
Out[7]:
(6287, 3)

Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.

  • Which type of merge would you do?
  • For how many patients do we have full information (records and which diet they followed?
In [10]:
## your code here
data_diet = pd.merge(df, info, how='inner', left_on=['location-id', 'patient-id'], right_on=['location-id', 'patient-id'])

2. Add location information to the patients' records

There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.

  • Add a new column to the dataset that contains the city where each patient was recorded.
In [9]:
locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], 
                                    'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations
Out[9]:
location-id City
0 1 Madrid
1 2 Valencia
2 3 Barcelona
3 4 Bilbao
4 5 Malaga
In [11]:
## your code here:
data_diet_loc = pd.merge(data_diet, locations, how='inner', left_on='location-id', right_on='location-id')
In [23]:
data_diet_loc.head()
Out[23]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City
0 436 4 Male 58 Former 33.53 122 0.753086 No No Yes No No 10 5.374401 Yes Control Bilbao
1 1130 4 Male 77 Current 31.05 119 0.730061 Yes Yes No No No 10 6.097194 No Control Bilbao
2 1131 4 Female 72 Former 30.86 106 0.654321 No Yes No Yes No 8 5.946612 No MedDiet + VOO Bilbao
3 1132 4 Male 71 Former 27.68 118 0.694118 Yes No Yes No No 8 2.907598 Yes MedDiet + Nuts Bilbao
4 1111 2 Female 79 Never 35.94 129 0.806250 Yes No Yes No No 9 4.761123 No MedDiet + VOO Valencia

3. Remove drops from table

Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv.

  1. Load the list of patients who droped, from dropped.csv
  2. Use an anti-join to remove them from the table
  3. How many patients (rows) are left in the data?
In [13]:
dropped = pd.read_csv('dropped.csv')
In [14]:
dropped.shape
Out[14]:
(42, 2)
In [15]:
dropped.head()
Out[15]:
location-id patient-id
0 1 217
1 1 1147
2 1 1170
3 1 627
4 4 541
In [21]:
# your code here
data_diet_loc_drop = pd.merge(
    data_diet_loc, 
    dropped, 
    how='left', 
    left_on=['location-id', 'patient-id'], 
    right_on=['location-id', 'patient-id'], 
    indicator=True
).query('_merge != "both"').drop(columns='_merge')

4. Save final result in processed_data_predimed.csv

  1. Using the .to_csv method of Pandas DataFrames
In [22]:
fname = 'processed_data_predimed.csv'

#  your code here
data_diet_loc_drop.to_csv(fname)
In [ ]: