24 KiB
Exercise on Joins and anti-joins: add information from other tables¶
import pandas as pd
Load data from clinical trial¶
Data comes in two different files. The file predimed_records.csv
file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv
contain the information of which patient was assigned to which diet group.
df = pd.read_csv('../../data/predimed_records.csv')
df.head()
info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()
There were 5 different locations where the study was conducted, each one gave an identification number patient-id
to each participant.
info['location-id'].unique()
1. Add diet information to the patients' records¶
- For how many patients do we have clinical information? (i.e., rows in
df
)
## your code here
df.shape
- For how many patients do we have diet information? (i.e., rows in
info
)
## your code here
info.shape
Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.
- Which type of merge would you do?
- For how many patients do we have full information (records and which diet they followed?
## your code here
data_diet = pd.merge(df, info, how='inner', left_on=['location-id', 'patient-id'], right_on=['location-id', 'patient-id'])
2. Add location information to the patients' records¶
There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.
- Add a new column to the dataset that contains the city where each patient was recorded.
locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5],
'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations
## your code here:
data_diet_loc = pd.merge(data_diet, locations, how='inner', left_on='location-id', right_on='location-id')
data_diet_loc.head()
3. Remove drops from table¶
Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv
.
- Load the list of patients who droped, from
dropped.csv
- Use an anti-join to remove them from the table
- How many patients (rows) are left in the data?
dropped = pd.read_csv('dropped.csv')
dropped.shape
dropped.head()
# your code here
data_diet_loc_drop = pd.merge(
data_diet_loc,
dropped,
how='left',
left_on=['location-id', 'patient-id'],
right_on=['location-id', 'patient-id'],
indicator=True
).query('_merge != "both"').drop(columns='_merge')
4. Save final result in processed_data_predimed.csv
¶
- Using the
.to_csv
method of Pandas DataFrames
fname = 'processed_data_predimed.csv'
# your code here
data_diet_loc_drop.to_csv(fname)