18 KiB
Exercise on Joins and anti-joins: add information from other tables¶
import pandas as pd
Load data from clinical trial¶
Data comes in two different files. The file predimed_records.csv
file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv
contain the information of which patient was assigned to which diet group.
df = pd.read_csv('../../data/predimed_records.csv')
df.head()
info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()
There were 5 different locations where the study was conducted, each one gave an identification number patient-id
to each participant.
info['location-id'].unique()
1. Add diet information to the patients' records¶
- For how many patients do we have clinical information? (i.e., rows in
df
)
## yourd
df.shape
- For how many patients do we have diet information? (i.e., rows in
info
)
## your code here
info.shape
Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.
- Which type of merge would you do?
- For how many patients do we have full information (records and which diet they followed?
## your code here
data_with_diet = df.merge(info,how = 'inner', on = ['location-id', 'patient-id'])
2. Add location information to the patients' records¶
There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.
- Add a new column to the dataset that contains the city where each patient was recorded.
locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5],
'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations
## your code here:
data_with_locations = data_with_diet.merge(locations, on = 'location-id')
3. Remove drops from table¶
Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv
.
- Load the list of patients who droped, from
dropped.csv
- Use an anti-join to remove them from the table
- How many patients (rows) are left in the data?
dropped = pd.read_csv('dropped.csv')
dropped.shape
dropped.head()
# your code here
data_without_dropouts = data_with_location
4. Save final result in processed_data_predimed.csv
¶
- Using the
.to_csv
method of Pandas DataFrames
fname = 'processed_data_predimed.csv'
# your code here