58 KiB
Exercise on Joins and anti-joins: add information from other tables¶
import pandas as pd
# Set some Pandas options: maximum number of rows/columns it's going to display
#pd.set_option('display.max_rows', 1000)
#pd.set_option('display.max_columns', 100)
Load data from clinical trial¶
Data comes in two different files. The file predimed_records.csv file contains the clinical data for each patient, except which diet group they were assigned. The file predimed_mapping.csv contain the information of which patient was assigned to which diet group.
df = pd.read_csv('../../data/predimed_records.csv')
df.head()
info = pd.read_csv('../../data/predimed_mapping.csv')
info.head()
There were 5 different locations where the study was conducted, each one gave an identification number patient-id to each participant.
info['location-id'].unique()
1. Add diet information to the patients' records¶
- For how many patients do we have clinical information? (i.e., rows in
df) - For how many patients do we have diet information? (i.e., rows in
info)
len(df)
len(info)
Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information.
- Which type of merge would you do?
- For how many patients do we have full information (records and which diet they followed?
df_with_info = df.merge(info, on=['patient-id', 'location-id'], how='right')
df_with_info.count()
df_with_info
2. Add location information to the patients' records¶
There were five locations where the study was conducted. Here is a DataFrame containing the information of each location.
- Add a new column to the dataset that contains the city where each patient was recorded.
locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5],
'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})
locations
df_with_info = df_with_info.merge(locations, on='location-id', how='right')
df_with_info
3. Remove drops from table¶
Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file dropped.csv.
- Load the list of patients who droped, from
dropped.csv - Use an anti-join to remove them from the table
- How many patients (rows) are left in the data?
dropped = pd.read_csv('dropped.csv')
dropped.shape
dropped.head()
temp = df_with_info.merge(dropped, on=['location-id', 'patient-id'], how='outer', indicator=True)
temp
df_without_dropped = temp[temp['_merge'] == 'left_only'].drop('_merge', axis=1)
df_without_dropped.shape
df_without_dropped.head()
4. Save final result in processed_data_predimed.csv¶
- Using the
.to_csvmethod of Pandas DataFrames
df_without_dropped.to_csv('processed_data_predimed.csv', index=None)