2025-plovdiv-data/exercises/tabular_join/tabular_join.ipynb

657 lines
17 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"id": "f11a76bf",
"metadata": {},
"source": [
"# Exercise on Joins and anti-joins: add information from other tables"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b6f2742b",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "2967c84e",
"metadata": {},
"source": [
"# Load data from clinical trial\n",
"\n",
"Data comes in two different files. The file `predimed_records.csv` file contains the clinical data for each patient, except which diet group they were assigned. The file `predimed_mapping.csv` contain the information of which patient was assigned to which diet group. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ed626ee3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>436</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>33.53</td>\n",
" <td>122</td>\n",
" <td>0.753086</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>5.374401</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1130</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>77</td>\n",
" <td>Current</td>\n",
" <td>31.05</td>\n",
" <td>119</td>\n",
" <td>0.730061</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>6.097194</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1131</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>72</td>\n",
" <td>Former</td>\n",
" <td>30.86</td>\n",
" <td>106</td>\n",
" <td>0.654321</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.946612</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1132</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>27.68</td>\n",
" <td>118</td>\n",
" <td>0.694118</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.907598</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1111</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>35.94</td>\n",
" <td>129</td>\n",
" <td>0.806250</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>4.761123</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth htn \\\n",
"0 436 4 Male 58 Former 33.53 122 0.753086 No \n",
"1 1130 4 Male 77 Current 31.05 119 0.730061 Yes \n",
"2 1131 4 Female 72 Former 30.86 106 0.654321 No \n",
"3 1132 4 Male 71 Former 27.68 118 0.694118 Yes \n",
"4 1111 2 Female 79 Never 35.94 129 0.806250 Yes \n",
"\n",
" diab hyperchol famhist hormo p14 toevent event \n",
"0 No Yes No No 10 5.374401 Yes \n",
"1 Yes No No No 10 6.097194 No \n",
"2 Yes No Yes No 8 5.946612 No \n",
"3 No Yes No No 8 2.907598 Yes \n",
"4 No Yes No No 9 4.761123 No "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('../../data/predimed_records.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "48d5375f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>patient-id</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>885</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>182</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>971</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>691</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>632</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id patient-id group\n",
"0 2 885 MedDiet + VOO\n",
"1 1 182 MedDiet + Nuts\n",
"2 1 971 MedDiet + Nuts\n",
"3 2 691 MedDiet + Nuts\n",
"4 2 632 Control"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info = pd.read_csv('../../data/predimed_mapping.csv')\n",
"info.head()"
]
},
{
"cell_type": "markdown",
"id": "2b4b98ed-d7ec-4b7c-b983-adc616d2f16f",
"metadata": {},
"source": [
"There were 5 different locations where the study was conducted, each one gave an identification number `patient-id` to each participant."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b9dbc492-1489-4530-96ac-5f33f7389caa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 1, 3, 4, 5])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info['location-id'].unique()"
]
},
{
"cell_type": "markdown",
"id": "2fef4d37",
"metadata": {},
"source": [
"# 1. Add diet information to the patients' records\n",
"\n",
"* For how many patients do we have clinical information? (i.e., rows in `df`)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "861ac334-14ce-490a-b3c4-877b32789f3e",
"metadata": {},
"outputs": [],
"source": [
"## your code here\n"
]
},
{
"cell_type": "markdown",
"id": "1c1701e2-c295-4032-9e89-0d8470f41593",
"metadata": {},
"source": [
"* For how many patients do we have diet information? (i.e., rows in `info`)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "14f57842-5722-4953-88d6-d7cf3070400c",
"metadata": {},
"outputs": [],
"source": [
"## your code here\n"
]
},
{
"cell_type": "markdown",
"id": "3f23fa17-af3e-41c3-883f-3e1279d4820e",
"metadata": {},
"source": [
"Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information. \n",
"* Which type of merge would you do? \n",
"* For how many patients do we have full information (records and which diet they followed? "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "35e19a53",
"metadata": {},
"outputs": [],
"source": [
"## your code here\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "946beb08-30a5-4020-8612-360385cdfc1e",
"metadata": {},
"source": [
"# 2. Add location information to the patients' records\n",
"\n",
"There were five locations where the study was conducted. Here is a DataFrame containing the information of each location. \n",
"\n",
"- Add a new column to the dataset that contains the city where each patient was recorded.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "36ce0688-d421-4a07-b00e-0e9b3201f0e0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Valencia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Barcelona</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id City\n",
"0 1 Madrid\n",
"1 2 Valencia\n",
"2 3 Barcelona\n",
"3 4 Bilbao\n",
"4 5 Malaga"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], \n",
" 'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})\n",
"locations"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b636dde4-129a-4dd1-8cbf-c539c9c8a5f2",
"metadata": {},
"outputs": [],
"source": [
"## your code here:\n"
]
},
{
"cell_type": "markdown",
"id": "44031178",
"metadata": {},
"source": [
"# 3. Remove drops from table\n",
"\n",
"Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file `dropped.csv`.\n",
"1. Load the list of patients who droped, from `dropped.csv`\n",
"2. Use an anti-join to remove them from the table\n",
"3. How many patients (rows) are left in the data?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d1d4cc27",
"metadata": {},
"outputs": [],
"source": [
"dropped = pd.read_csv('dropped.csv')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fbebbd97",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(42, 2)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dropped.shape"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8a3c7943",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>patient-id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>541</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id patient-id\n",
"0 1 217\n",
"1 1 1147\n",
"2 1 1170\n",
"3 1 627\n",
"4 4 541"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dropped.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "573687e7",
"metadata": {},
"outputs": [],
"source": [
"# your code here\n"
]
},
{
"cell_type": "markdown",
"id": "84270332",
"metadata": {},
"source": [
"# 4. Save final result in `processed_data_predimed.csv`\n",
"\n",
"1. Using the `.to_csv` method of Pandas DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "85902eea",
"metadata": {},
"outputs": [],
"source": [
"fname = 'processed_data_predimed.csv'\n",
"\n",
"# your code here\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}