2025-plovdiv-data/exercises/tabular_join/tabular_join_solution.ipynb

1908 lines
58 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "f11a76bf",
"metadata": {},
"source": [
"# Exercise on Joins and anti-joins: add information from other tables"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b6f2742b",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Set some Pandas options: maximum number of rows/columns it's going to display\n",
"#pd.set_option('display.max_rows', 1000)\n",
"#pd.set_option('display.max_columns', 100)"
]
},
{
"cell_type": "markdown",
"id": "2967c84e",
"metadata": {},
"source": [
"# Load data from clinical trial\n",
"\n",
"Data comes in two different files. The file `predimed_records.csv` file contains the clinical data for each patient, except which diet group they were assigned. The file `predimed_mapping.csv` contain the information of which patient was assigned to which diet group. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ed626ee3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>436</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>33.53</td>\n",
" <td>122</td>\n",
" <td>0.753086</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>5.374401</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1130</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>77</td>\n",
" <td>Current</td>\n",
" <td>31.05</td>\n",
" <td>119</td>\n",
" <td>0.730061</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>6.097194</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1131</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>72</td>\n",
" <td>Former</td>\n",
" <td>30.86</td>\n",
" <td>106</td>\n",
" <td>0.654321</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.946612</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1132</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>27.68</td>\n",
" <td>118</td>\n",
" <td>0.694118</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.907598</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1111</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>35.94</td>\n",
" <td>129</td>\n",
" <td>0.806250</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>4.761123</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth htn \\\n",
"0 436 4 Male 58 Former 33.53 122 0.753086 No \n",
"1 1130 4 Male 77 Current 31.05 119 0.730061 Yes \n",
"2 1131 4 Female 72 Former 30.86 106 0.654321 No \n",
"3 1132 4 Male 71 Former 27.68 118 0.694118 Yes \n",
"4 1111 2 Female 79 Never 35.94 129 0.806250 Yes \n",
"\n",
" diab hyperchol famhist hormo p14 toevent event \n",
"0 No Yes No No 10 5.374401 Yes \n",
"1 Yes No No No 10 6.097194 No \n",
"2 Yes No Yes No 8 5.946612 No \n",
"3 No Yes No No 8 2.907598 Yes \n",
"4 No Yes No No 9 4.761123 No "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('../../data/predimed_records.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "48d5375f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>patient-id</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>885</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>182</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>971</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>691</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>632</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id patient-id group\n",
"0 2 885 MedDiet + VOO\n",
"1 1 182 MedDiet + Nuts\n",
"2 1 971 MedDiet + Nuts\n",
"3 2 691 MedDiet + Nuts\n",
"4 2 632 Control"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info = pd.read_csv('../../data/predimed_mapping.csv')\n",
"info.head()"
]
},
{
"cell_type": "markdown",
"id": "2b4b98ed-d7ec-4b7c-b983-adc616d2f16f",
"metadata": {},
"source": [
"There were 5 different locations where the study was conducted, each one gave an identification number `patient-id` to each participant."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b9dbc492-1489-4530-96ac-5f33f7389caa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 1, 3, 4, 5])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info['location-id'].unique()"
]
},
{
"cell_type": "markdown",
"id": "2fef4d37",
"metadata": {},
"source": [
"# 1. Add diet information to the patients' records\n",
"\n",
"* For how many patients do we have clinical information? (i.e., rows in `df`)\n",
"* For how many patients do we have diet information? (i.e., rows in `info`)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "861ac334-14ce-490a-b3c4-877b32789f3e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6324"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "14f57842-5722-4953-88d6-d7cf3070400c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6287"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(info)"
]
},
{
"cell_type": "markdown",
"id": "3f23fa17-af3e-41c3-883f-3e1279d4820e",
"metadata": {},
"source": [
"Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information. \n",
"* Which type of merge would you do? \n",
"* For how many patients do we have full information (records and which diet they followed? "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "35e19a53",
"metadata": {},
"outputs": [],
"source": [
"df_with_info = df.merge(info, on=['patient-id', 'location-id'], how='right')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "eac1244f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"patient-id 6287\n",
"location-id 6287\n",
"sex 6287\n",
"age 6287\n",
"smoke 6287\n",
"bmi 6287\n",
"waist 6287\n",
"wth 6287\n",
"htn 6287\n",
"diab 6287\n",
"hyperchol 6287\n",
"famhist 6287\n",
"hormo 5629\n",
"p14 6287\n",
"toevent 6287\n",
"event 6287\n",
"group 6287\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_with_info.count()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0d770d69-7a0a-47c9-bb5d-93a11329e7ad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>885</td>\n",
" <td>2</td>\n",
" <td>Male</td>\n",
" <td>74</td>\n",
" <td>Former</td>\n",
" <td>29.94</td>\n",
" <td>107</td>\n",
" <td>0.681529</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>NaN</td>\n",
" <td>8</td>\n",
" <td>5.711157</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>182</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Former</td>\n",
" <td>30.76</td>\n",
" <td>85</td>\n",
" <td>0.555556</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>10</td>\n",
" <td>3.274470</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>971</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>65</td>\n",
" <td>Never</td>\n",
" <td>23.81</td>\n",
" <td>86</td>\n",
" <td>0.540881</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>3.088296</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>691</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>64</td>\n",
" <td>Never</td>\n",
" <td>32.70</td>\n",
" <td>102</td>\n",
" <td>0.637500</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>3.028063</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>632</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>73</td>\n",
" <td>Never</td>\n",
" <td>28.32</td>\n",
" <td>91</td>\n",
" <td>0.594771</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.919233</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6282</th>\n",
" <td>855</td>\n",
" <td>1</td>\n",
" <td>Male</td>\n",
" <td>55</td>\n",
" <td>Former</td>\n",
" <td>29.77</td>\n",
" <td>106</td>\n",
" <td>0.612717</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>3.449692</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6283</th>\n",
" <td>711</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>78</td>\n",
" <td>Never</td>\n",
" <td>34.72</td>\n",
" <td>104</td>\n",
" <td>0.712329</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>1.921971</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6284</th>\n",
" <td>113</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Never</td>\n",
" <td>31.48</td>\n",
" <td>98</td>\n",
" <td>0.640523</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>6.403833</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6285</th>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>28.92</td>\n",
" <td>92</td>\n",
" <td>0.625850</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>11</td>\n",
" <td>5.360712</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6286</th>\n",
" <td>85</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>61</td>\n",
" <td>Never</td>\n",
" <td>37.50</td>\n",
" <td>106</td>\n",
" <td>0.662500</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>1.823409</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6287 rows × 17 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 885 2 Male 74 Former 29.94 107 0.681529 \n",
"1 182 1 Female 60 Former 30.76 85 0.555556 \n",
"2 971 1 Female 65 Never 23.81 86 0.540881 \n",
"3 691 2 Female 64 Never 32.70 102 0.637500 \n",
"4 632 2 Female 73 Never 28.32 91 0.594771 \n",
"... ... ... ... ... ... ... ... ... \n",
"6282 855 1 Male 55 Former 29.77 106 0.612717 \n",
"6283 711 4 Female 78 Never 34.72 104 0.712329 \n",
"6284 113 4 Female 60 Never 31.48 98 0.640523 \n",
"6285 50 5 Female 77 Never 28.92 92 0.625850 \n",
"6286 85 4 Female 61 Never 37.50 106 0.662500 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \n",
"0 Yes Yes Yes No NaN 8 5.711157 No MedDiet + VOO \n",
"1 No No Yes Yes Yes 10 3.274470 No MedDiet + Nuts \n",
"2 Yes Yes Yes No No 6 3.088296 No MedDiet + Nuts \n",
"3 Yes Yes No No No 8 3.028063 No MedDiet + Nuts \n",
"4 Yes Yes Yes Yes No 9 5.919233 No Control \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6282 Yes No Yes Yes No 9 3.449692 No MedDiet + VOO \n",
"6283 Yes No Yes No No 9 1.921971 No Control \n",
"6284 No Yes Yes No No 10 6.403833 No MedDiet + VOO \n",
"6285 Yes No Yes No No 11 5.360712 No MedDiet + Nuts \n",
"6286 Yes No Yes Yes No 9 1.823409 No MedDiet + Nuts \n",
"\n",
"[6287 rows x 17 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_with_info"
]
},
{
"cell_type": "markdown",
"id": "946beb08-30a5-4020-8612-360385cdfc1e",
"metadata": {},
"source": [
"# 2. Add location information to the patients' records\n",
"\n",
"There were five locations where the study was conducted. Here is a DataFrame containing the information of each location. \n",
"\n",
"- Add a new column to the dataset that contains the city where each patient was recorded.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "36ce0688-d421-4a07-b00e-0e9b3201f0e0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Valencia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Barcelona</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id City\n",
"0 1 Madrid\n",
"1 2 Valencia\n",
"2 3 Barcelona\n",
"3 4 Bilbao\n",
"4 5 Malaga"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], \n",
" 'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})\n",
"locations"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b636dde4-129a-4dd1-8cbf-c539c9c8a5f2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>182</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Former</td>\n",
" <td>30.76</td>\n",
" <td>85</td>\n",
" <td>0.555556</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>10</td>\n",
" <td>3.274470</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>971</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>65</td>\n",
" <td>Never</td>\n",
" <td>23.81</td>\n",
" <td>86</td>\n",
" <td>0.540881</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>3.088296</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>485</td>\n",
" <td>1</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>22.41</td>\n",
" <td>92</td>\n",
" <td>0.516854</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>0.438056</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>621</td>\n",
" <td>1</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>32.70</td>\n",
" <td>110</td>\n",
" <td>0.662651</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>11</td>\n",
" <td>2.661191</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>954</td>\n",
" <td>1</td>\n",
" <td>Male</td>\n",
" <td>70</td>\n",
" <td>Former</td>\n",
" <td>29.48</td>\n",
" <td>107</td>\n",
" <td>0.633136</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>4.186174</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6282</th>\n",
" <td>1120</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>63</td>\n",
" <td>Never</td>\n",
" <td>28.83</td>\n",
" <td>92</td>\n",
" <td>0.534884</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>NaN</td>\n",
" <td>9</td>\n",
" <td>3.392197</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6283</th>\n",
" <td>1082</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>63</td>\n",
" <td>Never</td>\n",
" <td>25.33</td>\n",
" <td>92</td>\n",
" <td>0.613333</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>5</td>\n",
" <td>5.886379</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6284</th>\n",
" <td>302</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>59</td>\n",
" <td>Never</td>\n",
" <td>31.71</td>\n",
" <td>80</td>\n",
" <td>0.519481</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.631759</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6285</th>\n",
" <td>566</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>27.81</td>\n",
" <td>98</td>\n",
" <td>0.583333</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>1.913758</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6286</th>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>28.92</td>\n",
" <td>92</td>\n",
" <td>0.625850</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>11</td>\n",
" <td>5.360712</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6287 rows × 18 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 182 1 Female 60 Former 30.76 85 0.555556 \n",
"1 971 1 Female 65 Never 23.81 86 0.540881 \n",
"2 485 1 Male 71 Former 22.41 92 0.516854 \n",
"3 621 1 Male 71 Former 32.70 110 0.662651 \n",
"4 954 1 Male 70 Former 29.48 107 0.633136 \n",
"... ... ... ... ... ... ... ... ... \n",
"6282 1120 5 Male 63 Never 28.83 92 0.534884 \n",
"6283 1082 5 Female 63 Never 25.33 92 0.613333 \n",
"6284 302 5 Female 59 Never 31.71 80 0.519481 \n",
"6285 566 5 Male 58 Former 27.81 98 0.583333 \n",
"6286 50 5 Female 77 Never 28.92 92 0.625850 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
"0 No No Yes Yes Yes 10 3.274470 No MedDiet + Nuts \n",
"1 Yes Yes Yes No No 6 3.088296 No MedDiet + Nuts \n",
"2 Yes Yes Yes No No 9 0.438056 No Control \n",
"3 No Yes Yes No No 11 2.661191 No MedDiet + Nuts \n",
"4 No Yes Yes No No 8 4.186174 No MedDiet + Nuts \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6282 Yes No Yes No NaN 9 3.392197 No MedDiet + VOO \n",
"6283 Yes No Yes No No 5 5.886379 No MedDiet + Nuts \n",
"6284 No Yes Yes No No 8 5.631759 No MedDiet + VOO \n",
"6285 Yes No Yes No No 10 1.913758 No MedDiet + VOO \n",
"6286 Yes No Yes No No 11 5.360712 No MedDiet + Nuts \n",
"\n",
" City \n",
"0 Madrid \n",
"1 Madrid \n",
"2 Madrid \n",
"3 Madrid \n",
"4 Madrid \n",
"... ... \n",
"6282 Malaga \n",
"6283 Malaga \n",
"6284 Malaga \n",
"6285 Malaga \n",
"6286 Malaga \n",
"\n",
"[6287 rows x 18 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_with_info = df_with_info.merge(locations, on='location-id', how='right')\n",
"df_with_info"
]
},
{
"cell_type": "markdown",
"id": "44031178",
"metadata": {},
"source": [
"# 3. Remove drops from table\n",
"\n",
"Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file `dropped.csv`.\n",
"1. Load the list of patients who droped, from `dropped.csv`\n",
"2. Use an anti-join to remove them from the table\n",
"3. How many patients (rows) are left in the data?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d1d4cc27",
"metadata": {},
"outputs": [],
"source": [
"dropped = pd.read_csv('dropped.csv')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "fbebbd97",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(42, 2)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dropped.shape"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "8a3c7943",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>patient-id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>541</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id patient-id\n",
"0 1 217\n",
"1 1 1147\n",
"2 1 1170\n",
"3 1 627\n",
"4 4 541"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dropped.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "573687e7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" <th>City</th>\n",
" <th>_merge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>25.92</td>\n",
" <td>94</td>\n",
" <td>0.657343</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.538672</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>68</td>\n",
" <td>Never</td>\n",
" <td>34.85</td>\n",
" <td>150</td>\n",
" <td>0.949367</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>10</td>\n",
" <td>3.063655</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>66</td>\n",
" <td>Never</td>\n",
" <td>37.50</td>\n",
" <td>120</td>\n",
" <td>0.750000</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.590691</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>29.26</td>\n",
" <td>93</td>\n",
" <td>0.628378</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.456537</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Never</td>\n",
" <td>30.02</td>\n",
" <td>104</td>\n",
" <td>0.662420</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.746064</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Madrid</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6282</th>\n",
" <td>1253</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>25.28</td>\n",
" <td>105</td>\n",
" <td>0.640244</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.828884</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6283</th>\n",
" <td>1254</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>62</td>\n",
" <td>Former</td>\n",
" <td>27.10</td>\n",
" <td>104</td>\n",
" <td>0.594286</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.067762</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6284</th>\n",
" <td>1255</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>65</td>\n",
" <td>Never</td>\n",
" <td>35.02</td>\n",
" <td>103</td>\n",
" <td>0.686667</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>1.993155</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6285</th>\n",
" <td>1256</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>61</td>\n",
" <td>Never</td>\n",
" <td>28.42</td>\n",
" <td>94</td>\n",
" <td>0.576687</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.039699</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6286</th>\n",
" <td>1257</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>24.43</td>\n",
" <td>93</td>\n",
" <td>0.547059</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.590007</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6287 rows × 19 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 1 1 Female 77 Never 25.92 94 0.657343 \n",
"1 2 1 Female 68 Never 34.85 150 0.949367 \n",
"2 3 1 Female 66 Never 37.50 120 0.750000 \n",
"3 4 1 Female 77 Never 29.26 93 0.628378 \n",
"4 5 1 Female 60 Never 30.02 104 0.662420 \n",
"... ... ... ... ... ... ... ... ... \n",
"6282 1253 5 Male 79 Never 25.28 105 0.640244 \n",
"6283 1254 5 Male 62 Former 27.10 104 0.594286 \n",
"6284 1255 5 Female 65 Never 35.02 103 0.686667 \n",
"6285 1256 5 Male 61 Never 28.42 94 0.576687 \n",
"6286 1257 5 Male 58 Former 24.43 93 0.547059 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
"0 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO \n",
"1 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts \n",
"2 Yes Yes No No No 6 5.590691 No MedDiet + Nuts \n",
"3 Yes Yes No No No 6 5.456537 No MedDiet + VOO \n",
"4 Yes No Yes No No 9 2.746064 No Control \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6282 Yes No Yes No No 8 5.828884 No MedDiet + VOO \n",
"6283 Yes No Yes Yes No 9 5.067762 No MedDiet + Nuts \n",
"6284 Yes No Yes No No 10 1.993155 No MedDiet + VOO \n",
"6285 Yes Yes No No No 9 2.039699 No MedDiet + Nuts \n",
"6286 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
"\n",
" City _merge \n",
"0 Madrid left_only \n",
"1 Madrid left_only \n",
"2 Madrid left_only \n",
"3 Madrid left_only \n",
"4 Madrid left_only \n",
"... ... ... \n",
"6282 Malaga left_only \n",
"6283 Malaga left_only \n",
"6284 Malaga left_only \n",
"6285 Malaga left_only \n",
"6286 Malaga left_only \n",
"\n",
"[6287 rows x 19 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = df_with_info.merge(dropped, on=['location-id', 'patient-id'], how='outer', indicator=True)\n",
"temp"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a4a6574b",
"metadata": {},
"outputs": [],
"source": [
"df_without_dropped = temp[temp['_merge'] == 'left_only'].drop('_merge', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "8fd89a40",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(6245, 18)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_without_dropped.shape"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "07f4776a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>25.92</td>\n",
" <td>94</td>\n",
" <td>0.657343</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.538672</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>68</td>\n",
" <td>Never</td>\n",
" <td>34.85</td>\n",
" <td>150</td>\n",
" <td>0.949367</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>10</td>\n",
" <td>3.063655</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>66</td>\n",
" <td>Never</td>\n",
" <td>37.50</td>\n",
" <td>120</td>\n",
" <td>0.750000</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.590691</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>29.26</td>\n",
" <td>93</td>\n",
" <td>0.628378</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.456537</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Never</td>\n",
" <td>30.02</td>\n",
" <td>104</td>\n",
" <td>0.662420</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.746064</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth htn \\\n",
"0 1 1 Female 77 Never 25.92 94 0.657343 Yes \n",
"1 2 1 Female 68 Never 34.85 150 0.949367 Yes \n",
"2 3 1 Female 66 Never 37.50 120 0.750000 Yes \n",
"3 4 1 Female 77 Never 29.26 93 0.628378 Yes \n",
"4 5 1 Female 60 Never 30.02 104 0.662420 Yes \n",
"\n",
" diab hyperchol famhist hormo p14 toevent event group City \n",
"0 No Yes Yes No 9 5.538672 No MedDiet + VOO Madrid \n",
"1 No Yes Yes NaN 10 3.063655 No MedDiet + Nuts Madrid \n",
"2 Yes No No No 6 5.590691 No MedDiet + Nuts Madrid \n",
"3 Yes No No No 6 5.456537 No MedDiet + VOO Madrid \n",
"4 No Yes No No 9 2.746064 No Control Madrid "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_without_dropped.head()"
]
},
{
"cell_type": "markdown",
"id": "84270332",
"metadata": {},
"source": [
"# 4. Save final result in `processed_data_predimed.csv`\n",
"\n",
"1. Using the `.to_csv` method of Pandas DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "85902eea",
"metadata": {},
"outputs": [],
"source": [
"df_without_dropped.to_csv('processed_data_predimed.csv', index=None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7bcff45",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}