1887 lines
63 KiB
Text
1887 lines
63 KiB
Text
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f11a76bf",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Exercise on Joins and anti-joins: add information from other tables"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "b6f2742b",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2967c84e",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Load data from clinical trial\n",
|
||
"\n",
|
||
"Data comes in two different files. The file `predimed_records.csv` file contains the clinical data for each patient, except which diet group they were assigned. The file `predimed_mapping.csv` contain the information of which patient was assigned to which diet group. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "ed626ee3",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>patient-id</th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>sex</th>\n",
|
||
" <th>age</th>\n",
|
||
" <th>smoke</th>\n",
|
||
" <th>bmi</th>\n",
|
||
" <th>waist</th>\n",
|
||
" <th>wth</th>\n",
|
||
" <th>htn</th>\n",
|
||
" <th>diab</th>\n",
|
||
" <th>hyperchol</th>\n",
|
||
" <th>famhist</th>\n",
|
||
" <th>hormo</th>\n",
|
||
" <th>p14</th>\n",
|
||
" <th>toevent</th>\n",
|
||
" <th>event</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>436</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>33.53</td>\n",
|
||
" <td>122</td>\n",
|
||
" <td>0.753086</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>5.374401</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1130</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>Current</td>\n",
|
||
" <td>31.05</td>\n",
|
||
" <td>119</td>\n",
|
||
" <td>0.730061</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>6.097194</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1131</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>72</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>30.86</td>\n",
|
||
" <td>106</td>\n",
|
||
" <td>0.654321</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>5.946612</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>1132</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>27.68</td>\n",
|
||
" <td>118</td>\n",
|
||
" <td>0.694118</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>2.907598</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>1111</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>35.94</td>\n",
|
||
" <td>129</td>\n",
|
||
" <td>0.806250</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>4.761123</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" patient-id location-id sex age smoke bmi waist wth htn \\\n",
|
||
"0 436 4 Male 58 Former 33.53 122 0.753086 No \n",
|
||
"1 1130 4 Male 77 Current 31.05 119 0.730061 Yes \n",
|
||
"2 1131 4 Female 72 Former 30.86 106 0.654321 No \n",
|
||
"3 1132 4 Male 71 Former 27.68 118 0.694118 Yes \n",
|
||
"4 1111 2 Female 79 Never 35.94 129 0.806250 Yes \n",
|
||
"\n",
|
||
" diab hyperchol famhist hormo p14 toevent event \n",
|
||
"0 No Yes No No 10 5.374401 Yes \n",
|
||
"1 Yes No No No 10 6.097194 No \n",
|
||
"2 Yes No Yes No 8 5.946612 No \n",
|
||
"3 No Yes No No 8 2.907598 Yes \n",
|
||
"4 No Yes No No 9 4.761123 No "
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df = pd.read_csv('../../data/predimed_records.csv')\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "48d5375f",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>patient-id</th>\n",
|
||
" <th>group</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>885</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>182</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>971</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>691</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>632</td>\n",
|
||
" <td>Control</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" location-id patient-id group\n",
|
||
"0 2 885 MedDiet + VOO\n",
|
||
"1 1 182 MedDiet + Nuts\n",
|
||
"2 1 971 MedDiet + Nuts\n",
|
||
"3 2 691 MedDiet + Nuts\n",
|
||
"4 2 632 Control"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"info = pd.read_csv('../../data/predimed_mapping.csv')\n",
|
||
"info.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2b4b98ed-d7ec-4b7c-b983-adc616d2f16f",
|
||
"metadata": {},
|
||
"source": [
|
||
"There were 5 different locations where the study was conducted, each one gave an identification number `patient-id` to each participant."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "b9dbc492-1489-4530-96ac-5f33f7389caa",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([2, 1, 3, 4, 5])"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"info['location-id'].unique()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2fef4d37",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 1. Add diet information to the patients' records\n",
|
||
"\n",
|
||
"* For how many patients do we have clinical information? (i.e., rows in `df`)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "861ac334-14ce-490a-b3c4-877b32789f3e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(6324, 16)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"## your code here\n",
|
||
"\n",
|
||
"print(df.shape)\n",
|
||
"\n",
|
||
"identifier = ['location-id', 'patient-id']\n",
|
||
"\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1c1701e2-c295-4032-9e89-0d8470f41593",
|
||
"metadata": {},
|
||
"source": [
|
||
"* For how many patients do we have diet information? (i.e., rows in `info`)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "14f57842-5722-4953-88d6-d7cf3070400c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(6287, 3)"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"## your code here\n",
|
||
"\n",
|
||
"info.shape\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3f23fa17-af3e-41c3-883f-3e1279d4820e",
|
||
"metadata": {},
|
||
"source": [
|
||
"Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information. \n",
|
||
"* Which type of merge would you do? \n",
|
||
"* For how many patients do we have full information (records and which diet they followed? "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "2dfaf171-2b81-4c7f-9101-c689aa56494d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"\u001b[0;31mSignature:\u001b[0m\n",
|
||
"\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmerge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mleft\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'DataFrame | Series'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mright\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'DataFrame | Series'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'MergeHow'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'inner'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | AnyArrayLike | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mleft_on\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | AnyArrayLike | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mright_on\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | AnyArrayLike | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mleft_index\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mright_index\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0msuffixes\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Suffixes'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'_x'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'_y'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mindicator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
|
||
"\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'DataFrame'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||
"\u001b[0;31mDocstring:\u001b[0m\n",
|
||
"Merge DataFrame or named Series objects with a database-style join.\n",
|
||
"\n",
|
||
"A named Series object is treated as a DataFrame with a single named column.\n",
|
||
"\n",
|
||
"The join is done on columns or indexes. If joining columns on\n",
|
||
"columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes\n",
|
||
"on indexes or indexes on a column or columns, the index will be passed on.\n",
|
||
"When performing a cross merge, no column specifications to merge on are\n",
|
||
"allowed.\n",
|
||
"\n",
|
||
".. warning::\n",
|
||
"\n",
|
||
" If both key columns contain rows where the key is a null value, those\n",
|
||
" rows will be matched against each other. This is different from usual SQL\n",
|
||
" join behaviour and can lead to unexpected results.\n",
|
||
"\n",
|
||
"Parameters\n",
|
||
"----------\n",
|
||
"left : DataFrame or named Series\n",
|
||
"right : DataFrame or named Series\n",
|
||
" Object to merge with.\n",
|
||
"how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'\n",
|
||
" Type of merge to be performed.\n",
|
||
"\n",
|
||
" * left: use only keys from left frame, similar to a SQL left outer join;\n",
|
||
" preserve key order.\n",
|
||
" * right: use only keys from right frame, similar to a SQL right outer join;\n",
|
||
" preserve key order.\n",
|
||
" * outer: use union of keys from both frames, similar to a SQL full outer\n",
|
||
" join; sort keys lexicographically.\n",
|
||
" * inner: use intersection of keys from both frames, similar to a SQL inner\n",
|
||
" join; preserve the order of the left keys.\n",
|
||
" * cross: creates the cartesian product from both frames, preserves the order\n",
|
||
" of the left keys.\n",
|
||
"on : label or list\n",
|
||
" Column or index level names to join on. These must be found in both\n",
|
||
" DataFrames. If `on` is None and not merging on indexes then this defaults\n",
|
||
" to the intersection of the columns in both DataFrames.\n",
|
||
"left_on : label or list, or array-like\n",
|
||
" Column or index level names to join on in the left DataFrame. Can also\n",
|
||
" be an array or list of arrays of the length of the left DataFrame.\n",
|
||
" These arrays are treated as if they are columns.\n",
|
||
"right_on : label or list, or array-like\n",
|
||
" Column or index level names to join on in the right DataFrame. Can also\n",
|
||
" be an array or list of arrays of the length of the right DataFrame.\n",
|
||
" These arrays are treated as if they are columns.\n",
|
||
"left_index : bool, default False\n",
|
||
" Use the index from the left DataFrame as the join key(s). If it is a\n",
|
||
" MultiIndex, the number of keys in the other DataFrame (either the index\n",
|
||
" or a number of columns) must match the number of levels.\n",
|
||
"right_index : bool, default False\n",
|
||
" Use the index from the right DataFrame as the join key. Same caveats as\n",
|
||
" left_index.\n",
|
||
"sort : bool, default False\n",
|
||
" Sort the join keys lexicographically in the result DataFrame. If False,\n",
|
||
" the order of the join keys depends on the join type (how keyword).\n",
|
||
"suffixes : list-like, default is (\"_x\", \"_y\")\n",
|
||
" A length-2 sequence where each element is optionally a string\n",
|
||
" indicating the suffix to add to overlapping column names in\n",
|
||
" `left` and `right` respectively. Pass a value of `None` instead\n",
|
||
" of a string to indicate that the column name from `left` or\n",
|
||
" `right` should be left as-is, with no suffix. At least one of the\n",
|
||
" values must not be None.\n",
|
||
"copy : bool, default True\n",
|
||
" If False, avoid copy if possible.\n",
|
||
"\n",
|
||
" .. note::\n",
|
||
" The `copy` keyword will change behavior in pandas 3.0.\n",
|
||
" `Copy-on-Write\n",
|
||
" <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__\n",
|
||
" will be enabled by default, which means that all methods with a\n",
|
||
" `copy` keyword will use a lazy copy mechanism to defer the copy and\n",
|
||
" ignore the `copy` keyword. The `copy` keyword will be removed in a\n",
|
||
" future version of pandas.\n",
|
||
"\n",
|
||
" You can already get the future behavior and improvements through\n",
|
||
" enabling copy on write ``pd.options.mode.copy_on_write = True``\n",
|
||
"indicator : bool or str, default False\n",
|
||
" If True, adds a column to the output DataFrame called \"_merge\" with\n",
|
||
" information on the source of each row. The column can be given a different\n",
|
||
" name by providing a string argument. The column will have a Categorical\n",
|
||
" type with the value of \"left_only\" for observations whose merge key only\n",
|
||
" appears in the left DataFrame, \"right_only\" for observations\n",
|
||
" whose merge key only appears in the right DataFrame, and \"both\"\n",
|
||
" if the observation's merge key is found in both DataFrames.\n",
|
||
"\n",
|
||
"validate : str, optional\n",
|
||
" If specified, checks if merge is of specified type.\n",
|
||
"\n",
|
||
" * \"one_to_one\" or \"1:1\": check if merge keys are unique in both\n",
|
||
" left and right datasets.\n",
|
||
" * \"one_to_many\" or \"1:m\": check if merge keys are unique in left\n",
|
||
" dataset.\n",
|
||
" * \"many_to_one\" or \"m:1\": check if merge keys are unique in right\n",
|
||
" dataset.\n",
|
||
" * \"many_to_many\" or \"m:m\": allowed, but does not result in checks.\n",
|
||
"\n",
|
||
"Returns\n",
|
||
"-------\n",
|
||
"DataFrame\n",
|
||
" A DataFrame of the two merged objects.\n",
|
||
"\n",
|
||
"See Also\n",
|
||
"--------\n",
|
||
"merge_ordered : Merge with optional filling/interpolation.\n",
|
||
"merge_asof : Merge on nearest keys.\n",
|
||
"DataFrame.join : Similar method using indices.\n",
|
||
"\n",
|
||
"Examples\n",
|
||
"--------\n",
|
||
">>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n",
|
||
"... 'value': [1, 2, 3, 5]})\n",
|
||
">>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n",
|
||
"... 'value': [5, 6, 7, 8]})\n",
|
||
">>> df1\n",
|
||
" lkey value\n",
|
||
"0 foo 1\n",
|
||
"1 bar 2\n",
|
||
"2 baz 3\n",
|
||
"3 foo 5\n",
|
||
">>> df2\n",
|
||
" rkey value\n",
|
||
"0 foo 5\n",
|
||
"1 bar 6\n",
|
||
"2 baz 7\n",
|
||
"3 foo 8\n",
|
||
"\n",
|
||
"Merge df1 and df2 on the lkey and rkey columns. The value columns have\n",
|
||
"the default suffixes, _x and _y, appended.\n",
|
||
"\n",
|
||
">>> df1.merge(df2, left_on='lkey', right_on='rkey')\n",
|
||
" lkey value_x rkey value_y\n",
|
||
"0 foo 1 foo 5\n",
|
||
"1 foo 1 foo 8\n",
|
||
"2 bar 2 bar 6\n",
|
||
"3 baz 3 baz 7\n",
|
||
"4 foo 5 foo 5\n",
|
||
"5 foo 5 foo 8\n",
|
||
"\n",
|
||
"Merge DataFrames df1 and df2 with specified left and right suffixes\n",
|
||
"appended to any overlapping columns.\n",
|
||
"\n",
|
||
">>> df1.merge(df2, left_on='lkey', right_on='rkey',\n",
|
||
"... suffixes=('_left', '_right'))\n",
|
||
" lkey value_left rkey value_right\n",
|
||
"0 foo 1 foo 5\n",
|
||
"1 foo 1 foo 8\n",
|
||
"2 bar 2 bar 6\n",
|
||
"3 baz 3 baz 7\n",
|
||
"4 foo 5 foo 5\n",
|
||
"5 foo 5 foo 8\n",
|
||
"\n",
|
||
"Merge DataFrames df1 and df2, but raise an exception if the DataFrames have\n",
|
||
"any overlapping columns.\n",
|
||
"\n",
|
||
">>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))\n",
|
||
"Traceback (most recent call last):\n",
|
||
"...\n",
|
||
"ValueError: columns overlap but no suffix specified:\n",
|
||
" Index(['value'], dtype='object')\n",
|
||
"\n",
|
||
">>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n",
|
||
">>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})\n",
|
||
">>> df1\n",
|
||
" a b\n",
|
||
"0 foo 1\n",
|
||
"1 bar 2\n",
|
||
">>> df2\n",
|
||
" a c\n",
|
||
"0 foo 3\n",
|
||
"1 baz 4\n",
|
||
"\n",
|
||
">>> df1.merge(df2, how='inner', on='a')\n",
|
||
" a b c\n",
|
||
"0 foo 1 3\n",
|
||
"\n",
|
||
">>> df1.merge(df2, how='left', on='a')\n",
|
||
" a b c\n",
|
||
"0 foo 1 3.0\n",
|
||
"1 bar 2 NaN\n",
|
||
"\n",
|
||
">>> df1 = pd.DataFrame({'left': ['foo', 'bar']})\n",
|
||
">>> df2 = pd.DataFrame({'right': [7, 8]})\n",
|
||
">>> df1\n",
|
||
" left\n",
|
||
"0 foo\n",
|
||
"1 bar\n",
|
||
">>> df2\n",
|
||
" right\n",
|
||
"0 7\n",
|
||
"1 8\n",
|
||
"\n",
|
||
">>> df1.merge(df2, how='cross')\n",
|
||
" left right\n",
|
||
"0 foo 7\n",
|
||
"1 foo 8\n",
|
||
"2 bar 7\n",
|
||
"3 bar 8\n",
|
||
"\u001b[0;31mFile:\u001b[0m /usr/lib64/python3.13/site-packages/pandas/core/reshape/merge.py\n",
|
||
"\u001b[0;31mType:\u001b[0m function"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"pd.merge?\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "35e19a53",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>patient-id</th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>sex</th>\n",
|
||
" <th>age</th>\n",
|
||
" <th>smoke</th>\n",
|
||
" <th>bmi</th>\n",
|
||
" <th>waist</th>\n",
|
||
" <th>wth</th>\n",
|
||
" <th>htn</th>\n",
|
||
" <th>diab</th>\n",
|
||
" <th>hyperchol</th>\n",
|
||
" <th>famhist</th>\n",
|
||
" <th>hormo</th>\n",
|
||
" <th>p14</th>\n",
|
||
" <th>toevent</th>\n",
|
||
" <th>event</th>\n",
|
||
" <th>group</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>436</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>33.53</td>\n",
|
||
" <td>122</td>\n",
|
||
" <td>0.753086</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>5.374401</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Control</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1130</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>Current</td>\n",
|
||
" <td>31.05</td>\n",
|
||
" <td>119</td>\n",
|
||
" <td>0.730061</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>6.097194</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1131</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>72</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>30.86</td>\n",
|
||
" <td>106</td>\n",
|
||
" <td>0.654321</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>5.946612</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>1132</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>27.68</td>\n",
|
||
" <td>118</td>\n",
|
||
" <td>0.694118</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>2.907598</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>1111</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>35.94</td>\n",
|
||
" <td>129</td>\n",
|
||
" <td>0.806250</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>4.761123</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6319</th>\n",
|
||
" <td>120</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>28.51</td>\n",
|
||
" <td>104</td>\n",
|
||
" <td>0.645963</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>3.550992</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6320</th>\n",
|
||
" <td>118</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>80</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>23.81</td>\n",
|
||
" <td>109</td>\n",
|
||
" <td>0.589189</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>2.743326</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6321</th>\n",
|
||
" <td>351</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>57</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>25.24</td>\n",
|
||
" <td>100</td>\n",
|
||
" <td>0.571429</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>0.479124</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6322</th>\n",
|
||
" <td>499</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>32.04</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>0.653333</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>2.587269</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6323</th>\n",
|
||
" <td>1257</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>24.43</td>\n",
|
||
" <td>93</td>\n",
|
||
" <td>0.547059</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>2.590007</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>6324 rows × 17 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" patient-id location-id sex age smoke bmi waist wth \\\n",
|
||
"0 436 4 Male 58 Former 33.53 122 0.753086 \n",
|
||
"1 1130 4 Male 77 Current 31.05 119 0.730061 \n",
|
||
"2 1131 4 Female 72 Former 30.86 106 0.654321 \n",
|
||
"3 1132 4 Male 71 Former 27.68 118 0.694118 \n",
|
||
"4 1111 2 Female 79 Never 35.94 129 0.806250 \n",
|
||
"... ... ... ... ... ... ... ... ... \n",
|
||
"6319 120 5 Female 66 Never 28.51 104 0.645963 \n",
|
||
"6320 118 5 Male 80 Never 23.81 109 0.589189 \n",
|
||
"6321 351 3 Male 57 Former 25.24 100 0.571429 \n",
|
||
"6322 499 5 Female 71 Never 32.04 98 0.653333 \n",
|
||
"6323 1257 5 Male 58 Former 24.43 93 0.547059 \n",
|
||
"\n",
|
||
" htn diab hyperchol famhist hormo p14 toevent event group \n",
|
||
"0 No No Yes No No 10 5.374401 Yes Control \n",
|
||
"1 Yes Yes No No No 10 6.097194 No Control \n",
|
||
"2 No Yes No Yes No 8 5.946612 No MedDiet + VOO \n",
|
||
"3 Yes No Yes No No 8 2.907598 Yes MedDiet + Nuts \n",
|
||
"4 Yes No Yes No No 9 4.761123 No MedDiet + VOO \n",
|
||
"... ... ... ... ... ... ... ... ... ... \n",
|
||
"6319 Yes No Yes Yes No 8 3.550992 No Control \n",
|
||
"6320 Yes Yes Yes Yes No 8 2.743326 No Control \n",
|
||
"6321 Yes No Yes No NaN 7 0.479124 No MedDiet + Nuts \n",
|
||
"6322 Yes No Yes Yes No 6 2.587269 No MedDiet + VOO \n",
|
||
"6323 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
|
||
"\n",
|
||
"[6324 rows x 17 columns]"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"## your code here\n",
|
||
"\n",
|
||
"\n",
|
||
"df_merged = df.merge(info, how = \"left\", on = identifier)\n",
|
||
"df_merged"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "946beb08-30a5-4020-8612-360385cdfc1e",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 2. Add location information to the patients' records\n",
|
||
"\n",
|
||
"There were five locations where the study was conducted. Here is a DataFrame containing the information of each location. \n",
|
||
"\n",
|
||
"- Add a new column to the dataset that contains the city where each patient was recorded.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "36ce0688-d421-4a07-b00e-0e9b3201f0e0",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>City</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Madrid</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Valencia</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Barcelona</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Bilbao</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" location-id City\n",
|
||
"0 1 Madrid\n",
|
||
"1 2 Valencia\n",
|
||
"2 3 Barcelona\n",
|
||
"3 4 Bilbao\n",
|
||
"4 5 Malaga"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], \n",
|
||
" 'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})\n",
|
||
"locations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "b636dde4-129a-4dd1-8cbf-c539c9c8a5f2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>patient-id</th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>sex</th>\n",
|
||
" <th>age</th>\n",
|
||
" <th>smoke</th>\n",
|
||
" <th>bmi</th>\n",
|
||
" <th>waist</th>\n",
|
||
" <th>wth</th>\n",
|
||
" <th>htn</th>\n",
|
||
" <th>diab</th>\n",
|
||
" <th>hyperchol</th>\n",
|
||
" <th>famhist</th>\n",
|
||
" <th>hormo</th>\n",
|
||
" <th>p14</th>\n",
|
||
" <th>toevent</th>\n",
|
||
" <th>event</th>\n",
|
||
" <th>group</th>\n",
|
||
" <th>City</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>436</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>33.53</td>\n",
|
||
" <td>122</td>\n",
|
||
" <td>0.753086</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>5.374401</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Control</td>\n",
|
||
" <td>Bilbao</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1130</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>Current</td>\n",
|
||
" <td>31.05</td>\n",
|
||
" <td>119</td>\n",
|
||
" <td>0.730061</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>6.097194</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" <td>Bilbao</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1131</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>72</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>30.86</td>\n",
|
||
" <td>106</td>\n",
|
||
" <td>0.654321</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>5.946612</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Bilbao</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>1132</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>27.68</td>\n",
|
||
" <td>118</td>\n",
|
||
" <td>0.694118</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>2.907598</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Bilbao</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>1111</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>35.94</td>\n",
|
||
" <td>129</td>\n",
|
||
" <td>0.806250</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>4.761123</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Valencia</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6319</th>\n",
|
||
" <td>120</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>28.51</td>\n",
|
||
" <td>104</td>\n",
|
||
" <td>0.645963</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>3.550992</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6320</th>\n",
|
||
" <td>118</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>80</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>23.81</td>\n",
|
||
" <td>109</td>\n",
|
||
" <td>0.589189</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>2.743326</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6321</th>\n",
|
||
" <td>351</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>57</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>25.24</td>\n",
|
||
" <td>100</td>\n",
|
||
" <td>0.571429</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>0.479124</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Barcelona</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6322</th>\n",
|
||
" <td>499</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>32.04</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>0.653333</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>2.587269</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6323</th>\n",
|
||
" <td>1257</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>24.43</td>\n",
|
||
" <td>93</td>\n",
|
||
" <td>0.547059</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>2.590007</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>6324 rows × 18 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" patient-id location-id sex age smoke bmi waist wth \\\n",
|
||
"0 436 4 Male 58 Former 33.53 122 0.753086 \n",
|
||
"1 1130 4 Male 77 Current 31.05 119 0.730061 \n",
|
||
"2 1131 4 Female 72 Former 30.86 106 0.654321 \n",
|
||
"3 1132 4 Male 71 Former 27.68 118 0.694118 \n",
|
||
"4 1111 2 Female 79 Never 35.94 129 0.806250 \n",
|
||
"... ... ... ... ... ... ... ... ... \n",
|
||
"6319 120 5 Female 66 Never 28.51 104 0.645963 \n",
|
||
"6320 118 5 Male 80 Never 23.81 109 0.589189 \n",
|
||
"6321 351 3 Male 57 Former 25.24 100 0.571429 \n",
|
||
"6322 499 5 Female 71 Never 32.04 98 0.653333 \n",
|
||
"6323 1257 5 Male 58 Former 24.43 93 0.547059 \n",
|
||
"\n",
|
||
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
|
||
"0 No No Yes No No 10 5.374401 Yes Control \n",
|
||
"1 Yes Yes No No No 10 6.097194 No Control \n",
|
||
"2 No Yes No Yes No 8 5.946612 No MedDiet + VOO \n",
|
||
"3 Yes No Yes No No 8 2.907598 Yes MedDiet + Nuts \n",
|
||
"4 Yes No Yes No No 9 4.761123 No MedDiet + VOO \n",
|
||
"... ... ... ... ... ... ... ... ... ... \n",
|
||
"6319 Yes No Yes Yes No 8 3.550992 No Control \n",
|
||
"6320 Yes Yes Yes Yes No 8 2.743326 No Control \n",
|
||
"6321 Yes No Yes No NaN 7 0.479124 No MedDiet + Nuts \n",
|
||
"6322 Yes No Yes Yes No 6 2.587269 No MedDiet + VOO \n",
|
||
"6323 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
|
||
"\n",
|
||
" City \n",
|
||
"0 Bilbao \n",
|
||
"1 Bilbao \n",
|
||
"2 Bilbao \n",
|
||
"3 Bilbao \n",
|
||
"4 Valencia \n",
|
||
"... ... \n",
|
||
"6319 Malaga \n",
|
||
"6320 Malaga \n",
|
||
"6321 Barcelona \n",
|
||
"6322 Malaga \n",
|
||
"6323 Malaga \n",
|
||
"\n",
|
||
"[6324 rows x 18 columns]"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"## your code here:\n",
|
||
"\n",
|
||
"df_location = df_merged.merge(locations, on = \"location-id\")\n",
|
||
"df_location\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "44031178",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 3. Remove drops from table\n",
|
||
"\n",
|
||
"Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file `dropped.csv`.\n",
|
||
"1. Load the list of patients who droped, from `dropped.csv`\n",
|
||
"2. Use an anti-join to remove them from the table\n",
|
||
"3. How many patients (rows) are left in the data?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "d1d4cc27",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"dropped = pd.read_csv('dropped.csv')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"id": "fbebbd97",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(42, 2)"
|
||
]
|
||
},
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"dropped.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "8a3c7943",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>patient-id</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>217</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1147</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1170</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>627</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>541</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" location-id patient-id\n",
|
||
"0 1 217\n",
|
||
"1 1 1147\n",
|
||
"2 1 1170\n",
|
||
"3 1 627\n",
|
||
"4 4 541"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"dropped.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"id": "573687e7",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>patient-id</th>\n",
|
||
" <th>location-id</th>\n",
|
||
" <th>sex</th>\n",
|
||
" <th>age</th>\n",
|
||
" <th>smoke</th>\n",
|
||
" <th>bmi</th>\n",
|
||
" <th>waist</th>\n",
|
||
" <th>wth</th>\n",
|
||
" <th>htn</th>\n",
|
||
" <th>diab</th>\n",
|
||
" <th>hyperchol</th>\n",
|
||
" <th>famhist</th>\n",
|
||
" <th>hormo</th>\n",
|
||
" <th>p14</th>\n",
|
||
" <th>toevent</th>\n",
|
||
" <th>event</th>\n",
|
||
" <th>group</th>\n",
|
||
" <th>City</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>25.92</td>\n",
|
||
" <td>94</td>\n",
|
||
" <td>0.657343</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>5.538672</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Madrid</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>34.85</td>\n",
|
||
" <td>150</td>\n",
|
||
" <td>0.949367</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>3.063655</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Madrid</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>3</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>37.50</td>\n",
|
||
" <td>120</td>\n",
|
||
" <td>0.750000</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>5.590691</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Madrid</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>29.26</td>\n",
|
||
" <td>93</td>\n",
|
||
" <td>0.628378</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>5.456537</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Madrid</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>5</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>60</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>30.02</td>\n",
|
||
" <td>104</td>\n",
|
||
" <td>0.662420</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>2.746064</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Control</td>\n",
|
||
" <td>Madrid</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6319</th>\n",
|
||
" <td>1253</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>25.28</td>\n",
|
||
" <td>105</td>\n",
|
||
" <td>0.640244</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>5.828884</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6320</th>\n",
|
||
" <td>1254</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>62</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>27.10</td>\n",
|
||
" <td>104</td>\n",
|
||
" <td>0.594286</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>5.067762</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6321</th>\n",
|
||
" <td>1255</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Female</td>\n",
|
||
" <td>65</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>35.02</td>\n",
|
||
" <td>103</td>\n",
|
||
" <td>0.686667</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>1.993155</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + VOO</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6322</th>\n",
|
||
" <td>1256</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>61</td>\n",
|
||
" <td>Never</td>\n",
|
||
" <td>28.42</td>\n",
|
||
" <td>94</td>\n",
|
||
" <td>0.576687</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>2.039699</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6323</th>\n",
|
||
" <td>1257</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Male</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Former</td>\n",
|
||
" <td>24.43</td>\n",
|
||
" <td>93</td>\n",
|
||
" <td>0.547059</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>2.590007</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>MedDiet + Nuts</td>\n",
|
||
" <td>Malaga</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>6282 rows × 18 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" patient-id location-id sex age smoke bmi waist wth \\\n",
|
||
"0 1 1 Female 77 Never 25.92 94 0.657343 \n",
|
||
"1 2 1 Female 68 Never 34.85 150 0.949367 \n",
|
||
"2 3 1 Female 66 Never 37.50 120 0.750000 \n",
|
||
"3 4 1 Female 77 Never 29.26 93 0.628378 \n",
|
||
"4 5 1 Female 60 Never 30.02 104 0.662420 \n",
|
||
"... ... ... ... ... ... ... ... ... \n",
|
||
"6319 1253 5 Male 79 Never 25.28 105 0.640244 \n",
|
||
"6320 1254 5 Male 62 Former 27.10 104 0.594286 \n",
|
||
"6321 1255 5 Female 65 Never 35.02 103 0.686667 \n",
|
||
"6322 1256 5 Male 61 Never 28.42 94 0.576687 \n",
|
||
"6323 1257 5 Male 58 Former 24.43 93 0.547059 \n",
|
||
"\n",
|
||
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
|
||
"0 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO \n",
|
||
"1 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts \n",
|
||
"2 Yes Yes No No No 6 5.590691 No MedDiet + Nuts \n",
|
||
"3 Yes Yes No No No 6 5.456537 No MedDiet + VOO \n",
|
||
"4 Yes No Yes No No 9 2.746064 No Control \n",
|
||
"... ... ... ... ... ... ... ... ... ... \n",
|
||
"6319 Yes No Yes No No 8 5.828884 No MedDiet + VOO \n",
|
||
"6320 Yes No Yes Yes No 9 5.067762 No MedDiet + Nuts \n",
|
||
"6321 Yes No Yes No No 10 1.993155 No MedDiet + VOO \n",
|
||
"6322 Yes Yes No No No 9 2.039699 No MedDiet + Nuts \n",
|
||
"6323 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
|
||
"\n",
|
||
" City \n",
|
||
"0 Madrid \n",
|
||
"1 Madrid \n",
|
||
"2 Madrid \n",
|
||
"3 Madrid \n",
|
||
"4 Madrid \n",
|
||
"... ... \n",
|
||
"6319 Malaga \n",
|
||
"6320 Malaga \n",
|
||
"6321 Malaga \n",
|
||
"6322 Malaga \n",
|
||
"6323 Malaga \n",
|
||
"\n",
|
||
"[6282 rows x 18 columns]"
|
||
]
|
||
},
|
||
"execution_count": 31,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# your code here\n",
|
||
"\n",
|
||
"df_removed = df_location.merge(dropped, how = \"outer\", on = identifier, indicator = True)\n",
|
||
"\n",
|
||
"df_removed = df_removed.loc[df_removed[\"_merge\"] == \"left_only\",].drop([\"_merge\"], axis = 1)\n",
|
||
"df_removed"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "84270332",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 4. Save final result in `processed_data_predimed.csv`\n",
|
||
"\n",
|
||
"1. Using the `.to_csv` method of Pandas DataFrames"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"id": "85902eea",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"fname = 'processed_data_predimed.csv'\n",
|
||
"\n",
|
||
"# your code here\n",
|
||
"\n",
|
||
"df_removed.to_csv(fname)\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.6"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|