2025-plovdiv-data/exercises/tabular_join/tabular_join.ipynb
2025-09-24 13:09:53 +03:00

1887 lines
63 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "f11a76bf",
"metadata": {},
"source": [
"# Exercise on Joins and anti-joins: add information from other tables"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b6f2742b",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "2967c84e",
"metadata": {},
"source": [
"# Load data from clinical trial\n",
"\n",
"Data comes in two different files. The file `predimed_records.csv` file contains the clinical data for each patient, except which diet group they were assigned. The file `predimed_mapping.csv` contain the information of which patient was assigned to which diet group. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ed626ee3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>436</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>33.53</td>\n",
" <td>122</td>\n",
" <td>0.753086</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>5.374401</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1130</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>77</td>\n",
" <td>Current</td>\n",
" <td>31.05</td>\n",
" <td>119</td>\n",
" <td>0.730061</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>6.097194</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1131</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>72</td>\n",
" <td>Former</td>\n",
" <td>30.86</td>\n",
" <td>106</td>\n",
" <td>0.654321</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.946612</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1132</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>27.68</td>\n",
" <td>118</td>\n",
" <td>0.694118</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.907598</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1111</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>35.94</td>\n",
" <td>129</td>\n",
" <td>0.806250</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>4.761123</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth htn \\\n",
"0 436 4 Male 58 Former 33.53 122 0.753086 No \n",
"1 1130 4 Male 77 Current 31.05 119 0.730061 Yes \n",
"2 1131 4 Female 72 Former 30.86 106 0.654321 No \n",
"3 1132 4 Male 71 Former 27.68 118 0.694118 Yes \n",
"4 1111 2 Female 79 Never 35.94 129 0.806250 Yes \n",
"\n",
" diab hyperchol famhist hormo p14 toevent event \n",
"0 No Yes No No 10 5.374401 Yes \n",
"1 Yes No No No 10 6.097194 No \n",
"2 Yes No Yes No 8 5.946612 No \n",
"3 No Yes No No 8 2.907598 Yes \n",
"4 No Yes No No 9 4.761123 No "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('../../data/predimed_records.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "48d5375f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>patient-id</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>885</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>182</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>971</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>691</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>632</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id patient-id group\n",
"0 2 885 MedDiet + VOO\n",
"1 1 182 MedDiet + Nuts\n",
"2 1 971 MedDiet + Nuts\n",
"3 2 691 MedDiet + Nuts\n",
"4 2 632 Control"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info = pd.read_csv('../../data/predimed_mapping.csv')\n",
"info.head()"
]
},
{
"cell_type": "markdown",
"id": "2b4b98ed-d7ec-4b7c-b983-adc616d2f16f",
"metadata": {},
"source": [
"There were 5 different locations where the study was conducted, each one gave an identification number `patient-id` to each participant."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b9dbc492-1489-4530-96ac-5f33f7389caa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 1, 3, 4, 5])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info['location-id'].unique()"
]
},
{
"cell_type": "markdown",
"id": "2fef4d37",
"metadata": {},
"source": [
"# 1. Add diet information to the patients' records\n",
"\n",
"* For how many patients do we have clinical information? (i.e., rows in `df`)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "861ac334-14ce-490a-b3c4-877b32789f3e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(6324, 16)\n"
]
}
],
"source": [
"## your code here\n",
"\n",
"print(df.shape)\n",
"\n",
"identifier = ['location-id', 'patient-id']\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "1c1701e2-c295-4032-9e89-0d8470f41593",
"metadata": {},
"source": [
"* For how many patients do we have diet information? (i.e., rows in `info`)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "14f57842-5722-4953-88d6-d7cf3070400c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(6287, 3)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## your code here\n",
"\n",
"info.shape\n"
]
},
{
"cell_type": "markdown",
"id": "3f23fa17-af3e-41c3-883f-3e1279d4820e",
"metadata": {},
"source": [
"Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information. \n",
"* Which type of merge would you do? \n",
"* For how many patients do we have full information (records and which diet they followed? "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2dfaf171-2b81-4c7f-9101-c689aa56494d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\u001b[0;31mSignature:\u001b[0m\n",
"\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmerge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mleft\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'DataFrame | Series'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mright\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'DataFrame | Series'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'MergeHow'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'inner'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | AnyArrayLike | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mleft_on\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | AnyArrayLike | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mright_on\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | AnyArrayLike | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mleft_index\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mright_index\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0msuffixes\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Suffixes'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'_x'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'_y'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mindicator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'DataFrame'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mDocstring:\u001b[0m\n",
"Merge DataFrame or named Series objects with a database-style join.\n",
"\n",
"A named Series object is treated as a DataFrame with a single named column.\n",
"\n",
"The join is done on columns or indexes. If joining columns on\n",
"columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes\n",
"on indexes or indexes on a column or columns, the index will be passed on.\n",
"When performing a cross merge, no column specifications to merge on are\n",
"allowed.\n",
"\n",
".. warning::\n",
"\n",
" If both key columns contain rows where the key is a null value, those\n",
" rows will be matched against each other. This is different from usual SQL\n",
" join behaviour and can lead to unexpected results.\n",
"\n",
"Parameters\n",
"----------\n",
"left : DataFrame or named Series\n",
"right : DataFrame or named Series\n",
" Object to merge with.\n",
"how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'\n",
" Type of merge to be performed.\n",
"\n",
" * left: use only keys from left frame, similar to a SQL left outer join;\n",
" preserve key order.\n",
" * right: use only keys from right frame, similar to a SQL right outer join;\n",
" preserve key order.\n",
" * outer: use union of keys from both frames, similar to a SQL full outer\n",
" join; sort keys lexicographically.\n",
" * inner: use intersection of keys from both frames, similar to a SQL inner\n",
" join; preserve the order of the left keys.\n",
" * cross: creates the cartesian product from both frames, preserves the order\n",
" of the left keys.\n",
"on : label or list\n",
" Column or index level names to join on. These must be found in both\n",
" DataFrames. If `on` is None and not merging on indexes then this defaults\n",
" to the intersection of the columns in both DataFrames.\n",
"left_on : label or list, or array-like\n",
" Column or index level names to join on in the left DataFrame. Can also\n",
" be an array or list of arrays of the length of the left DataFrame.\n",
" These arrays are treated as if they are columns.\n",
"right_on : label or list, or array-like\n",
" Column or index level names to join on in the right DataFrame. Can also\n",
" be an array or list of arrays of the length of the right DataFrame.\n",
" These arrays are treated as if they are columns.\n",
"left_index : bool, default False\n",
" Use the index from the left DataFrame as the join key(s). If it is a\n",
" MultiIndex, the number of keys in the other DataFrame (either the index\n",
" or a number of columns) must match the number of levels.\n",
"right_index : bool, default False\n",
" Use the index from the right DataFrame as the join key. Same caveats as\n",
" left_index.\n",
"sort : bool, default False\n",
" Sort the join keys lexicographically in the result DataFrame. If False,\n",
" the order of the join keys depends on the join type (how keyword).\n",
"suffixes : list-like, default is (\"_x\", \"_y\")\n",
" A length-2 sequence where each element is optionally a string\n",
" indicating the suffix to add to overlapping column names in\n",
" `left` and `right` respectively. Pass a value of `None` instead\n",
" of a string to indicate that the column name from `left` or\n",
" `right` should be left as-is, with no suffix. At least one of the\n",
" values must not be None.\n",
"copy : bool, default True\n",
" If False, avoid copy if possible.\n",
"\n",
" .. note::\n",
" The `copy` keyword will change behavior in pandas 3.0.\n",
" `Copy-on-Write\n",
" <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__\n",
" will be enabled by default, which means that all methods with a\n",
" `copy` keyword will use a lazy copy mechanism to defer the copy and\n",
" ignore the `copy` keyword. The `copy` keyword will be removed in a\n",
" future version of pandas.\n",
"\n",
" You can already get the future behavior and improvements through\n",
" enabling copy on write ``pd.options.mode.copy_on_write = True``\n",
"indicator : bool or str, default False\n",
" If True, adds a column to the output DataFrame called \"_merge\" with\n",
" information on the source of each row. The column can be given a different\n",
" name by providing a string argument. The column will have a Categorical\n",
" type with the value of \"left_only\" for observations whose merge key only\n",
" appears in the left DataFrame, \"right_only\" for observations\n",
" whose merge key only appears in the right DataFrame, and \"both\"\n",
" if the observation's merge key is found in both DataFrames.\n",
"\n",
"validate : str, optional\n",
" If specified, checks if merge is of specified type.\n",
"\n",
" * \"one_to_one\" or \"1:1\": check if merge keys are unique in both\n",
" left and right datasets.\n",
" * \"one_to_many\" or \"1:m\": check if merge keys are unique in left\n",
" dataset.\n",
" * \"many_to_one\" or \"m:1\": check if merge keys are unique in right\n",
" dataset.\n",
" * \"many_to_many\" or \"m:m\": allowed, but does not result in checks.\n",
"\n",
"Returns\n",
"-------\n",
"DataFrame\n",
" A DataFrame of the two merged objects.\n",
"\n",
"See Also\n",
"--------\n",
"merge_ordered : Merge with optional filling/interpolation.\n",
"merge_asof : Merge on nearest keys.\n",
"DataFrame.join : Similar method using indices.\n",
"\n",
"Examples\n",
"--------\n",
">>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n",
"... 'value': [1, 2, 3, 5]})\n",
">>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n",
"... 'value': [5, 6, 7, 8]})\n",
">>> df1\n",
" lkey value\n",
"0 foo 1\n",
"1 bar 2\n",
"2 baz 3\n",
"3 foo 5\n",
">>> df2\n",
" rkey value\n",
"0 foo 5\n",
"1 bar 6\n",
"2 baz 7\n",
"3 foo 8\n",
"\n",
"Merge df1 and df2 on the lkey and rkey columns. The value columns have\n",
"the default suffixes, _x and _y, appended.\n",
"\n",
">>> df1.merge(df2, left_on='lkey', right_on='rkey')\n",
" lkey value_x rkey value_y\n",
"0 foo 1 foo 5\n",
"1 foo 1 foo 8\n",
"2 bar 2 bar 6\n",
"3 baz 3 baz 7\n",
"4 foo 5 foo 5\n",
"5 foo 5 foo 8\n",
"\n",
"Merge DataFrames df1 and df2 with specified left and right suffixes\n",
"appended to any overlapping columns.\n",
"\n",
">>> df1.merge(df2, left_on='lkey', right_on='rkey',\n",
"... suffixes=('_left', '_right'))\n",
" lkey value_left rkey value_right\n",
"0 foo 1 foo 5\n",
"1 foo 1 foo 8\n",
"2 bar 2 bar 6\n",
"3 baz 3 baz 7\n",
"4 foo 5 foo 5\n",
"5 foo 5 foo 8\n",
"\n",
"Merge DataFrames df1 and df2, but raise an exception if the DataFrames have\n",
"any overlapping columns.\n",
"\n",
">>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))\n",
"Traceback (most recent call last):\n",
"...\n",
"ValueError: columns overlap but no suffix specified:\n",
" Index(['value'], dtype='object')\n",
"\n",
">>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n",
">>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})\n",
">>> df1\n",
" a b\n",
"0 foo 1\n",
"1 bar 2\n",
">>> df2\n",
" a c\n",
"0 foo 3\n",
"1 baz 4\n",
"\n",
">>> df1.merge(df2, how='inner', on='a')\n",
" a b c\n",
"0 foo 1 3\n",
"\n",
">>> df1.merge(df2, how='left', on='a')\n",
" a b c\n",
"0 foo 1 3.0\n",
"1 bar 2 NaN\n",
"\n",
">>> df1 = pd.DataFrame({'left': ['foo', 'bar']})\n",
">>> df2 = pd.DataFrame({'right': [7, 8]})\n",
">>> df1\n",
" left\n",
"0 foo\n",
"1 bar\n",
">>> df2\n",
" right\n",
"0 7\n",
"1 8\n",
"\n",
">>> df1.merge(df2, how='cross')\n",
" left right\n",
"0 foo 7\n",
"1 foo 8\n",
"2 bar 7\n",
"3 bar 8\n",
"\u001b[0;31mFile:\u001b[0m /usr/lib64/python3.13/site-packages/pandas/core/reshape/merge.py\n",
"\u001b[0;31mType:\u001b[0m function"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pd.merge?\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "35e19a53",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>436</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>33.53</td>\n",
" <td>122</td>\n",
" <td>0.753086</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>5.374401</td>\n",
" <td>Yes</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1130</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>77</td>\n",
" <td>Current</td>\n",
" <td>31.05</td>\n",
" <td>119</td>\n",
" <td>0.730061</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>6.097194</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1131</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>72</td>\n",
" <td>Former</td>\n",
" <td>30.86</td>\n",
" <td>106</td>\n",
" <td>0.654321</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.946612</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1132</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>27.68</td>\n",
" <td>118</td>\n",
" <td>0.694118</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.907598</td>\n",
" <td>Yes</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1111</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>35.94</td>\n",
" <td>129</td>\n",
" <td>0.806250</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>4.761123</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6319</th>\n",
" <td>120</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>66</td>\n",
" <td>Never</td>\n",
" <td>28.51</td>\n",
" <td>104</td>\n",
" <td>0.645963</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>3.550992</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6320</th>\n",
" <td>118</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>80</td>\n",
" <td>Never</td>\n",
" <td>23.81</td>\n",
" <td>109</td>\n",
" <td>0.589189</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.743326</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6321</th>\n",
" <td>351</td>\n",
" <td>3</td>\n",
" <td>Male</td>\n",
" <td>57</td>\n",
" <td>Former</td>\n",
" <td>25.24</td>\n",
" <td>100</td>\n",
" <td>0.571429</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>0.479124</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6322</th>\n",
" <td>499</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>71</td>\n",
" <td>Never</td>\n",
" <td>32.04</td>\n",
" <td>98</td>\n",
" <td>0.653333</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>2.587269</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6323</th>\n",
" <td>1257</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>24.43</td>\n",
" <td>93</td>\n",
" <td>0.547059</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.590007</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6324 rows × 17 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 436 4 Male 58 Former 33.53 122 0.753086 \n",
"1 1130 4 Male 77 Current 31.05 119 0.730061 \n",
"2 1131 4 Female 72 Former 30.86 106 0.654321 \n",
"3 1132 4 Male 71 Former 27.68 118 0.694118 \n",
"4 1111 2 Female 79 Never 35.94 129 0.806250 \n",
"... ... ... ... ... ... ... ... ... \n",
"6319 120 5 Female 66 Never 28.51 104 0.645963 \n",
"6320 118 5 Male 80 Never 23.81 109 0.589189 \n",
"6321 351 3 Male 57 Former 25.24 100 0.571429 \n",
"6322 499 5 Female 71 Never 32.04 98 0.653333 \n",
"6323 1257 5 Male 58 Former 24.43 93 0.547059 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \n",
"0 No No Yes No No 10 5.374401 Yes Control \n",
"1 Yes Yes No No No 10 6.097194 No Control \n",
"2 No Yes No Yes No 8 5.946612 No MedDiet + VOO \n",
"3 Yes No Yes No No 8 2.907598 Yes MedDiet + Nuts \n",
"4 Yes No Yes No No 9 4.761123 No MedDiet + VOO \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6319 Yes No Yes Yes No 8 3.550992 No Control \n",
"6320 Yes Yes Yes Yes No 8 2.743326 No Control \n",
"6321 Yes No Yes No NaN 7 0.479124 No MedDiet + Nuts \n",
"6322 Yes No Yes Yes No 6 2.587269 No MedDiet + VOO \n",
"6323 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
"\n",
"[6324 rows x 17 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## your code here\n",
"\n",
"\n",
"df_merged = df.merge(info, how = \"left\", on = identifier)\n",
"df_merged"
]
},
{
"cell_type": "markdown",
"id": "946beb08-30a5-4020-8612-360385cdfc1e",
"metadata": {},
"source": [
"# 2. Add location information to the patients' records\n",
"\n",
"There were five locations where the study was conducted. Here is a DataFrame containing the information of each location. \n",
"\n",
"- Add a new column to the dataset that contains the city where each patient was recorded.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "36ce0688-d421-4a07-b00e-0e9b3201f0e0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Valencia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Barcelona</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id City\n",
"0 1 Madrid\n",
"1 2 Valencia\n",
"2 3 Barcelona\n",
"3 4 Bilbao\n",
"4 5 Malaga"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], \n",
" 'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})\n",
"locations"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b636dde4-129a-4dd1-8cbf-c539c9c8a5f2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>436</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>33.53</td>\n",
" <td>122</td>\n",
" <td>0.753086</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>5.374401</td>\n",
" <td>Yes</td>\n",
" <td>Control</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1130</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>77</td>\n",
" <td>Current</td>\n",
" <td>31.05</td>\n",
" <td>119</td>\n",
" <td>0.730061</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>6.097194</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1131</td>\n",
" <td>4</td>\n",
" <td>Female</td>\n",
" <td>72</td>\n",
" <td>Former</td>\n",
" <td>30.86</td>\n",
" <td>106</td>\n",
" <td>0.654321</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.946612</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1132</td>\n",
" <td>4</td>\n",
" <td>Male</td>\n",
" <td>71</td>\n",
" <td>Former</td>\n",
" <td>27.68</td>\n",
" <td>118</td>\n",
" <td>0.694118</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.907598</td>\n",
" <td>Yes</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Bilbao</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1111</td>\n",
" <td>2</td>\n",
" <td>Female</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>35.94</td>\n",
" <td>129</td>\n",
" <td>0.806250</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>4.761123</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Valencia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6319</th>\n",
" <td>120</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>66</td>\n",
" <td>Never</td>\n",
" <td>28.51</td>\n",
" <td>104</td>\n",
" <td>0.645963</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>3.550992</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6320</th>\n",
" <td>118</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>80</td>\n",
" <td>Never</td>\n",
" <td>23.81</td>\n",
" <td>109</td>\n",
" <td>0.589189</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>2.743326</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6321</th>\n",
" <td>351</td>\n",
" <td>3</td>\n",
" <td>Male</td>\n",
" <td>57</td>\n",
" <td>Former</td>\n",
" <td>25.24</td>\n",
" <td>100</td>\n",
" <td>0.571429</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>0.479124</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Barcelona</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6322</th>\n",
" <td>499</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>71</td>\n",
" <td>Never</td>\n",
" <td>32.04</td>\n",
" <td>98</td>\n",
" <td>0.653333</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>2.587269</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6323</th>\n",
" <td>1257</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>24.43</td>\n",
" <td>93</td>\n",
" <td>0.547059</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.590007</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6324 rows × 18 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 436 4 Male 58 Former 33.53 122 0.753086 \n",
"1 1130 4 Male 77 Current 31.05 119 0.730061 \n",
"2 1131 4 Female 72 Former 30.86 106 0.654321 \n",
"3 1132 4 Male 71 Former 27.68 118 0.694118 \n",
"4 1111 2 Female 79 Never 35.94 129 0.806250 \n",
"... ... ... ... ... ... ... ... ... \n",
"6319 120 5 Female 66 Never 28.51 104 0.645963 \n",
"6320 118 5 Male 80 Never 23.81 109 0.589189 \n",
"6321 351 3 Male 57 Former 25.24 100 0.571429 \n",
"6322 499 5 Female 71 Never 32.04 98 0.653333 \n",
"6323 1257 5 Male 58 Former 24.43 93 0.547059 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
"0 No No Yes No No 10 5.374401 Yes Control \n",
"1 Yes Yes No No No 10 6.097194 No Control \n",
"2 No Yes No Yes No 8 5.946612 No MedDiet + VOO \n",
"3 Yes No Yes No No 8 2.907598 Yes MedDiet + Nuts \n",
"4 Yes No Yes No No 9 4.761123 No MedDiet + VOO \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6319 Yes No Yes Yes No 8 3.550992 No Control \n",
"6320 Yes Yes Yes Yes No 8 2.743326 No Control \n",
"6321 Yes No Yes No NaN 7 0.479124 No MedDiet + Nuts \n",
"6322 Yes No Yes Yes No 6 2.587269 No MedDiet + VOO \n",
"6323 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
"\n",
" City \n",
"0 Bilbao \n",
"1 Bilbao \n",
"2 Bilbao \n",
"3 Bilbao \n",
"4 Valencia \n",
"... ... \n",
"6319 Malaga \n",
"6320 Malaga \n",
"6321 Barcelona \n",
"6322 Malaga \n",
"6323 Malaga \n",
"\n",
"[6324 rows x 18 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## your code here:\n",
"\n",
"df_location = df_merged.merge(locations, on = \"location-id\")\n",
"df_location\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "44031178",
"metadata": {},
"source": [
"# 3. Remove drops from table\n",
"\n",
"Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file `dropped.csv`.\n",
"1. Load the list of patients who droped, from `dropped.csv`\n",
"2. Use an anti-join to remove them from the table\n",
"3. How many patients (rows) are left in the data?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d1d4cc27",
"metadata": {},
"outputs": [],
"source": [
"dropped = pd.read_csv('dropped.csv')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "fbebbd97",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(42, 2)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dropped.shape"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "8a3c7943",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location-id</th>\n",
" <th>patient-id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>541</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location-id patient-id\n",
"0 1 217\n",
"1 1 1147\n",
"2 1 1170\n",
"3 1 627\n",
"4 4 541"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dropped.head()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "573687e7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>25.92</td>\n",
" <td>94</td>\n",
" <td>0.657343</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.538672</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>68</td>\n",
" <td>Never</td>\n",
" <td>34.85</td>\n",
" <td>150</td>\n",
" <td>0.949367</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>10</td>\n",
" <td>3.063655</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>66</td>\n",
" <td>Never</td>\n",
" <td>37.50</td>\n",
" <td>120</td>\n",
" <td>0.750000</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.590691</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>29.26</td>\n",
" <td>93</td>\n",
" <td>0.628378</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.456537</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Never</td>\n",
" <td>30.02</td>\n",
" <td>104</td>\n",
" <td>0.662420</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.746064</td>\n",
" <td>No</td>\n",
" <td>Control</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6319</th>\n",
" <td>1253</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>25.28</td>\n",
" <td>105</td>\n",
" <td>0.640244</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.828884</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6320</th>\n",
" <td>1254</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>62</td>\n",
" <td>Former</td>\n",
" <td>27.10</td>\n",
" <td>104</td>\n",
" <td>0.594286</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.067762</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6321</th>\n",
" <td>1255</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>65</td>\n",
" <td>Never</td>\n",
" <td>35.02</td>\n",
" <td>103</td>\n",
" <td>0.686667</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>1.993155</td>\n",
" <td>No</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6322</th>\n",
" <td>1256</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>61</td>\n",
" <td>Never</td>\n",
" <td>28.42</td>\n",
" <td>94</td>\n",
" <td>0.576687</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.039699</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6323</th>\n",
" <td>1257</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>24.43</td>\n",
" <td>93</td>\n",
" <td>0.547059</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.590007</td>\n",
" <td>No</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6282 rows × 18 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 1 1 Female 77 Never 25.92 94 0.657343 \n",
"1 2 1 Female 68 Never 34.85 150 0.949367 \n",
"2 3 1 Female 66 Never 37.50 120 0.750000 \n",
"3 4 1 Female 77 Never 29.26 93 0.628378 \n",
"4 5 1 Female 60 Never 30.02 104 0.662420 \n",
"... ... ... ... ... ... ... ... ... \n",
"6319 1253 5 Male 79 Never 25.28 105 0.640244 \n",
"6320 1254 5 Male 62 Former 27.10 104 0.594286 \n",
"6321 1255 5 Female 65 Never 35.02 103 0.686667 \n",
"6322 1256 5 Male 61 Never 28.42 94 0.576687 \n",
"6323 1257 5 Male 58 Former 24.43 93 0.547059 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
"0 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO \n",
"1 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts \n",
"2 Yes Yes No No No 6 5.590691 No MedDiet + Nuts \n",
"3 Yes Yes No No No 6 5.456537 No MedDiet + VOO \n",
"4 Yes No Yes No No 9 2.746064 No Control \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6319 Yes No Yes No No 8 5.828884 No MedDiet + VOO \n",
"6320 Yes No Yes Yes No 9 5.067762 No MedDiet + Nuts \n",
"6321 Yes No Yes No No 10 1.993155 No MedDiet + VOO \n",
"6322 Yes Yes No No No 9 2.039699 No MedDiet + Nuts \n",
"6323 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n",
"\n",
" City \n",
"0 Madrid \n",
"1 Madrid \n",
"2 Madrid \n",
"3 Madrid \n",
"4 Madrid \n",
"... ... \n",
"6319 Malaga \n",
"6320 Malaga \n",
"6321 Malaga \n",
"6322 Malaga \n",
"6323 Malaga \n",
"\n",
"[6282 rows x 18 columns]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# your code here\n",
"\n",
"df_removed = df_location.merge(dropped, how = \"outer\", on = identifier, indicator = True)\n",
"\n",
"df_removed = df_removed.loc[df_removed[\"_merge\"] == \"left_only\",].drop([\"_merge\"], axis = 1)\n",
"df_removed"
]
},
{
"cell_type": "markdown",
"id": "84270332",
"metadata": {},
"source": [
"# 4. Save final result in `processed_data_predimed.csv`\n",
"\n",
"1. Using the `.to_csv` method of Pandas DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "85902eea",
"metadata": {},
"outputs": [],
"source": [
"fname = 'processed_data_predimed.csv'\n",
"\n",
"# your code here\n",
"\n",
"df_removed.to_csv(fname)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}