adds exercises for tabular data part

This commit is contained in:
Guillermo Aguilar 2025-09-23 11:49:18 +02:00
parent 2e60b94c52
commit 26eb146a5c
16 changed files with 60195 additions and 0 deletions

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,467 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6f6aa857",
"metadata": {},
"source": [
"# Exercise window functions: compute the cumulative number of cases across time, per diet group\n",
"\n",
"The variable `toevent` contains the time that patients where followed up. We want to calculate the number of events as a function of the follow-up time, separatedely for each diet group. We expect that, if the mediterranean diet has an effect, then over time there will be more cases appearing on the control group in comparison to the other diet groups. \n",
"\n",
"Here is how to proceed:\n",
"- Use a window function to compute the cumulative number of events for each diet group separatedly. As we are interested in the follow-up time, you need to sort the events by the follow-up time first (`toevent`), and then calculate the cumulative sum of events, separatedely per group.\n",
"- Add the result as a new column called `'cumulative_event_count'`\n",
"\n",
"With your new awesome vectorization skills, these two steps should take only one line!\n",
"\n",
"When ready, execute the code at the end, which has already code that creates a visualiation with the cumulative number of events per group, as a function of the time of follow-up."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8f9bc8b1",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"id": "1be11d54",
"metadata": {},
"source": [
"### Load patient data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8dfc3020",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>patient-id</th>\n",
" <th>location-id</th>\n",
" <th>sex</th>\n",
" <th>age</th>\n",
" <th>smoke</th>\n",
" <th>bmi</th>\n",
" <th>waist</th>\n",
" <th>wth</th>\n",
" <th>htn</th>\n",
" <th>diab</th>\n",
" <th>hyperchol</th>\n",
" <th>famhist</th>\n",
" <th>hormo</th>\n",
" <th>p14</th>\n",
" <th>toevent</th>\n",
" <th>event</th>\n",
" <th>group</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>25.92</td>\n",
" <td>94</td>\n",
" <td>0.657343</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.538672</td>\n",
" <td>0</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>68</td>\n",
" <td>Never</td>\n",
" <td>34.85</td>\n",
" <td>150</td>\n",
" <td>0.949367</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>10</td>\n",
" <td>3.063655</td>\n",
" <td>0</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>66</td>\n",
" <td>Never</td>\n",
" <td>37.50</td>\n",
" <td>120</td>\n",
" <td>0.750000</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.590691</td>\n",
" <td>0</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>77</td>\n",
" <td>Never</td>\n",
" <td>29.26</td>\n",
" <td>93</td>\n",
" <td>0.628378</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>6</td>\n",
" <td>5.456537</td>\n",
" <td>0</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Female</td>\n",
" <td>60</td>\n",
" <td>Never</td>\n",
" <td>30.02</td>\n",
" <td>104</td>\n",
" <td>0.662420</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.746064</td>\n",
" <td>0</td>\n",
" <td>Control</td>\n",
" <td>Madrid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6240</th>\n",
" <td>1253</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>79</td>\n",
" <td>Never</td>\n",
" <td>25.28</td>\n",
" <td>105</td>\n",
" <td>0.640244</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>8</td>\n",
" <td>5.828884</td>\n",
" <td>0</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6241</th>\n",
" <td>1254</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>62</td>\n",
" <td>Former</td>\n",
" <td>27.10</td>\n",
" <td>104</td>\n",
" <td>0.594286</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>5.067762</td>\n",
" <td>0</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6242</th>\n",
" <td>1255</td>\n",
" <td>5</td>\n",
" <td>Female</td>\n",
" <td>65</td>\n",
" <td>Never</td>\n",
" <td>35.02</td>\n",
" <td>103</td>\n",
" <td>0.686667</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>10</td>\n",
" <td>1.993155</td>\n",
" <td>0</td>\n",
" <td>MedDiet + VOO</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6243</th>\n",
" <td>1256</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>61</td>\n",
" <td>Never</td>\n",
" <td>28.42</td>\n",
" <td>94</td>\n",
" <td>0.576687</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.039699</td>\n",
" <td>0</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6244</th>\n",
" <td>1257</td>\n",
" <td>5</td>\n",
" <td>Male</td>\n",
" <td>58</td>\n",
" <td>Former</td>\n",
" <td>24.43</td>\n",
" <td>93</td>\n",
" <td>0.547059</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>9</td>\n",
" <td>2.590007</td>\n",
" <td>0</td>\n",
" <td>MedDiet + Nuts</td>\n",
" <td>Malaga</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6245 rows × 18 columns</p>\n",
"</div>"
],
"text/plain": [
" patient-id location-id sex age smoke bmi waist wth \\\n",
"0 1 1 Female 77 Never 25.92 94 0.657343 \n",
"1 2 1 Female 68 Never 34.85 150 0.949367 \n",
"2 3 1 Female 66 Never 37.50 120 0.750000 \n",
"3 4 1 Female 77 Never 29.26 93 0.628378 \n",
"4 5 1 Female 60 Never 30.02 104 0.662420 \n",
"... ... ... ... ... ... ... ... ... \n",
"6240 1253 5 Male 79 Never 25.28 105 0.640244 \n",
"6241 1254 5 Male 62 Former 27.10 104 0.594286 \n",
"6242 1255 5 Female 65 Never 35.02 103 0.686667 \n",
"6243 1256 5 Male 61 Never 28.42 94 0.576687 \n",
"6244 1257 5 Male 58 Former 24.43 93 0.547059 \n",
"\n",
" htn diab hyperchol famhist hormo p14 toevent event group \\\n",
"0 Yes No Yes Yes No 9 5.538672 0 MedDiet + VOO \n",
"1 Yes No Yes Yes NaN 10 3.063655 0 MedDiet + Nuts \n",
"2 Yes Yes No No No 6 5.590691 0 MedDiet + Nuts \n",
"3 Yes Yes No No No 6 5.456537 0 MedDiet + VOO \n",
"4 Yes No Yes No No 9 2.746064 0 Control \n",
"... ... ... ... ... ... ... ... ... ... \n",
"6240 Yes No Yes No No 8 5.828884 0 MedDiet + VOO \n",
"6241 Yes No Yes Yes No 9 5.067762 0 MedDiet + Nuts \n",
"6242 Yes No Yes No No 10 1.993155 0 MedDiet + VOO \n",
"6243 Yes Yes No No No 9 2.039699 0 MedDiet + Nuts \n",
"6244 Yes Yes Yes No No 9 2.590007 0 MedDiet + Nuts \n",
"\n",
" City \n",
"0 Madrid \n",
"1 Madrid \n",
"2 Madrid \n",
"3 Madrid \n",
"4 Madrid \n",
"... ... \n",
"6240 Malaga \n",
"6241 Malaga \n",
"6242 Malaga \n",
"6243 Malaga \n",
"6244 Malaga \n",
"\n",
"[6245 rows x 18 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('processed_data_predimed.csv')\n",
"df['event'] = df['event'].map({'Yes': 1, 'No': 0})\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "db0d21f7-033e-48ca-9c90-81451af57003",
"metadata": {},
"outputs": [],
"source": [
"# calculate cumulative number of cases across time, independently for each group\n",
"\n",
"# your code here:\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "7ee10840-12ce-491b-b48a-5fb04df22919",
"metadata": {},
"source": [
"If you do it right, the following code will create a visualization as shown in the slides. Uncomment it"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "de6ee1a8",
"metadata": {},
"outputs": [],
"source": [
"#sns.lineplot(data=df.sort_values('toevent'), x='toevent', y='cumulative_event_count', hue='group')\n",
"#plt.ylabel('Cumulative events')\n",
"#plt.xlabel('Years of follow up (variable `toevent`)')\n",
"#sns.despine()"
]
},
{
"cell_type": "markdown",
"id": "b8f7a0b7-1ae0-470c-a153-59dc8a6caa28",
"metadata": {},
"source": [
"### Optional exercise\n",
"\n",
"Redo the plot but with the cummulative *percentage* of cases. For that you need to divide the cummulative count by the total number of cases in each group. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d63a56b1",
"metadata": {},
"outputs": [],
"source": [
"# your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffd0e784-ab11-4073-a14a-038ef87c5464",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long