2025-plovdiv-data/exercises/tabular_window_functions/window_functions_solution.ipynb

134 KiB
Raw Permalink Blame History

Exercise window functions: compute the cumulative number of cases across time, per diet group

The variable toevent contains the time that patients where followed up. We want to calculate the number of events as a function of the follow-up time, separatedely for each diet group. We expect that, if the mediterranean diet has an effect, then over time there will be more cases appearing on the control group in comparison to the other diet groups.

Here is how to proceed:

  • Use a window function to compute the cumulative number of events for each diet group separatedly. As we are interested in the follow-up time, you need to sort the events by the follow-up time first (toevent), and then calculate the cumulative sum of events, separatedely per group.
  • Add the result as a new column called 'cumulative_event_count'

With your new awesome vectorization skills, these two steps should take only one line!

When ready, execute the code at the end, which has already code that creates a visualiation with the cumulative number of events per group, as a function of the time of follow-up.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load patient data

In [2]:
df = pd.read_csv('processed_data_predimed.csv')
df['event'] = df['event'].map({'Yes': 1, 'No': 0})
df
Out[2]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City
0 1 1 Female 77 Never 25.92 94 0.657343 Yes No Yes Yes No 9 5.538672 0 MedDiet + VOO Madrid
1 2 1 Female 68 Never 34.85 150 0.949367 Yes No Yes Yes NaN 10 3.063655 0 MedDiet + Nuts Madrid
2 3 1 Female 66 Never 37.50 120 0.750000 Yes Yes No No No 6 5.590691 0 MedDiet + Nuts Madrid
3 4 1 Female 77 Never 29.26 93 0.628378 Yes Yes No No No 6 5.456537 0 MedDiet + VOO Madrid
4 5 1 Female 60 Never 30.02 104 0.662420 Yes No Yes No No 9 2.746064 0 Control Madrid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6240 1253 5 Male 79 Never 25.28 105 0.640244 Yes No Yes No No 8 5.828884 0 MedDiet + VOO Malaga
6241 1254 5 Male 62 Former 27.10 104 0.594286 Yes No Yes Yes No 9 5.067762 0 MedDiet + Nuts Malaga
6242 1255 5 Female 65 Never 35.02 103 0.686667 Yes No Yes No No 10 1.993155 0 MedDiet + VOO Malaga
6243 1256 5 Male 61 Never 28.42 94 0.576687 Yes Yes No No No 9 2.039699 0 MedDiet + Nuts Malaga
6244 1257 5 Male 58 Former 24.43 93 0.547059 Yes Yes Yes No No 9 2.590007 0 MedDiet + Nuts Malaga

6245 rows × 18 columns

In [3]:
# calculate cumulative number of cases across time, independently for each group
df['cumulative_event_count'] = df.sort_values('toevent').groupby('group')['event'].cumsum()
df
Out[3]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City cumulative_event_count
0 1 1 Female 77 Never 25.92 94 0.657343 Yes No Yes Yes No 9 5.538672 0 MedDiet + VOO Madrid 73
1 2 1 Female 68 Never 34.85 150 0.949367 Yes No Yes Yes NaN 10 3.063655 0 MedDiet + Nuts Madrid 35
2 3 1 Female 66 Never 37.50 120 0.750000 Yes Yes No No No 6 5.590691 0 MedDiet + Nuts Madrid 61
3 4 1 Female 77 Never 29.26 93 0.628378 Yes Yes No No No 6 5.456537 0 MedDiet + VOO Madrid 73
4 5 1 Female 60 Never 30.02 104 0.662420 Yes No Yes No No 9 2.746064 0 Control Madrid 50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6240 1253 5 Male 79 Never 25.28 105 0.640244 Yes No Yes No No 8 5.828884 0 MedDiet + VOO Malaga 74
6241 1254 5 Male 62 Former 27.10 104 0.594286 Yes No Yes Yes No 9 5.067762 0 MedDiet + Nuts Malaga 57
6242 1255 5 Female 65 Never 35.02 103 0.686667 Yes No Yes No No 10 1.993155 0 MedDiet + VOO Malaga 27
6243 1256 5 Male 61 Never 28.42 94 0.576687 Yes Yes No No No 9 2.039699 0 MedDiet + Nuts Malaga 16
6244 1257 5 Male 58 Former 24.43 93 0.547059 Yes Yes Yes No No 9 2.590007 0 MedDiet + Nuts Malaga 27

6245 rows × 19 columns

In [4]:
sns.lineplot(data=df.sort_values('toevent'), x='toevent', y='cumulative_event_count', hue='group')
plt.ylabel('Cumulative events')
plt.xlabel('Years of follow up (variable `toevent`)')
sns.despine()

Optional exercise

Redo the plot but with the cummulative percentage of cases. For that you need to divide the cummulative count by the total number of cases in each group.

In [5]:
# we calculate the number of cases per group, and save them in a dictionary
n_dict = df.groupby('group')['cumulative_event_count'].count().to_dict()
n_dict
Out[5]:
{'Control': 2016, 'MedDiet + Nuts': 2077, 'MedDiet + VOO': 2152}
In [6]:
# we assign now the N to eaach group, to later divide it in vector form
df['N'] = df['group'].map(n_dict)
df
Out[6]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City cumulative_event_count N
0 1 1 Female 77 Never 25.92 94 0.657343 Yes No Yes Yes No 9 5.538672 0 MedDiet + VOO Madrid 73 2152
1 2 1 Female 68 Never 34.85 150 0.949367 Yes No Yes Yes NaN 10 3.063655 0 MedDiet + Nuts Madrid 35 2077
2 3 1 Female 66 Never 37.50 120 0.750000 Yes Yes No No No 6 5.590691 0 MedDiet + Nuts Madrid 61 2077
3 4 1 Female 77 Never 29.26 93 0.628378 Yes Yes No No No 6 5.456537 0 MedDiet + VOO Madrid 73 2152
4 5 1 Female 60 Never 30.02 104 0.662420 Yes No Yes No No 9 2.746064 0 Control Madrid 50 2016
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6240 1253 5 Male 79 Never 25.28 105 0.640244 Yes No Yes No No 8 5.828884 0 MedDiet + VOO Malaga 74 2152
6241 1254 5 Male 62 Former 27.10 104 0.594286 Yes No Yes Yes No 9 5.067762 0 MedDiet + Nuts Malaga 57 2077
6242 1255 5 Female 65 Never 35.02 103 0.686667 Yes No Yes No No 10 1.993155 0 MedDiet + VOO Malaga 27 2152
6243 1256 5 Male 61 Never 28.42 94 0.576687 Yes Yes No No No 9 2.039699 0 MedDiet + Nuts Malaga 16 2077
6244 1257 5 Male 58 Former 24.43 93 0.547059 Yes Yes Yes No No 9 2.590007 0 MedDiet + Nuts Malaga 27 2077

6245 rows × 20 columns

In [7]:
# divide, vectorized
df['cumulative_incidence'] = df['cumulative_event_count'] / df['N']
In [8]:
sns.lineplot(data=df.sort_values('toevent'), x='toevent', y='cumulative_incidence', hue='group')
plt.ylabel('Cumulative incidence')
plt.xlabel('Years of follow up')
sns.despine()
In [ ]: