Exercise window functions: compute the cumulative number of cases across time, per diet group¶

The variable toevent contains the time that patients where followed up. We want to calculate the number of events as a function of the follow-up time, separatedely for each diet group. We expect that, if the mediterranean diet has an effect, then over time there will be more cases appearing on the control group in comparison to the other diet groups.

Here is how to proceed:

Use a window function to compute the cumulative number of events for each diet group separatedly. As we are interested in the follow-up time, you need to sort the events by the follow-up time first (toevent), and then calculate the cumulative sum of events, separatedely per group.
Add the result as a new column called 'cumulative_event_count'

With your new awesome vectorization skills, these two steps should take only one line!

When ready, execute the code at the end, which has already code that creates a visualiation with the cumulative number of events per group, as a function of the time of follow-up.

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load patient data¶

In [2]:

df = pd.read_csv('processed_data_predimed.csv')
df['event'] = df['event'].map({'Yes': 1, 'No': 0})
df

Out[2]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City
0	1	1	Female	77	Never	25.92	94	0.657343	Yes	No	Yes	Yes	No	9	5.538672	0	MedDiet + VOO	Madrid
1	2	1	Female	68	Never	34.85	150	0.949367	Yes	No	Yes	Yes	NaN	10	3.063655	0	MedDiet + Nuts	Madrid
2	3	1	Female	66	Never	37.50	120	0.750000	Yes	Yes	No	No	No	6	5.590691	0	MedDiet + Nuts	Madrid
3	4	1	Female	77	Never	29.26	93	0.628378	Yes	Yes	No	No	No	6	5.456537	0	MedDiet + VOO	Madrid
4	5	1	Female	60	Never	30.02	104	0.662420	Yes	No	Yes	No	No	9	2.746064	0	Control	Madrid
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6240	1253	5	Male	79	Never	25.28	105	0.640244	Yes	No	Yes	No	No	8	5.828884	0	MedDiet + VOO	Malaga
6241	1254	5	Male	62	Former	27.10	104	0.594286	Yes	No	Yes	Yes	No	9	5.067762	0	MedDiet + Nuts	Malaga
6242	1255	5	Female	65	Never	35.02	103	0.686667	Yes	No	Yes	No	No	10	1.993155	0	MedDiet + VOO	Malaga
6243	1256	5	Male	61	Never	28.42	94	0.576687	Yes	Yes	No	No	No	9	2.039699	0	MedDiet + Nuts	Malaga
6244	1257	5	Male	58	Former	24.43	93	0.547059	Yes	Yes	Yes	No	No	9	2.590007	0	MedDiet + Nuts	Malaga

6245 rows × 18 columns

In [3]:

# calculate cumulative number of cases across time, independently for each group
df['cumulative_event_count'] = df.sort_values('toevent').groupby('group')['event'].cumsum()
df

Out[3]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City	cumulative_event_count
0	1	1	Female	77	Never	25.92	94	0.657343	Yes	No	Yes	Yes	No	9	5.538672	0	MedDiet + VOO	Madrid	73
1	2	1	Female	68	Never	34.85	150	0.949367	Yes	No	Yes	Yes	NaN	10	3.063655	0	MedDiet + Nuts	Madrid	35
2	3	1	Female	66	Never	37.50	120	0.750000	Yes	Yes	No	No	No	6	5.590691	0	MedDiet + Nuts	Madrid	61
3	4	1	Female	77	Never	29.26	93	0.628378	Yes	Yes	No	No	No	6	5.456537	0	MedDiet + VOO	Madrid	73
4	5	1	Female	60	Never	30.02	104	0.662420	Yes	No	Yes	No	No	9	2.746064	0	Control	Madrid	50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6240	1253	5	Male	79	Never	25.28	105	0.640244	Yes	No	Yes	No	No	8	5.828884	0	MedDiet + VOO	Malaga	74
6241	1254	5	Male	62	Former	27.10	104	0.594286	Yes	No	Yes	Yes	No	9	5.067762	0	MedDiet + Nuts	Malaga	57
6242	1255	5	Female	65	Never	35.02	103	0.686667	Yes	No	Yes	No	No	10	1.993155	0	MedDiet + VOO	Malaga	27
6243	1256	5	Male	61	Never	28.42	94	0.576687	Yes	Yes	No	No	No	9	2.039699	0	MedDiet + Nuts	Malaga	16
6244	1257	5	Male	58	Former	24.43	93	0.547059	Yes	Yes	Yes	No	No	9	2.590007	0	MedDiet + Nuts	Malaga	27

6245 rows × 19 columns

In [4]:

sns.lineplot(data=df.sort_values('toevent'), x='toevent', y='cumulative_event_count', hue='group')
plt.ylabel('Cumulative events')
plt.xlabel('Years of follow up (variable `toevent`)')
sns.despine()

Optional exercise¶

Redo the plot but with the cummulative percentage of cases. For that you need to divide the cummulative count by the total number of cases in each group.

In [5]:

# we calculate the number of cases per group, and save them in a dictionary
n_dict = df.groupby('group')['cumulative_event_count'].count().to_dict()
n_dict

Out[5]:

{'Control': 2016, 'MedDiet + Nuts': 2077, 'MedDiet + VOO': 2152}

In [6]:

# we assign now the N to eaach group, to later divide it in vector form
df['N'] = df['group'].map(n_dict)
df

Out[6]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City	cumulative_event_count	N
0	1	1	Female	77	Never	25.92	94	0.657343	Yes	No	Yes	Yes	No	9	5.538672	0	MedDiet + VOO	Madrid	73	2152
1	2	1	Female	68	Never	34.85	150	0.949367	Yes	No	Yes	Yes	NaN	10	3.063655	0	MedDiet + Nuts	Madrid	35	2077
2	3	1	Female	66	Never	37.50	120	0.750000	Yes	Yes	No	No	No	6	5.590691	0	MedDiet + Nuts	Madrid	61	2077
3	4	1	Female	77	Never	29.26	93	0.628378	Yes	Yes	No	No	No	6	5.456537	0	MedDiet + VOO	Madrid	73	2152
4	5	1	Female	60	Never	30.02	104	0.662420	Yes	No	Yes	No	No	9	2.746064	0	Control	Madrid	50	2016
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6240	1253	5	Male	79	Never	25.28	105	0.640244	Yes	No	Yes	No	No	8	5.828884	0	MedDiet + VOO	Malaga	74	2152
6241	1254	5	Male	62	Former	27.10	104	0.594286	Yes	No	Yes	Yes	No	9	5.067762	0	MedDiet + Nuts	Malaga	57	2077
6242	1255	5	Female	65	Never	35.02	103	0.686667	Yes	No	Yes	No	No	10	1.993155	0	MedDiet + VOO	Malaga	27	2152
6243	1256	5	Male	61	Never	28.42	94	0.576687	Yes	Yes	No	No	No	9	2.039699	0	MedDiet + Nuts	Malaga	16	2077
6244	1257	5	Male	58	Former	24.43	93	0.547059	Yes	Yes	Yes	No	No	9	2.590007	0	MedDiet + Nuts	Malaga	27	2077

6245 rows × 20 columns

In [7]:

# divide, vectorized
df['cumulative_incidence'] = df['cumulative_event_count'] / df['N']

In [8]:

sns.lineplot(data=df.sort_values('toevent'), x='toevent', y='cumulative_incidence', hue='group')
plt.ylabel('Cumulative incidence')
plt.xlabel('Years of follow up')
sns.despine()

In [ ]:

134 KiB Raw Permalink Blame History Unescape Escape

Exercise window functions: compute the cumulative number of cases across time, per diet group¶

Load patient data¶

Optional exercise¶

134 KiB

Raw Permalink Blame History