2025-plovdiv-data/exercises/tabular_split_apply_combine/split_apply_combine_solution.ipynb

26 KiB

Exercise: Compute summary statistics

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

Load the patient data

In [25]:
df = pd.read_csv('processed_data_predimed.csv')
In [26]:
df.shape
Out[26]:
(6245, 17)
In [27]:
df.head()
Out[27]:
patient-id location-id sex age smoke bmi waist wth htn diab hyperchol famhist hormo p14 toevent event group City
0 1 1 Female 77 Never 25.92 94 0.657343 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO Madrid
1 2 1 Female 68 Never 34.85 150 0.949367 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts Madrid
2 3 1 Female 66 Never 37.50 120 0.750000 Yes Yes No No No 6 5.590691 No MedDiet + Nuts Madrid
3 4 1 Female 77 Never 29.26 93 0.628378 Yes Yes No No No 6 5.456537 No MedDiet + VOO Madrid
4 5 1 Female 60 Never 30.02 104 0.662420 Yes No Yes No No 9 2.746064 No Control Madrid

1. Did the mediterranean diet help prevent cardiovascular events?

To answer this question, we need to compute how many cardiovascular "events" occured in each group of participants, separated by the diet they followed. In the data the column event contains Yes or No, indicating if that patient had an cardiovascular event. The column group contains which diet they followed.

  • Convert the column event from string to binary (1 for Yes, 0 for No) (this will ease the calculations that follow later). Hint: use the method .map()
In [28]:
df['event'] = df['event'].map({'Yes': 1, 'No': 0})
  • Now compute the total number of events by diet group. Compare the numbers and see if you can answer the question.
In [29]:
df.groupby('group')['event'].sum()
Out[29]:
group
Control           96
MedDiet + Nuts    69
MedDiet + VOO     83
Name: event, dtype: int64
  • Check how many patients had each group
In [30]:
df.groupby('group')['event'].count()
Out[30]:
group
Control           2016
MedDiet + Nuts    2077
MedDiet + VOO     2152
Name: event, dtype: int64

There were no equal number in each group, so to be precise we need to put the numbers into perspective of the total. For that:

  • Calculate how many events occured relative to the amount of patients in each group (in percentage). Do this sepearated by diet group.
In [31]:
df.groupby('group')['event'].sum()*100 / df.groupby('group')['event'].count()
Out[31]:
group
Control           4.761905
MedDiet + Nuts    3.322099
MedDiet + VOO     3.856877
Name: event, dtype: float64

It seems that the control group had a higher percentage of events than the other two

2. Smoking

Did smoking make a difference in the outcome of the study? Calculate how many events occured by diet group and smoking. The idea is that you arrive to a table like this:

group Current Former Never
Control ... ... ...
MedDiet + Nuts ... ... ...
MedDiet + VOO ... ... ...

where each entry in the table has the percentage of events for each group Hint: use pivot_table

In [32]:
counts = df.pivot_table(index='group', columns='smoke', values='event', aggfunc='sum')
counts
Out[32]:
smoke Current Former Never
group
Control 13 39 44
MedDiet + Nuts 15 20 34
MedDiet + VOO 20 29 34
In [17]:
N = df.pivot_table(index='group', columns='smoke', values='event', aggfunc='count')
N
Out[17]:
smoke Current Former Never
group
Control 264 485 1267
MedDiet + Nuts 291 539 1247
MedDiet + VOO 290 531 1331
In [18]:
counts*100/N
Out[18]:
smoke Current Former Never
group
Control 4.924242 8.041237 3.472770
MedDiet + Nuts 5.154639 3.710575 2.726544
MedDiet + VOO 6.896552 5.461394 2.554470

3. Age differences?

Finally, check that there were no big differences in the age between the groups.

  • Calculate the mean and standard deviation of the patient's age, separated by diet group.

You should be getting a table where diet group are in the rows and gender in columns, like this

group Female Male
Control 68 66.4
MedDiet + Nuts 67.4 65.8
MedDiet + VOO 67.7 66.1
In [33]:
# this works but it is longer than necessary
df.groupby(['group', 'sex'])['age'].mean().reset_index().pivot_table(index='group', columns='sex', values='age').round(1)
Out[33]:
sex Female Male
group
Control 68.0 66.4
MedDiet + Nuts 67.4 65.8
MedDiet + VOO 67.7 66.1
In [34]:
# instead of grouping first, do the pivot first and pass the aggregation function as an argument
df.pivot_table(index='group', columns='sex', values='age', aggfunc='mean').round(1)
Out[34]:
sex Female Male
group
Control 68.0 66.4
MedDiet + Nuts 67.4 65.8
MedDiet + VOO 67.7 66.1
In [37]:
# to get the standard deviation you could do the same the last time but pass aggfunc = 'std. This will return you another dataframe.

# Alternatively, you can calculate both mean and S.D. in one step. For that you can pass more than one value for the aggregation function
df.pivot_table(index='group', columns='sex', values='age', aggfunc=['mean', 'std']).round(1)
Out[37]:
mean std
sex Female Male Female Male
group
Control 68.0 66.4 6.0 6.6
MedDiet + Nuts 67.4 65.8 5.6 6.4
MedDiet + VOO 67.7 66.1 5.8 6.6
In [ ]: