26 KiB
Exercise: Compute summary statistics¶
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
Load the patient data¶
df = pd.read_csv('processed_data_predimed.csv')
df.shape
df.head()
1. Did the mediterranean diet help prevent cardiovascular events?¶
To answer this question, we need to compute how many cardiovascular "events" occured in each group of participants, separated by the diet they followed.
In the data the column event
contains Yes
or No
, indicating if that patient had an cardiovascular event. The column group
contains which diet they followed.
- Convert the column
event
from string to binary (1 for Yes, 0 for No) (this will ease the calculations that follow later). Hint: use the method.map()
df['event'] = df['event'].map({'Yes': 1, 'No': 0})
- Now compute the total number of events by diet group. Compare the numbers and see if you can answer the question.
df.groupby('group')['event'].sum()
- Check how many patients had each group
df.groupby('group')['event'].count()
There were no equal number in each group, so to be precise we need to put the numbers into perspective of the total. For that:
- Calculate how many events occured relative to the amount of patients in each group (in percentage). Do this sepearated by diet group.
df.groupby('group')['event'].sum()*100 / df.groupby('group')['event'].count()
It seems that the control group had a higher percentage of events than the other two
2. Smoking¶
Did smoking make a difference in the outcome of the study? Calculate how many events occured by diet group and smoking. The idea is that you arrive to a table like this:
group | Current | Former | Never |
---|---|---|---|
Control | ... | ... | ... |
MedDiet + Nuts | ... | ... | ... |
MedDiet + VOO | ... | ... | ... |
where each entry in the table has the percentage of events for each group
Hint: use pivot_table
counts = df.pivot_table(index='group', columns='smoke', values='event', aggfunc='sum')
counts
N = df.pivot_table(index='group', columns='smoke', values='event', aggfunc='count')
N
counts*100/N
3. Age differences?¶
Finally, check that there were no big differences in the age between the groups.
- Calculate the mean and standard deviation of the patient's age, separated by diet group.
You should be getting a table where diet group are in the rows and gender in columns, like this
group | Female | Male |
---|---|---|
Control | 68 | 66.4 |
MedDiet + Nuts | 67.4 | 65.8 |
MedDiet + VOO | 67.7 | 66.1 |
# this works but it is longer than necessary
df.groupby(['group', 'sex'])['age'].mean().reset_index().pivot_table(index='group', columns='sex', values='age').round(1)
# instead of grouping first, do the pivot first and pass the aggregation function as an argument
df.pivot_table(index='group', columns='sex', values='age', aggfunc='mean').round(1)
# to get the standard deviation you could do the same the last time but pass aggfunc = 'std. This will return you another dataframe.
# Alternatively, you can calculate both mean and S.D. in one step. For that you can pass more than one value for the aggregation function
df.pivot_table(index='group', columns='sex', values='age', aggfunc=['mean', 'std']).round(1)