15 KiB
Exercise: Compute summary statistics¶
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
Load the patient data¶
df = pd.read_csv('processed_data_predimed.csv')
df.shape
df.head()
1. Did the mediterranean diet help prevent cardiovascular events?¶
To answer this question, we need to compute how many cardiovascular "events" occured in each group of participants, separated by the diet they followed.
In the data the column event
contains Yes
or No
, indicating if that patient had an cardiovascular event. The column group
contains which diet they followed.
We first convert the column ``event'' to a binary value.
df['event'] = df['event'].map({'Yes': 1, 'No': 0})
- Now compute the total number of events by diet group. Compare the numbers and see if you can answer the question.
# your code here:
df.groupby('group')['event'].sum()
- Check how many patients had each group
# your code here:
df.groupby('group').count()['event']
There were no equal number in each group, so to be precise we need to put the numbers into perspective of the total. For that:
- Calculate how many events occured relative to the amount of patients in each group (in percentage). Do this sepearated by diet group.
# your code here:
df.groupby('group')['event'].mean()*100
It seems that the control group had a higher percentage of events than the other two
2. Smoking¶
Did smoking make a difference in the outcome of the study? Calculate how many events occured by diet group and smoking. The idea is that you arrive to a table like this:
group | Current | Former | Never |
---|---|---|---|
Control | ... | ... | ... |
MedDiet + Nuts | ... | ... | ... |
MedDiet + VOO | ... | ... | ... |
where each entry in the table has the percentage of events for each group.
Hint: use pivot_table
# your code here
(df.pivot_table(index='group', columns='smoke', values='event')*100).round(2)
3. Age differences?¶
Finally, check that there were no big differences in the age between the groups.
- Calculate the mean and standard deviation of the patient's age, separated by diet group.
You should be getting a table where diet group are in the rows and gender in columns, like this
group | Female | Male |
---|---|---|
Control | 68 | 66.4 |
MedDiet + Nuts | 67.4 | 65.8 |
MedDiet + VOO | 67.7 | 66.1 |
# your code here:
df.pivot_table(index='group', columns='smoke', values='age')