Exercise: Compute summary statistics¶

In [1]:

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

Load the patient data¶

In [2]:

df = pd.read_csv('processed_data_predimed.csv')

In [3]:

df.shape

Out[3]:

(6245, 18)

In [4]:

df.head()

Out[4]:

	patient-id	location-id	sex	age	smoke	bmi	waist	wth	htn	diab	hyperchol	famhist	hormo	p14	toevent	event	group	City
0	1	1	Female	77	Never	25.92	94	0.657343	Yes	No	Yes	Yes	No	9	5.538672	No	MedDiet + VOO	Madrid
1	2	1	Female	68	Never	34.85	150	0.949367	Yes	No	Yes	Yes	NaN	10	3.063655	No	MedDiet + Nuts	Madrid
2	3	1	Female	66	Never	37.50	120	0.750000	Yes	Yes	No	No	No	6	5.590691	No	MedDiet + Nuts	Madrid
3	4	1	Female	77	Never	29.26	93	0.628378	Yes	Yes	No	No	No	6	5.456537	No	MedDiet + VOO	Madrid
4	5	1	Female	60	Never	30.02	104	0.662420	Yes	No	Yes	No	No	9	2.746064	No	Control	Madrid

1. Did the mediterranean diet help prevent cardiovascular events?¶

To answer this question, we need to compute how many cardiovascular "events" occured in each group of participants, separated by the diet they followed. In the data the column event contains Yes or No, indicating if that patient had an cardiovascular event. The column group contains which diet they followed.

We first convert the column ``event'' to a binary value.

In [5]:

df['event'] = df['event'].map({'Yes': 1, 'No': 0})

Now compute the total number of events by diet group. Compare the numbers and see if you can answer the question.

In [12]:

# your code here:
event_by_group = df.groupby('group')['event'].sum()

Check how many patients had each group

In [13]:

# your code here:
group_size = df.groupby('group').size()

There were no equal number in each group, so to be precise we need to put the numbers into perspective of the total. For that:

Calculate how many events occured relative to the amount of patients in each group (in percentage). Do this sepearated by diet group.

In [15]:

# your code here:
event_by_group / group_size * 100

Out[15]:

group
Control           4.761905
MedDiet + Nuts    3.322099
MedDiet + VOO     3.856877
dtype: float64

It seems that the control group had a higher percentage of events than the other two

2. Smoking¶

Did smoking make a difference in the outcome of the study? Calculate how many events occured by diet group and smoking. The idea is that you arrive to a table like this:

group	Current	Former	Never
Control	...	...	...
MedDiet + Nuts	...	...	...
MedDiet + VOO	...	...	...

where each entry in the table has the percentage of events for each group.

Hint: use pivot_table

In [20]:

# your code here
df.pivot_table(index='group', columns='smoke', values='event',aggfunc='sum')

Out[20]:

smoke	Current	Former	Never
group
Control	13	39	44
MedDiet + Nuts	15	20	34
MedDiet + VOO	20	29	34

3. Age differences?¶

Finally, check that there were no big differences in the age between the groups.

Calculate the mean and standard deviation of the patient's age, separated by diet group.

You should be getting a table where diet group are in the rows and gender in columns, like this

group	Female	Male
Control	68	66.4
MedDiet + Nuts	67.4	65.8
MedDiet + VOO	67.7	66.1

In [22]:

# your code here:
df.pivot_table(index='group', columns='sex', values='age', aggfunc=['mean','std'])

Out[22]:

	mean		std
sex	Female	Male	Female	Male
group
Control	68.009046	66.400000	5.979313	6.605266
MedDiet + Nuts	67.414591	65.822665	5.580050	6.403373
MedDiet + VOO	67.668775	66.080045	5.816703	6.621440

16 KiB Raw Blame History

Exercise: Compute summary statistics¶

Load the patient data¶

1. Did the mediterranean diet help prevent cardiovascular events?¶

2. Smoking¶

3. Age differences?¶

16 KiB

Raw Blame History