Split-apply-combine operations for tabular data¶

In [1]:

import pandas as pd

In [2]:

data = pd.DataFrame(
    data=[
        ['312', 'A1', 0.12, 'LEFT'],
        ['312', 'A2', 0.37, 'LEFT'],
        ['312', 'C2', 0.68, 'LEFT'],
        ['313', 'A1', 0.07, 'RIGHT'],
        ['313', 'B1', 0.08, 'RIGHT'],
        ['314', 'A2', 0.29, 'LEFT'],
        ['314', 'B1', 0.14, 'RIGHT'],
        ['314', 'C2', 0.73, 'RIGHT'],
        ['711', 'A1', 4.01, 'RIGHT'],
        ['712', 'A2', 3.29, 'LEFT'],
        ['713', 'B1', 5.74, 'LEFT'],
        ['714', 'B2', 3.32, 'RIGHT'],
    ],
    columns=['subject_id', 'condition_id', 'response_time', 'response'],
)
data

Out[2]:

	subject_id	condition_id	response_time	response
0	312	A1	0.12	LEFT
1	312	A2	0.37	LEFT
2	312	C2	0.68	LEFT
3	313	A1	0.07	RIGHT
4	313	B1	0.08	RIGHT
5	314	A2	0.29	LEFT
6	314	B1	0.14	RIGHT
7	314	C2	0.73	RIGHT
8	711	A1	4.01	RIGHT
9	712	A2	3.29	LEFT
10	713	B1	5.74	LEFT
11	714	B2	3.32	RIGHT

Group-by¶

We want to compute the mean response time by condition.

Let's start by doing it by hand, using for loops!

In [14]:

conditions = data['condition_id'].unique()
results_dict = {}
for condition in conditions:
    group = data[data['condition_id'] == condition]
    results_dict[condition] = group['response_time'].mean()

results = pd.DataFrame([results_dict], index=['response_time']).T

In [15]:

results

Out[15]:

	response_time
A1	1.400000
A2	1.316667
C2	0.705000
B1	1.986667
B2	3.320000

This is a basic operation, and we would need to repeat his pattern a million times!

Pandas and all other tools for tabular data provide a command for performing operations on groups.

In [29]:

# df.groupby(column_name) groups a DataFrame by the values in the column
data.groupby('condition_id')

Out[29]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x14ff67a90>

In [3]:

# The group-by object can by used as a DataFrame. 
# Operations are executed on each group individually, then aggregated
data.groupby('condition_id').size()

Out[3]:

condition_id
A1    3
A2    3
B1    3
B2    1
C2    2
dtype: int64

In [33]:

data.groupby('condition_id')['response_time'].mean()

Out[33]:

condition_id
A1    1.400000
A2    1.316667
B1    1.986667
B2    3.320000
C2    0.705000
Name: response_time, dtype: float64

In [36]:

data.groupby('condition_id')['response_time'].max()

Out[36]:

condition_id
A1    4.01
A2    3.29
B1    5.74
B2    3.32
C2    0.73
Name: response_time, dtype: float64

Pivot tables¶

We want to look at response time biases when the subjects respond LEFT vs RIGHT. In principle, we expect them to have the same response time in both cases.

We compute a summary table with 1) condition_id on the rows; 2) response on the columns; 3) the average response time for all experiments with a that condition and response

We can do it with groupby, with some table manipulation commands.

In [44]:

summary = data.groupby(['condition_id', 'response'])['response_time'].mean()
summary

Out[44]:

condition_id  response
A1            LEFT        0.120000
              RIGHT       2.040000
A2            LEFT        1.316667
B1            LEFT        5.740000
              RIGHT       0.110000
B2            RIGHT       3.320000
C2            LEFT        0.680000
              RIGHT       0.730000
Name: response_time, dtype: float64

In [45]:

summary.unstack(level=1)

Out[45]:

response	LEFT	RIGHT
condition_id
A1	0.120000	2.04
A2	1.316667	NaN
B1	5.740000	0.11
B2	NaN	3.32
C2	0.680000	0.73

Pandas has a command called pivot_table that can be used to perform this kind of operation straightforwardly.

In [47]:

data.pivot_table(index='condition_id', columns='response', values='response_time', aggfunc='mean')

Out[47]:

response	LEFT	RIGHT
condition_id
A1	0.120000	2.04
A2	1.316667	NaN
B1	5.740000	0.11
B2	NaN	3.32
C2	0.680000	0.73

In [59]:

(
    data
    .pivot_table(
        index='condition_id', 
        columns='response', 
        values='response_time', 
        aggfunc=['mean', 'std', 'count'],
    )
)

Out[59]:

	mean		std		count
response	LEFT	RIGHT	LEFT	RIGHT	LEFT	RIGHT
condition_id
A1	0.120000	2.04	NaN	2.786001	1.0	2.0
A2	1.316667	NaN	1.709425	NaN	3.0	NaN
B1	5.740000	0.11	NaN	0.042426	1.0	2.0
B2	NaN	3.32	NaN	NaN	NaN	1.0
C2	0.680000	0.73	NaN	NaN	1.0	1.0

In [ ]:

21 KiB Raw Blame History

Split-apply-combine operations for tabular data¶

Group-by¶

Pivot tables¶

21 KiB

Raw Blame History