21 KiB
Split-apply-combine operations for tabular data¶
import pandas as pd
data = pd.DataFrame(
data=[
['312', 'A1', 0.12, 'LEFT'],
['312', 'A2', 0.37, 'LEFT'],
['312', 'C2', 0.68, 'LEFT'],
['313', 'A1', 0.07, 'RIGHT'],
['313', 'B1', 0.08, 'RIGHT'],
['314', 'A2', 0.29, 'LEFT'],
['314', 'B1', 0.14, 'RIGHT'],
['314', 'C2', 0.73, 'RIGHT'],
['711', 'A1', 4.01, 'RIGHT'],
['712', 'A2', 3.29, 'LEFT'],
['713', 'B1', 5.74, 'LEFT'],
['714', 'B2', 3.32, 'RIGHT'],
],
columns=['subject_id', 'condition_id', 'response_time', 'response'],
)
data
Group-by¶
We want to compute the mean response time by condition.
Let's start by doing it by hand, using for loops!
conditions = data['condition_id'].unique()
results_dict = {}
for condition in conditions:
group = data[data['condition_id'] == condition]
results_dict[condition] = group['response_time'].mean()
results = pd.DataFrame([results_dict], index=['response_time']).T
results
This is a basic operation, and we would need to repeat his pattern a million times!
Pandas and all other tools for tabular data provide a command for performing operations on groups.
# df.groupby(column_name) groups a DataFrame by the values in the column
data.groupby('condition_id')
# The group-by object can by used as a DataFrame.
# Operations are executed on each group individually, then aggregated
data.groupby('condition_id').size()
data.groupby('condition_id')['response_time'].mean()
data.groupby('condition_id')['response_time'].max()
Pivot tables¶
We want to look at response time biases when the subjects respond LEFT vs RIGHT. In principle, we expect them to have the same response time in both cases.
We compute a summary table with 1) condition_id on the rows; 2) response on the columns; 3) the average response time for all experiments with a that condition and response
We can do it with groupby
, with some table manipulation commands.
summary = data.groupby(['condition_id', 'response'])['response_time'].mean()
summary
summary.unstack(level=1)
Pandas has a command called pivot_table
that can be used to perform this kind of operation straightforwardly.
data.pivot_table(index='condition_id', columns='response', values='response_time', aggfunc='mean')
(
data
.pivot_table(
index='condition_id',
columns='response',
values='response_time',
aggfunc=['mean', 'std', 'count'],
)
)