41 KiB
41 KiB
Window functions for tabular data¶
In [1]:
import pandas as pd
Load experimental data¶
In [2]:
df = pd.read_csv('timed_responses.csv', index_col=0)
In [3]:
df
Out[3]:
Split-apply-combine operations return one aggregated value per group¶
In [4]:
df.groupby('subject_id')['accuracy'].max()
Out[4]:
However, for some calculations we need to have a value per row¶
For example: for each subject, rank the responses by decreasing accuracy
In [5]:
df.groupby('subject_id')['accuracy'].rank()
Out[5]:
In [6]:
df['accuracy_rank'] = df.groupby('subject_id')['accuracy'].rank(ascending=False)
df
Out[6]:
In [7]:
df.sort_values(['subject_id', 'accuracy_rank'])
Out[7]:
In many cases, a window functions is combined with a sorting operation¶
For example: for each subject, count the number of "LEFT" responses up until any moment in the experiment
In [8]:
# Add a flag column "is_left", so that we can count the number of LEFT reponses using a cumulative sum
df['is_left'] = df['response'] == 'LEFT'
df
Out[8]:
In [9]:
# Without sorting, we get the number of LEFT responses... in no particular order
df['nr_lefts'] = df.groupby('subject_id')['is_left'].cumsum()
df.sort_values(['subject_id'])
Out[9]:
Window functions are also useful to compute changes in the data for each group¶
In this case, the window function often uses the shift(n)
method that lags the data by n
rows
In [10]:
df['shifted time'] = (
df
.sort_values('time (ms)')
.groupby('subject_id')['time (ms)']
.shift(1)
)
df.sort_values(['subject_id', 'time (ms)'])[['subject_id', 'time (ms)', 'shifted time']]
Out[10]:
In [11]:
df['time from prev'] = df['time (ms)'] - df['shifted time']
df.sort_values(['subject_id', 'time (ms)'])[['subject_id', 'time (ms)', 'time from prev']]
Out[11]:
In [ ]:
In [ ]:
In [ ]: