Window functions for tabular data¶

In [1]:

import pandas as pd

Load experimental data¶

In [2]:

df = pd.read_csv('timed_responses.csv', index_col=0)

In [3]:

df

Out[3]:

	subject_id	time (ms)	response	accuracy
574	3	540	RIGHT	0.04
1190	2	552	LEFT	0.43
1895	2	1036	LEFT	0.36
53	3	257	RIGHT	0.11
158	2	743	RIGHT	0.32
551	3	619	LEFT	0.25
1602	1	43	RIGHT	0.65
413	1	471	LEFT	0.80
785	1	121	LEFT	0.10
1393	2	903	RIGHT	0.33
629	2	353	LEFT	0.17
1829	3	768	RIGHT	0.26
902	1	1093	LEFT	0.34
1486	2	3	RIGHT	0.29

Split-apply-combine operations return one aggregated value per group¶

In [4]:

df.groupby('subject_id')['accuracy'].max()

Out[4]:

subject_id
1    0.80
2    0.43
3    0.26
Name: accuracy, dtype: float64

However, for some calculations we need to have a value per row¶

For example: for each subject, rank the responses by decreasing accuracy

In [5]:

df.groupby('subject_id')['accuracy'].rank()

Out[5]:

574     1.0
1190    6.0
1895    5.0
53      2.0
158     3.0
551     3.0
1602    3.0
413     4.0
785     1.0
1393    4.0
629     1.0
1829    4.0
902     2.0
1486    2.0
Name: accuracy, dtype: float64

In [6]:

df['accuracy_rank'] = df.groupby('subject_id')['accuracy'].rank(ascending=False)
df

Out[6]:

	subject_id	time (ms)	response	accuracy	accuracy_rank
574	3	540	RIGHT	0.04	4.0
1190	2	552	LEFT	0.43	1.0
1895	2	1036	LEFT	0.36	2.0
53	3	257	RIGHT	0.11	3.0
158	2	743	RIGHT	0.32	4.0
551	3	619	LEFT	0.25	2.0
1602	1	43	RIGHT	0.65	2.0
413	1	471	LEFT	0.80	1.0
785	1	121	LEFT	0.10	4.0
1393	2	903	RIGHT	0.33	3.0
629	2	353	LEFT	0.17	6.0
1829	3	768	RIGHT	0.26	1.0
902	1	1093	LEFT	0.34	3.0
1486	2	3	RIGHT	0.29	5.0

In [7]:

df.sort_values(['subject_id', 'accuracy_rank'])

Out[7]:

	subject_id	time (ms)	response	accuracy	accuracy_rank
413	1	471	LEFT	0.80	1.0
1602	1	43	RIGHT	0.65	2.0
902	1	1093	LEFT	0.34	3.0
785	1	121	LEFT	0.10	4.0
1190	2	552	LEFT	0.43	1.0
1895	2	1036	LEFT	0.36	2.0
1393	2	903	RIGHT	0.33	3.0
158	2	743	RIGHT	0.32	4.0
1486	2	3	RIGHT	0.29	5.0
629	2	353	LEFT	0.17	6.0
1829	3	768	RIGHT	0.26	1.0
551	3	619	LEFT	0.25	2.0
53	3	257	RIGHT	0.11	3.0
574	3	540	RIGHT	0.04	4.0

In many cases, a window functions is combined with a sorting operation¶

For example: for each subject, count the number of "LEFT" responses up until any moment in the experiment

In [8]:

# Add a flag column "is_left", so that we can count the number of LEFT reponses using a cumulative sum
df['is_left'] = df['response'] == 'LEFT'
df

Out[8]:

	subject_id	time (ms)	response	accuracy	accuracy_rank	is_left
574	3	540	RIGHT	0.04	4.0	False
1190	2	552	LEFT	0.43	1.0	True
1895	2	1036	LEFT	0.36	2.0	True
53	3	257	RIGHT	0.11	3.0	False
158	2	743	RIGHT	0.32	4.0	False
551	3	619	LEFT	0.25	2.0	True
1602	1	43	RIGHT	0.65	2.0	False
413	1	471	LEFT	0.80	1.0	True
785	1	121	LEFT	0.10	4.0	True
1393	2	903	RIGHT	0.33	3.0	False
629	2	353	LEFT	0.17	6.0	True
1829	3	768	RIGHT	0.26	1.0	False
902	1	1093	LEFT	0.34	3.0	True
1486	2	3	RIGHT	0.29	5.0	False

In [9]:

# Without sorting, we get the number of LEFT responses... in no particular order
df['nr_lefts'] = df.groupby('subject_id')['is_left'].cumsum()
df.sort_values(['subject_id'])

Out[9]:

	subject_id	time (ms)	response	accuracy	accuracy_rank	is_left	nr_lefts
1602	1	43	RIGHT	0.65	2.0	False	0
413	1	471	LEFT	0.80	1.0	True	1
785	1	121	LEFT	0.10	4.0	True	2
902	1	1093	LEFT	0.34	3.0	True	3
1190	2	552	LEFT	0.43	1.0	True	1
1895	2	1036	LEFT	0.36	2.0	True	2
158	2	743	RIGHT	0.32	4.0	False	2
1393	2	903	RIGHT	0.33	3.0	False	2
629	2	353	LEFT	0.17	6.0	True	3
1486	2	3	RIGHT	0.29	5.0	False	3
574	3	540	RIGHT	0.04	4.0	False	0
53	3	257	RIGHT	0.11	3.0	False	0
551	3	619	LEFT	0.25	2.0	True	1
1829	3	768	RIGHT	0.26	1.0	False	1

Window functions are also useful to compute changes in the data for each group¶

In this case, the window function often uses the shift(n) method that lags the data by n rows

In [10]:

df['shifted time'] = (
    df
    .sort_values('time (ms)')
    .groupby('subject_id')['time (ms)']
    .shift(1)
)
df.sort_values(['subject_id', 'time (ms)'])[['subject_id', 'time (ms)', 'shifted time']]

Out[10]:

	subject_id	time (ms)	shifted time
1602	1	43	NaN
785	1	121	43.0
413	1	471	121.0
902	1	1093	471.0
1486	2	3	NaN
629	2	353	3.0
1190	2	552	353.0
158	2	743	552.0
1393	2	903	743.0
1895	2	1036	903.0
53	3	257	NaN
574	3	540	257.0
551	3	619	540.0
1829	3	768	619.0

In [11]:

df['time from prev'] = df['time (ms)'] - df['shifted time']
df.sort_values(['subject_id', 'time (ms)'])[['subject_id', 'time (ms)', 'time from prev']]

Out[11]:

	subject_id	time (ms)	time from prev
1602	1	43	NaN
785	1	121	78.0
413	1	471	350.0
902	1	1093	622.0
1486	2	3	NaN
629	2	353	350.0
1190	2	552	199.0
158	2	743	191.0
1393	2	903	160.0
1895	2	1036	133.0
53	3	257	NaN
574	3	540	283.0
551	3	619	79.0
1829	3	768	149.0

In [ ]:

41 KiB Raw Blame History