2024-heraklion-data/notebooks/030_tabular_data/040_window_functions.ipynb
2024-08-27 15:27:53 +03:00

41 KiB

Window functions for tabular data

In [1]:
import pandas as pd

Load experimental data

In [2]:
df = pd.read_csv('timed_responses.csv', index_col=0)
In [3]:
df
Out[3]:
subject_id time (ms) response accuracy
574 3 540 RIGHT 0.04
1190 2 552 LEFT 0.43
1895 2 1036 LEFT 0.36
53 3 257 RIGHT 0.11
158 2 743 RIGHT 0.32
551 3 619 LEFT 0.25
1602 1 43 RIGHT 0.65
413 1 471 LEFT 0.80
785 1 121 LEFT 0.10
1393 2 903 RIGHT 0.33
629 2 353 LEFT 0.17
1829 3 768 RIGHT 0.26
902 1 1093 LEFT 0.34
1486 2 3 RIGHT 0.29

Split-apply-combine operations return one aggregated value per group

In [4]:
df.groupby('subject_id')['accuracy'].max()
Out[4]:
subject_id
1    0.80
2    0.43
3    0.26
Name: accuracy, dtype: float64

However, for some calculations we need to have a value per row

For example: for each subject, rank the responses by decreasing accuracy

In [5]:
df.groupby('subject_id')['accuracy'].rank()
Out[5]:
574     1.0
1190    6.0
1895    5.0
53      2.0
158     3.0
551     3.0
1602    3.0
413     4.0
785     1.0
1393    4.0
629     1.0
1829    4.0
902     2.0
1486    2.0
Name: accuracy, dtype: float64
In [6]:
df['accuracy_rank'] = df.groupby('subject_id')['accuracy'].rank(ascending=False)
df
Out[6]:
subject_id time (ms) response accuracy accuracy_rank
574 3 540 RIGHT 0.04 4.0
1190 2 552 LEFT 0.43 1.0
1895 2 1036 LEFT 0.36 2.0
53 3 257 RIGHT 0.11 3.0
158 2 743 RIGHT 0.32 4.0
551 3 619 LEFT 0.25 2.0
1602 1 43 RIGHT 0.65 2.0
413 1 471 LEFT 0.80 1.0
785 1 121 LEFT 0.10 4.0
1393 2 903 RIGHT 0.33 3.0
629 2 353 LEFT 0.17 6.0
1829 3 768 RIGHT 0.26 1.0
902 1 1093 LEFT 0.34 3.0
1486 2 3 RIGHT 0.29 5.0
In [7]:
df.sort_values(['subject_id', 'accuracy_rank'])
Out[7]:
subject_id time (ms) response accuracy accuracy_rank
413 1 471 LEFT 0.80 1.0
1602 1 43 RIGHT 0.65 2.0
902 1 1093 LEFT 0.34 3.0
785 1 121 LEFT 0.10 4.0
1190 2 552 LEFT 0.43 1.0
1895 2 1036 LEFT 0.36 2.0
1393 2 903 RIGHT 0.33 3.0
158 2 743 RIGHT 0.32 4.0
1486 2 3 RIGHT 0.29 5.0
629 2 353 LEFT 0.17 6.0
1829 3 768 RIGHT 0.26 1.0
551 3 619 LEFT 0.25 2.0
53 3 257 RIGHT 0.11 3.0
574 3 540 RIGHT 0.04 4.0

In many cases, a window functions is combined with a sorting operation

For example: for each subject, count the number of "LEFT" responses up until any moment in the experiment

In [8]:
# Add a flag column "is_left", so that we can count the number of LEFT reponses using a cumulative sum
df['is_left'] = df['response'] == 'LEFT'
df
Out[8]:
subject_id time (ms) response accuracy accuracy_rank is_left
574 3 540 RIGHT 0.04 4.0 False
1190 2 552 LEFT 0.43 1.0 True
1895 2 1036 LEFT 0.36 2.0 True
53 3 257 RIGHT 0.11 3.0 False
158 2 743 RIGHT 0.32 4.0 False
551 3 619 LEFT 0.25 2.0 True
1602 1 43 RIGHT 0.65 2.0 False
413 1 471 LEFT 0.80 1.0 True
785 1 121 LEFT 0.10 4.0 True
1393 2 903 RIGHT 0.33 3.0 False
629 2 353 LEFT 0.17 6.0 True
1829 3 768 RIGHT 0.26 1.0 False
902 1 1093 LEFT 0.34 3.0 True
1486 2 3 RIGHT 0.29 5.0 False
In [9]:
# Without sorting, we get the number of LEFT responses... in no particular order
df['nr_lefts'] = df.groupby('subject_id')['is_left'].cumsum()
df.sort_values(['subject_id'])
Out[9]:
subject_id time (ms) response accuracy accuracy_rank is_left nr_lefts
1602 1 43 RIGHT 0.65 2.0 False 0
413 1 471 LEFT 0.80 1.0 True 1
785 1 121 LEFT 0.10 4.0 True 2
902 1 1093 LEFT 0.34 3.0 True 3
1190 2 552 LEFT 0.43 1.0 True 1
1895 2 1036 LEFT 0.36 2.0 True 2
158 2 743 RIGHT 0.32 4.0 False 2
1393 2 903 RIGHT 0.33 3.0 False 2
629 2 353 LEFT 0.17 6.0 True 3
1486 2 3 RIGHT 0.29 5.0 False 3
574 3 540 RIGHT 0.04 4.0 False 0
53 3 257 RIGHT 0.11 3.0 False 0
551 3 619 LEFT 0.25 2.0 True 1
1829 3 768 RIGHT 0.26 1.0 False 1

Window functions are also useful to compute changes in the data for each group

In this case, the window function often uses the shift(n) method that lags the data by n rows

In [10]:
df['shifted time'] = (
    df
    .sort_values('time (ms)')
    .groupby('subject_id')['time (ms)']
    .shift(1)
)
df.sort_values(['subject_id', 'time (ms)'])[['subject_id', 'time (ms)', 'shifted time']]
Out[10]:
subject_id time (ms) shifted time
1602 1 43 NaN
785 1 121 43.0
413 1 471 121.0
902 1 1093 471.0
1486 2 3 NaN
629 2 353 3.0
1190 2 552 353.0
158 2 743 552.0
1393 2 903 743.0
1895 2 1036 903.0
53 3 257 NaN
574 3 540 257.0
551 3 619 540.0
1829 3 768 619.0
In [11]:
df['time from prev'] = df['time (ms)'] - df['shifted time']
df.sort_values(['subject_id', 'time (ms)'])[['subject_id', 'time (ms)', 'time from prev']]
Out[11]:
subject_id time (ms) time from prev
1602 1 43 NaN
785 1 121 78.0
413 1 471 350.0
902 1 1093 622.0
1486 2 3 NaN
629 2 353 350.0
1190 2 552 199.0
158 2 743 191.0
1393 2 903 160.0
1895 2 1036 133.0
53 3 257 NaN
574 3 540 283.0
551 3 619 79.0
1829 3 768 149.0
In [ ]:

In [ ]:

In [ ]: