2024-heraklion-data/notebooks/030_tabular_data/021_join_operations_tutor.ipynb
2024-08-27 15:27:53 +03:00

12 KiB

Combine information across tables: joins and anti-joins

In [1]:
import pandas as pd

"Load" some experimental data

In [2]:
data = pd.DataFrame(
    data=[
        ['312', 'A1', 0.12, 'LEFT'],
        ['312', 'A2', 0.37, 'LEFT'],
        ['312', 'C2', 0.68, 'LEFT'],
        ['711', 'A1', 4.01, 'RIGHT'],
        ['711', 'A2', 0.44, 'LEFT'],
        ['313', 'A1', 0.07, 'RIGHT'],
        ['313', 'B1', 0.08, 'RIGHT'],
        ['712', 'A2', 3.29, 'LEFT'],
        ['314', 'A2', 0.29, 'LEFT'],
        ['714', 'B2', 3.32, 'RIGHT'],
        ['314', 'B1', 0.14, 'RIGHT'],
        ['314', 'C2', 0.73, 'RIGHT'],
        ['713', 'B1', 5.74, 'LEFT'],
    ],
    columns=['subject_id', 'condition_id', 'response_time', 'response'],
)
data
Out[2]:
subject_id condition_id response_time response
0 312 A1 0.12 LEFT
1 312 A2 0.37 LEFT
2 312 C2 0.68 LEFT
3 711 A1 4.01 RIGHT
4 711 A2 0.44 LEFT
5 313 A1 0.07 RIGHT
6 313 B1 0.08 RIGHT
7 712 A2 3.29 LEFT
8 314 A2 0.29 LEFT
9 714 B2 3.32 RIGHT
10 314 B1 0.14 RIGHT
11 314 C2 0.73 RIGHT
12 713 B1 5.74 LEFT

Each experiment belongs to one experimental condition, but the parameters of each condition are not in the table

In [3]:
condition_to_orientation = {
    'A1': 0,
    'A2': 0,
    'B1': 45,
    'B2': 45,
    'C1': 90,
}

condition_to_duration = {
    'A1': 0.1,
    'A2': 0.01,
    'B1': 0.1,
    'B2': 0.01,
    'C1': 0.2,
}

condition_to_surround = {
    'A1': 'FULL',
    'A2': 'NONE',
    'B1': 'NONE',
    'B2': 'FULL',
    'C1': 'FULL',
}


condition_to_stimulus_type = {
    'A1': 'LINES',
    'A2': 'DOTS',
    'B1': 'PLAID',
    'B2': 'PLAID',
    'C1': 'WIGGLES',
}

Manually adding the condition parameters to the table

In [73]:
data_with_properties = data.copy()
In [ ]:

In [ ]:

Using a join operation

In [4]:
# Often, this is done using a spreadsheet
condition_properties = pd.DataFrame(
    [condition_to_orientation, condition_to_duration, condition_to_surround, condition_to_stimulus_type],
    index=['orientation', 'duration', 'surround', 'stimulus_type'],
).T
condition_properties
Out[4]:
orientation duration surround stimulus_type
A1 0 0.1 FULL LINES
A2 0 0.01 NONE DOTS
B1 45 0.1 NONE PLAID
B2 45 0.01 FULL PLAID
C1 90 0.2 FULL WIGGLES
In [ ]:

In [ ]:

Anti-join: filter out unwanted data

In [5]:
# We are given a list of subjects that are outliers and should be disregarded in the analysis
outliers = pd.DataFrame([['711'], ['712'], ['713'], ['714'], ['888']], columns=['subject_id'])
In [ ]:

In [ ]: