2024-heraklion-data/exercises/tabular_join/.ipynb_checkpoints/tabular_join_solution-checkpoint.ipynb
2024-08-27 15:27:53 +03:00

34 KiB

Exercise: Add experiment information to electrophysiology data

In [1]:
import pandas as pd

# Set some Pandas options: maximum number of rows/columns it's going to display
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)

Load electrophysiology data

In [2]:
df = pd.read_csv('../../data/QC_passed_2024-07-04_collected.csv')
info = pd.read_csv('../../data/op_info.csv')
In [3]:
df.head()
Out[3]:
OP filename slice cell_ch cell_ID day treatment hrs_incubation repatch hrs_after_OP Rs Rin resting_potential max_spikes Rheobase AP_heigth TH max_depol max_repol membra_time_constant_tau capacitance comments rheo_ramp AP_halfwidth Rheobse_ramp Unnamed: 27 rheos_ramp comment high K concentration RMP_from_char
0 OP230420 23420003.abf S1 1 23420S1c1 D1 TTX 0.0 no 10.416389 6.675643 39.025301 -74.285889 24 200.0 80.749512 -35.278320 336.181641 -60.791016 19.40 510.601767 0 753.380113 1.151009 NaN NaN NaN NaN NaN 8 mM -61.828554
1 OP230420 23420003.abf S1 3 23420S1c3 D1 TTX 0.0 no 10.416389 7.867174 48.728367 -69.573975 26 300.0 78.448486 -32.043457 350.097656 -67.138672 17.30 393.397918 1 585.102837 1.006321 NaN NaN NaN NaN NaN 8 mM -60.460298
2 OP230420 23420003.abf S1 6 23420S1c6 D1 TTX 0.0 no 10.416389 8.820134 35.971082 -54.956055 22 300.0 76.660156 -29.827881 270.629883 -52.246094 14.85 426.098774 3 173.915797 1.266335 NaN NaN NaN NaN NaN 8 mM -59.615979
3 OP230420 23420003.abf S1 7 23420S1c7 D1 TTX 0.0 yes 10.416389 7.269195 39.186101 -69.268799 24 300.0 75.030518 -29.699707 242.553711 -71.411133 17.15 478.273362 4 598.079936 0.994396 NaN NaN NaN NaN NaN 8 mM -61.173839
4 OP230420 23420003.abf S1 8 23420S1c8 D1 TTX 0.0 yes 10.416389 6.000400 31.599917 -70.550537 22 350.0 81.011963 -33.068848 309.448242 -61.401367 16.65 575.513924 5 786.927898 1.182830 NaN NaN NaN NaN NaN 8 mM -60.956350
In [4]:
info.head()
Out[4]:
OP tissue_source area patient_age patcher
0 OP201027 Mitte temporal 33 Rosie
1 OP201029 Mitte temporal 47 Rosie
2 OP210323 Virchow temporal 10 Rosie
3 OP210615 Virchow temporal 19 Rosie
4 OP211123 Bielefeld temporal 68 Rosie

1. Add experiment information to the electrophysiology results

  • Is there information for every experiment?
  • How many experiments did each patcher perform? (i.e., individual OPs, or rows in info)
  • How many samples did each patcher analyze? (i.e., individual rows in df)
In [5]:
df_with_info = df.merge(info, on='OP', how='left')
In [6]:
df_with_info.count()
Out[6]:
OP                          827
filename                    827
slice                       827
cell_ch                     827
cell_ID                     827
day                         827
treatment                   827
hrs_incubation              827
repatch                     827
hrs_after_OP                827
Rs                          827
Rin                         827
resting_potential           827
max_spikes                  827
Rheobase                    824
AP_heigth                   824
TH                          824
max_depol                   824
max_repol                   824
membra_time_constant_tau    827
capacitance                 827
comments                    742
rheo_ramp                   120
AP_halfwidth                820
Rheobse_ramp                160
Unnamed: 27                   0
rheos_ramp                   32
comment                       5
                             37
high K concentration        827
RMP_from_char               827
tissue_source               800
area                        800
patient_age                 800
patcher                     800
dtype: int64
In [7]:
info['patcher'].value_counts()
Out[7]:
Verji    35
Rosie     8
Anna      2
Name: patcher, dtype: int64
In [8]:
df_with_info['patcher'].value_counts()
Out[8]:
Verji    594
Rosie    206
Name: patcher, dtype: int64

2. Remove outliers from the table

  1. Load the list of outliers in outliers.csv
  2. Use an anti-join to remove the outliers from the table
  3. How many samples (rows) are left in the data?
In [9]:
outliers = pd.read_csv('outliers.csv')
In [10]:
outliers.shape
Out[10]:
(134, 2)
In [11]:
outliers.head()
Out[11]:
OP cell_ID
0 OP240201 24201S2c2
1 OP210323 2021_03_25_0S4_D2c6
2 OP230808 23808S2c6
3 OP240503 24503S1c6
4 OP230109 2311S3c2
In [12]:
temp = df_with_info.merge(outliers, on=['OP', 'cell_ID'], how='outer', indicator=True)
In [13]:
df_without_outliers = temp[temp['_merge'] == 'left_only'].drop('_merge', axis=1)
In [14]:
df_without_outliers.shape
Out[14]:
(659, 35)
In [15]:
df_without_outliers.head()
Out[15]:
OP filename slice cell_ch cell_ID day treatment hrs_incubation repatch hrs_after_OP Rs Rin resting_potential max_spikes Rheobase AP_heigth TH max_depol max_repol membra_time_constant_tau capacitance comments rheo_ramp AP_halfwidth Rheobse_ramp Unnamed: 27 rheos_ramp comment high K concentration RMP_from_char tissue_source area patient_age patcher
0 OP230420 23420003.abf S1 1 23420S1c1 D1 TTX 0.0 no 10.416389 6.675643 39.025301 -74.285889 24 200.0 80.749512 -35.278320 336.181641 -60.791016 19.40 510.601767 0 753.380113 1.151009 NaN NaN NaN NaN NaN 8 mM -61.828554 Bielefeld temporal 13.0 Verji
1 OP230420 23420003.abf S1 3 23420S1c3 D1 TTX 0.0 no 10.416389 7.867174 48.728367 -69.573975 26 300.0 78.448486 -32.043457 350.097656 -67.138672 17.30 393.397918 1 585.102837 1.006321 NaN NaN NaN NaN NaN 8 mM -60.460298 Bielefeld temporal 13.0 Verji
2 OP230420 23420003.abf S1 6 23420S1c6 D1 TTX 0.0 no 10.416389 8.820134 35.971082 -54.956055 22 300.0 76.660156 -29.827881 270.629883 -52.246094 14.85 426.098774 3 173.915797 1.266335 NaN NaN NaN NaN NaN 8 mM -59.615979 Bielefeld temporal 13.0 Verji
5 OP230420 23420003.abf S1 8 23420S1c8 D1 TTX 0.0 yes 10.416389 6.000400 31.599917 -70.550537 22 350.0 81.011963 -33.068848 309.448242 -61.401367 16.65 575.513924 5 786.927898 1.182830 NaN NaN NaN NaN NaN 8 mM -60.956350 Bielefeld temporal 13.0 Verji
6 OP230420 23420061.abf S1_D2 8 23420S1c8 D2 TTX 19.0 yes 29.633333 8.271614 30.607259 -70.745850 1 1300.0 48.883057 -20.855713 100.952148 -27.465820 13.25 864.892430 29 565.938865 1.504127 NaN NaN NaN NaN NaN 8 mM -61.283967 Bielefeld temporal 13.0 Verji

3. Save final result in processed_QC_passed_2024-07-04_collected_v1.csv

  1. Using the .to_csv method of Pandas DataFrames
In [16]:
df_without_outliers.to_csv('processed_QC_passed_2024-07-04_collected_v1.csv', index=None)
In [ ]: