2024-heraklion-data/notebooks/030_tabular_data/010_pandas_introduction.ipynb
2024-08-27 15:27:53 +03:00

30 KiB

Pandas, quick introduction

In [1]:
import pandas as pd

Pandas introduces a tabular data structure, the DataFrame

  • Columns can be of any C-native type
  • Columns and rows have indices, i.e. labels that identify each column or row
In [2]:
df = pd.DataFrame(
    data = [
        ['Anthony', 28, 1.53], 
        ['Maria', 31, 1.76], 
        ['Emma', 26, 1.83], 
        ['Philip', 41, 1.81], 
        ['Bill', 27, None],
    ],
    columns = ['name', 'age', 'height'],
    index=['A484', 'C012', 'A123', 'B663', 'A377'],
)
In [3]:
df
Out[3]:
name age height
A484 Anthony 28 1.53
C012 Maria 31 1.76
A123 Emma 26 1.83
B663 Philip 41 1.81
A377 Bill 27 NaN
In [4]:
df.head(3)
Out[4]:
name age height
A484 Anthony 28 1.53
C012 Maria 31 1.76
A123 Emma 26 1.83
In [5]:
df.sample(3)
Out[5]:
name age height
A377 Bill 27 NaN
C012 Maria 31 1.76
A484 Anthony 28 1.53

DataFrame attributes

In [6]:
df.shape
Out[6]:
(5, 3)
In [7]:
# Each column can be a different dtype
# All dtypes are native data types, as in NumPy
df.dtypes
Out[7]:
name       object
age         int64
height    float64
dtype: object
In [8]:
df.columns
Out[8]:
Index(['name', 'age', 'height'], dtype='object')
In [9]:
df.index
Out[9]:
Index(['A484', 'C012', 'A123', 'B663', 'A377'], dtype='object')

Indexing rows and columns

In [10]:
# Default indexing is by column
df['age']
Out[10]:
A484    28
C012    31
A123    26
B663    41
A377    27
Name: age, dtype: int64
In [11]:
# Use a list to select multiple columns (like in NumPy's fancy indexing)
df[['age', 'name']]
Out[11]:
age name
A484 28 Anthony
C012 31 Maria
A123 26 Emma
B663 41 Philip
A377 27 Bill
In [12]:
# Indexing by row / column name
df.loc['A484', 'height']
Out[12]:
1.53
In [13]:
# Indexing by element position like in NumPy (it's a bit of a smell)
df.iloc[0, 2]
Out[13]:
1.53

Examining a column

In [14]:
df['name'].unique()
Out[14]:
array(['Anthony', 'Maria', 'Emma', 'Philip', 'Bill'], dtype=object)
In [15]:
df['name'].nunique()
Out[15]:
5
In [16]:
df['height'].describe()
Out[16]:
count    4.000000
mean     1.732500
std      0.138173
min      1.530000
25%      1.702500
50%      1.785000
75%      1.815000
max      1.830000
Name: height, dtype: float64

Filtering

In [17]:
df[df['age'] > 30]
Out[17]:
name age height
C012 Maria 31 1.76
B663 Philip 41 1.81
In [18]:
is_old_and_tall = (df['age'] > 30) & (df['height'] > 1.8)
df[is_old_and_tall]
Out[18]:
name age height
B663 Philip 41 1.81

Basic operations are by column (unlike NumPy)

In [19]:
df['age'].min()
Out[19]:
26
In [20]:
df.min()
Out[20]:
name      Anthony
age            26
height       1.53
dtype: object
In [21]:
# Note that Pandas operations ignore NaNs (they consider them as "missing")
df.mean()
/tmp/ipykernel_80457/1061404192.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean()
Out[21]:
age       30.6000
height     1.7325
dtype: float64
In [22]:
df.mean(numeric_only=True)
Out[22]:
age       30.6000
height     1.7325
dtype: float64
In [23]:
# Operations that change the order of the rows keep the index and column labels intact
df.sort_values('name', axis=0)
Out[23]:
name age height
A484 Anthony 28 1.53
A377 Bill 27 NaN
A123 Emma 26 1.83
C012 Maria 31 1.76
B663 Philip 41 1.81
In [24]:
df
Out[24]:
name age height
A484 Anthony 28 1.53
C012 Maria 31 1.76
A123 Emma 26 1.83
B663 Philip 41 1.81
A377 Bill 27 NaN

Operations on strings

In [25]:
# Use `.str` to access string operations
# Third character of each name
df['name'].str[2]
Out[25]:
A484    t
C012    r
A123    m
B663    i
A377    l
Name: name, dtype: object
In [26]:
# Third character of each name
df['name'].str.upper()
Out[26]:
A484    ANTHONY
C012      MARIA
A123       EMMA
B663     PHILIP
A377       BILL
Name: name, dtype: object
In [27]:
df['name'].str.count('a')
Out[27]:
A484    0
C012    2
A123    1
B663    0
A377    0
Name: name, dtype: int64
In [28]:
df['name'].str.lower().str.count('a')
Out[28]:
A484    1
C012    2
A123    1
B663    0
A377    0
Name: name, dtype: int64

Adding new columns

In [29]:
df
Out[29]:
name age height
A484 Anthony 28 1.53
C012 Maria 31 1.76
A123 Emma 26 1.83
B663 Philip 41 1.81
A377 Bill 27 NaN
In [30]:
df['name_upper'] = df['name'].str.upper()
In [31]:
df
Out[31]:
name age height name_upper
A484 Anthony 28 1.53 ANTHONY
C012 Maria 31 1.76 MARIA
A123 Emma 26 1.83 EMMA
B663 Philip 41 1.81 PHILIP
A377 Bill 27 NaN BILL
In [ ]: