Pandas, quick introduction¶

In [1]:

import pandas as pd

Pandas introduces a tabular data structure, the DataFrame¶

Columns can be of any C-native type
Columns and rows have indices, i.e. labels that identify each column or row

In [2]:

df = pd.DataFrame(
    data = [
        ['Anthony', 28, 1.53], 
        ['Maria', 31, 1.76], 
        ['Emma', 26, 1.83], 
        ['Philip', 41, 1.81], 
        ['Bill', 27, None],
    ],
    columns = ['name', 'age', 'height'],
    index=['A484', 'C012', 'A123', 'B663', 'A377'],
)

In [3]:

df

Out[3]:

	name	age	height
A484	Anthony	28	1.53
C012	Maria	31	1.76
A123	Emma	26	1.83
B663	Philip	41	1.81
A377	Bill	27	NaN

In [4]:

df.head(3)

Out[4]:

	name	age	height
A484	Anthony	28	1.53
C012	Maria	31	1.76
A123	Emma	26	1.83

In [5]:

df.sample(3)

Out[5]:

	name	age	height
A377	Bill	27	NaN
C012	Maria	31	1.76
A484	Anthony	28	1.53

DataFrame attributes¶

In [6]:

df.shape

Out[6]:

(5, 3)

In [7]:

# Each column can be a different dtype
# All dtypes are native data types, as in NumPy
df.dtypes

Out[7]:

name       object
age         int64
height    float64
dtype: object

In [8]:

df.columns

Out[8]:

Index(['name', 'age', 'height'], dtype='object')

In [9]:

df.index

Out[9]:

Index(['A484', 'C012', 'A123', 'B663', 'A377'], dtype='object')

Indexing rows and columns¶

In [10]:

# Default indexing is by column
df['age']

Out[10]:

A484    28
C012    31
A123    26
B663    41
A377    27
Name: age, dtype: int64

In [11]:

# Use a list to select multiple columns (like in NumPy's fancy indexing)
df[['age', 'name']]

Out[11]:

	age	name
A484	28	Anthony
C012	31	Maria
A123	26	Emma
B663	41	Philip
A377	27	Bill

In [12]:

# Indexing by row / column name
df.loc['A484', 'height']

Out[12]:

1.53

In [13]:

# Indexing by element position like in NumPy (it's a bit of a smell)
df.iloc[0, 2]

Out[13]:

1.53

Examining a column¶

In [14]:

df['name'].unique()

Out[14]:

array(['Anthony', 'Maria', 'Emma', 'Philip', 'Bill'], dtype=object)

In [15]:

df['name'].nunique()

Out[15]:

In [16]:

df['height'].describe()

Out[16]:

count    4.000000
mean     1.732500
std      0.138173
min      1.530000
25%      1.702500
50%      1.785000
75%      1.815000
max      1.830000
Name: height, dtype: float64

Filtering¶

In [17]:

df[df['age'] > 30]

Out[17]:

	name	age	height
C012	Maria	31	1.76
B663	Philip	41	1.81

In [18]:

is_old_and_tall = (df['age'] > 30) & (df['height'] > 1.8)
df[is_old_and_tall]

Out[18]:

	name	age	height
B663	Philip	41	1.81

Basic operations are by column (unlike NumPy)¶

In [19]:

df['age'].min()

Out[19]:

In [20]:

df.min()

Out[20]:

name      Anthony
age            26
height       1.53
dtype: object

In [21]:

# Note that Pandas operations ignore NaNs (they consider them as "missing")
df.mean()

/tmp/ipykernel_80457/1061404192.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean()

Out[21]:

age       30.6000
height     1.7325
dtype: float64

In [22]:

df.mean(numeric_only=True)

Out[22]:

age       30.6000
height     1.7325
dtype: float64

In [23]:

# Operations that change the order of the rows keep the index and column labels intact
df.sort_values('name', axis=0)

Out[23]:

	name	age	height
A484	Anthony	28	1.53
A377	Bill	27	NaN
A123	Emma	26	1.83
C012	Maria	31	1.76
B663	Philip	41	1.81

In [24]:

df

Out[24]:

	name	age	height
A484	Anthony	28	1.53
C012	Maria	31	1.76
A123	Emma	26	1.83
B663	Philip	41	1.81
A377	Bill	27	NaN

Operations on strings¶

In [25]:

# Use `.str` to access string operations
# Third character of each name
df['name'].str[2]

Out[25]:

A484    t
C012    r
A123    m
B663    i
A377    l
Name: name, dtype: object

In [26]:

# Third character of each name
df['name'].str.upper()

Out[26]:

A484    ANTHONY
C012      MARIA
A123       EMMA
B663     PHILIP
A377       BILL
Name: name, dtype: object

In [27]:

df['name'].str.count('a')

Out[27]:

A484    0
C012    2
A123    1
B663    0
A377    0
Name: name, dtype: int64

In [28]:

df['name'].str.lower().str.count('a')

Out[28]:

A484    1
C012    2
A123    1
B663    0
A377    0
Name: name, dtype: int64

Adding new columns¶

In [29]:

df

Out[29]:

	name	age	height
A484	Anthony	28	1.53
C012	Maria	31	1.76
A123	Emma	26	1.83
B663	Philip	41	1.81
A377	Bill	27	NaN

In [30]:

df['name_upper'] = df['name'].str.upper()

In [31]:

df

Out[31]:

	name	age	height	name_upper
A484	Anthony	28	1.53	ANTHONY
C012	Maria	31	1.76	MARIA
A123	Emma	26	1.83	EMMA
B663	Philip	41	1.81	PHILIP
A377	Bill	27	NaN	BILL

In [ ]:

30 KiB Raw Blame History