30 KiB
30 KiB
Pandas, quick introduction¶
In [1]:
import pandas as pd
Pandas introduces a tabular data structure, the DataFrame¶
- Columns can be of any C-native type
- Columns and rows have indices, i.e. labels that identify each column or row
In [2]:
df = pd.DataFrame(
data = [
['Anthony', 28, 1.53],
['Maria', 31, 1.76],
['Emma', 26, 1.83],
['Philip', 41, 1.81],
['Bill', 27, None],
],
columns = ['name', 'age', 'height'],
index=['A484', 'C012', 'A123', 'B663', 'A377'],
)
In [3]:
df
Out[3]:
In [4]:
df.head(3)
Out[4]:
In [5]:
df.sample(3)
Out[5]:
DataFrame attributes¶
In [6]:
df.shape
Out[6]:
In [7]:
# Each column can be a different dtype
# All dtypes are native data types, as in NumPy
df.dtypes
Out[7]:
In [8]:
df.columns
Out[8]:
In [9]:
df.index
Out[9]:
Indexing rows and columns¶
In [10]:
# Default indexing is by column
df['age']
Out[10]:
In [11]:
# Use a list to select multiple columns (like in NumPy's fancy indexing)
df[['age', 'name']]
Out[11]:
In [12]:
# Indexing by row / column name
df.loc['A484', 'height']
Out[12]:
In [13]:
# Indexing by element position like in NumPy (it's a bit of a smell)
df.iloc[0, 2]
Out[13]:
Examining a column¶
In [14]:
df['name'].unique()
Out[14]:
In [15]:
df['name'].nunique()
Out[15]:
In [16]:
df['height'].describe()
Out[16]:
Filtering¶
In [17]:
df[df['age'] > 30]
Out[17]:
In [18]:
is_old_and_tall = (df['age'] > 30) & (df['height'] > 1.8)
df[is_old_and_tall]
Out[18]:
Basic operations are by column (unlike NumPy)¶
In [19]:
df['age'].min()
Out[19]:
In [20]:
df.min()
Out[20]:
In [21]:
# Note that Pandas operations ignore NaNs (they consider them as "missing")
df.mean()
Out[21]:
In [22]:
df.mean(numeric_only=True)
Out[22]:
In [23]:
# Operations that change the order of the rows keep the index and column labels intact
df.sort_values('name', axis=0)
Out[23]:
In [24]:
df
Out[24]:
Operations on strings¶
In [25]:
# Use `.str` to access string operations
# Third character of each name
df['name'].str[2]
Out[25]:
In [26]:
# Third character of each name
df['name'].str.upper()
Out[26]:
In [27]:
df['name'].str.count('a')
Out[27]:
In [28]:
df['name'].str.lower().str.count('a')
Out[28]:
Adding new columns¶
In [29]:
df
Out[29]:
In [30]:
df['name_upper'] = df['name'].str.upper()
In [31]:
df
Out[31]:
In [ ]: