2024-heraklion-data/exercises/tabular_tidy_data/tidy_data.ipynb
2024-08-27 15:27:53 +03:00

25 KiB

Exercise: Analysis of tubercolosis cases by country and year period

In [1]:
import pandas as pd

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
pd.set_option("display.max_colwidth", None)

Load the TB data from the World Health Organization

In [2]:
tb_raw = pd.read_csv('who2.csv', index_col='rownames')

Only keep data between 2000 and 2012

In [3]:
cols = ['country', 'year'] + [c for c in tb_raw.columns if c.startswith('sp')]
tb_raw = tb_raw.loc[tb_raw['year'].between(2000, 2012), cols]
In [4]:
tb_raw.shape
Out[4]:
(2783, 16)
In [5]:
tb_raw.sample(7, random_state=727)
Out[5]:
country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564 sp_m_65 sp_f_014 sp_f_1524 sp_f_2534 sp_f_3544 sp_f_4554 sp_f_5564 sp_f_65
rownames
5551 San Marino 2009 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
642 Belarus 2009 0.0 66.0 173.0 208.0 287.0 134.0 54.0 0.0 41.0 52.0 52.0 41.0 25.0 68.0
7234 Zimbabwe 2007 138.0 500.0 3693.0 0.0 716.0 292.0 153.0 185.0 739.0 3311.0 0.0 553.0 213.0 90.0
3471 Kuwait 2008 0.0 18.0 90.0 56.0 34.0 11.0 9.0 2.0 33.0 47.0 27.0 7.0 5.0 6.0
3336 Jordan 2009 1.0 5.0 15.0 14.0 10.0 7.0 6.0 0.0 7.0 14.0 8.0 3.0 7.0 12.0
2689 Grenada 2008 NaN 1.0 NaN 1.0 2.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
634 Belarus 2001 2.0 NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN NaN NaN
In [6]:
tb_raw[tb_raw['country'] == 'Angola']
Out[6]:
country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564 sp_m_65 sp_f_014 sp_f_1524 sp_f_2534 sp_f_3544 sp_f_4554 sp_f_5564 sp_f_65
rownames
191 Angola 2000 186.0 999.0 1003.0 912.0 482.0 312.0 194.0 247.0 1142.0 1091.0 844.0 417.0 200.0 120.0
192 Angola 2001 230.0 892.0 752.0 648.0 420.0 197.0 173.0 279.0 993.0 869.0 647.0 323.0 200.0 182.0
193 Angola 2002 435.0 2223.0 2292.0 1915.0 1187.0 624.0 444.0 640.0 2610.0 2208.0 1600.0 972.0 533.0 305.0
194 Angola 2003 409.0 2355.0 2598.0 1908.0 1090.0 512.0 361.0 591.0 3078.0 2641.0 1747.0 1157.0 395.0 129.0
195 Angola 2004 554.0 2684.0 2659.0 1998.0 1196.0 561.0 321.0 733.0 3198.0 2772.0 1854.0 1029.0 505.0 269.0
196 Angola 2005 520.0 2549.0 2797.0 1918.0 1255.0 665.0 461.0 704.0 2926.0 2682.0 1797.0 1138.0 581.0 417.0
197 Angola 2006 540.0 2632.0 3049.0 2182.0 1397.0 729.0 428.0 689.0 2851.0 2892.0 1990.0 1223.0 583.0 314.0
198 Angola 2007 484.0 2824.0 3197.0 2255.0 1357.0 699.0 465.0 703.0 2943.0 2721.0 1812.0 1041.0 554.0 367.0
199 Angola 2008 367.0 2970.0 3493.0 2418.0 1480.0 733.0 420.0 512.0 3199.0 2786.0 2082.0 1209.0 556.0 337.0
200 Angola 2009 392.0 3054.0 3600.0 2420.0 1590.0 748.0 463.0 568.0 3152.0 2798.0 1790.0 1069.0 572.0 272.0
201 Angola 2010 448.0 2900.0 3584.0 2415.0 1424.0 691.0 355.0 558.0 2763.0 2594.0 1688.0 958.0 482.0 286.0
202 Angola 2011 501.0 3000.0 3792.0 2386.0 1395.0 680.0 455.0 708.0 2731.0 2563.0 1683.0 1006.0 457.0 346.0
203 Angola 2012 390.0 2804.0 3627.0 2529.0 1427.0 732.0 424.0 592.0 2501.0 2540.0 1617.0 1028.0 529.0 384.0
In [7]:
tb_raw.columns
Out[7]:
Index(['country', 'year', 'sp_m_014', 'sp_m_1524', 'sp_m_2534', 'sp_m_3544',
       'sp_m_4554', 'sp_m_5564', 'sp_m_65', 'sp_f_014', 'sp_f_1524',
       'sp_f_2534', 'sp_f_3544', 'sp_f_4554', 'sp_f_5564', 'sp_f_65'],
      dtype='object')

1. Make data tidy

The final table should have these columns: country, year, gender, age_range, cases

In [ ]:

2. Compute summary tables

  1. Compute the number of cases per country and gender, for data between 2000 and 2006 (included)
  2. Compute the number of cases per country and year range (2000-2006, 2007-2012) on rows, and gender on columns
In [ ]: