10 Using Pandas to Explore COVID-19 Data

10.1 Plans for this section

In the previous two sections, we introduced basics to plotting using matplotlib and generating exploratory analysis & visualisations using pandas. In this section, we are going to put these skills into practice and explore the COVID19 dataset from GitHub (https://github.com/datasets/covid-19) that is maintained by Johns Hopkins University.

10.2 Housekeeping

It is always good to get a rough idea about a dataset, right at the start.

Exercise 1 What are we dealing with?

Result


Confirmed Recovered Deaths
count 1.142700e+05 1.142700e+05 114270.000000
mean 3.832402e+05 2.055785e+05 8941.022079
std 2.057547e+06 1.153802e+06 40707.331433
min 0.000000e+00 0.000000e+00 0.000000
25% 4.410000e+02 2.600000e+01 6.000000
50% 9.767000e+03 3.811000e+03 155.000000
75% 1.213500e+05 6.130400e+04 2185.000000
max 3.879675e+07 3.097475e+07 637531.000000
Date Country Confirmed Recovered Deaths
0 2020-01-22 Afghanistan 0 0 0
1 2020-01-23 Afghanistan 0 0 0
2 2020-01-24 Afghanistan 0 0 0
3 2020-01-25 Afghanistan 0 0 0
4 2020-01-26 Afghanistan 0 0 0

Tasks


The dataframe is huge with many countries!

  1. Load the data from https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv

  2. What is the shape of the data? I.e. how many rows and columns are there?

  3. Are there any missing numbers?

  4. What are column names?

Solution


data_url = 'https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv'
df = pd.read_csv(data_url)

df.shape

df.columns

df.describe()

df.info()

df.head()

Exercise 2 Drop those unnecessary columns?

Result


Date Country Confirmed Deaths
0 2020-01-22 Afghanistan 0 0
1 2020-01-23 Afghanistan 0 0
2 2020-01-24 Afghanistan 0 0
3 2020-01-25 Afghanistan 0 0
4 2020-01-26 Afghanistan 0 0

Tasks


Assuming we are only interested in Number of Deaths and Number of Confirmed cases.

  1. What are all the columns in the this dataset.
  2. Drop unnecessary columns.
  3. Keep necessary columns

Solution (Drop Columns)


df_dropped = df.drop(columns = 'Recovered')
df_dropped.head()

Solution (Keep Columns)


df_keep = df[["Date", "Country", "Confirmed", "Deaths"]]
df_keep.head()

Exercise 3 Subset to ASEAN countries only

Result


Date Country Confirmed Recovered Deaths
14064 2020-01-22 Brunei 0 0 0
14065 2020-01-23 Brunei 0 0 0
14066 2020-01-24 Brunei 0 0 0
14067 2020-01-25 Brunei 0 0 0
14068 2020-01-26 Brunei 0 0 0

Tasks


The dataframe is huge with many countries!

  1. Find a way to subset the dataframe to only contain ASEAN countries. Here is a list of the ASEAN countries.

    ASEAN_countries_list = [
    'Brunei', 'Burma', 'Cambodia',
    'Indonesia', 'Laos', 'Malaysia',
    'Philippines', 'Singapore', 'Vietnam']

Solution


ASEAN_countries = [
    'Brunei', 'Burma', 'Cambodia',
    'Indonesia', 'Laos', 'Malaysia',
    'Philippines', 'Singapore', 'Vietnam'
]

# Where are the rows with the ASEAN countries?
filtered_ASEAN_rows = df_all['Country'].isin(ASEAN_countries)

# Select the ASEAN rows
df_ASEAN = df_all[filtered_ASEAN_rows]
df_ASEAN.head()

10.3 Exploration of COVID19 dataset: Bar Chart

Exercise 4 Plotting the Total Deaths in each country

Result


Tasks


  1. Subset the dataset so that it contains only the latest date of each country.

  2. Using either pandas or matplotlib or both, plot the total number of deaths in each country.

  3. Is there a better way to represent this plot? Hint: Maybe make the barchart descending!

Solution


# Pandas Manipulation #
latest_date = max(df_ASEAN['Date'])
df_ASEAN_latest_date = df_ASEAN[df_ASEAN['Date'] == latest_date]
df_ASEAN_latest_date.set_index('Country', inplace=True)
df_ASEAN_latest_date.sort_values(by='Deaths', ascending=True, inplace=True)

# Plotting #
ax = df_ASEAN_latest_date['Deaths'].plot(kind='barh')

# Pretty Things #
ax.set_xlabel('Total Deaths', fontsize=12)
ax.set_ylabel('Country', fontsize=12)
ax.set_title('Number of Deaths due to COVID-19 by countries', fontsize=15)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.show()

10.4 Exploration of COVID19 dataset: Time-Series

Exercise 5 Total Confirmed Cases in each country over time

Result


Tasks


Using the dataframe containing ALL DATES of ASEAN countries only:

  1. Plot the total number of confirmed cases against dates.

  2. Ensure that each country has their own colour.

  3. Enable the legend

Solution


# Plotting #
fig, ax = plt.subplots(figsize=(6, 5))

# Iterate through our countries so we can plot automatically plot them!
for country in ASEAN_countries:
    df_ASEAN_perCountry = df_ASEAN[df_ASEAN['Country'] == country]
    df_ASEAN_perCountry.plot('Date','Confirmed',ax=ax, label = country)

# Pretty Things #
plt.legend()
plt.ylabel('Number of Confirmed Cases', fontsize = 12)
plt.xlabel('Date', fontsize = 12)
plt.title('Confirmed Cases over time in ASEAN countries\n', fontsize = 15)
plt.xticks(fontsize = 8)
plt.yticks(fontsize = 8)
plt.tight_layout()
plt.show()

10.5 Exploration of COVID19 dataset: Time-Series Daily Cases

Exercise 6 Daily Confirmed Cases in each country over time

Result


Tasks


Using the dataframe containing ALL DATES of ASEAN countries only:

  1. Find the daily cases for each country using .diff().

  2. Set all negative values to NaN or 0s.

  3. Plot the daily number of confirmed cases against dates.

  4. Ensure that each country has their own colour.

  5. Enable the legend

Solution


import numpy as np

# Pandas Manipulation #

df_ASEAN.loc[:, 'Daily'] = df_ASEAN['Confirmed'].diff()
df_ASEAN[df_ASEAN['Daily'] < 0] = np.NaN

# Plotting #
fig, ax = plt.subplots(figsize=(6, 5))

# Iterate through our countries so we can plot automatically plot them!
for country in ASEAN_countries:
    df_ASEAN_perCountry = df_ASEAN[df_ASEAN['Country'] == country]
    df_ASEAN_perCountry.plot('Date','Daily',ax=ax, label = country)

# Pretty Things #
plt.legend()
plt.ylabel('Number of Daily Cases', fontsize = 12)
plt.xlabel('Date', fontsize = 12)
plt.title('Daily Cases over time in ASEAN countries', fontsize = 15)
plt.xticks(fontsize = 8)
plt.yticks(fontsize = 8)
plt.tight_layout()
plt.show()