8 Using Pandas to Explore Data

In this chapter we will use the package pandas to do some exploratory plots on a dataset containing Gender, Height and, Weight. The dataset is an edited version from Kaggle where we trimmed to 200 total rows and converted everything to the metric system.

8.1 Barcharts

Example 1 Barcharts with error bars

Result


Things to take note


  1. Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
  2. Standard Deviation and Mean are calculated using numpy.
  3. Usage of groupby, .agg and yerr.

Example


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)

df_height_grouped = df.groupby("Gender")['Height']


df_height_mean_se_gender = df_height_grouped.agg([np.mean, np.std])
df_height_mean_se_gender.plot(kind = 'bar', yerr = 'std',capsize=10, rot=0,legend=False)

plt.ylabel('Average Height (cm)')
plt.xlabel('Gender')
plt.title('Average Height (cm) vs Gender')
plt.tight_layout()
plt.show()

8.2 Histograms

Example 2 Histograms with Grouping

Result (Grouped)


Things to take note


  1. Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
  2. Usage of .groupby to group in histograms

Example (No Grouping)


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)


df['Height'].plot(kind='hist')

# Label your axis
plt.ylabel('Frequency', fontsize = 10)
plt.xlabel('Height (cm)', fontsize = 10)
plt.title('Histogram of Height (cm)', fontsize = 20)
plt.tight_layout()
plt.legend()
plt.show()

Example (Grouped)


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)


grouped_gender = df.groupby("Gender")['Height']
grouped_gender.plot(kind='hist', alpha = 0.5, bins = 15)

# Label your axis
plt.ylabel('Frequency')
plt.xlabel('Height (cm)')
plt.title('Histogram of Height (cm) grouped by Gender')
plt.tight_layout()
plt.legend()
plt.show()

8.3 Scatterplots

Example 3 Grouped Scatterplots

Result (Grouping)


Example (No Grouping)


import pandas as pd
import matplotlib.pyplot as plt

# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)


df.plot("Height", 'Weight', kind='scatter')

# Label your axis
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Weight (kg) vs Height (cm)')
plt.tight_layout()
plt.legend()
plt.show()

Example (Grouped)


import pandas as pd
import matplotlib.pyplot as plt

# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)


df_grouped_gender = df.groupby("Gender")

fig, ax = plt.subplots()

for name, gender in df_grouped_gender:
    ax.scatter(gender["Height"], gender["Weight"], label=name)

# Label your axis
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Weight (kg) vs Height (cm) grouped by Gender')
plt.tight_layout()
plt.legend()
plt.show()

8.4 Boxplots

Example 4 Boxplots

Result


Things to take note


  1. Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
  2. Note the use of df.boxplot for boxplot instead of df.plot.
  3. For our purposes plt.suptitle('') was used to remove the secondary title of our boxplot.

Example


import pandas as pd
import matplotlib.pyplot as plt

# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)


#Plotting boxplot
ax = df.boxplot('Height', by='Gender')
ax.grid(False)

# Don't forget to add the labels for clarity!
plt.suptitle('')
plt.title('Boxplot of Height (cm) grouped by Gender')
plt.ylabel('Height (cm)')
plt.xlabel('Gender')
plt.tight_layout()
plt.show()