8 Using Pandas to Explore Data
In this chapter we will use the package pandas
to do some exploratory plots on a dataset containing Gender
, Height
and, Weight
. The dataset is an edited version from Kaggle where we trimmed to 200 total rows and converted everything to the metric system.
8.1 Barcharts
Example 1 Barcharts with error bars
Things to take note
- Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
- Standard Deviation and Mean are calculated using numpy.
- Usage of
groupby
,.agg
andyerr
.
Example
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)
df_height_grouped = df.groupby("Gender")['Height']
df_height_mean_se_gender = df_height_grouped.agg([np.mean, np.std])
df_height_mean_se_gender.plot(kind = 'bar', yerr = 'std',capsize=10, rot=0,legend=False)
plt.ylabel('Average Height (cm)')
plt.xlabel('Gender')
plt.title('Average Height (cm) vs Gender')
plt.tight_layout()
plt.show()
8.2 Histograms
Example 2 Histograms with Grouping
Things to take note
- Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
- Usage of
.groupby
to group in histograms
Example (No Grouping)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)
df['Height'].plot(kind='hist')
# Label your axis
plt.ylabel('Frequency', fontsize = 10)
plt.xlabel('Height (cm)', fontsize = 10)
plt.title('Histogram of Height (cm)', fontsize = 20)
plt.tight_layout()
plt.legend()
plt.show()
Example (Grouped)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)
grouped_gender = df.groupby("Gender")['Height']
grouped_gender.plot(kind='hist', alpha = 0.5, bins = 15)
# Label your axis
plt.ylabel('Frequency')
plt.xlabel('Height (cm)')
plt.title('Histogram of Height (cm) grouped by Gender')
plt.tight_layout()
plt.legend()
plt.show()
8.3 Scatterplots
Example 3 Grouped Scatterplots
Things to take note
- Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
Example (No Grouping)
import pandas as pd
import matplotlib.pyplot as plt
# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)
df.plot("Height", 'Weight', kind='scatter')
# Label your axis
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Weight (kg) vs Height (cm)')
plt.tight_layout()
plt.legend()
plt.show()
Example (Grouped)
import pandas as pd
import matplotlib.pyplot as plt
# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)
df_grouped_gender = df.groupby("Gender")
fig, ax = plt.subplots()
for name, gender in df_grouped_gender:
ax.scatter(gender["Height"], gender["Weight"], label=name)
# Label your axis
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Weight (kg) vs Height (cm) grouped by Gender')
plt.tight_layout()
plt.legend()
plt.show()
8.4 Boxplots
Example 4 Boxplots
Things to take note
- Loading in the dataset from https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv
- Note the use of
df.boxplot
for boxplot instead ofdf.plot
. - For our purposes
plt.suptitle('')
was used to remove the secondary title of our boxplot.
Example
import pandas as pd
import matplotlib.pyplot as plt
# Reading in the dataset as df
link = 'https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/height-weight-metric.csv'
df = pd.read_csv(link)
#Plotting boxplot
ax = df.boxplot('Height', by='Gender')
ax.grid(False)
# Don't forget to add the labels for clarity!
plt.suptitle('')
plt.title('Boxplot of Height (cm) grouped by Gender')
plt.ylabel('Height (cm)')
plt.xlabel('Gender')
plt.tight_layout()
plt.show()