Chapter 7 The Power of Dataframes | TFI e-Workshop | Using Python To Tell Stories With Data

Exercise 1 Uploding Files to Colab

Task

Although Pandas can read the file directly over the internet, it will be good for us to learn how to upload files to Colab.

Download the xls file from the following link by right-clicking and using ‘Save as…’
https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/dummy_class.xls.
Upload the file to Colab.

Use the following code to read the file into Pandas.

import pandas as pd
pd.read_excel('dummy_class.xls')

Now we are ready to start exploring the Pandas by exploring the dataset! It is always best to first understand what kind of dataset you are dealing with. So, lets do some data ‘house keeping!’

Exercise 2 Some Housekeeping

Result

	Unnamed: 0	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
0	0	Braiden Henson	PHY	M	20.205	18.960	21.25
1	1	Gustavo Vang	CHM	F	13.470	17.440	18.75
2	2	Ronin Christian	PHY	M	18.366	15.560	43.75
3	3	Owen Anderson	LS	F	18.366	16.360	18.75
4	4	Kyla Young	PHY	M	15.306	17.896	17.50
5	5	Wyatt Oliver	PHY	M	12.246	14.088	32.50
6	6	Essence Bauer	LS	M	16.530	16.720	31.25
7	7	Maryjane Sandoval	LS	F	18.981	16.400	43.75
8	8	Carl Trujillo	LS	F	15.306	13.680	47.50
9	9	Halle Fritz	LS	F	17.754	9.040	41.25

Tasks

import pandas as pd
df = pd.read_excel('dummy_class.xls')

Starting with the code above, answer the questions in the following table using the Pandas command indicated.

	Question	Possible code
1.	How many rows and columns are there?	`df.shape`
2.	What are the names of the columns?	`df.colmns`
3.	Can we look at a sample of the data	`df.head()` `df.tail()`
4.	What type of data is contained?	`df.info()`
5.	Can we have descriptive statistics of the data?	`df.describe()`
6.	What are the unique values in a column?	`df[column_name].unique()`
7.	How many unique values are there?	`df[column_name].value_counts()`

Solution

import pandas as pd
# df = pd.read_excel('dummy_class.xls')


print(f'{argv[0]}.html')

df.shape

df.columns

df.head(3)
df.tail(3)

df.info()

df.describe()

df['Major'].unique()

df['Major'].value_counts()

Example 1 Housekeeping Round 2 : Rename stuff

Result

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
0	Braiden Henson	Physics	Male	20.205	18.960	21.25
1	Gustavo Vang	Chemistry	Female	13.470	17.440	18.75
2	Ronin Christian	Physics	Male	18.366	15.560	43.75
3	Owen Anderson	Life Sciences	Female	18.366	16.360	18.75
4	Kyla Young	Physics	Male	15.306	17.896	17.50
5	Wyatt Oliver	Physics	Male	12.246	14.088	32.50
6	Essence Bauer	Life Sciences	Male	16.530	16.720	31.25
7	Maryjane Sandoval	Life Sciences	Female	18.981	16.400	43.75
8	Carl Trujillo	Life Sciences	Female	15.306	13.680	47.50
9	Halle Fritz	Life Sciences	Female	17.754	9.040	41.25

Tasks

Let’s tidy up our dataset by:

Dropping unnecessary columns (i.e.’Unnamed: 0’)
Replacing the majors with full names:
Replacing ‘M’ and ‘F’ with thier full form.

Before	After
PHY	Physics
CHM	Chemistry
LS	Life Sciences
CBIO’	Comp. Biology

Solution

my_to_replace = {
    'PHY': 'Physics',
    'CHM': 'Chemistry',
    'LS': 'Life Sciences',
    'CBIO': 'Comp. Biology',
    'F': 'Female',
    'M': 'Male'
}

# What does inplace do?
df.replace(to_replace=my_to_replace, inplace=True)

# Now lets drop the unnecessary columns
my_to_drop = 'Unnamed: 0'
df.drop(columns=my_to_drop, inplace=True)

Example 2 Housekeeping Round 3 : Missing Numbers

Result

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
10	Marie Hoffman	LS	F	19.593	16.880	32.50
11	Lilianna Kaufman	LS	M	26.328	17.000	50.00
12	Jaxon Chung	PHY	M	14.082	16.680	41.25
13	Zoey Oconnell	LS	F	22.041	14.128	25.00
14	Quentin Kemp	CHM	F	0.000	15.720	35.00
15	Leo Mayo	PHY	F	20.817	14.680	20.00
16	Camden Williams	CHM	F	22.653	17.648	50.00
17	Sidney Wiggins	PHY	F	19.593	14.680	0.00
18	Solomon Fletcher	CHM	F	18.981	15.000	33.75
19	Riley Christensen	CHM	F	15.306	16.016	43.75

’

Code

Missing numbers are common in real datasets. Therefore, it is important to have a policy on how you deal with them. So, let’s fill the missing numbers with 0 for our class. But, first, let’s see how to locate missing data.

Note: Pandas uses na and NaN to represent missing numbers.

Check if there are missing numbers by using info()
Use isna() and any() to locate:
All the missing numbers in the dataframe.
Any columns with missing numbers
Any rows with missing numbers

How many missing numbers do we have?

# How many missing numbers?
df.info()

#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 35 entries, 0 to 34
#> Data columns (total 6 columns):
#>  #   Column        Non-Null Count  Dtype  
#> ---  ------        --------------  -----  
#>  0   Name          35 non-null     object 
#>  1   Major         35 non-null     object 
#>  2   Gender        35 non-null     object 
#>  3   Test 1 (30%)  33 non-null     float64
#>  4   Test 2 (20%)  35 non-null     float64
#>  5   Test 3 (50%)  32 non-null     float64
#> dtypes: float64(3), object(3)
#> memory usage: 1.8+ KB

Missing everywhere

# Missing everywhere
df.isna()

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	False	False	False	False	False	False
4	False	False	False	False	False	False

’

Shows columns that have missing values

# Columns with missing numbers
df.isna().any(axis=0)

#> Name            False
#> Major           False
#> Gender          False
#> Test 1 (30%)     True
#> Test 2 (20%)    False
#> Test 3 (50%)     True
#> dtype: bool

Shows rows that have missing values

# Rows with missing numbers
df.isna().any(axis=1).head(20)

#> 0     False
#> 1     False
#> 2     False
#> 3     False
#> 4     False
#> 5     False
#> 6     False
#> 7     False
#> 8     False
#> 9     False
#> 10    False
#> 11    False
#> 12    False
#> 13    False
#> 14     True
#> 15    False
#> 16    False
#> 17     True
#> 18    False
#> 19    False
#> dtype: bool

Show the rows that have missing values

mask = df.isna().any(axis=1)
df[mask]

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
14	Quentin Kemp	CHM	F	NaN	15.720	35.00
17	Sidney Wiggins	PHY	F	19.593	14.680	NaN
28	Norah Miles	CHM	M	NaN	18.440	38.75
32	Katie Ayers	CBIO	F	16.530	10.280	NaN
34	Carter Crane	LS	M	20.817	13.816	NaN

’

Replace values with 0

df.fillna(0,inplace=True)

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
0	Braiden Henson	PHY	M	20.205	18.960	21.25
1	Gustavo Vang	CHM	F	13.470	17.440	18.75
2	Ronin Christian	PHY	M	18.366	15.560	43.75
3	Owen Anderson	LS	F	18.366	16.360	18.75
4	Kyla Young	PHY	M	15.306	17.896	17.50
5	Wyatt Oliver	PHY	M	12.246	14.088	32.50
6	Essence Bauer	LS	M	16.530	16.720	31.25
7	Maryjane Sandoval	LS	F	18.981	16.400	43.75
8	Carl Trujillo	LS	F	15.306	13.680	47.50
9	Halle Fritz	LS	F	17.754	9.040	41.25
10	Marie Hoffman	LS	F	19.593	16.880	32.50
11	Lilianna Kaufman	LS	M	26.328	17.000	50.00
12	Jaxon Chung	PHY	M	14.082	16.680	41.25
13	Zoey Oconnell	LS	F	22.041	14.128	25.00
14	Quentin Kemp	CHM	F	0.000	15.720	35.00
15	Leo Mayo	PHY	F	20.817	14.680	20.00
16	Camden Williams	CHM	F	22.653	17.648	50.00
17	Sidney Wiggins	PHY	F	19.593	14.680	0.00
18	Solomon Fletcher	CHM	F	18.981	15.000	33.75
19	Riley Christensen	CHM	F	15.306	16.016	43.75
20	Malik Becker	PHY	M	12.858	15.480	46.25
21	Skylar Hensley	CHM	M	23.877	17.648	32.50
22	Braydon Duran	LS	F	15.918	14.360	41.25
23	Jalen Harmon	LS	F	22.041	16.880	22.50
24	Ean Haas	LS	F	19.593	14.360	30.00
25	Carolina Mcmahon	PHY	M	12.858	18.080	47.50
26	Elian Potter	PHY	M	23.265	14.360	31.25
27	Litzy White	PHY	M	20.817	15.920	16.25
28	Norah Miles	CHM	M	0.000	18.440	38.75
29	Mariela Sheppard	LS	M	18.366	11.000	28.75
30	Isabela Stokes	CHM	M	18.366	16.800	36.25
31	Kathleen Rodriguez	PHY	M	22.653	17.760	16.25
32	Katie Ayers	CBIO	F	16.530	10.280	0.00
33	Tucker Sloan	LS	M	17.754	15.040	25.00
34	Carter Crane	LS	M	20.817	13.816	0.00

’

Example 3 Adding New Columns

Results

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)	Total (100%)
0	Braiden Henson	PHY	M	20.205	18.960	21.25	60.415
1	Gustavo Vang	CHM	F	13.470	17.440	18.75	49.660
2	Ronin Christian	PHY	M	18.366	15.560	43.75	77.676
3	Owen Anderson	LS	F	18.366	16.360	18.75	53.476
4	Kyla Young	PHY	M	15.306	17.896	17.50	50.702

’

Code

It is very easy to add columns with Pandas. Let’s add a column for the total of the test scores.

df['Total (100%)'] = df['Test 1 (30%)'] +   df['Test 2 (20%)'] + df['Test 3 (50%)']
df.head()

Exercise 3 Is Maryjane and Ronin in the class?

Results

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
2	Ronin Christian	PHY	M	18.366	15.56	43.75
7	Maryjane Sandoval	LS	F	18.981	16.40	43.75

’

Tasks

Use the ‘hidden’ function isin() to find out if Maryjane Sandoval and Ronin Christian are in the class?
Use a mask to locate the corresponding rows in the dataframe.

Solution

students = ['Maryjane Sandoval' , 'Ronin Christian']
mask=df.isin(students).any(axis=1)
df[mask]

import pandas as pd
df=pd.read_excel('files/dummy_class.xls')
df.drop(columns = ['Unnamed: 0'],inplace=True)
students = ['Maryjane Sandoval' , 'Ronin Christian']
mask=df.isin(students).any(axis=1)
df[mask]

#>                 Name Major Gender  Test 1 (30%)  Test 2 (20%)  Test 3 (50%)
#> 2    Ronin Christian   PHY      M        18.366         15.56         43.75
#> 7  Maryjane Sandoval    LS      F        18.981         16.40         43.75

Example 4 Moving Around and Asking Questions

Results

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)

’

Tasks

One of the most powerful features of Pandas is how easy it is to find subsets of data that satisfy one or more conditions.

Show columns ‘Name’ and ‘Major.’
Show rows 5 to 10 for columns 3 to 5.
What are the names of the female students?
What are the names of the female students and studying chemsitry?

Code

Pandas offers us two ways (loc and iloc) to ‘move around’ in a dataframe.

How	Command
By specifying a name	`loc[]`
By specifying a row number (starting at 0)	`iloc[]`

Show columns ‘Name’ and ‘Major.’

df.loc[:,['Name', 'Major']]

#>                   Name          Major
#> 0       Braiden Henson        Physics
#> 1         Gustavo Vang      Chemistry
#> 2      Ronin Christian        Physics
#> 3        Owen Anderson  Life Sciences
#> 4           Kyla Young        Physics
#> 5         Wyatt Oliver        Physics
#> 6        Essence Bauer  Life Sciences
#> 7    Maryjane Sandoval  Life Sciences
#> 8        Carl Trujillo  Life Sciences
#> 9          Halle Fritz  Life Sciences
#> 10       Marie Hoffman  Life Sciences
#> 11    Lilianna Kaufman  Life Sciences
#> 12         Jaxon Chung        Physics
#> 13       Zoey Oconnell  Life Sciences
#> 14        Quentin Kemp      Chemistry
#> 15            Leo Mayo        Physics
#> 16     Camden Williams      Chemistry
#> 17      Sidney Wiggins        Physics
#> 18    Solomon Fletcher      Chemistry
#> 19   Riley Christensen      Chemistry
#> 20        Malik Becker        Physics
#> 21      Skylar Hensley      Chemistry
#> 22       Braydon Duran  Life Sciences
#> 23        Jalen Harmon  Life Sciences
#> 24            Ean Haas  Life Sciences
#> 25    Carolina Mcmahon        Physics
#> 26        Elian Potter        Physics
#> 27         Litzy White        Physics
#> 28         Norah Miles      Chemistry
#> 29    Mariela Sheppard  Life Sciences
#> 30      Isabela Stokes      Chemistry
#> 31  Kathleen Rodriguez        Physics
#> 32         Katie Ayers  Comp. Biology
#> 33        Tucker Sloan  Life Sciences
#> 34        Carter Crane  Life Sciences

This will work too.

df[['Name', 'Major']]

Show rows 5 to 10 for columns 3 to 5

df.iloc[5:11,3:6]

#>     Test 1 (30%)  Test 2 (20%)  Test 3 (50%)
#> 5         12.246        14.088         32.50
#> 6         16.530        16.720         31.25
#> 7         18.981        16.400         43.75
#> 8         15.306        13.680         47.50
#> 9         17.754         9.040         41.25
#> 10        19.593        16.880         32.50

What are the names of the female students?

mask = df['Gender'] == 'Female'                 # Asking a question.
df.loc[mask,['Name','Gender']]

#>                  Name  Gender
#> 1        Gustavo Vang  Female
#> 3       Owen Anderson  Female
#> 7   Maryjane Sandoval  Female
#> 8       Carl Trujillo  Female
#> 9         Halle Fritz  Female
#> 10      Marie Hoffman  Female
#> 13      Zoey Oconnell  Female
#> 14       Quentin Kemp  Female
#> 15           Leo Mayo  Female
#> 16    Camden Williams  Female
#> 17     Sidney Wiggins  Female
#> 18   Solomon Fletcher  Female
#> 19  Riley Christensen  Female
#> 22      Braydon Duran  Female
#> 23       Jalen Harmon  Female
#> 24           Ean Haas  Female
#> 32        Katie Ayers  Female

What are the names of the female students studying chemistry?

You can combine the use of == with the and (&) or the or (|) operator.

mask = (df['Gender'] == 'Female') & (df['Major'] == 'Chemistry')
df[mask]

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)
1	Gustavo Vang	Chemistry	Female	13.470	17.440	18.75
14	Quentin Kemp	Chemistry	Female	0.000	15.720	35.00
16	Camden Williams	Chemistry	Female	22.653	17.648	50.00
18	Solomon Fletcher	Chemistry	Female	18.981	15.000	33.75
19	Riley Christensen	Chemistry	Female	15.306	16.016	43.75

’

Exercise 4 Physics students with less that 60%

Results

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)	Total (100%)
4	Kyla Young	Physics	Male	15.306	17.896	17.50	50.702
5	Wyatt Oliver	Physics	Male	12.246	14.088	32.50	58.834
15	Leo Mayo	Physics	Female	20.817	14.680	20.00	55.497
17	Sidney Wiggins	Physics	Female	19.593	14.680	0.00	34.273
27	Litzy White	Physics	Male	20.817	15.920	16.25	52.987
31	Kathleen Rodriguez	Physics	Male	22.653	17.760	16.25	56.663

’

Tasks

Who are the Physics studnets with less than a total of 60%?

Solution

mask = (df['Major'] == 'Physics') & (df['Total (100%)'] <= 60)
df[mask]

Example 5 Grouping

Results

’

	Name	Major	Gender	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)	Total (100%)
4	Kyla Young	Physics	Male	15.306	17.896	17.50	50.702
5	Wyatt Oliver	Physics	Male	12.246	14.088	32.50	58.834
15	Leo Mayo	Physics	Female	20.817	14.680	20.00	55.497
17	Sidney Wiggins	Physics	Female	19.593	14.680	0.00	34.273
27	Litzy White	Physics	Male	20.817	15.920	16.25	52.987
31	Kathleen Rodriguez	Physics	Male	22.653	17.760	16.25	56.663

’

Code

Another extremely powerful feature of Pandas is its ability to group items according to categories and then calculate values for those groupes.

You can apply the ‘in-built’ function (count(), sum(), mean(), median(), min(), max(), mode(), std(), var()) directly to the groups. However, the real power of grouping comes with using agg() to apply any function.

What are the means of the scores for the various test, according to Major

df.groupby(by='Major').mean().round(2)

’

	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)	Total (100%)
Major
Chemistry	14.08	16.84	36.09	67.01
Comp. Biology	16.53	10.28	0.00	26.81
Life Sciences	19.24	14.69	31.25	65.18
Physics	17.76	16.18	27.81	61.75

’

What are the means of the scores for the various tests, according to Gender.

df.groupby(by='Gender').mean().round(2)

’

	Test 1 (30%)	Test 2 (20%)	Test 3 (50%)	Total (100%)
Gender
Female	17.47	14.91	29.63	62.01
Male	17.48	16.18	30.90	64.57

’

What are the means of the scores for the various tests, according to Major & Gender.

df.groupby(by=['Major','Gender']).mean().round(2)

’

		Test 1 (30%)	Test 2 (20%)	Test 3 (50%)	Total (100%)
Major	Gender
Chemistry	Female	14.08	16.36	36.25	66.70
Chemistry	Male	14.08	17.63	35.83	67.54
Comp. Biology	Female	16.53	10.28	0.00	26.81
Life Sciences	Female	18.84	14.68	33.61	67.13
Life Sciences	Male	19.96	14.72	27.00	61.67
Physics	Female	20.20	14.68	10.00	44.89
Physics	Male	17.27	16.48	31.38	65.12

’

Let’s use `mean()` and `std()` that belongs to `numpy`

df.groupby(by=['Major','Gender']).agg([np.mean, np.std]).round(2)

’

		Test 1 (30%)		Test 2 (20%)		Test 3 (50%)		Total (100%)
		mean	std	mean	std	mean	std	mean	std
Major	Gender
Chemistry	Female	14.08	8.63	16.36	1.14	36.25	11.83	66.70	17.13
Chemistry	Male	14.08	12.50	17.63	0.82	35.83	3.15	67.54	9.06
Comp. Biology	Female	16.53	NaN	10.28	NaN	0.00	NaN	26.81	NaN
Life Sciences	Female	18.84	2.35	14.68	2.47	33.61	10.28	67.13	8.06
Life Sciences	Male	19.96	3.89	14.72	2.45	27.00	17.91	61.67	21.03
Physics	Female	20.20	0.87	14.68	0.00	10.00	14.14	44.89	15.01
Physics	Male	17.27	4.29	16.48	1.66	31.38	12.85	65.12	10.41

’

7 The Power of Dataframes

Just enough

Exercise 1 Uploding Files to Colab

Task

Exercise 2 Some Housekeeping

Result

Tasks

Solution

Example 1 Housekeeping Round 2 : Rename stuff

Result

Tasks

Solution

Example 2 Housekeeping Round 3 : Missing Numbers

Result

Code

How many missing numbers do we have?

Missing everywhere

Shows columns that have missing values

Shows rows that have missing values

Show the rows that have missing values

Replace values with 0

Example 3 Adding New Columns

Results

Code

Exercise 3 Is Maryjane and Ronin in the class?

Results

Tasks

Solution

Example 4 Moving Around and Asking Questions

Results

Tasks

Code

Show columns ‘Name’ and ‘Major.’

Show rows 5 to 10 for columns 3 to 5

What are the names of the female students?

What are the names of the female students studying chemistry?

Exercise 4 Physics students with less that 60%

Results

Tasks

Solution

Example 5 Grouping

Results

Code

What are the means of the scores for the various test, according to Major

What are the means of the scores for the various tests, according to Gender.

What are the means of the scores for the various tests, according to Major & Gender.

Let’s use mean() and std() that belongs to numpy

Let’s use `mean()` and `std()` that belongs to `numpy`