7 The Power of Dataframes
In this day and age of ‘big data,’ it almost an oxymoron to claim that ‘data analysis skills are important, relevant and useful.’ In this section, we will start using Pandas which is Python’s de-facto module for data analysis. In addition to using Pandas you will see how immensely efficient it is to have your data in a dataframe than in a spreadsheet.
Just enough
Pandas offer several ways (i.e. different syntaxes) to achieve a specific goal. We will not cover all of these. Instead, we will only highlight a minimally sufficient set of syntax and instructions.
To learn the power of Pandas, we will use a dataset of test scores for a dummy class. You can access the data at https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/dummy_class.xls.
Exercise 1 Uploding Files to Colab
Task
Although Pandas can read the file directly over the internet, it will be good for us to learn how to upload files to Colab.
Download the
xls
file from the following link by right-clicking and using ‘Save as…’
https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/dummy_class.xls.Upload the file to Colab.
-
Use the following code to read the file into Pandas.
import pandas as pd 'dummy_class.xls') pd.read_excel(
Now we are ready to start exploring the Pandas by exploring the dataset! It is always best to first understand what kind of dataset you are dealing with. So, lets do some data ‘house keeping!’
Exercise 2 Some Housekeeping
Result
Unnamed: 0 | Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|---|
0 | 0 | Braiden Henson | PHY | M | 20.205 | 18.960 | 21.25 |
1 | 1 | Gustavo Vang | CHM | F | 13.470 | 17.440 | 18.75 |
2 | 2 | Ronin Christian | PHY | M | 18.366 | 15.560 | 43.75 |
3 | 3 | Owen Anderson | LS | F | 18.366 | 16.360 | 18.75 |
4 | 4 | Kyla Young | PHY | M | 15.306 | 17.896 | 17.50 |
5 | 5 | Wyatt Oliver | PHY | M | 12.246 | 14.088 | 32.50 |
6 | 6 | Essence Bauer | LS | M | 16.530 | 16.720 | 31.25 |
7 | 7 | Maryjane Sandoval | LS | F | 18.981 | 16.400 | 43.75 |
8 | 8 | Carl Trujillo | LS | F | 15.306 | 13.680 | 47.50 |
9 | 9 | Halle Fritz | LS | F | 17.754 | 9.040 | 41.25 |
Tasks
import pandas as pd
= pd.read_excel('dummy_class.xls') df
Starting with the code above, answer the questions in the following table using the Pandas command indicated.
Question | Possible code | |
---|---|---|
1. | How many rows and columns are there? | df.shape |
2. | What are the names of the columns? | df.colmns |
3. | Can we look at a sample of the data |
df.head() df.tail()
|
4. | What type of data is contained? | df.info() |
5. | Can we have descriptive statistics of the data? | df.describe() |
6. | What are the unique values in a column? | df[column_name].unique() |
7. | How many unique values are there? | df[column_name].value_counts() |
Example 1 Housekeeping Round 2 : Rename stuff
Result
Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
0 | Braiden Henson | Physics | Male | 20.205 | 18.960 | 21.25 |
1 | Gustavo Vang | Chemistry | Female | 13.470 | 17.440 | 18.75 |
2 | Ronin Christian | Physics | Male | 18.366 | 15.560 | 43.75 |
3 | Owen Anderson | Life Sciences | Female | 18.366 | 16.360 | 18.75 |
4 | Kyla Young | Physics | Male | 15.306 | 17.896 | 17.50 |
5 | Wyatt Oliver | Physics | Male | 12.246 | 14.088 | 32.50 |
6 | Essence Bauer | Life Sciences | Male | 16.530 | 16.720 | 31.25 |
7 | Maryjane Sandoval | Life Sciences | Female | 18.981 | 16.400 | 43.75 |
8 | Carl Trujillo | Life Sciences | Female | 15.306 | 13.680 | 47.50 |
9 | Halle Fritz | Life Sciences | Female | 17.754 | 9.040 | 41.25 |
Tasks
Let’s tidy up our dataset by:
- Dropping unnecessary columns (i.e.’Unnamed: 0’)
- Replacing the majors with full names:
- Replacing ‘M’ and ‘F’ with thier full form.
Before | After |
---|---|
PHY | Physics |
CHM | Chemistry |
LS | Life Sciences |
CBIO’ | Comp. Biology |
Solution
my_to_replace = {
'PHY': 'Physics',
'CHM': 'Chemistry',
'LS': 'Life Sciences',
'CBIO': 'Comp. Biology',
'F': 'Female',
'M': 'Male'
}
# What does inplace do?
df.replace(to_replace=my_to_replace, inplace=True)
# Now lets drop the unnecessary columns
my_to_drop = 'Unnamed: 0'
df.drop(columns=my_to_drop, inplace=True)
Example 2 Housekeeping Round 3 : Missing Numbers
Result
’Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
10 | Marie Hoffman | LS | F | 19.593 | 16.880 | 32.50 |
11 | Lilianna Kaufman | LS | M | 26.328 | 17.000 | 50.00 |
12 | Jaxon Chung | PHY | M | 14.082 | 16.680 | 41.25 |
13 | Zoey Oconnell | LS | F | 22.041 | 14.128 | 25.00 |
14 | Quentin Kemp | CHM | F | 0.000 | 15.720 | 35.00 |
15 | Leo Mayo | PHY | F | 20.817 | 14.680 | 20.00 |
16 | Camden Williams | CHM | F | 22.653 | 17.648 | 50.00 |
17 | Sidney Wiggins | PHY | F | 19.593 | 14.680 | 0.00 |
18 | Solomon Fletcher | CHM | F | 18.981 | 15.000 | 33.75 |
19 | Riley Christensen | CHM | F | 15.306 | 16.016 | 43.75 |
’
Code
Missing numbers are common in real datasets. Therefore, it is important to have a policy on how you deal with them. So, let’s fill the missing numbers with 0 for our class. But, first, let’s see how to locate missing data.
Note: Pandas uses na
and NaN
to represent missing numbers.
- Check if there are missing numbers by using
info()
- Use
isna()
andany()
to locate: - All the missing numbers in the dataframe.
- Any columns with missing numbers
- Any rows with missing numbers
How many missing numbers do we have?
# How many missing numbers?
df.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 35 entries, 0 to 34
#> Data columns (total 6 columns):
#> # Column Non-Null Count Dtype
#> --- ------ -------------- -----
#> 0 Name 35 non-null object
#> 1 Major 35 non-null object
#> 2 Gender 35 non-null object
#> 3 Test 1 (30%) 33 non-null float64
#> 4 Test 2 (20%) 35 non-null float64
#> 5 Test 3 (50%) 32 non-null float64
#> dtypes: float64(3), object(3)
#> memory usage: 1.8+ KB
Missing everywhere
# Missing everywhere
df.isna()
Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | False | False | False | False |
3 | False | False | False | False | False | False |
4 | False | False | False | False | False | False |
’
Shows columns that have missing values
# Columns with missing numbers
any(axis=0) df.isna().
#> Name False
#> Major False
#> Gender False
#> Test 1 (30%) True
#> Test 2 (20%) False
#> Test 3 (50%) True
#> dtype: bool
Shows rows that have missing values
# Rows with missing numbers
any(axis=1).head(20) df.isna().
#> 0 False
#> 1 False
#> 2 False
#> 3 False
#> 4 False
#> 5 False
#> 6 False
#> 7 False
#> 8 False
#> 9 False
#> 10 False
#> 11 False
#> 12 False
#> 13 False
#> 14 True
#> 15 False
#> 16 False
#> 17 True
#> 18 False
#> 19 False
#> dtype: bool
Show the rows that have missing values
= df.isna().any(axis=1)
mask df[mask]
Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
14 | Quentin Kemp | CHM | F | NaN | 15.720 | 35.00 |
17 | Sidney Wiggins | PHY | F | 19.593 | 14.680 | NaN |
28 | Norah Miles | CHM | M | NaN | 18.440 | 38.75 |
32 | Katie Ayers | CBIO | F | 16.530 | 10.280 | NaN |
34 | Carter Crane | LS | M | 20.817 | 13.816 | NaN |
’
Replace values with 0
0,inplace=True) df.fillna(
Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
0 | Braiden Henson | PHY | M | 20.205 | 18.960 | 21.25 |
1 | Gustavo Vang | CHM | F | 13.470 | 17.440 | 18.75 |
2 | Ronin Christian | PHY | M | 18.366 | 15.560 | 43.75 |
3 | Owen Anderson | LS | F | 18.366 | 16.360 | 18.75 |
4 | Kyla Young | PHY | M | 15.306 | 17.896 | 17.50 |
5 | Wyatt Oliver | PHY | M | 12.246 | 14.088 | 32.50 |
6 | Essence Bauer | LS | M | 16.530 | 16.720 | 31.25 |
7 | Maryjane Sandoval | LS | F | 18.981 | 16.400 | 43.75 |
8 | Carl Trujillo | LS | F | 15.306 | 13.680 | 47.50 |
9 | Halle Fritz | LS | F | 17.754 | 9.040 | 41.25 |
10 | Marie Hoffman | LS | F | 19.593 | 16.880 | 32.50 |
11 | Lilianna Kaufman | LS | M | 26.328 | 17.000 | 50.00 |
12 | Jaxon Chung | PHY | M | 14.082 | 16.680 | 41.25 |
13 | Zoey Oconnell | LS | F | 22.041 | 14.128 | 25.00 |
14 | Quentin Kemp | CHM | F | 0.000 | 15.720 | 35.00 |
15 | Leo Mayo | PHY | F | 20.817 | 14.680 | 20.00 |
16 | Camden Williams | CHM | F | 22.653 | 17.648 | 50.00 |
17 | Sidney Wiggins | PHY | F | 19.593 | 14.680 | 0.00 |
18 | Solomon Fletcher | CHM | F | 18.981 | 15.000 | 33.75 |
19 | Riley Christensen | CHM | F | 15.306 | 16.016 | 43.75 |
20 | Malik Becker | PHY | M | 12.858 | 15.480 | 46.25 |
21 | Skylar Hensley | CHM | M | 23.877 | 17.648 | 32.50 |
22 | Braydon Duran | LS | F | 15.918 | 14.360 | 41.25 |
23 | Jalen Harmon | LS | F | 22.041 | 16.880 | 22.50 |
24 | Ean Haas | LS | F | 19.593 | 14.360 | 30.00 |
25 | Carolina Mcmahon | PHY | M | 12.858 | 18.080 | 47.50 |
26 | Elian Potter | PHY | M | 23.265 | 14.360 | 31.25 |
27 | Litzy White | PHY | M | 20.817 | 15.920 | 16.25 |
28 | Norah Miles | CHM | M | 0.000 | 18.440 | 38.75 |
29 | Mariela Sheppard | LS | M | 18.366 | 11.000 | 28.75 |
30 | Isabela Stokes | CHM | M | 18.366 | 16.800 | 36.25 |
31 | Kathleen Rodriguez | PHY | M | 22.653 | 17.760 | 16.25 |
32 | Katie Ayers | CBIO | F | 16.530 | 10.280 | 0.00 |
33 | Tucker Sloan | LS | M | 17.754 | 15.040 | 25.00 |
34 | Carter Crane | LS | M | 20.817 | 13.816 | 0.00 |
’
Example 3 Adding New Columns
Results
’Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | |
---|---|---|---|---|---|---|---|
0 | Braiden Henson | PHY | M | 20.205 | 18.960 | 21.25 | 60.415 |
1 | Gustavo Vang | CHM | F | 13.470 | 17.440 | 18.75 | 49.660 |
2 | Ronin Christian | PHY | M | 18.366 | 15.560 | 43.75 | 77.676 |
3 | Owen Anderson | LS | F | 18.366 | 16.360 | 18.75 | 53.476 |
4 | Kyla Young | PHY | M | 15.306 | 17.896 | 17.50 | 50.702 |
’
Exercise 3 Is Maryjane and Ronin in the class?
Results
’Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
2 | Ronin Christian | PHY | M | 18.366 | 15.56 | 43.75 |
7 | Maryjane Sandoval | LS | F | 18.981 | 16.40 | 43.75 |
’
Tasks
- Use the ‘hidden’ function
isin()
to find out if Maryjane Sandoval and Ronin Christian are in the class? - Use a mask to locate the corresponding rows in the dataframe.
Solution
= ['Maryjane Sandoval' , 'Ronin Christian']
students =df.isin(students).any(axis=1)
mask df[mask]
import pandas as pd
=pd.read_excel('files/dummy_class.xls')
df= ['Unnamed: 0'],inplace=True)
df.drop(columns = ['Maryjane Sandoval' , 'Ronin Christian']
students =df.isin(students).any(axis=1)
mask df[mask]
#> Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
#> 2 Ronin Christian PHY M 18.366 15.56 43.75
#> 7 Maryjane Sandoval LS F 18.981 16.40 43.75
Example 4 Moving Around and Asking Questions
Tasks
One of the most powerful features of Pandas is how easy it is to find subsets of data that satisfy one or more conditions.
- Show columns ‘Name’ and ‘Major.’
- Show rows 5 to 10 for columns 3 to 5.
- What are the names of the female students?
- What are the names of the female students and studying chemsitry?
Code
Pandas offers us two ways (loc
and iloc
) to ‘move around’ in a dataframe.
How | Command |
---|---|
By specifying a name | loc[] |
By specifying a row number (starting at 0) | iloc[] |
Show columns ‘Name’ and ‘Major.’
'Name', 'Major']] df.loc[:,[
#> Name Major
#> 0 Braiden Henson Physics
#> 1 Gustavo Vang Chemistry
#> 2 Ronin Christian Physics
#> 3 Owen Anderson Life Sciences
#> 4 Kyla Young Physics
#> 5 Wyatt Oliver Physics
#> 6 Essence Bauer Life Sciences
#> 7 Maryjane Sandoval Life Sciences
#> 8 Carl Trujillo Life Sciences
#> 9 Halle Fritz Life Sciences
#> 10 Marie Hoffman Life Sciences
#> 11 Lilianna Kaufman Life Sciences
#> 12 Jaxon Chung Physics
#> 13 Zoey Oconnell Life Sciences
#> 14 Quentin Kemp Chemistry
#> 15 Leo Mayo Physics
#> 16 Camden Williams Chemistry
#> 17 Sidney Wiggins Physics
#> 18 Solomon Fletcher Chemistry
#> 19 Riley Christensen Chemistry
#> 20 Malik Becker Physics
#> 21 Skylar Hensley Chemistry
#> 22 Braydon Duran Life Sciences
#> 23 Jalen Harmon Life Sciences
#> 24 Ean Haas Life Sciences
#> 25 Carolina Mcmahon Physics
#> 26 Elian Potter Physics
#> 27 Litzy White Physics
#> 28 Norah Miles Chemistry
#> 29 Mariela Sheppard Life Sciences
#> 30 Isabela Stokes Chemistry
#> 31 Kathleen Rodriguez Physics
#> 32 Katie Ayers Comp. Biology
#> 33 Tucker Sloan Life Sciences
#> 34 Carter Crane Life Sciences
This will work too.
'Name', 'Major']] df[[
Show rows 5 to 10 for columns 3 to 5
5:11,3:6] df.iloc[
#> Test 1 (30%) Test 2 (20%) Test 3 (50%)
#> 5 12.246 14.088 32.50
#> 6 16.530 16.720 31.25
#> 7 18.981 16.400 43.75
#> 8 15.306 13.680 47.50
#> 9 17.754 9.040 41.25
#> 10 19.593 16.880 32.50
What are the names of the female students?
= df['Gender'] == 'Female' # Asking a question.
mask 'Name','Gender']] df.loc[mask,[
#> Name Gender
#> 1 Gustavo Vang Female
#> 3 Owen Anderson Female
#> 7 Maryjane Sandoval Female
#> 8 Carl Trujillo Female
#> 9 Halle Fritz Female
#> 10 Marie Hoffman Female
#> 13 Zoey Oconnell Female
#> 14 Quentin Kemp Female
#> 15 Leo Mayo Female
#> 16 Camden Williams Female
#> 17 Sidney Wiggins Female
#> 18 Solomon Fletcher Female
#> 19 Riley Christensen Female
#> 22 Braydon Duran Female
#> 23 Jalen Harmon Female
#> 24 Ean Haas Female
#> 32 Katie Ayers Female
What are the names of the female students studying chemistry?
You can combine the use of ==
with the and (&
) or the or (|
) operator.
= (df['Gender'] == 'Female') & (df['Major'] == 'Chemistry')
mask df[mask]
Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | |
---|---|---|---|---|---|---|
1 | Gustavo Vang | Chemistry | Female | 13.470 | 17.440 | 18.75 |
14 | Quentin Kemp | Chemistry | Female | 0.000 | 15.720 | 35.00 |
16 | Camden Williams | Chemistry | Female | 22.653 | 17.648 | 50.00 |
18 | Solomon Fletcher | Chemistry | Female | 18.981 | 15.000 | 33.75 |
19 | Riley Christensen | Chemistry | Female | 15.306 | 16.016 | 43.75 |
’
Exercise 4 Physics students with less that 60%
Results
’Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | |
---|---|---|---|---|---|---|---|
4 | Kyla Young | Physics | Male | 15.306 | 17.896 | 17.50 | 50.702 |
5 | Wyatt Oliver | Physics | Male | 12.246 | 14.088 | 32.50 | 58.834 |
15 | Leo Mayo | Physics | Female | 20.817 | 14.680 | 20.00 | 55.497 |
17 | Sidney Wiggins | Physics | Female | 19.593 | 14.680 | 0.00 | 34.273 |
27 | Litzy White | Physics | Male | 20.817 | 15.920 | 16.25 | 52.987 |
31 | Kathleen Rodriguez | Physics | Male | 22.653 | 17.760 | 16.25 | 56.663 |
’
Example 5 Grouping
Results
’Name | Major | Gender | Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | |
---|---|---|---|---|---|---|---|
4 | Kyla Young | Physics | Male | 15.306 | 17.896 | 17.50 | 50.702 |
5 | Wyatt Oliver | Physics | Male | 12.246 | 14.088 | 32.50 | 58.834 |
15 | Leo Mayo | Physics | Female | 20.817 | 14.680 | 20.00 | 55.497 |
17 | Sidney Wiggins | Physics | Female | 19.593 | 14.680 | 0.00 | 34.273 |
27 | Litzy White | Physics | Male | 20.817 | 15.920 | 16.25 | 52.987 |
31 | Kathleen Rodriguez | Physics | Male | 22.653 | 17.760 | 16.25 | 56.663 |
’
Code
Another extremely powerful feature of Pandas is its ability to group items according to categories and then calculate values for those groupes.
You can apply the ‘in-built’ function (count()
, sum()
, mean()
, median()
, min()
, max()
, mode()
, std()
, var()
) directly to the groups. However, the real power of grouping comes with using agg()
to apply any function.
What are the means of the scores for the various test, according to Major
='Major').mean().round(2) df.groupby(by
Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | |
---|---|---|---|---|
Major | ||||
Chemistry | 14.08 | 16.84 | 36.09 | 67.01 |
Comp. Biology | 16.53 | 10.28 | 0.00 | 26.81 |
Life Sciences | 19.24 | 14.69 | 31.25 | 65.18 |
Physics | 17.76 | 16.18 | 27.81 | 61.75 |
’
What are the means of the scores for the various tests, according to Gender.
='Gender').mean().round(2) df.groupby(by
Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | |
---|---|---|---|---|
Gender | ||||
Female | 17.47 | 14.91 | 29.63 | 62.01 |
Male | 17.48 | 16.18 | 30.90 | 64.57 |
’
What are the means of the scores for the various tests, according to Major & Gender.
=['Major','Gender']).mean().round(2) df.groupby(by
Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | ||
---|---|---|---|---|---|
Major | Gender | ||||
Chemistry | Female | 14.08 | 16.36 | 36.25 | 66.70 |
Male | 14.08 | 17.63 | 35.83 | 67.54 | |
Comp. Biology | Female | 16.53 | 10.28 | 0.00 | 26.81 |
Life Sciences | Female | 18.84 | 14.68 | 33.61 | 67.13 |
Male | 19.96 | 14.72 | 27.00 | 61.67 | |
Physics | Female | 20.20 | 14.68 | 10.00 | 44.89 |
Male | 17.27 | 16.48 | 31.38 | 65.12 |
’
Let’s use mean()
and std()
that belongs to numpy
=['Major','Gender']).agg([np.mean, np.std]).round(2) df.groupby(by
Test 1 (30%) | Test 2 (20%) | Test 3 (50%) | Total (100%) | ||||||
---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | ||
Major | Gender | ||||||||
Chemistry | Female | 14.08 | 8.63 | 16.36 | 1.14 | 36.25 | 11.83 | 66.70 | 17.13 |
Male | 14.08 | 12.50 | 17.63 | 0.82 | 35.83 | 3.15 | 67.54 | 9.06 | |
Comp. Biology | Female | 16.53 | NaN | 10.28 | NaN | 0.00 | NaN | 26.81 | NaN |
Life Sciences | Female | 18.84 | 2.35 | 14.68 | 2.47 | 33.61 | 10.28 | 67.13 | 8.06 |
Male | 19.96 | 3.89 | 14.72 | 2.45 | 27.00 | 17.91 | 61.67 | 21.03 | |
Physics | Female | 20.20 | 0.87 | 14.68 | 0.00 | 10.00 | 14.14 | 44.89 | 15.01 |
Male | 17.27 | 4.29 | 16.48 | 1.66 | 31.38 | 12.85 | 65.12 | 10.41 |
’