7 The Power of Dataframes

In this day and age of ‘big data,’ it almost an oxymoron to claim that ‘data analysis skills are important, relevant and useful.’ In this section, we will start using Pandas which is Python’s de-facto module for data analysis. In addition to using Pandas you will see how immensely efficient it is to have your data in a dataframe than in a spreadsheet.

Just enough

Pandas offer several ways (i.e. different syntaxes) to achieve a specific goal. We will not cover all of these. Instead, we will only highlight a minimally sufficient set of syntax and instructions.

To learn the power of Pandas, we will use a dataset of test scores for a dummy class. You can access the data at https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/dummy_class.xls.

Exercise 1 Uploding Files to Colab

Task

Although Pandas can read the file directly over the internet, it will be good for us to learn how to upload files to Colab.

  1. Download the xls file from the following link by right-clicking and using ‘Save as…’
    https://raw.githubusercontent.com/nus-sps/workshops.tfi.data-visualisation/main/files/dummy_class.xls.

  2. Upload the file to Colab.

  3. Use the following code to read the file into Pandas.

    import pandas as pd
    pd.read_excel('dummy_class.xls')

Now we are ready to start exploring the Pandas by exploring the dataset! It is always best to first understand what kind of dataset you are dealing with. So, lets do some data ‘house keeping!’

Exercise 2 Some Housekeeping

Result


Unnamed: 0 Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
0 0 Braiden Henson PHY M 20.205 18.960 21.25
1 1 Gustavo Vang CHM F 13.470 17.440 18.75
2 2 Ronin Christian PHY M 18.366 15.560 43.75
3 3 Owen Anderson LS F 18.366 16.360 18.75
4 4 Kyla Young PHY M 15.306 17.896 17.50
5 5 Wyatt Oliver PHY M 12.246 14.088 32.50
6 6 Essence Bauer LS M 16.530 16.720 31.25
7 7 Maryjane Sandoval LS F 18.981 16.400 43.75
8 8 Carl Trujillo LS F 15.306 13.680 47.50
9 9 Halle Fritz LS F 17.754 9.040 41.25

Tasks


import pandas as pd
df = pd.read_excel('dummy_class.xls')

Starting with the code above, answer the questions in the following table using the Pandas command indicated.

Question Possible code
1. How many rows and columns are there? df.shape
2. What are the names of the columns? df.colmns
3. Can we look at a sample of the data df.head()
df.tail()
4. What type of data is contained? df.info()
5. Can we have descriptive statistics of the data? df.describe()
6. What are the unique values in a column? df[column_name].unique()
7. How many unique values are there? df[column_name].value_counts()

Solution


import pandas as pd
# df = pd.read_excel('dummy_class.xls')


print(f'{argv[0]}.html')

df.shape

df.columns

df.head(3)
df.tail(3)

df.info()

df.describe()

df['Major'].unique()

df['Major'].value_counts()

Example 1 Housekeeping Round 2 : Rename stuff

Result


Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
0 Braiden Henson Physics Male 20.205 18.960 21.25
1 Gustavo Vang Chemistry Female 13.470 17.440 18.75
2 Ronin Christian Physics Male 18.366 15.560 43.75
3 Owen Anderson Life Sciences Female 18.366 16.360 18.75
4 Kyla Young Physics Male 15.306 17.896 17.50
5 Wyatt Oliver Physics Male 12.246 14.088 32.50
6 Essence Bauer Life Sciences Male 16.530 16.720 31.25
7 Maryjane Sandoval Life Sciences Female 18.981 16.400 43.75
8 Carl Trujillo Life Sciences Female 15.306 13.680 47.50
9 Halle Fritz Life Sciences Female 17.754 9.040 41.25

Tasks


Let’s tidy up our dataset by:

  1. Dropping unnecessary columns (i.e.’Unnamed: 0’)
  2. Replacing the majors with full names:
  3. Replacing ‘M’ and ‘F’ with thier full form.
Before After
PHY Physics
CHM Chemistry
LS Life Sciences
CBIO’ Comp. Biology

Solution


my_to_replace = {
    'PHY': 'Physics',
    'CHM': 'Chemistry',
    'LS': 'Life Sciences',
    'CBIO': 'Comp. Biology',
    'F': 'Female',
    'M': 'Male'
}

# What does inplace do?
df.replace(to_replace=my_to_replace, inplace=True)

# Now lets drop the unnecessary columns
my_to_drop = 'Unnamed: 0'
df.drop(columns=my_to_drop, inplace=True)

Example 2 Housekeeping Round 3 : Missing Numbers

Result

Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
10 Marie Hoffman LS F 19.593 16.880 32.50
11 Lilianna Kaufman LS M 26.328 17.000 50.00
12 Jaxon Chung PHY M 14.082 16.680 41.25
13 Zoey Oconnell LS F 22.041 14.128 25.00
14 Quentin Kemp CHM F 0.000 15.720 35.00
15 Leo Mayo PHY F 20.817 14.680 20.00
16 Camden Williams CHM F 22.653 17.648 50.00
17 Sidney Wiggins PHY F 19.593 14.680 0.00
18 Solomon Fletcher CHM F 18.981 15.000 33.75
19 Riley Christensen CHM F 15.306 16.016 43.75

Code

Missing numbers are common in real datasets. Therefore, it is important to have a policy on how you deal with them. So, let’s fill the missing numbers with 0 for our class. But, first, let’s see how to locate missing data.

Note: Pandas uses na and NaN to represent missing numbers.

  1. Check if there are missing numbers by using info()
  2. Use isna() and any() to locate:
  3. All the missing numbers in the dataframe.
  4. Any columns with missing numbers
  5. Any rows with missing numbers
How many missing numbers do we have?
# How many missing numbers?
df.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 35 entries, 0 to 34
#> Data columns (total 6 columns):
#>  #   Column        Non-Null Count  Dtype  
#> ---  ------        --------------  -----  
#>  0   Name          35 non-null     object 
#>  1   Major         35 non-null     object 
#>  2   Gender        35 non-null     object 
#>  3   Test 1 (30%)  33 non-null     float64
#>  4   Test 2 (20%)  35 non-null     float64
#>  5   Test 3 (50%)  32 non-null     float64
#> dtypes: float64(3), object(3)
#> memory usage: 1.8+ KB
Missing everywhere
# Missing everywhere
df.isna()           
Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False

Shows columns that have missing values
# Columns with missing numbers
df.isna().any(axis=0)  
#> Name            False
#> Major           False
#> Gender          False
#> Test 1 (30%)     True
#> Test 2 (20%)    False
#> Test 3 (50%)     True
#> dtype: bool
Shows rows that have missing values
# Rows with missing numbers
df.isna().any(axis=1).head(20)
#> 0     False
#> 1     False
#> 2     False
#> 3     False
#> 4     False
#> 5     False
#> 6     False
#> 7     False
#> 8     False
#> 9     False
#> 10    False
#> 11    False
#> 12    False
#> 13    False
#> 14     True
#> 15    False
#> 16    False
#> 17     True
#> 18    False
#> 19    False
#> dtype: bool
Show the rows that have missing values
mask = df.isna().any(axis=1)
df[mask]
Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
14 Quentin Kemp CHM F NaN 15.720 35.00
17 Sidney Wiggins PHY F 19.593 14.680 NaN
28 Norah Miles CHM M NaN 18.440 38.75
32 Katie Ayers CBIO F 16.530 10.280 NaN
34 Carter Crane LS M 20.817 13.816 NaN

Replace values with 0
df.fillna(0,inplace=True)
Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
0 Braiden Henson PHY M 20.205 18.960 21.25
1 Gustavo Vang CHM F 13.470 17.440 18.75
2 Ronin Christian PHY M 18.366 15.560 43.75
3 Owen Anderson LS F 18.366 16.360 18.75
4 Kyla Young PHY M 15.306 17.896 17.50
5 Wyatt Oliver PHY M 12.246 14.088 32.50
6 Essence Bauer LS M 16.530 16.720 31.25
7 Maryjane Sandoval LS F 18.981 16.400 43.75
8 Carl Trujillo LS F 15.306 13.680 47.50
9 Halle Fritz LS F 17.754 9.040 41.25
10 Marie Hoffman LS F 19.593 16.880 32.50
11 Lilianna Kaufman LS M 26.328 17.000 50.00
12 Jaxon Chung PHY M 14.082 16.680 41.25
13 Zoey Oconnell LS F 22.041 14.128 25.00
14 Quentin Kemp CHM F 0.000 15.720 35.00
15 Leo Mayo PHY F 20.817 14.680 20.00
16 Camden Williams CHM F 22.653 17.648 50.00
17 Sidney Wiggins PHY F 19.593 14.680 0.00
18 Solomon Fletcher CHM F 18.981 15.000 33.75
19 Riley Christensen CHM F 15.306 16.016 43.75
20 Malik Becker PHY M 12.858 15.480 46.25
21 Skylar Hensley CHM M 23.877 17.648 32.50
22 Braydon Duran LS F 15.918 14.360 41.25
23 Jalen Harmon LS F 22.041 16.880 22.50
24 Ean Haas LS F 19.593 14.360 30.00
25 Carolina Mcmahon PHY M 12.858 18.080 47.50
26 Elian Potter PHY M 23.265 14.360 31.25
27 Litzy White PHY M 20.817 15.920 16.25
28 Norah Miles CHM M 0.000 18.440 38.75
29 Mariela Sheppard LS M 18.366 11.000 28.75
30 Isabela Stokes CHM M 18.366 16.800 36.25
31 Kathleen Rodriguez PHY M 22.653 17.760 16.25
32 Katie Ayers CBIO F 16.530 10.280 0.00
33 Tucker Sloan LS M 17.754 15.040 25.00
34 Carter Crane LS M 20.817 13.816 0.00

Example 3 Adding New Columns

Results

Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
0 Braiden Henson PHY M 20.205 18.960 21.25 60.415
1 Gustavo Vang CHM F 13.470 17.440 18.75 49.660
2 Ronin Christian PHY M 18.366 15.560 43.75 77.676
3 Owen Anderson LS F 18.366 16.360 18.75 53.476
4 Kyla Young PHY M 15.306 17.896 17.50 50.702

Code

It is very easy to add columns with Pandas. Let’s add a column for the total of the test scores.

df['Total (100%)'] = df['Test 1 (30%)'] +   df['Test 2 (20%)'] + df['Test 3 (50%)']
df.head()

Exercise 3 Is Maryjane and Ronin in the class?

Results

Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
2 Ronin Christian PHY M 18.366 15.56 43.75
7 Maryjane Sandoval LS F 18.981 16.40 43.75

Tasks

  1. Use the ‘hidden’ function isin() to find out if Maryjane Sandoval and Ronin Christian are in the class?
  2. Use a mask to locate the corresponding rows in the dataframe.

Solution

students = ['Maryjane Sandoval' , 'Ronin Christian']
mask=df.isin(students).any(axis=1)
df[mask]
import pandas as pd
df=pd.read_excel('files/dummy_class.xls')
df.drop(columns = ['Unnamed: 0'],inplace=True)
students = ['Maryjane Sandoval' , 'Ronin Christian']
mask=df.isin(students).any(axis=1)
df[mask]
#>                 Name Major Gender  Test 1 (30%)  Test 2 (20%)  Test 3 (50%)
#> 2    Ronin Christian   PHY      M        18.366         15.56         43.75
#> 7  Maryjane Sandoval    LS      F        18.981         16.40         43.75

Example 4 Moving Around and Asking Questions

Results

Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)

Tasks

One of the most powerful features of Pandas is how easy it is to find subsets of data that satisfy one or more conditions.

  1. Show columns ‘Name’ and ‘Major.’
  2. Show rows 5 to 10 for columns 3 to 5.
  3. What are the names of the female students?
  4. What are the names of the female students and studying chemsitry?

Code

Pandas offers us two ways (loc and iloc) to ‘move around’ in a dataframe.

How Command
By specifying a name loc[]
By specifying a row number (starting at 0) iloc[]
Show columns ‘Name’ and ‘Major.’
df.loc[:,['Name', 'Major']]
#>                   Name          Major
#> 0       Braiden Henson        Physics
#> 1         Gustavo Vang      Chemistry
#> 2      Ronin Christian        Physics
#> 3        Owen Anderson  Life Sciences
#> 4           Kyla Young        Physics
#> 5         Wyatt Oliver        Physics
#> 6        Essence Bauer  Life Sciences
#> 7    Maryjane Sandoval  Life Sciences
#> 8        Carl Trujillo  Life Sciences
#> 9          Halle Fritz  Life Sciences
#> 10       Marie Hoffman  Life Sciences
#> 11    Lilianna Kaufman  Life Sciences
#> 12         Jaxon Chung        Physics
#> 13       Zoey Oconnell  Life Sciences
#> 14        Quentin Kemp      Chemistry
#> 15            Leo Mayo        Physics
#> 16     Camden Williams      Chemistry
#> 17      Sidney Wiggins        Physics
#> 18    Solomon Fletcher      Chemistry
#> 19   Riley Christensen      Chemistry
#> 20        Malik Becker        Physics
#> 21      Skylar Hensley      Chemistry
#> 22       Braydon Duran  Life Sciences
#> 23        Jalen Harmon  Life Sciences
#> 24            Ean Haas  Life Sciences
#> 25    Carolina Mcmahon        Physics
#> 26        Elian Potter        Physics
#> 27         Litzy White        Physics
#> 28         Norah Miles      Chemistry
#> 29    Mariela Sheppard  Life Sciences
#> 30      Isabela Stokes      Chemistry
#> 31  Kathleen Rodriguez        Physics
#> 32         Katie Ayers  Comp. Biology
#> 33        Tucker Sloan  Life Sciences
#> 34        Carter Crane  Life Sciences

This will work too.

df[['Name', 'Major']]
Show rows 5 to 10 for columns 3 to 5
df.iloc[5:11,3:6]
#>     Test 1 (30%)  Test 2 (20%)  Test 3 (50%)
#> 5         12.246        14.088         32.50
#> 6         16.530        16.720         31.25
#> 7         18.981        16.400         43.75
#> 8         15.306        13.680         47.50
#> 9         17.754         9.040         41.25
#> 10        19.593        16.880         32.50
What are the names of the female students?
mask = df['Gender'] == 'Female'                 # Asking a question.
df.loc[mask,['Name','Gender']]
#>                  Name  Gender
#> 1        Gustavo Vang  Female
#> 3       Owen Anderson  Female
#> 7   Maryjane Sandoval  Female
#> 8       Carl Trujillo  Female
#> 9         Halle Fritz  Female
#> 10      Marie Hoffman  Female
#> 13      Zoey Oconnell  Female
#> 14       Quentin Kemp  Female
#> 15           Leo Mayo  Female
#> 16    Camden Williams  Female
#> 17     Sidney Wiggins  Female
#> 18   Solomon Fletcher  Female
#> 19  Riley Christensen  Female
#> 22      Braydon Duran  Female
#> 23       Jalen Harmon  Female
#> 24           Ean Haas  Female
#> 32        Katie Ayers  Female
What are the names of the female students studying chemistry?

You can combine the use of == with the and (&) or the or (|) operator.

mask = (df['Gender'] == 'Female') & (df['Major'] == 'Chemistry')
df[mask]
Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%)
1 Gustavo Vang Chemistry Female 13.470 17.440 18.75
14 Quentin Kemp Chemistry Female 0.000 15.720 35.00
16 Camden Williams Chemistry Female 22.653 17.648 50.00
18 Solomon Fletcher Chemistry Female 18.981 15.000 33.75
19 Riley Christensen Chemistry Female 15.306 16.016 43.75

Exercise 4 Physics students with less that 60%

Results

Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
4 Kyla Young Physics Male 15.306 17.896 17.50 50.702
5 Wyatt Oliver Physics Male 12.246 14.088 32.50 58.834
15 Leo Mayo Physics Female 20.817 14.680 20.00 55.497
17 Sidney Wiggins Physics Female 19.593 14.680 0.00 34.273
27 Litzy White Physics Male 20.817 15.920 16.25 52.987
31 Kathleen Rodriguez Physics Male 22.653 17.760 16.25 56.663

Tasks

Who are the Physics studnets with less than a total of 60%?

Solution

mask = (df['Major'] == 'Physics') & (df['Total (100%)'] <= 60)
df[mask]

Example 5 Grouping

Results

Name Major Gender Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
4 Kyla Young Physics Male 15.306 17.896 17.50 50.702
5 Wyatt Oliver Physics Male 12.246 14.088 32.50 58.834
15 Leo Mayo Physics Female 20.817 14.680 20.00 55.497
17 Sidney Wiggins Physics Female 19.593 14.680 0.00 34.273
27 Litzy White Physics Male 20.817 15.920 16.25 52.987
31 Kathleen Rodriguez Physics Male 22.653 17.760 16.25 56.663

Code

Another extremely powerful feature of Pandas is its ability to group items according to categories and then calculate values for those groupes.

You can apply the ‘in-built’ function (count(), sum(), mean(), median(), min(), max(), mode(), std(), var()) directly to the groups. However, the real power of grouping comes with using agg() to apply any function.

What are the means of the scores for the various test, according to Major
df.groupby(by='Major').mean().round(2)
Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
Major
Chemistry 14.08 16.84 36.09 67.01
Comp. Biology 16.53 10.28 0.00 26.81
Life Sciences 19.24 14.69 31.25 65.18
Physics 17.76 16.18 27.81 61.75

What are the means of the scores for the various tests, according to Gender.
df.groupby(by='Gender').mean().round(2)
Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
Gender
Female 17.47 14.91 29.63 62.01
Male 17.48 16.18 30.90 64.57

What are the means of the scores for the various tests, according to Major & Gender.
df.groupby(by=['Major','Gender']).mean().round(2)
Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
Major Gender
Chemistry Female 14.08 16.36 36.25 66.70
Male 14.08 17.63 35.83 67.54
Comp. Biology Female 16.53 10.28 0.00 26.81
Life Sciences Female 18.84 14.68 33.61 67.13
Male 19.96 14.72 27.00 61.67
Physics Female 20.20 14.68 10.00 44.89
Male 17.27 16.48 31.38 65.12

Let’s use mean() and std() that belongs to numpy
df.groupby(by=['Major','Gender']).agg([np.mean, np.std]).round(2)
Test 1 (30%) Test 2 (20%) Test 3 (50%) Total (100%)
mean std mean std mean std mean std
Major Gender
Chemistry Female 14.08 8.63 16.36 1.14 36.25 11.83 66.70 17.13
Male 14.08 12.50 17.63 0.82 35.83 3.15 67.54 9.06
Comp. Biology Female 16.53 NaN 10.28 NaN 0.00 NaN 26.81 NaN
Life Sciences Female 18.84 2.35 14.68 2.47 33.61 10.28 67.13 8.06
Male 19.96 3.89 14.72 2.45 27.00 17.91 61.67 21.03
Physics Female 20.20 0.87 14.68 0.00 10.00 14.14 44.89 15.01
Male 17.27 4.29 16.48 1.66 31.38 12.85 65.12 10.41