Module 1.1

Working with Flat Files

Prework
  • Install R, R Studio and the Tidyverse collection of packages if you have not done so already (see getting started)
  • Install the janitor package (install.packages("janitor"))
  • Have a look at the documentation for readr, dplyr, tidyr and janitor
  • Familiarize yourself with the readr, dplyr and tidyr cheatsheets
  • Create a new directory/folder in your project folder called “data” where you can store .csv files
  • Start a new quarto project called “modules” and generate a quarto document named “module-1.1.qmd” inside of it so that you can code along with me
Note

In this module, we will be using the readr, tidyr and dplyr packages from the Tidyverse. While we could just load the entire Tidyverse collection of packages (e.g. library(tidyverse)), it is generally better from a performance standpoint to load the packages individually. Another reason to laod the packages separately is that it is good to be more intentional about which packages we are loading while we are learning them.

Also, note that in this module we will load the packages as we go along in various code chunks; but it is also possible to load all of the packages you will be using in a Quarto doc or R script in a single “setup” code chunk at the beginning of the file. You will see both workflows represented in this course.

Overview

In this module, we are going to work with a “flat file” (.csv) that we will download from the World Bank’s Data Bank. We are going to encounter many problems with these data that we will rectify using various R packages that I will introduce along the way. The idea is to take this file in its current state and transform it into a tidy dataset where each column represents a variable, each row represents an observation, and each cell represents a single value.

Reading data into R

Go to the World Development Indicators portal at the World Bank’s Data Bank.

Under Countries, select the Countries tab and then select the little check mark ☑️ to select all of the countries. Be sure to select the Countries tab first, though, or you will also be downloading aggregate data for regions and groups of countries.

Next, under Series, search for “labor force participation” and find labor force participation rates for women ages 15-64 (ILO modeled estimates). Check that series.

Now go to Time and select the years from the last 50 years. Click Apply Changes, go to Download Options and download as a .csv file. Place the .csv file in the data directory that you created for this module. Save it as “messy_wb_data.csv” or something like that.

Now we are going to read this messy World Bank data into R using the read_csv() function from the readr package. readr is a collection of functions that parses data from a flat file into a tibble, the modern Tidyverse version of a data frame. After we have read the data into R, we are going to have a look at it with the glimpse() function from the dplyr package.

glimpse() shows us a list of the columns in the along with their type (e.g. character, double, etc.) and a few rows’ worth of data.

Note

While comma delimited files are the most common kind of flat file, readr includes functions for parsing files with a wide range of delimiters including tabs (read_tsv()), semicolons (read_csv2()) and white spaces (read_table()). There is also a Tidyverse package for reading in Excel files called readxl.

# Load packages
library(readr) 
library(dplyr) 

# Read data from csv file into an object called "wb_data_messy"
wb_data_messy <- read_csv("data/messy_wb_data.csv")

# View the data
glimpse(wb_data_messy)
Rows: 222
Columns: 54
$ `Country Name`  <chr> "Afghanistan", "Albania", "Algeria", "American Samoa",…
$ `Country Code`  <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", "ATG", "ARG"…
$ `Series Name`   <chr> "Labor force participation rate, female (% of female p…
$ `Series Code`   <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI…
$ `1972 [YR1972]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1973 [YR1973]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1974 [YR1974]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1975 [YR1975]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1976 [YR1976]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1977 [YR1977]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1978 [YR1978]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1979 [YR1979]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1980 [YR1980]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1981 [YR1981]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1982 [YR1982]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1983 [YR1983]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1984 [YR1984]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1985 [YR1985]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1986 [YR1986]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1987 [YR1987]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1988 [YR1988]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1989 [YR1989]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1990 [YR1990]` <chr> "15.83", "60.63", "12.31", "..", "..", "76.73", "..", …
$ `1991 [YR1991]` <chr> "15.89", "65.54", "12.33", "..", "..", "76.69", "..", …
$ `1992 [YR1992]` <chr> "15.92", "66.56", "12.37", "..", "..", "76.66", "..", …
$ `1993 [YR1993]` <chr> "15.91", "65.01", "12.41", "..", "..", "76.68", "..", …
$ `1994 [YR1994]` <chr> "15.88", "63.64", "12.47", "..", "..", "76.64", "..", …
$ `1995 [YR1995]` <chr> "15.92", "61.59", "12.56", "..", "..", "76.57", "..", …
$ `1996 [YR1996]` <chr> "15.75", "60.28", "12.64", "..", "..", "76.55", "..", …
$ `1997 [YR1997]` <chr> "15.59", "61.91", "12.59", "..", "..", "76.53", "..", …
$ `1998 [YR1998]` <chr> "15.47", "60.62", "12.59", "..", "..", "76.53", "..", …
$ `1999 [YR1999]` <chr> "15.4", "58.87", "12.63", "..", "..", "76.51", "..", "…
$ `2000 [YR2000]` <chr> "15.35", "57.89", "12.71", "..", "..", "76.49", "..", …
$ `2001 [YR2001]` <chr> "15.5", "56.71", "12.85", "..", "..", "76.48", "..", "…
$ `2002 [YR2002]` <chr> "15.7", "56.06", "13.02", "..", "..", "76.44", "..", "…
$ `2003 [YR2003]` <chr> "15.92", "55.3", "13.24", "..", "..", "76.41", "..", "…
$ `2004 [YR2004]` <chr> "16.13", "54.57", "13.5", "..", "..", "76.38", "..", "…
$ `2005 [YR2005]` <chr> "16.33", "53.88", "13.79", "..", "..", "76.36", "..", …
$ `2006 [YR2006]` <chr> "16.12", "53.43", "14.12", "..", "..", "76.39", "..", …
$ `2007 [YR2007]` <chr> "15.91", "53.07", "14.47", "..", "..", "76.42", "..", …
$ `2008 [YR2008]` <chr> "15.74", "52.78", "14.87", "..", "..", "76.46", "..", …
$ `2009 [YR2009]` <chr> "15.65", "51.57", "15.31", "..", "..", "76.53", "..", …
$ `2010 [YR2010]` <chr> "15.65", "52.75", "15.49", "..", "..", "76.59", "..", …
$ `2011 [YR2011]` <chr> "16", "60.59", "16.45", "..", "..", "76.67", "..", "55…
$ `2012 [YR2012]` <chr> "16.44", "55.1", "17.48", "..", "..", "76.73", "..", "…
$ `2013 [YR2013]` <chr> "17.42", "50.58", "18.29", "..", "..", "76.79", "..", …
$ `2014 [YR2014]` <chr> "18.46", "50.18", "16.68", "..", "..", "76.83", "..", …
$ `2015 [YR2015]` <chr> "19.55", "54.05", "17.5", "..", "..", "76.87", "..", "…
$ `2016 [YR2016]` <chr> "20.7", "56.4", "18.33", "..", "..", "76.9", "..", "56…
$ `2017 [YR2017]` <chr> "21.91", "55.54", "19.19", "..", "..", "76.91", "..", …
$ `2018 [YR2018]` <chr> "22.32", "59.12", "18.95", "..", "..", "76.9", "..", "…
$ `2019 [YR2019]` <chr> "22.74", "61.46", "18.7", "..", "..", "76.88", "..", "…
$ `2020 [YR2020]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `2021 [YR2021]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …

Reshaping data

There are a few things about the data here that make it messy. First, in order for the data to be tidy, we want each column to represent a variable and each row to represent an observation.

But here we see the reverse: the data are in wide form, meaning that each column represents a year and each row represents a country. This entails that each row represents multiple observations in that we have data for multiple years for each row.

To rectify this, we need to reshape the data from wide form to long form. For this, we need the pivot_longer() function from the tidyr package.

The pivot_longer() function takes three basic arguments:

  • cols - which columns you want to pivot
  • names_to - the name of the column where the old column names are going to
  • values_to - the name of the column where the values are going to

In our case, we want to reshape all of the year columns and have the years represented in the rows. We want the newly created column to be called “year” and the values are going to represent the data on female labor force participation we downloaded (flfp).

# Load tidyr
library(tidyr)

# Reshape the data
wb_data <- wb_data_messy |> # take wb_data_messy, and put it in wb_data, but first...
  pivot_longer(             # pivot the data from wide to long form
    cols = `1972 [YR1972]`: `2021 [YR2021]`, # columns to pivot
    names_to = "year", # name the first column "year"
    values_to = "flfp" # name the second column "flfp"
  ) 

# View the data
glimpse(wb_data)
Rows: 11,100
Columns: 6
$ `Country Name` <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
$ `Country Code` <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ `Series Name`  <chr> "Labor force participation rate, female (% of female po…
$ `Series Code`  <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.…
$ year           <chr> "1972 [YR1972]", "1973 [YR1973]", "1974 [YR1974]", "197…
$ flfp           <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", "…
The Pipe Operator

For our pivot-longer() call we used R’s native pipe operator, e.g. |>. Pipes tell R to do something to the object that they are attached to. In this case, we are telling R to apply pivot_longer() to wb_data. The alternative way of writing this code would be to include the data as the first argument in the function, e.g. pivot_longer(wb_data, cols = ..., names_to = ... , values_to = ...). As you will see, pipe operators enable us to string together multiple functions in a convenient way to transform our data.

Spaces in Variable Names

Notice that when we specify the years in our pivot_longer() call we encapsulate them in backticks (``). This is because the years, as they were imported from the WDI dataset, have spaces in them. Typically we want to avoid this scenario by writing our variable names in snake_case.

Cleaning up our data

Now that our data are transposed, we can start to work on some other key issues. Notice that the year is stored in the weird way in which it was imported–as a character (or string) with both the year and the year in brackets, e.g. 1972 [YR1972]. Notice that flfp is also stored as a character whereas it should be numeric.

To fix this, we will use the mutate() and mutate_at() functions from dplyr. mutate() is used to transform variables and to create new ones while mutate_at() allow us to transform multiple columns at once.

First we call mutate() along with substring() to truncate the year variable to only include the first four characters of the string. Then we call mutate_at() along with as.numeric to transform year and flfp to numeric variables.

# Fix year and flfp
wb_data <- wb_data |> # replace wb_data with a modified version of the dataframe 
  mutate(year = substring(year, 1, 4)) |> # truncate year (keep first four characters)
  mutate_at(c("year", "flfp"), as.numeric) # change year and flfp to numeric

# View the data
glimpse(wb_data)
Rows: 11,100
Columns: 6
$ `Country Name` <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
$ `Country Code` <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ `Series Name`  <chr> "Labor force participation rate, female (% of female po…
$ `Series Code`  <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.…
$ year           <dbl> 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1…
$ flfp           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

The last thing we are going to do is to fix the variable names. Specifically, we want to remove the spaces from the remaining variables and conver them from title case to snake case. To do this, we will use the clean_names() function from the janitor package.

As a final step, we can export our clean data to a new .csv file with the write.csv() function from readr.

# Load janitor
library(janitor)

# Apply clean_names() to wb_data, store in new data frame called wb_data_clean
wb_data_clean <- wb_data |>  
  clean_names() 

# Write wb_data_clean to a csv file
write_csv(wb_data_clean, "data/wb_data_clean.csv")

# View the data
glimpse(wb_data_clean)
Rows: 11,100
Columns: 6
$ country_name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan…
$ country_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "…
$ series_name  <chr> "Labor force participation rate, female (% of female popu…
$ series_code  <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE…
$ year         <dbl> 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 198…
$ flfp         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…