Wrangling Data

September 23, 2024

Data Frames

What is a Data Frame?


  • Special kind of tabular data used in data science
  • Each column can be a different data type
  • Data frames are the most common data structure in R

What is a Tibble?


  • Modern data frames in R
  • Offers better printing and subsetting behaviors
  • Does not convert character vectors to factors by default
  • Displays only the first 10 rows and as many columns as fit on screen
  • Column names are preserved exactly, even if they contain spaces

Creating a Tibble


  • When you read data into R with readr you automatically get a tibble
  • You can create a tibble using tibble() from the tibble package:
  library(tibble)
  
  # Create a tibble
  my_tibble <- tibble(
    name = c("Alice", "Bob", "Charlie"),
    age = c(25, 30, 35),
    height = c(160, 170, 180),
    is_student = c(TRUE, FALSE, FALSE)
  )
  
my_tibble  
# A tibble: 3 × 4
  name      age height is_student
  <chr>   <dbl>  <dbl> <lgl>     
1 Alice      25    160 TRUE      
2 Bob        30    170 FALSE     
3 Charlie    35    180 FALSE     

Common Data Types

  • <chr> (Character): Stores text strings
    • Example: "hello", "R programming"
  • <dbl> (Double): Stores decimal (floating-point) numbers
    • Example: 3.14, -1.0
  • <int> (Integer): Stores whole numbers (integers)
    • Example: 1, -100, 42
  • <lgl> (Logical): Stores boolean values (TRUE, FALSE, NA)
    • Example: TRUE, FALSE, NA
  • <fct> (Factor): Stores categorical variables with fixed levels
    • Example: factor(c("low", "medium", "high"))
  • <date> (Date): Stores dates in the “YYYY-MM-DD” format
    • Example: as.Date("2024-09-05")

Other Data Types


  • <dttm> (Date-Time or POSIXct): Stores date-time objects (both date and time).
    • Example: as.POSIXct("2024-09-05 14:30:00")
  • <time> (Time): Specifically stores time-of-day values (rarely seen without a date)
    • Example: "14:30:00"
  • <list> (List): Stores lists, where each entry can be a complex object.
    • Example: list(c(1, 2, 3), c("a", "b", "c"))

Dates and Times with lubridate

  • lubridate is an R package that makes it easier to work with dates and times

  • Use convenient functions to store dates in different formats

library(lubridate)
  
# Store a date
my_date <- ymd("2024-09-05")
my_date2 <- mdy("09-05-2024")
my_date3 <- dmy("05-09-2024")
  
# Print in long form
format(my_date, "%B %d, %Y")
[1] "September 05, 2024"

Your Turn


  • Create your own tibble
  • Make it on a topic you find interesting
  • Try to include at least three data types
05:00

Tidying Data

What is Tyding Data?

  • In the last class we talked about the concept of “tidy data”
    • Each variable forms a column
    • Each observation is in a row
    • Each cell has a single value
  • The process of tidying data involves reshaping (or pivoting) data into a tidy format
  • We want to use the pivot_longer() or pivot_wider() functions from tidyr to do this

Query: Are these data in a tidy format?


library(tidyr)

smiths
# A tibble: 2 × 5
  subject     time   age weight height
  <chr>      <dbl> <dbl>  <dbl>  <dbl>
1 John Smith     1    33     90   1.87
2 Mary Smith     1    NA     NA   1.54


To get a list of data frames available in a package use the data() function, e.g. data(package = "tidyr").


How about these data?


world_bank_pop
# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

Pivot Longer


pivot_longer() takes three arguments:

  • cols - which columns you want to pivot
  • names_to - the name of the column where the old column names are going to (identifier)
  • values_to - the name of the column where the values are going to

Example: WB Population Data


library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data
# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

This is better, usable even, but are we done if we want a tidy data frame?


library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data
# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows


This is better, usable even, but are we done if we want a tidy data frame?


library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data
# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Issue is that the data are in long form (which is OK for some purposes), but we want to make it wider. Wider, but tidy…


library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data
# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Pivot Wider


pivot_wider() takes three main arguments:

  • names_from - the column whose values will become new column names (identifier)
  • values_from - the column containing the values that will fill the new columns
  • values_fill (optional) - specifies what to use for missing values (e.g., NA, 0)

pivotwider() the WB Data


# pivot wider
tidy_pop_data <- long_pop_data |>
  pivot_wider(
    names_from = indicator, 
    values_from = pop
  )

# view the data
tidy_pop_data
# A tibble: 4,788 × 6
   country year  SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
   <chr>   <chr>       <dbl>       <dbl>       <dbl>       <dbl>
 1 ABW     2000        41625      1.66         89101       2.54 
 2 ABW     2001        42025      0.956        90691       1.77 
 3 ABW     2002        42194      0.401        91781       1.19 
 4 ABW     2003        42277      0.197        92701       0.997
 5 ABW     2004        42317      0.0946       93540       0.901
 6 ABW     2005        42399      0.194        94483       1.00 
 7 ABW     2006        42555      0.367        95606       1.18 
 8 ABW     2007        42729      0.408        96787       1.23 
 9 ABW     2008        42906      0.413        97996       1.24 
10 ABW     2009        43079      0.402        99212       1.23 
# ℹ 4,778 more rows

Exercise

Download Some Messy Data


  • Download some messy WB data
  • Make it multiple variables
  • Download as a CSV file
  • Save in your project /data folder

Messy Data Example

Read Data


# Load packages
library(readr) 
library(dplyr) 

# Read data from csv file into an object called "wb_data_messy"
wb_data_messy <- read_csv("data/your_file_name.csv")

# View the data
glimpse(wb_data_messy)

Pivot the Data


  • Use pivot_longer() to get the data in long form
  • Use pivot_wider() to get the series in the columns

Special Considerations


# Load tidyr
library(tidyr)

# Reshape the data
wb_data <- wb_data_messy |> 
  pivot_longer(             
    cols = `1973 [YR1973]`: `2022 [YR2022]`, # columns to pivot
    names_to = "year", # name the identifier column "year"
    values_to = "values" # name the numeric var column 
  ) 

# View the data
glimpse(wb_data)

Special Considerations


wb_data <- wb_data |>
  select(-`Series Name`)

# Reshape the data
tidy_data <- wb_data |> 
  pivot_wider(             
    names_from = `Series Code`,
    values_from = values
  ) 

# View the data
glimpse(wb_data)

Give it a Shot!


  • Try downloading some messy WB data and make it tidy
  • Refer to previous slides for help
10:00

Mutate

Mutating Variables


  • Anytime we want to change a variable, we are going to use the dplyr verbs mutate() or mutate_at()
  • mutate() is if you want to change on variable
  • mutate_at() is for multiple variables

Let’s Fix Our Variables


# Fix year and flfp
wb_data <- wb_data |> # replace wb_data with a modified version of the dataframe 
  mutate(year = substring(year, 1, 4)) |> # truncate year (keep first four characters)
  mutate_at(c("year", "flfp"), as.numeric) # change year and flfp to numeric

# View the data
glimpse(wb_data)

Now Try it With Multiple Varaibles


  • Go to the data frame with multiple variables that you created earlier
  • How would you modify this code to make sure the variables are in the right format?

Clean Variable Names


Very simple: use the janitor package!
# Load janitor
library(janitor)

# Apply clean_names() to wb_data, store in new data frame called wb_data_clean
wb_data_clean <- wb_data |>  
  clean_names() 

# Write wb_data_clean to a csv file
write_csv(wb_data_clean, "data/wb_data_clean.csv")

# View the data
glimpse(wb_data_clean)

Column Specifications


  • Can clean columns on backend
  • But can also specify data types in read_csv() or read_excel()
spec(wb_data_messy)

wb_data_messy <- read_csv("data/your_file_name.csv", 
                          col_types = cols(
                           col_double(`1990 [YR1990]`) 
                          )
                          )