Wrangling Data

September 23, 2024

Data Frames

What is a Data Frame?

Special kind of tabular data used in data science
Each column can be a different data type
Data frames are the most common data structure in R

What is a Tibble?

Modern data frames in R
Offers better printing and subsetting behaviors
Does not convert character vectors to factors by default
Displays only the first 10 rows and as many columns as fit on screen
Column names are preserved exactly, even if they contain spaces

Creating a Tibble

When you read data into R with readr you automatically get a tibble
You can create a tibble using tibble() from the tibble package:

  library(tibble)
  
  # Create a tibble
  my_tibble <- tibble(
    name = c("Alice", "Bob", "Charlie"),
    age = c(25, 30, 35),
    height = c(160, 170, 180),
    is_student = c(TRUE, FALSE, FALSE)
  )
  
my_tibble

# A tibble: 3 × 4
  name      age height is_student
  <chr>   <dbl>  <dbl> <lgl>     
1 Alice      25    160 TRUE      
2 Bob        30    170 FALSE     
3 Charlie    35    180 FALSE

Common Data Types

<chr> (Character): Stores text strings
- Example: "hello", "R programming"
<dbl> (Double): Stores decimal (floating-point) numbers
- Example: 3.14, -1.0
<int> (Integer): Stores whole numbers (integers)
- Example: 1, -100, 42
<lgl> (Logical): Stores boolean values (TRUE, FALSE, NA)
- Example: TRUE, FALSE, NA
<fct> (Factor): Stores categorical variables with fixed levels
- Example: factor(c("low", "medium", "high"))
<date> (Date): Stores dates in the “YYYY-MM-DD” format
- Example: as.Date("2024-09-05")

Other Data Types

<dttm> (Date-Time or POSIXct): Stores date-time objects (both date and time).
- Example: as.POSIXct("2024-09-05 14:30:00")
<time> (Time): Specifically stores time-of-day values (rarely seen without a date)
- Example: "14:30:00"
<list> (List): Stores lists, where each entry can be a complex object.
- Example: list(c(1, 2, 3), c("a", "b", "c"))

Dates and Times with `lubridate`

lubridate is an R package that makes it easier to work with dates and times
Use convenient functions to store dates in different formats

library(lubridate)
  
# Store a date
my_date <- ymd("2024-09-05")
my_date2 <- mdy("09-05-2024")
my_date3 <- dmy("05-09-2024")
  
# Print in long form
format(my_date, "%B %d, %Y")

[1] "September 05, 2024"

Your Turn

Create your own tibble
Make it on a topic you find interesting
Try to include at least three data types

05:00

Tidying Data

What is Tyding Data?

In the last class we talked about the concept of “tidy data”
- Each variable forms a column
- Each observation is in a row
- Each cell has a single value
The process of tidying data involves reshaping (or pivoting) data into a tidy format
We want to use the pivot_longer() or pivot_wider() functions from tidyr to do this

Query: Are these data in a tidy format?

library(tidyr)

smiths

# A tibble: 2 × 5
  subject     time   age weight height
  <chr>      <dbl> <dbl>  <dbl>  <dbl>
1 John Smith     1    33     90   1.87
2 Mary Smith     1    NA     NA   1.54

To get a list of data frames available in a package use the data() function, e.g. data(package = "tidyr").

How about these data?

world_bank_pop

# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

Pivot Longer

pivot_longer() takes three arguments:

cols - which columns you want to pivot
names_to - the name of the column where the old column names are going to (identifier)
values_to - the name of the column where the values are going to

Example: WB Population Data

library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

This is better, usable even, but are we done if we want a tidy data frame?

library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

This is better, usable even, but are we done if we want a tidy data frame?

library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Issue is that the data are in long form (which is OK for some purposes), but we want to make it wider. Wider, but tidy…

library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Pivot Wider

pivot_wider() takes three main arguments:

names_from - the column whose values will become new column names (identifier)
values_from - the column containing the values that will fill the new columns
values_fill (optional) - specifies what to use for missing values (e.g., NA, 0)

`pivotwider()` the WB Data

# pivot wider
tidy_pop_data <- long_pop_data |>
  pivot_wider(
    names_from = indicator, 
    values_from = pop
  )

# view the data
tidy_pop_data

# A tibble: 4,788 × 6
   country year  SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
   <chr>   <chr>       <dbl>       <dbl>       <dbl>       <dbl>
 1 ABW     2000        41625      1.66         89101       2.54 
 2 ABW     2001        42025      0.956        90691       1.77 
 3 ABW     2002        42194      0.401        91781       1.19 
 4 ABW     2003        42277      0.197        92701       0.997
 5 ABW     2004        42317      0.0946       93540       0.901
 6 ABW     2005        42399      0.194        94483       1.00 
 7 ABW     2006        42555      0.367        95606       1.18 
 8 ABW     2007        42729      0.408        96787       1.23 
 9 ABW     2008        42906      0.413        97996       1.24 
10 ABW     2009        43079      0.402        99212       1.23 
# ℹ 4,778 more rows

Exercise

Download Some Messy Data

Download some messy WB data
Make it multiple variables
Download as a CSV file
Save in your project /data folder

Messy Data Example

Read Data

# Load packages
library(readr) 
library(dplyr) 

# Read data from csv file into an object called "wb_data_messy"
wb_data_messy <- read_csv("data/your_file_name.csv")

# View the data
glimpse(wb_data_messy)

Pivot the Data

Use pivot_longer() to get the data in long form
Use pivot_wider() to get the series in the columns

Special Considerations

# Load tidyr
library(tidyr)

# Reshape the data
wb_data <- wb_data_messy |> 
  pivot_longer(             
    cols = `1973 [YR1973]`: `2022 [YR2022]`, # columns to pivot
    names_to = "year", # name the identifier column "year"
    values_to = "values" # name the numeric var column 
  ) 

# View the data
glimpse(wb_data)

Special Considerations

wb_data <- wb_data |>
  select(-`Series Name`)

# Reshape the data
tidy_data <- wb_data |> 
  pivot_wider(             
    names_from = `Series Code`,
    values_from = values
  ) 

# View the data
glimpse(wb_data)

Give it a Shot!

Try downloading some messy WB data and make it tidy
Refer to previous slides for help

10:00

Mutate

Mutating Variables

Anytime we want to change a variable, we are going to use the dplyr verbs mutate() or mutate_at()
mutate() is if you want to change on variable
mutate_at() is for multiple variables

Let’s Fix Our Variables

# Fix year and flfp
wb_data <- wb_data |> # replace wb_data with a modified version of the dataframe 
  mutate(year = substring(year, 1, 4)) |> # truncate year (keep first four characters)
  mutate_at(c("year", "flfp"), as.numeric) # change year and flfp to numeric

# View the data
glimpse(wb_data)

Now Try it With Multiple Varaibles

Go to the data frame with multiple variables that you created earlier
How would you modify this code to make sure the variables are in the right format?

Clean Variable Names

Very simple: use the janitor package!

# Load janitor
library(janitor)

# Apply clean_names() to wb_data, store in new data frame called wb_data_clean
wb_data_clean <- wb_data |>  
  clean_names() 

# Write wb_data_clean to a csv file
write_csv(wb_data_clean, "data/wb_data_clean.csv")

# View the data
glimpse(wb_data_clean)

Column Specifications

Can clean columns on backend
But can also specify data types in read_csv() or read_excel()

spec(wb_data_messy)

wb_data_messy <- read_csv("data/your_file_name.csv", 
                          col_types = cols(
                           col_double(`1990 [YR1990]`) 
                          )
                          )

Wrangling Data

Data Frames

What is a Data Frame?

What is a Tibble?

Creating a Tibble

Common Data Types

Other Data Types

Dates and Times with lubridate

Your Turn

Tidying Data

What is Tyding Data?

Pivot Longer

Example: WB Population Data

Pivot Wider

pivotwider() the WB Data

Exercise

Download Some Messy Data

Messy Data Example

Read Data

Pivot the Data

Special Considerations

Special Considerations

Give it a Shot!

Mutate

Mutating Variables

Let’s Fix Our Variables

Now Try it With Multiple Varaibles

Clean Variable Names

Column Specifications

Dates and Times with `lubridate`

`pivotwider()` the WB Data