Working With Data

September 22, 2024

How Do We Get Tidy/Clean Data?


  • Wrangle it ourselves
  • Use a package where it has been wrangled for us
  • Download via an API

This Lesson


  • Practice with World Bank and V-Dem data
  • World Bank data through wbstats
    • There is another package called WDI
    • Both packages for accessing data through WB API
  • Varieties of Democracy (V-Dem) through vdemlite
    • There is also a package called vdemdata
    • vdemlite offers more functionality, works better in the cloud

filter(), select(), mutate()


Along the way we will practice some important dplyr verbs:


  • filter() is used to select observations based on their values
  • select() is used to select variables
  • mutate() is used to create new variables or modifying existing ones


As well as some helpful functions from the janitor package.

APIs


  • API stands for “Application Programming Interface”
  • Way for two computers to talk to each other
  • In our case, we will use APIs to download social science data

APIs in R

  • APIs are accessed through packages in R
  • Sometimes there can be more than one package for an API
  • Much easier than reading in data from messy flat file!
  • We will use a few API packages in this course
    • World Bank data through wbstats (or WDI)
    • fredr for Federal Reserve Economic Data
    • tidycensus for US Census data
  • But there are many APIs out there (please explore!)

Searching for WB Indicators


flfp_indicators <- wb_search("female labor force") # store the list of indicators

print(flfp_indicators, n=26) # view the indicators

wbstats Example


# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
women_emp <- wb_data(indicators, mrv = 50) |> # download data for last 50 yrs
  select(!iso2c) |> # drop the iso2c code which we won't be using
  rename(year = date) |> # rename date to year 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(women_emp) 

Your Turn!


  • Search for a WB indicator
  • Download the data
05:00

The V-Dem Dataset


  • V-Dem stands for Varieties of Democracy
  • It is a dataset that measures democracy around the world
  • Based on expert assessments of the quality of democracy in each country
  • Two packages we will explore: vdemdata and vdemlite

Downloading V-Dem Data


The vdem function from vdemdata just downloads all of the data. Try running this code chunk. What do you see in democracy?

library(vdemdata) # load the V-Dem package

democracy <- vdem() # download the V-Dem dataset

filter()

  • Run this code. What do you see?
  • Try changing the year
  • For one year, use == instead of >=
  • Or try <= and see what happens
democracy <- vdem |> # download the V-Dem dataset
  filter(year >= 1990) # filter out years less than 1990
  
glimpse(democracy)  

= versus ==


  • = is used to assign values to variables, just like <-
  • == is used to test if two values are equal to each other
  • So filter(year == 1990) will give you just the observations for 1990

>= and <=

  • >= is used to test if a value is greater than or equal to another value
  • <= is used to test if a value is less than or equal to another value
  • So filter(year >= 1990) will give you the observations for 1990 and later
  • And filter(year <= 1990) will give you the observations for 1990 and earlier

select()

  • Run this code. What do you see?
  • Now try v2x_libdem instead of v2x_polyarchy
  • Choose more from the codebook
democracy <- vdem |> # download the V-Dem dataset
  select(                  # select (and rename) these variables
    country = country_name,     # before the = sign is new name  
    vdem_ctry_id = country_id,  # after the = sign is the old name
    year, 
    polyarchy = v2x_polyarchy
  )
  
glimpse(democracy)  
05:00

mutate()

  • Modify the code to create new variable that is three times the value of polyarchy
  • How about polyarchy squared?
democracy <- vdem |> # download the V-Dem dataset
  filter(year == 2015) |> # keep only observations from 2015
  select(                  # select (and rename) these variables
    country = country_name,     # name before the = sign is new name  
    vdem_ctry_id = country_id,  # name after the = sign is old name
    year, 
    polyarchy = v2x_polyarchy 
    ) |>
  mutate(
    polyarchy_dbl = polyarchy * 2 # create variable 2X polyarchy
  )
  
glimpse(democracy)  
05:00

Some Common Arithmetic Operators


  • + addition
  • - subtraction
  • * multiplication
  • / division
  • ^ exponentiation (also **)

vdemdata Example


# Load packages
library(vdemdata) # to download V-Dem data
library(dplyr)

# Download the data
democracy <- vdem |> # download the V-Dem dataset
  filter(year == 2015)  |> # filter year, keep 2015
  select(                  # select (and rename) these variables
    country = country_name,     # the name before the = sign is the new name  
    vdem_ctry_id = country_id,  # the name after the = sign is the old name
    year, 
    polyarchy = v2x_polyarchy, 
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region, # replace the values in region with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
  )

# View the data
glimpse(democracy)


Use filter() to select years…


# Download the data
democracy <- vdem |> 
  filter(year == 2015)  |> # keep 2015
  select(                 
    country = country_name,       
    vdem_ctry_id = country_id,  
    year, 
    polyarchy = v2x_polyarchy, 
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region,
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
  )


Use select() to choose variables…


# Download the data
democracy <- vdem |> 
  filter(year == 2015)  |> 
  select(                  # select (and rename) these variables
    country = country_name,     # the name before the = sign is the new name  
    vdem_ctry_id = country_id,  # the name after the = sign is the old name
    year, 
    polyarchy = v2x_polyarchy, 
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region, 
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
  )


Use mutate with case_match() to Recode Region….


# Download the data
democracy <- vdem |>
  filter(year == 2015)  |> 
  select(                  
    country = country_name,     
    vdem_ctry_id = country_id,  
    year, 
    polyarchy = v2x_polyarchy, 
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region, # replace the values in region with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
                    # number on the left of the ~ is the V-Dem region code
                    # we are changing the number to the country name on the right
                    # of the equals sign
  )

vdemlite


  • Covers a few hundred commonly used indicators and indices from 1970 onward
  • Covers everything in this document
  • As opposed to 4000+ indicators from the 18th century onward
  • Adds some functionality for working with the data
  • Easier to work with in the cloud and apps

vdemlite fuctions


  • fetchdem() to download the data
  • summarizedem() provides searchable table of indicators with summary stats
  • searchdem() to search for specific indicators or all indicators used to construct an index
  • See the vdemlite documentation for more details

fetchdem()


# Load packages
library(vdemlite) # to download V-Dem data

# Polyarchy and clean elections index for USA and Sweden for 2000-2020
dem_indicators <- fetchdem(indicators = c("v2x_polyarchy", "v2xel_frefair"),
                           countries = c("USA", "SWE"))

# View the data
glimpse(dem_indicators)

summarizedem()


# Summary statistics for the polyarchy index
summarizedem(indicator = "v2x_polyarchy")

searchdem()


searchdem()

Your Turn


  • Look at the vdemlite documentation
  • Try using searchdem() to find an indicator you are interested in using
  • Use summarizedem() to get summary statistics for that variable
  • Use fetchdem() to download the data for that variable for a country or countries of interest
  • Try using mutate() to add region codes to the data
05:00