Merging and Summarizing Data

September 22, 2024

Merging Data Frames

Often we have data from two different sources
Results in two data frames
How to make them one so we can analyze?
Key questions
- What is the unit of analysis?
- What is/are the corresponding identifier variables?
- Are the identifier variables in common?
- Or do they have to be added/transformed to match?

Merging WB and V-Dem Data

These are both time-series, country-level data
Need to merge by country-year
Year is easy
But there are many different country codes
Can use countrycode package to assign country codes

`countrycode` Example

# Load countrycode
library(countrycode)

# Create new iso3c variable
dem_data <- dem_data |>    
  mutate(iso3c = countrycode(sourcevar = country_id, # what we are converting
        origin = "vdem",         # we are converting from vdem
        destination = "wb"))  |> # and converting to the WB iso3c code 
  relocate(iso3c, .after = country_id) # move iso3c 

# View the data
glimpse(dem_data)

Try it Yourself

Using your democracy data frame from the last lesson
Use mutate() and countrycode() to add iso3c country codes
Use relocate to move your iso3c code to the “front” of your data frame (optional)

Types of Joins in `dplyr`

Mutating versus filtering joins
Four types of mutating joins
- inner_join()
- full_join()
- left_join()
- right_join()
For the most part we will use left_join()

`left_join()` Example

# Perform left join using common iso3c variable and year
dem_women <- left_join(dem_data, women_emp, by = c("iso3c", "year")) |> 
  select(-c(country_text_id, country_id, country)) |> # drop extra country codes/names
  rename(region = e_regionpol_6C, 
         polyarchy = v2x_polyarchy,
         libdem = v2x_libdem) |>
    mutate(
    region = case_match(region, # replace the values in region with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
    )
  
# View the data
glimpse(dem_women)

Try it Yourself

Take your V-Dem data frame and your World Bank data frame
Using left_join() to merge on country code and year
Along the way, use rename() and select() to insure you have just one country name

Group, Summarize and Arrange

group_by(), summarize(), arrange()
A very common sequence in data science:
- Take an average or some other statistic for a group
- Rank from high to low values of summary value

Example

# group_by(), summarize() and arrange()
dem_women |> # save result as new object
  group_by(region)  |> # group data by region
  summarize(           # summarize following vars (by region)
    polyarchy_mean = mean(polyarchy, na.rm = TRUE), # calculate mean after remove NAs
    libdem_median = median(libdem, na.rm = TRUE), # calculate median after remove NAs
    women_rep_sd = sd(women_rep, na.rm = TRUE), # calculate std. dev after remove NAs
    flfp_min = min(flfp, na.rm = TRUE) # calculate minimum flfp after remove NAs
  ) |> 
  arrange(desc(polyarchy_mean)) # arrange in descending order by polyarchy score

Use across() to Apply Same Function to Multiple Columns

dem_women |>
  group_by(region) |> 
  summarize(
    across(c(polyarchy, libdem, women_rep, flfp), mean, na.rm = TRUE)
  ) |> 
  arrange(desc(polyarchy))

Try it Yourself

Try changing the parameters in the code to answer these questions:

Try summarizing the data with a different function for one or more of the variables.

What is the median value of polyarchy for The West?
What is the max value of libdem for Eastern Europe?
What is the standard deviation of women_rep for Africa?
What is the mean of flfp for the Middle East?

Now try grouping by country instead of region and filter for years >= 2000.

What is the median value of polyarchy for Sweden?
What is the max value of libdem New Zealand?
What is the standard deviation of women_rep for Norway?
What is the min of flfp for Germany?

Try sorting countries in ascending order of the selected variable (hint: delete “desc” from the arrange() call). Which country ranks at the “top” of the list? Now change it back to descending order.

Merging and Summarizing Data

Merging Data Frames

Merging WB and V-Dem Data

countrycode Example

Try it Yourself

Types of Joins in dplyr

left_join() Example

Try it Yourself

Group, Summarize and Arrange

Example

Try it Yourself

`countrycode` Example

Types of Joins in `dplyr`

`left_join()` Example