Grammar of Graphics

September 22, 2024

Types of Visualizations


  • Column charts (bar charts)
    • Use to compare values across categories
  • Histograms
    • Use to show distribution of a single variable
  • Line charts
    • Use to show trends over time
    • Can use column charts but not as effective
  • Scatter plots
    • Use to show relationships between two variables
    • X-axis is usually explanatory variable, Y-axis is outcome variable

The Grammar of Graphics

  • Data viz has a language with its own grammar
  • Basic components include:
    • Data we are trying to visualize
    • Aesthetics (dimensions)
    • Geom (e.g. bar, line, scatter plot)
    • Color scales
    • Themes
    • Annotations


Let’s start with the first two, the data and the aesthetic, with a column chart example…


library(readr)
library(ggplot2)

dem_summary <- read_csv("data/dem_summary.csv")

ggplot(dem_summary, aes(x = region, y = polyarchy)) 

This gives us the axes without any visualization:


Now let’s add a geom. In this case we want a column chart so we add geom_col().


ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col()

That gets the idea across but looks a little depressing, so…


…let’s change the color of the columns by specifying fill = "steelblue".


ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col(fill = "steelblue")


Tip

See here for more availableggplot2` colors.

Note how color of original columns is simply overwritten:


Now let’s add some labels with the labs() function:


ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col(fill = "steelblue") +
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

And that gives us…

Next, we reorder the bars with fct_reorder() from the forcats package.


library(forcats)

ggplot(dem_summary, aes(x = fct_reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )


Note that we could also use the base R reorder() function here.

This way, we get a nice, visually appealing ordering of the bars according to levels of democracy…


Now let’s change the theme to theme_minimal().


ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    ) + theme_minimal()


Tip

See here for available ggplot2 themes.

Gives us a clean, elegant look.


Note that you can also save your plot as an object to modify later.


dem_bar_chart <- ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue")

Which gives us…

dem_bar_chart


Now let’s add back our labels…


dem_bar_chart <- dem_bar_chart +
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

So now we have…

dem_bar_chart


And now we’ll add back our theme…


dem_bar_chart <- dem_bar_chart + theme_minimal()

Voila!

dem_bar_chart

Change the theme. There are many themes to choose from.

dem_bar_chart + theme_bw()

Your Turn!


  1. glimpse() the data
  2. Find a new variable to visualize1
  3. Make a bar chart with it
  4. Change the color of the bars
  5. Order the bars
  6. Add labels
  7. Add a theme
  8. Try saving your plot as an object
  9. Then change the labels and/or theme
10:00

Histograms

Purpose of Histograms


  • Histograms are used to visualize the distribution of a single variable
  • x-axis represents value of variable of interest
  • y-axis represents the frequency of that value

Purpose of Histograms


  • They are generally used for continuous variables (e.g., income, age, etc.)
    • A continuous variable is one that can take on any value within a range (e.g., 0.5, 1.2, 3.7, etc.)
    • A discrete variable is one that can only take on certain values (e.g., 1, 2, 3, etc.)
  • Typically, the height of the bar represents the number of observations which fall in that bin

Example

Histogram Code


# load dplyr

library(dplyr)

# load data
dem_women <- read_csv("data/dem_women.csv")

# filter to 2022
dem_women_2022 <- dem_women |>
  filter(year == 2022) 

# create histogram
ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Histogram Code


Note that you only need to specify the x axis variable in the aes() function. ggplot2 will automatically visualize the y-axis for a histogram.


ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(bins = 50, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Change Number of Bins


Change number of bins (bars) using bins or binwidth arguments (default number of bins = 30):


ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(bins = 50, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

At 50 bins…

At 100 bins…probably too many!


Using binwidth instead of bins


ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(binwidth = 2, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Setting binwidth to 2…

Change from Count to Density


ggplot(dem_women_2022, aes(after_stat(density), x = flfp)) +
  geom_histogram(fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Density",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()


For densities, the total area sums to 1. The height of a bar represents the probability of observations in that bin (rather than the number of observations).

Which gives us…

Your Turn!


  1. Pick a variable that you want to explore the distribution of1
  2. Make a histogram
    1. Only specify x = in aes()
    2. Specify geom as geom_histogram
  3. Choose color for bars
  4. Choose appropriate labels
  5. Change number of bins
  6. Change from count to density
10:00