Module 4.1

Making Tables

Prework

Get a U.S. Census api key
Install tidycensus to retrieve the data we will use
Install kableExtra and gt, out packages for making tables
install webshot2 for exporting gt tables

install.packages(c("tidycensus", "kableExtra", "gt", "webshot2"))

We will be using the stringr package, which is part of the Tidyverse, so you probably already have it installed. But spend some time reading about its usage and features.
Install webshot2 for the purposes of saving a .png of your table (install.packages("webshot2"))

Overview

This week we are going to be talking about making tables in R. Tables can be a great way to summarize data for your audience. While there are no hard and fast rules about when to use a table versus a plot, typically we use tables when we want to present summary statistics that we want to compare across groups or when we want to show the precise values for individual data points. This can be true when we have a small number of cases that we want to discuss.

In this module we are going to be working with the tidycensus package to download income data from the American Community Survey (ACS) and the kableExtra and gt packages to visualize it. Along the way, we discuss “the grammar of tables” and some situations where a table would be less appropriate than other methods of visualizing our data.

Working with tidycensus

We are going to start by using tidycensus to download some income data. To use tidycensus you need a Census API key, which you can get here.

library(tidycensus)
census_api_key("YOUR API KEY") # enter your census api key here in quotes

Use the load_variables() function to import data from the census or ACS for a particular year. There is a cache = TRUE option if you want to cache the data for faster retrieval in the future. We can save our data in an object called v21 and then click on it or use View() and the search function in the data frame viewer to see what data are available.

v21 <- load_variables(2021, "acs5", cache = TRUE)

#View(v21)

We want data on income quintiles, so let’s search for “quintile” in the search field. From there we use get_acs() to retrieve the data based on the codes for the five quintiles and the top five percent of earners.

In our call, we specify “state” as the geography and “2021” as the year. This will ensure that data from the 2017-2021 ACS is retrieved. Note that by default, tidycensus returns data such that rows represent a unit-variable combination. To get the data with census variables in the columns, we have to specify wide form with output = "wide". We will select all of the variables except for the margin of error and GEOID. Let’s rename NAME so it is in lower case. And for some reason, tidycensus puts an “E” suffix at the end of all of our variables when we specify wide format, so let’s use rename_with() and str_remove from the [stringr] package to get rid of that suffix. We will save the data frame in an object called quintiles.

library(stringr)
library(dplyr)

quintiles <- get_acs(geography = "state", 
                      variables = c(q1 = "B19081_001",
                                    q2 = "B19081_002",
                                    q3 = "B19081_003",
                                    q4 = "B19081_004",
                                    q5 = "B19081_005",
                                    top5 = "B19081_006"),
                      year = 2021,
                      output = "wide") |>
                      select(
                        !ends_with("M"), # eliminate margin of error
                        -GEOID) |> # eliminate geo id
                      rename(name = NAME) |>
                      rename_with(~str_remove(., 'E'))
    

glimpse(quintiles)

Rows: 52
Columns: 7
$ name <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colora…
$ q1   <dbl> 11602, 19344, 15486, 12076, 17433, 18963, 17417, 17052, 12971, 14…
$ q2   <dbl> 31928, 51030, 40774, 31051, 49234, 49711, 48870, 44638, 51060, 37…
$ q3   <dbl> 55270, 81169, 66384, 52280, 84658, 80679, 84042, 73187, 94478, 62…
$ q4   <dbl> 88640, 122652, 102299, 82811, 134560, 123359, 133488, 111915, 157…
$ q5   <dbl> 193311, 242097, 223521, 188510, 309857, 264516, 319533, 238612, 3…
$ top5 <dbl> 336788, 394694, 395620, 344470, 555007, 466181, 602707, 420859, 6…

Explore the data with kableExtra

Now let’s use some dplyr slice functions and kableExtra to subset and explore the data. First let’s see which are the wealthiest states. To do this, we can apply slice_max() to identify the states with the highest incomes among the top 5 percent of wage earners. We will save that list as an object called top_10 and then call kable() to view it.

library(kableExtra)

top_10 <- quintiles |>
  slice_max(top5, n = 10)

kable(top_10)

name	q1	q2	q3	q4	q5	top5
District of Columbia	12971	51060	94478	157803	375792	670768
Connecticut	17417	48870	84042	133488	319533	602707
New York	14054	42220	75647	123318	302676	574063
New Jersey	18458	52339	90337	142858	319140	562886
Massachusetts	16812	50519	89602	142491	316447	558616
California	17433	49234	84658	134560	309857	555007
Maryland	19946	55165	91725	140353	293979	503597
Washington	19367	50803	82817	127003	277165	487950
Virginia	17922	48359	81072	127411	280299	486006
Illinois	15102	42688	72900	114531	258373	466713

Now let’s do the same thing but searching for the poorest states instead. We will use the slice_min() function to identify the states with the lowest incomes in the first quintile of earners.

bottom_10 <- quintiles |>
  slice_min(q1, n = 10)

kable(bottom_10)

name	q1	q2	q3	q4	q5	top5
Puerto Rico	2906	12144	22163	38397	99043	187234
Mississippi	10096	27963	49418	80125	175581	308523
Louisiana	10371	29781	53925	89536	201514	357026
New Mexico	11058	31274	54295	86905	188282	323568
West Virginia	11120	29606	51038	81393	174019	299882
Alabama	11602	31928	55270	88640	193311	336788
Kentucky	11846	32616	55838	88089	194168	350411
Arkansas	12076	31051	52280	82811	188510	344470
South Carolina	12680	34881	58665	92118	207367	374427
District of Columbia	12971	51060	94478	157803	375792	670768

OK now let’s make a table with a selection of states that reflects the full range of household incomes. So first, we will use slice_min() and slice_max() without arguments to select the state with the poorest households in q1 and the state with the wealthiest households in top5. The we will use slice_sample() to take a random sample of five additional states.

We will store these selections in three objects and then combine them into a single data frame called states using the dplyr function bind_rows. bind_rows() appends data frames with different observations for the same set of columns. You can think of its a kind of “verticle merging” of data frames.

After we have done the append, we can view the new data by calling kable(states).

# lowest 
state_min <- quintiles |> 
  slice_min(q1) 

# highest
state_max <- quintiles |> 
  slice_max(top5) 

# randomly select five more
five_more <- quintiles |>
   slice_sample(n = 5) 

states <- bind_rows(state_min, state_max, five_more) |>
  arrange(desc(top5))

kable(states)

name	q1	q2	q3	q4	q5	top5
District of Columbia	12971	51060	94478	157803	375792	670768
Washington	19367	50803	82817	127003	277165	487950
Delaware	17052	44638	73187	111915	238612	420859
North Dakota	15507	41406	68558	104602	217143	380261
Nebraska	16506	41590	66977	100910	213091	378679
South Carolina	12680	34881	58665	92118	207367	374427
Puerto Rico	2906	12144	22163	38397	99043	187234

Display the data with a gt table

Now that we have some good data for a table, let’s make a really beautiful table with the gt package. “gt” stands for “grammar of tables.” So just as we talked about a “grammar of graphics” when we were studying plots, we can talk about the “grammar of tables” and break a table down into its component parts.

gt envisions six main parts of a table that can be customized in various ways. The table header includes the title and possibly a subtitle. Next, we have the stub section that contains our row labels and, above that, a stubhead label, which we could use to provide more information about what is in the rows. Then, we have column labels that tell us about what is in each column and the table body which contains the actual data that we want to present. Finally, we have the table footer, which would contain any notes that we have about information contained in the table as well as information about sources. Check out the gt function reference to get a sense of all the customizations available.

Let’s go ahead and start out by making a basic gt table with a title, subtitle, column labels, source note and format the numbers as dollar figures.

Make a good gt table

library(gt)

goodtable <- gt(states) |> 
  tab_header(
    title = "Mean Household Income of Quintiles, 2021",
    subtitle = "Seven Representative U.S. States"
  ) |> 
  cols_label(
    name = "",
    q1 = "lowest",
    q2 = "second",
    q3 = "third",
    q4 = "fourth",
    q5 = "highest",
    top5 = "top 5%"
  ) |> 
  fmt_currency(
    columns = c(q1:top5),
    currency = "USD", 
    use_subunits = FALSE
  ) |>
  # note that you can use markdown (md) to format the source note for html documents
  tab_source_note(source_note = md("**Source**: US Census Bureau, American Community Survey"))

goodtable

Mean Household Income of Quintiles, 2021
Seven Representative U.S. States
	lowest	second	third	fourth	highest	top 5%
District of Columbia	$12,971	$51,060	$94,478	$157,803	$375,792	$670,768
Washington	$19,367	$50,803	$82,817	$127,003	$277,165	$487,950
Delaware	$17,052	$44,638	$73,187	$111,915	$238,612	$420,859
North Dakota	$15,507	$41,406	$68,558	$104,602	$217,143	$380,261
Nebraska	$16,506	$41,590	$66,977	$100,910	$213,091	$378,679
South Carolina	$12,680	$34,881	$58,665	$92,118	$207,367	$374,427
Puerto Rico	$2,906	$12,144	$22,163	$38,397	$99,043	$187,234
Source: US Census Bureau, American Community Survey

Change column width

Now let’s add some further customization. One thing to pay attention to is the column width. Too narrow of columns can make it difficult to read the information, while too wide of columns can cause the table not to fit on the page. We can adjust the column width of our table with the cols_width() function.

vgoodtable <- goodtable |>
  cols_width(c(q1:top5) ~ px(90))

vgoodtable

Mean Household Income of Quintiles, 2021
Seven Representative U.S. States
	lowest	second	third	fourth	highest	top 5%
District of Columbia	$12,971	$51,060	$94,478	$157,803	$375,792	$670,768
Washington	$19,367	$50,803	$82,817	$127,003	$277,165	$487,950
Delaware	$17,052	$44,638	$73,187	$111,915	$238,612	$420,859
North Dakota	$15,507	$41,406	$68,558	$104,602	$217,143	$380,261
Nebraska	$16,506	$41,590	$66,977	$100,910	$213,091	$378,679
South Carolina	$12,680	$34,881	$58,665	$92,118	$207,367	$374,427
Puerto Rico	$2,906	$12,144	$22,163	$38,397	$99,043	$187,234
Source: US Census Bureau, American Community Survey

Change font

Next we want to make sure that the fonts that we use are legible and accessible, just like we did for our charts. We can do this with opt_table_font().

greattable <- vgoodtable |>
  opt_table_font(font = "verdana")

greattable

Mean Household Income of Quintiles, 2021
Seven Representative U.S. States
	lowest	second	third	fourth	highest	top 5%
District of Columbia	$12,971	$51,060	$94,478	$157,803	$375,792	$670,768
Washington	$19,367	$50,803	$82,817	$127,003	$277,165	$487,950
Delaware	$17,052	$44,638	$73,187	$111,915	$238,612	$420,859
North Dakota	$15,507	$41,406	$68,558	$104,602	$217,143	$380,261
Nebraska	$16,506	$41,590	$66,977	$100,910	$213,091	$378,679
South Carolina	$12,680	$34,881	$58,665	$92,118	$207,367	$374,427
Puerto Rico	$2,906	$12,144	$22,163	$38,397	$99,043	$187,234
Source: US Census Bureau, American Community Survey

Center

Then we want to check to make sure that information is aligned properly in the table. We can use right, left or center justify our text, depending on its purpose, by calling cols_align().

vgreattable <- greattable |>
  cols_align(
  align = "center",
  columns = q1:top5
)

vgreattable

Mean Household Income of Quintiles, 2021
Seven Representative U.S. States
	lowest	second	third	fourth	highest	top 5%
District of Columbia	$12,971	$51,060	$94,478	$157,803	$375,792	$670,768
Washington	$19,367	$50,803	$82,817	$127,003	$277,165	$487,950
Delaware	$17,052	$44,638	$73,187	$111,915	$238,612	$420,859
North Dakota	$15,507	$41,406	$68,558	$104,602	$217,143	$380,261
Nebraska	$16,506	$41,590	$66,977	$100,910	$213,091	$378,679
South Carolina	$12,680	$34,881	$58,665	$92,118	$207,367	$374,427
Puerto Rico	$2,906	$12,144	$22,163	$38,397	$99,043	$187,234
Source: US Census Bureau, American Community Survey

Add borders and lines

Finally, we want to think about how to use borders and lines to separate and identify different elements of the table using tab_options() like this.

awesometable <- vgreattable |>
  tab_options(
    table.border.top.color = "black", 
    table.border.bottom.color = "black",
    heading.border.bottom.color = "black", 
    column_labels.border.bottom.color = "black", 
    table_body.border.bottom.color = "black"
  )

awesometable

Mean Household Income of Quintiles, 2021
Seven Representative U.S. States
	lowest	second	third	fourth	highest	top 5%
District of Columbia	$12,971	$51,060	$94,478	$157,803	$375,792	$670,768
Washington	$19,367	$50,803	$82,817	$127,003	$277,165	$487,950
Delaware	$17,052	$44,638	$73,187	$111,915	$238,612	$420,859
North Dakota	$15,507	$41,406	$68,558	$104,602	$217,143	$380,261
Nebraska	$16,506	$41,590	$66,977	$100,910	$213,091	$378,679
South Carolina	$12,680	$34,881	$58,665	$92,118	$207,367	$374,427
Puerto Rico	$2,906	$12,144	$22,163	$38,397	$99,043	$187,234
Source: US Census Bureau, American Community Survey

Export your table

Now try exporting your table as .png file with the gtsave() function.

gtsave(awesometable, "awesometable.png")

When a plot is better than a table

Be judicious with your use of tables. You would not want to use tables where a plot is more appropriate. For example you would not want to use a table to show a trend over time (a line chart would be more appropriate) or to display the relationship between two variables (where a scatter plot would be more appropriate).

Another case where a table would be less effective than a plot is in showing estimates, margins of error and confidence intervals. Let’s do an example with median income estimates. We can start by searching for “median income” and discover that the code for median income is B06011_001. Let’s use that to extract the median income for counties in the state of Massachusetts.

library(janitor)

mass_med_inc <- get_acs(
  geography = "county", 
  variables = c(median_income = "B06011_001"), 
  state = "MA", 
  year = 2021
  ) |>
  mutate(
    lower_90 = estimate - moe,
    upper_90 = estimate + moe 
  ) |>
  clean_names() |>
  mutate(name = str_replace_all(name, " County, Massachusetts", "")) |>
  select(name, estimate, lower_90, upper_90)

glimpse(mass_med_inc)

Rows: 14
Columns: 4
$ name     <chr> "Barnstable", "Berkshire", "Bristol", "Dukes", "Essex", "Fran…
$ estimate <dbl> 40442, 33040, 36910, 40119, 39756, 34775, 32262, 30795, 51808…
$ lower_90 <dbl> 39554, 32074, 36347, 34791, 39174, 33712, 31707, 29868, 51337…
$ upper_90 <dbl> 41330, 34006, 37473, 45447, 40338, 35838, 32817, 31722, 52279…

We can select the county name, median income estimate and the upper and lower confidence intervals and put those in a table.

kable(mass_med_inc)

name	estimate	lower_90	upper_90
Barnstable	40442	39554	41330
Berkshire	33040	32074	34006
Bristol	36910	36347	37473
Dukes	40119	34791	45447
Essex	39756	39174	40338
Franklin	34775	33712	35838
Hampden	32262	31707	32817
Hampshire	30795	29868	31722
Middlesex	51808	51337	52279
Nantucket	45717	38260	53174
Norfolk	52591	51991	53191
Plymouth	43684	43014	44354
Suffolk	39200	38624	39776
Worcester	39009	38454	39564

But this is not very compelling. So instead, we can plot confidence intervals with ggplot using a combination of geom_errorbar() and geom_point().

library(ggplot2)

mass_med_inc |>
  ggplot(aes(x = estimate, y = reorder(name, estimate))) +
  geom_errorbar(aes(xmin = lower_90, xmax = upper_90)) +
  geom_point(color = "red", size = 2) +
  labs(title = "Household income by county in Massachusetts",
       subtitle = "2017-2021 American Community Survey",
       y = "",
       x = "Median Income", 
       caption = "ACS estimate (bars represent 90% confidence intervals)") +
  theme_minimal()

This conveys a lot more information relative to a table.