Understanding Data

Sources and Structure

September 22, 2024

Preliminaries

Where Does Data Come From?

Thoughts? 😎 💭

Your boss or a client sends you a file
Survey data collected by you or someone else
You can download it from a website
You can scrape it from a website
A data package (e.g. unvotes)
You can access it through an API

Getting Started with Data

Tabular data is data that is organized into rows and columns
- a.k.a. rectangular data
A data frame is a special kind of tabular data used in data science
A variable is something you can measure
An observation is a single unit or case in your data set
The unit of analysis is the level at which you are measuring
- In a cross-section: country, state, county, city, individual, etc.
- In a time-series: year, month, day, etc.

Adjectives for Your Data

The Concept of “Tidy Data”

Each column represents a single variable
Each row represents a single observation
Each cell represents a single value

Tidy Data Example

What are Clean Data?

Column names are easy to work with and are not duplicated
Missing values have been dealt with
There are no repeated observations or columns
There are no blank observations or columns
The data are in the proper format, for example dates should be formatted as dates

Messy Data Example

Which of These is Likely Tidy/Clean?

Your boss or a client sends you a file
Survey data collected by you or someone else
You can download it from a website
You can scrape it from a website
A curated collection (e.g. unvotes)
You can access it through an API

How Do We Get Tidy/Clean Data?

Wrangle it ourselves
Use a package where it has been wrangled for us
Download via an API

Reading Data

Read Data into R

Use read_csv() function from readr package
readr package is part of the tidyverse
Can do more with it than base R functions

R Code Review

<- is the assignment operator
- Use it to assign values to objects
# is the comment operator
- Use it to comment out code or add comments
- Different function than in markdown text
To call a library, use library() and name of library
- name of library does not have to be in quotes, e.g. library(readr)
- only when you install it, e.g. install.packages("readr")

Read Data into R

# load libraries
library(readr)
library(dplry)

dem_summary <- read_csv("data/dem_summary.csv") #notice file path

glimpse(dem_summary)

Viewing the Data in R

Use glimpse() to see the columns and data types:

# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

glimpse(dem_summary)

Rows: 6
Columns: 5
$ region    <chr> "The West", "Latin America", "Eastern Europe", "Asia", "Afri…
$ polyarchy <dbl> 0.8709230, 0.6371358, 0.5387451, 0.4076602, 0.3934166, 0.245…
$ gdp_pc    <dbl> 37.913054, 9.610284, 12.176554, 9.746391, 4.410484, 21.134319
$ flfp      <dbl> 52.99082, 48.12645, 50.45894, 50.32171, 56.69530, 26.57872
$ women_rep <dbl> 28.12921, 21.32548, 17.99728, 14.45225, 17.44296, 10.21568

Or use View() or click on the name of the object in your Environment tab to see the data in a spreadsheet:

Try It Yourself!

Open the CSV file to see what it looks like
Then use this code to read it into R and view it

# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

glimpse(dem_summary)

05:00

Write a New CSV File

Now try writing the same data to a file with a different name

write_csv(dem_summary, "data/your_new_file_name.csv")

02:00

Excel Files

Read in Excel File

library(readxl)

dem_summary <- read_excel("data/dem_summary.xlsx")

glimpse(dem_summary)

Try With Excel

Read in the Excel file
Follow same steps as with CSV file
- use read_excel() to read in the data
- install and experiment with writexl

05:00

Google Sheets

Import Data from Google Sheets

Can use googlesheets4
Have a look at these Gapminder data
Use gs4_deauth() to authenticate
Then use read_sheet() to read in the data

Example Code

library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the gapminder Africa data
gapminder_data <- read_sheet("1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY")

Or…

library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the gapminder Africa data
gapminder_data <- read_sheet("1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY")

Or…

library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the gapminder Africa data
gapminder_data <- googledrive::drive_get("gapminder") |>
  read_sheet()

Try It Yourself!

Use the code above to read in the data
Try reading in Gapminder data for a different country

05:00

Find Your Own Data

Visit kaggle.com
Find a dataset you like
Download it as a CSV
Upload to your Posit Cloud project
Read it into R
Explore with glimpse() and View()

05:00