Learning objectives

  • Focus on the third tidy data principle
    • Each variable forms a column.
    • Each observation forms a row.
    • Each type of observational unit forms a table.
  • Be able to use dplyr’s join functions to merge tables

Joins

The third tidy data maxim states that each observation type gets its own table. The idea of multiple tables within a dataset will be familiar to anyone who has worked with a relational database but may seem foreign to those who have not.

The idea is this: Suppose we conduct a behavioral experiment that puts individuals in groups, and we measure both individual- and group-level variables. We should have a table for the individual-level variables and a separate table for the group-level variables. Then, should we need to merge them, we can do so using the join functions of dplyr.

The join functions are nicely illustrated in RStudio’s Data wrangling cheatsheet. Each function takes two data.frames and, optionally, the name(s) of columns on which to match. If no column names are provided, the functions match on all shared column names.

The different join functions control what happens to rows that exist in one table but not the other.

  • left_join keeps all the entries that are present in the left (first) table and excludes any that are only in the right table.
  • right_join keeps all the entries that are present in the right table and excludes any that are only in the left table.
  • inner_join keeps only the entries that are present in both tables. inner_join is the only function that guarantees you won’t generate any missing entries.
  • full_join keeps all of the entries in both tables, regardless of whether or not they appear in the other table.
dplyr joins, via RStudio

dplyr joins, via RStudio

Practice with joins

Fabricate some example data

set.seed(12345)
x <- data.frame(key= LETTERS[c(1:3, 5)], 
                value1 = sample(1:10, 4), 
                stringsAsFactors = FALSE)
y <- data.frame(key= LETTERS[c(1:4)], 
                value2 = sample(1:10, 4), 
                stringsAsFactors = FALSE)

x
##   key value1
## 1   A      8
## 2   B     10
## 3   C      7
## 4   E      9
y
##   key value2
## 1   A      5
## 2   B      2
## 3   C      3
## 4   D      4

“Joining” joins

# What's in both x and y?
inner_join(x, y, by = "key")
##   key value1 value2
## 1   A      8      5
## 2   B     10      2
## 3   C      7      3
# What's in X and bring with it the stuff that matches in Y
left_join(x, y, by = "key")
##   key value1 value2
## 1   A      8      5
## 2   B     10      2
## 3   C      7      3
## 4   E      9     NA
# What's in Y and bring with it the stuff that matches in Y
right_join(x, y, by = "key")
##   key value1 value2
## 1   A      8      5
## 2   B     10      2
## 3   C      7      3
## 4   D     NA      4
# Give me everything!
full_join(x, y, by = "key")
##   key value1 value2
## 1   A      8      5
## 2   B     10      2
## 3   C      7      3
## 4   E      9     NA
## 5   D     NA      4

Filtering “joins”

# Give me the stuff in X that is also in Y
semi_join(x, y, by = "key")
##   key value1
## 1   A      8
## 2   B     10
## 3   C      7
# Give me the stuff in X that is not in Y
anti_join(x, y, by = "key")
##   key value1
## 1   E      9
# Want everything that doesn't match?
full_join(anti_join(x,y, by = "key"), anti_join(y,x, by = "key"), by= "key")
##   key value1 value2
## 1   E      9     NA
## 2   D     NA      4
# keys with different names?
x <- data.frame(keyX = LETTERS[c(1:3, 5)], 
                value1 = sample(1:10, 4), 
                stringsAsFactors = FALSE)
y <- data.frame(keyY = LETTERS[c(1:4)], 
                value2 = sample(1:10, 4), 
                stringsAsFactors = FALSE)

x
##   keyX value1
## 1    A      8
## 2    B      9
## 3    C      1
## 4    E      2
y
##   keyY value2
## 1    A      8
## 2    B      1
## 3    C      4
## 4    D     10
full_join(x, y) #should error out
## Error: `by` required, because the data sources have no common variables
full_join(x, y, by=c("keyX" = "keyY"))
##   keyX value1 value2
## 1    A      8      8
## 2    B      9      1
## 3    C      1      4
## 4    E      2     NA
## 5    D     NA     10

Set Operations

df1 <- data_frame(x = LETTERS[1:2], 
                  y = c(1L, 1L))
df2 <- data_frame(x = LETTERS[1:2], 
                  y = 1:2)

df1
## # A tibble: 2 x 2
##       x     y
##   <chr> <int>
## 1     A     1
## 2     B     1
df2
## # A tibble: 2 x 2
##       x     y
##   <chr> <int>
## 1     A     1
## 2     B     2
# Which rows are common in both datasets?
dplyr::intersect(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <chr> <int>
## 1     A     1
#Want all unique rows between both datasets?
dplyr::union(df1, df2)
## # A tibble: 3 x 2
##       x     y
##   <chr> <int>
## 1     B     2
## 2     B     1
## 3     A     1
#What's unique to df1?
dplyr::setdiff(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <chr> <int>
## 1     B     1
#What's unique to df2?
dplyr::setdiff(df2, df1)
## # A tibble: 1 x 2
##       x     y
##   <chr> <int>
## 1     B     2

Practice with joins using gapminder

We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn’t in your Environment, load it and recall what it consists of:

load('data/continents.RDA')
continents
##    continent area_km2 population percent_total_pop
## 1     Africa 30370000 1022234000              15.0
## 2   Americas 42330000  934611000              14.0
## 3 Antarctica 13720000       4490               0.0
## 4       Asia 43820000 4164252000              60.0
## 5     Europe 10180000  738199000              11.0
## 6    Oceania  9008500   29127000               0.4

We can join the two data.frames using any of the dplyr functions. We will pass the results to str to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.

left_join(gapminder, continents) 
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1,704 x 11
##        country continent  year lifeExp      pop gdpPercap   total_gdp
##         <fctr>     <chr> <int>   <dbl>    <int>     <dbl>       <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453  6567086330
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530  7585448670
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007  8758855797
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971  9648014150
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811  9678553274
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134 11697659231
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114 12598563401
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959 11820990309
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414 10595901589
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414 14121995875
## # ... with 1,694 more rows, and 4 more variables: log_gdp <dbl>,
## #   area_km2 <dbl>, population <dbl>, percent_total_pop <dbl>
right_join(gapminder, continents)
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1,705 x 11
##    country continent  year lifeExp      pop gdpPercap    total_gdp
##     <fctr>     <chr> <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Algeria    Africa  1952  43.077  9279525  2449.008  22725632678
##  2 Algeria    Africa  1957  45.685 10270856  3013.976  30956113720
##  3 Algeria    Africa  1962  48.303 11000948  2550.817  28061403854
##  4 Algeria    Africa  1967  51.407 12760499  3246.992  41433235247
##  5 Algeria    Africa  1972  54.518 14760787  4182.664  61739408943
##  6 Algeria    Africa  1977  58.014 17152804  4910.417  84227416174
##  7 Algeria    Africa  1982  61.368 20033753  5745.160 115097120653
##  8 Algeria    Africa  1987  65.799 23254956  5681.359 132119742845
##  9 Algeria    Africa  1992  67.744 26298373  5023.217 132102425043
## 10 Algeria    Africa  1997  69.152 29072015  4797.295 139467033682
## # ... with 1,695 more rows, and 4 more variables: log_gdp <dbl>,
## #   area_km2 <dbl>, population <dbl>, percent_total_pop <dbl>

These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn’t appear in the gapminder data.frame. When we use left_join it gets filtered from the results, but when we use right_join it appears, with missing values for all of the country-level variables:

right_join(gapminder, continents) %>% 
  filter(continent == "Antarctica")
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1 x 11
##   country  continent  year lifeExp   pop gdpPercap total_gdp log_gdp
##    <fctr>      <chr> <int>   <dbl> <int>     <dbl>     <dbl>   <dbl>
## 1      NA Antarctica    NA      NA    NA        NA        NA      NA
## # ... with 3 more variables: area_km2 <dbl>, population <dbl>,
## #   percent_total_pop <dbl>

There’s another problem in this data.frame – it has two population measures, one by continent and one by country and it’s not clear which is which! Let’s rename a couple of columns.

right_join(gapminder, continents) %>% 
  rename(country_pop = pop, continent_pop = population)
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1,705 x 11
##    country continent  year lifeExp country_pop gdpPercap    total_gdp
##     <fctr>     <chr> <int>   <dbl>       <int>     <dbl>        <dbl>
##  1 Algeria    Africa  1952  43.077     9279525  2449.008  22725632678
##  2 Algeria    Africa  1957  45.685    10270856  3013.976  30956113720
##  3 Algeria    Africa  1962  48.303    11000948  2550.817  28061403854
##  4 Algeria    Africa  1967  51.407    12760499  3246.992  41433235247
##  5 Algeria    Africa  1972  54.518    14760787  4182.664  61739408943
##  6 Algeria    Africa  1977  58.014    17152804  4910.417  84227416174
##  7 Algeria    Africa  1982  61.368    20033753  5745.160 115097120653
##  8 Algeria    Africa  1987  65.799    23254956  5681.359 132119742845
##  9 Algeria    Africa  1992  67.744    26298373  5023.217 132102425043
## 10 Algeria    Africa  1997  69.152    29072015  4797.295 139467033682
## # ... with 1,695 more rows, and 4 more variables: log_gdp <dbl>,
## #   area_km2 <dbl>, continent_pop <dbl>, percent_total_pop <dbl>

Challenge – Putting the pieces together

A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter “Not at the continental scale,” but your colleague insists. Test your colleague’s hypothesis by:

  • Calculating the total GDP of each continent,
    • Hint: Use dplyr’s group_by and summarize
  • Joining the resulting data.frame to the continents data.frame,
  • Calculating the per-capita GDP for each continent, and
  • Plotting per-capita gdp versus population density.

Challenge solutions

Solution to Challenge – Putting the pieces together

library(ggplot2)

gapminder %>%
  mutate(GDP = gdpPercap * pop) %>%  # Calculate country-level GDP
  group_by(continent) %>%  # Group by continent
  summarize(cont_gdp = sum(GDP)) %>%  # Calculate continent-level GDP
# Join the continent-GDP data.frame to the continents data.frame
    left_join(continents) %>%  
# Calculate continent-level per-capita GDP
    mutate(per_cap = cont_gdp / population) %>%  
# Plot gdp versus land area
    ggplot(aes(x = area_km2, y = per_cap)) +  
# Draw points
    geom_point() +  
# And label them
    geom_text(aes(label = continent), nudge_y = 5e3)  
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector


This lesson is adapted from the Software Carpentry: R for Reproducible Scientific Analysis Multi-Table Joins materials and Brandon Hurr’s dplyr II: Joins and Set Ops presentation to the Davis R UsersGroup on Februrary 2, 2016.