Learning objectives
- Focus on the third tidy data principle
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
- Be able to use
dplyr
’s join functions to merge tables
The third tidy data maxim states that each observation type gets its own table. The idea of multiple tables within a dataset will be familiar to anyone who has worked with a relational database but may seem foreign to those who have not.
The idea is this: Suppose we conduct a behavioral experiment that puts individuals in groups, and we measure both individual- and group-level variables. We should have a table for the individual-level variables and a separate table for the group-level variables. Then, should we need to merge them, we can do so using the join
functions of dplyr
.
The join functions are nicely illustrated in RStudio’s Data wrangling cheatsheet. Each function takes two data.frames and, optionally, the name(s) of columns on which to match. If no column names are provided, the functions match on all shared column names.
The different join functions control what happens to rows that exist in one table but not the other.
left_join
keeps all the entries that are present in the left (first) table and excludes any that are only in the right table.right_join
keeps all the entries that are present in the right table and excludes any that are only in the left table.inner_join
keeps only the entries that are present in both tables. inner_join
is the only function that guarantees you won’t generate any missing entries.full_join
keeps all of the entries in both tables, regardless of whether or not they appear in the other table.set.seed(12345)
x <- data.frame(key= LETTERS[c(1:3, 5)],
value1 = sample(1:10, 4),
stringsAsFactors = FALSE)
y <- data.frame(key= LETTERS[c(1:4)],
value2 = sample(1:10, 4),
stringsAsFactors = FALSE)
x
## key value1
## 1 A 8
## 2 B 10
## 3 C 7
## 4 E 9
y
## key value2
## 1 A 5
## 2 B 2
## 3 C 3
## 4 D 4
# What's in both x and y?
inner_join(x, y, by = "key")
## key value1 value2
## 1 A 8 5
## 2 B 10 2
## 3 C 7 3
# What's in X and bring with it the stuff that matches in Y
left_join(x, y, by = "key")
## key value1 value2
## 1 A 8 5
## 2 B 10 2
## 3 C 7 3
## 4 E 9 NA
# What's in Y and bring with it the stuff that matches in Y
right_join(x, y, by = "key")
## key value1 value2
## 1 A 8 5
## 2 B 10 2
## 3 C 7 3
## 4 D NA 4
# Give me everything!
full_join(x, y, by = "key")
## key value1 value2
## 1 A 8 5
## 2 B 10 2
## 3 C 7 3
## 4 E 9 NA
## 5 D NA 4
# Give me the stuff in X that is also in Y
semi_join(x, y, by = "key")
## key value1
## 1 A 8
## 2 B 10
## 3 C 7
# Give me the stuff in X that is not in Y
anti_join(x, y, by = "key")
## key value1
## 1 E 9
# Want everything that doesn't match?
full_join(anti_join(x,y, by = "key"), anti_join(y,x, by = "key"), by= "key")
## key value1 value2
## 1 E 9 NA
## 2 D NA 4
# keys with different names?
x <- data.frame(keyX = LETTERS[c(1:3, 5)],
value1 = sample(1:10, 4),
stringsAsFactors = FALSE)
y <- data.frame(keyY = LETTERS[c(1:4)],
value2 = sample(1:10, 4),
stringsAsFactors = FALSE)
x
## keyX value1
## 1 A 8
## 2 B 9
## 3 C 1
## 4 E 2
y
## keyY value2
## 1 A 8
## 2 B 1
## 3 C 4
## 4 D 10
full_join(x, y) #should error out
## Error: `by` required, because the data sources have no common variables
full_join(x, y, by=c("keyX" = "keyY"))
## keyX value1 value2
## 1 A 8 8
## 2 B 9 1
## 3 C 1 4
## 4 E 2 NA
## 5 D NA 10
df1 <- data_frame(x = LETTERS[1:2],
y = c(1L, 1L))
df2 <- data_frame(x = LETTERS[1:2],
y = 1:2)
df1
## # A tibble: 2 x 2
## x y
## <chr> <int>
## 1 A 1
## 2 B 1
df2
## # A tibble: 2 x 2
## x y
## <chr> <int>
## 1 A 1
## 2 B 2
# Which rows are common in both datasets?
dplyr::intersect(df1, df2)
## # A tibble: 1 x 2
## x y
## <chr> <int>
## 1 A 1
#Want all unique rows between both datasets?
dplyr::union(df1, df2)
## # A tibble: 3 x 2
## x y
## <chr> <int>
## 1 B 2
## 2 B 1
## 3 A 1
#What's unique to df1?
dplyr::setdiff(df1, df2)
## # A tibble: 1 x 2
## x y
## <chr> <int>
## 1 B 1
#What's unique to df2?
dplyr::setdiff(df2, df1)
## # A tibble: 1 x 2
## x y
## <chr> <int>
## 1 B 2
We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn’t in your Environment, load it and recall what it consists of:
load('data/continents.RDA')
continents
## continent area_km2 population percent_total_pop
## 1 Africa 30370000 1022234000 15.0
## 2 Americas 42330000 934611000 14.0
## 3 Antarctica 13720000 4490 0.0
## 4 Asia 43820000 4164252000 60.0
## 5 Europe 10180000 738199000 11.0
## 6 Oceania 9008500 29127000 0.4
We can join the two data.frames using any of the dplyr
functions. We will pass the results to str
to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.
left_join(gapminder, continents)
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1,704 x 11
## country continent year lifeExp pop gdpPercap total_gdp
## <fctr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875
## # ... with 1,694 more rows, and 4 more variables: log_gdp <dbl>,
## # area_km2 <dbl>, population <dbl>, percent_total_pop <dbl>
right_join(gapminder, continents)
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1,705 x 11
## country continent year lifeExp pop gdpPercap total_gdp
## <fctr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Algeria Africa 1952 43.077 9279525 2449.008 22725632678
## 2 Algeria Africa 1957 45.685 10270856 3013.976 30956113720
## 3 Algeria Africa 1962 48.303 11000948 2550.817 28061403854
## 4 Algeria Africa 1967 51.407 12760499 3246.992 41433235247
## 5 Algeria Africa 1972 54.518 14760787 4182.664 61739408943
## 6 Algeria Africa 1977 58.014 17152804 4910.417 84227416174
## 7 Algeria Africa 1982 61.368 20033753 5745.160 115097120653
## 8 Algeria Africa 1987 65.799 23254956 5681.359 132119742845
## 9 Algeria Africa 1992 67.744 26298373 5023.217 132102425043
## 10 Algeria Africa 1997 69.152 29072015 4797.295 139467033682
## # ... with 1,695 more rows, and 4 more variables: log_gdp <dbl>,
## # area_km2 <dbl>, population <dbl>, percent_total_pop <dbl>
These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn’t appear in the gapminder data.frame. When we use left_join
it gets filtered from the results, but when we use right_join
it appears, with missing values for all of the country-level variables:
right_join(gapminder, continents) %>%
filter(continent == "Antarctica")
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1 x 11
## country continent year lifeExp pop gdpPercap total_gdp log_gdp
## <fctr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 NA Antarctica NA NA NA NA NA NA
## # ... with 3 more variables: area_km2 <dbl>, population <dbl>,
## # percent_total_pop <dbl>
There’s another problem in this data.frame – it has two population measures, one by continent and one by country and it’s not clear which is which! Let’s rename a couple of columns.
right_join(gapminder, continents) %>%
rename(country_pop = pop, continent_pop = population)
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing
## into character vector
## # A tibble: 1,705 x 11
## country continent year lifeExp country_pop gdpPercap total_gdp
## <fctr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Algeria Africa 1952 43.077 9279525 2449.008 22725632678
## 2 Algeria Africa 1957 45.685 10270856 3013.976 30956113720
## 3 Algeria Africa 1962 48.303 11000948 2550.817 28061403854
## 4 Algeria Africa 1967 51.407 12760499 3246.992 41433235247
## 5 Algeria Africa 1972 54.518 14760787 4182.664 61739408943
## 6 Algeria Africa 1977 58.014 17152804 4910.417 84227416174
## 7 Algeria Africa 1982 61.368 20033753 5745.160 115097120653
## 8 Algeria Africa 1987 65.799 23254956 5681.359 132119742845
## 9 Algeria Africa 1992 67.744 26298373 5023.217 132102425043
## 10 Algeria Africa 1997 69.152 29072015 4797.295 139467033682
## # ... with 1,695 more rows, and 4 more variables: log_gdp <dbl>,
## # area_km2 <dbl>, continent_pop <dbl>, percent_total_pop <dbl>
Challenge – Putting the pieces together
A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter “Not at the continental scale,” but your colleague insists. Test your colleague’s hypothesis by:
- Calculating the total GDP of each continent,
- Hint: Use
dplyr
’sgroup_by
andsummarize
- Joining the resulting data.frame to the
continents
data.frame,- Calculating the per-capita GDP for each continent, and
- Plotting per-capita gdp versus population density.
Solution to Challenge – Putting the pieces together
library(ggplot2) gapminder %>% mutate(GDP = gdpPercap * pop) %>% # Calculate country-level GDP group_by(continent) %>% # Group by continent summarize(cont_gdp = sum(GDP)) %>% # Calculate continent-level GDP # Join the continent-GDP data.frame to the continents data.frame left_join(continents) %>% # Calculate continent-level per-capita GDP mutate(per_cap = cont_gdp / population) %>% # Plot gdp versus land area ggplot(aes(x = area_km2, y = per_cap)) + # Draw points geom_point() + # And label them geom_text(aes(label = continent), nudge_y = 5e3)
## Joining, by = "continent"
## Warning: Column `continent` joining factor and character vector, coercing ## into character vector
This lesson is adapted from the Software Carpentry: R for Reproducible Scientific Analysis Multi-Table Joins materials and Brandon Hurr’s dplyr II: Joins and Set Ops presentation to the Davis R UsersGroup on Februrary 2, 2016.