Learning objectives
- Understand the 6 main data types in R
- Be able to use the six major dplyr verbs (
filter
,select
,arrange
,mutate
,group_by
,summarize
)- Be able to use and understand the advantages of the
magrittr
pipe:%>%
dplyr
is not part of “base R”; rather it is a package – a library of functions that an R user wrote. This extensibility is part of the beauty of R. As of December 2016, there are 9,600 such packages in the official Comprehensive R Archive Network, better known as CRAN.
dplyr
is one of the most popular packages for R. It is part of a suite of R tools that make up “The Tidyverse”. Its author conveniently bundled these tools together in a super-package called tidyverse
. To use the tidyverse tools, you first need to download them to your machine (once) and then load them (each R session you want to use them). You can download a package via the RStudio menu bar Tools -> Install Packages…, or with a line of code:
install.packages('tidyverse')
You only have to download the code once. But whenever you want to use a package, you have to load it in your R session. For that, use the library
function:
library(tidyverse)
Challenge – Install and load tidyverse
- Install the
tidyverse
&gapminder
packages, either withinstall.packages('tidyverse', 'gapminder')
or via the menu bar: Tools -> Install Packages…
- Load
tidyverse
withlibrary(tidyverse)
- Load
gapminder
withlibrary(gapminder)
- You will see some warnings about conflicts. That’s okay.
There are six main types of data in R. We’ve already covered 2–3 of them. Can anyone help me list them?
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c()
function. For example we can create a vector of animal weights and assign it to a new object weight_g
:
weight_g <- c(50, 60, 65, 82)
weight_g
## [1] 50 60 65 82
A vector can also contain characters:
animals <- c("mouse", "rat", "dog")
animals
## [1] "mouse" "rat" "dog"
The quotes around “mouse”, “rat”, etc. are essential here. Without the quotes R will assume there are objects called mouse, rat and dog. As these objects don’t exist in R’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a vector. length()
tells you how many elements are in a particular vector:
length(weight_g)
## [1] 4
length(animals)
## [1] 3
An important feature of a vector, is that all of the elements are the same type of data. The function class()
indicates the class (the type of element) of an object:
class(weight_g)
## [1] "numeric"
class(animals)
## [1] "character"
The function str()
provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:
str(weight_g)
## num [1:4] 50 60 65 82
str(animals)
## chr [1:3] "mouse" "rat" "dog"
You can also use the c()
function to add other elements to your vector:
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
## [1] 30 50 60 65 82 90
In the first line, we take the original vector weight_g
, add the value 90 to the end of it, and save the result back into weight_g
. Then we add the value 30 to the beginning, again saving the result back into weight_g
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
We just saw 2 of the 6 main atomic vector types (or data types) that R uses: “character” and “numeric”. These are the basic building blocks that all R objects are built from. The other 4 are:
1+4i
) and that’s all we’re going to say about themVectors are one of the many data structures that R uses. Other important ones are lists (list
), matrices (matrix
), data frames (data.frame
), factors (factor
) and arrays (array
).
We’ve seen that atomic vectors can be of type character, numeric, integer, and logical. But what happens if we try to mix these types in a single vector? What will happen in each of these examples? (hint: use class() to check the data type of your objects):
num_char <- c(1, 2, 3, 'a')
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c('a', 'b', 'c', TRUE)
tricky <- c(1, 2, 3, '4')
- Why do you think it happens?
- You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
Sometimes if we look at a data set with str()
we can see columns consist of integers, character, etc. However, sometimes the columns are of a special class called a factor
. Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.
Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male"))
R will assign 1 to the level “female
” and 2 to the level “male
” (because f comes before m, even though the first element in this vector is “male
”). You can check this by using the function levels()
, and check the number of levels using nlevels()
:
levels(sex)
## [1] "female" "male"
nlevels(sex)
## [1] 2
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:
sex # current order
## [1] male female female male
## Levels: female male
#> [1] male female female male
#> Levels: female male
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering
## [1] male female female male
## Levels: male female
#> [1] male female female male
#> Levels: male female
In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: “female
”, “male
” is more descriptive than 1, 2. Which one is “male
”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our ecology example dataset dataset).
If you need to convert a factor to a character vector, you use as.character(x)
.
as.character(sex)
## [1] "male" "female" "female" "male"
Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the levels()
function. Compare:
f <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(f) # wrong! and there is no warning...
## [1] 3 2 1 4 3
as.numeric(as.character(f)) # works...
## [1] 1990 1983 1977 1998 1990
as.numeric(levels(f))[f] # The recommended way.
## [1] 1990 1983 1977 1998 1990
Notice that in the levels()
approach, three important steps occur:
levels(f)
as.numeric(levels(f))
f
inside the square bracketsstringsAsFactors=FALSE
By default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv()
and read.table()
have an argument called stringsAsFactors
which can be set to FALSE
.
In most cases, it’s preferable to set
stringsAsFactors = FALSE
when importing your data, and converting as a factor only the columns that require this data type.
Compare the output of str(surveys)
when setting stringsAsFactors = TRUE
(default) and stringsAsFactors = FALSE
:
We are going to use the R function download.file()
to download the CSV file that contains the survey data from figshare, and we will use read.csv()
to load into memory the content of the CSV file as an object of class data.frame.
To download the data into the data/
subdirectory, run the following:
download.file("https://ndownloader.figshare.com/files/2292169",
"data/combined.csv")
You are now ready to load the data:
surveys <- read.csv('data/combined.csv')
This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can print the first 6 lines of this data using head(surveys)
Now we can look at how reading in the data in different ways affects the different data types (factor
vs. character
):
## Compare the difference between when the data are being read as
## `factor`, and when they are being read as `character`.
surveys <- read.csv("data/combined.csv", stringsAsFactors = TRUE)
str(surveys)
surveys <- read.csv("data/combined.csv", stringsAsFactors = FALSE)
str(surveys)
## Convert the column "plot_type" into a factor
surveys$plot_type <- factor(surveys$plot_type)
We have seen how data frames are created when using the read.csv()
, but they can also be created by hand with the data.frame()
function. There are a few mistakes in this hand-crafted data.frame, can you spot and fix them? Don’t hesitate to experiment!
animal_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "squishy", "spiny"),
weight=c(45, 8 1.1, 0.8))
- Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate):
- Are they what you expected? Why? Why not?
- What would have been different if we had added stringsAsFactors = FALSE to this call?
- What would you need to change to ensure that each column had the accurate data type?
country_climate <- data.frame(
country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1)
)
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance.).
dplyr
It is an often bemoaned fact that a data scientist spends much, and often most, of her time wrangling data: getting it organized and clean. In this lesson we will learn an efficient set of tools that can handle the vast majority of most data management tasks.
Enter dplyr
, a package for making data manipulation easier. More on dplyr
later. dplyr
is part of tidyverse
, so it is already installed on your machine. You can load it individually, or with the other tidyverse packages like this:
library(tidyverse)
library(gapminder)
Those messages and conflicts are normal. The conflicts are R telling you that there are two packages with functions named “filter” and “lag”. When R gives you red text, it’s not always a bad thing, but it does mean you should pay attention and try to understand what it’s trying to tell you.
Remember that you only have to install each package once (per computer), but you have to load them for each R session in which you want to use them.
You also have to load any data you want to use each time you start a new R session. So, if it’s not already loaded, read in the gapminder data. We’re going to use tidyverse’s read_csv
instead of base R’s read.csv
here. It has a few nice features; the most obvious is that it makes a special kind of data.frame that only prints the first ten rows instead of all 1704.
# gapminder <- read_csv('data/gapminder-FiveYearData.csv')
class(gapminder)
## [1] "tbl_df" "tbl" "data.frame"
head(gapminder) # look at first few rows
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
str(gapminder) # look at data structure
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
You can always convert a data.frame into this special kind of data.frame like this:
gapminder <- tbl_df(gapminder)
The package dplyr
is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr
which has been in use for some time but suffered from being slow in some cases.dplyr
addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
dplyr
There are five actions we often want to apply to a tabular dataset:
We are about to see how to do each of those things using the dplyr
package. Everything we’re going to learn to do can also be done using “base R”, but dplyr
makes it easier, and the syntax is consistent, and it actually makes the computations faster.
filter()
Suppose we want to see just the gapminder data for the USA. First, we need to know how “USA” is written in the dataset: Is it USA or United States or what? We can see all the unique values of a variable with the unique
function.
unique(gapminder$country)
## [1] Afghanistan Albania
## [3] Algeria Angola
## [5] Argentina Australia
## [7] Austria Bahrain
## [9] Bangladesh Belgium
## [11] Benin Bolivia
## [13] Bosnia and Herzegovina Botswana
## [15] Brazil Bulgaria
## [17] Burkina Faso Burundi
## [19] Cambodia Cameroon
## [21] Canada Central African Republic
## [23] Chad Chile
## [25] China Colombia
## [27] Comoros Congo, Dem. Rep.
## [29] Congo, Rep. Costa Rica
## [31] Cote d'Ivoire Croatia
## [33] Cuba Czech Republic
## [35] Denmark Djibouti
## [37] Dominican Republic Ecuador
## [39] Egypt El Salvador
## [41] Equatorial Guinea Eritrea
## [43] Ethiopia Finland
## [45] France Gabon
## [47] Gambia Germany
## [49] Ghana Greece
## [51] Guatemala Guinea
## [53] Guinea-Bissau Haiti
## [55] Honduras Hong Kong, China
## [57] Hungary Iceland
## [59] India Indonesia
## [61] Iran Iraq
## [63] Ireland Israel
## [65] Italy Jamaica
## [67] Japan Jordan
## [69] Kenya Korea, Dem. Rep.
## [71] Korea, Rep. Kuwait
## [73] Lebanon Lesotho
## [75] Liberia Libya
## [77] Madagascar Malawi
## [79] Malaysia Mali
## [81] Mauritania Mauritius
## [83] Mexico Mongolia
## [85] Montenegro Morocco
## [87] Mozambique Myanmar
## [89] Namibia Nepal
## [91] Netherlands New Zealand
## [93] Nicaragua Niger
## [95] Nigeria Norway
## [97] Oman Pakistan
## [99] Panama Paraguay
## [101] Peru Philippines
## [103] Poland Portugal
## [105] Puerto Rico Reunion
## [107] Romania Rwanda
## [109] Sao Tome and Principe Saudi Arabia
## [111] Senegal Serbia
## [113] Sierra Leone Singapore
## [115] Slovak Republic Slovenia
## [117] Somalia South Africa
## [119] Spain Sri Lanka
## [121] Sudan Swaziland
## [123] Sweden Switzerland
## [125] Syria Taiwan
## [127] Tanzania Thailand
## [129] Togo Trinidad and Tobago
## [131] Tunisia Turkey
## [133] Uganda United Kingdom
## [135] United States Uruguay
## [137] Venezuela Vietnam
## [139] West Bank and Gaza Yemen, Rep.
## [141] Zambia Zimbabwe
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe
Okay, now we want to see just the rows of the data.frame where country is “United States”. The syntax for all dplyr
functions is the same: The first argument is the data.frame, the rest of the arguments are whatever you want to do in that data.frame.
filter(gapminder, country == "United States")
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 United States Americas 1952 68.440 157553000 13990.48
## 2 United States Americas 1957 69.490 171984000 14847.13
## 3 United States Americas 1962 70.210 186538000 16173.15
## 4 United States Americas 1967 70.760 198712000 19530.37
## 5 United States Americas 1972 71.340 209896000 21806.04
## 6 United States Americas 1977 73.380 220239000 24072.63
## 7 United States Americas 1982 74.650 232187835 25009.56
## 8 United States Americas 1987 75.020 242803533 29884.35
## 9 United States Americas 1992 76.090 256894189 32003.93
## 10 United States Americas 1997 76.810 272911760 35767.43
## 11 United States Americas 2002 77.310 287675526 39097.10
## 12 United States Americas 2007 78.242 301139947 42951.65
We can also apply multiple conditions, e.g. the US after 2000:
filter(gapminder, country == "United States" & year > 2000)
## # A tibble: 2 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 United States Americas 2002 77.310 287675526 39097.10
## 2 United States Americas 2007 78.242 301139947 42951.65
We can also use “or” conditions with the vertical pipe: |
. Notice that the variable (column) names don’t go in quotes, but values of character variables do.
filter(gapminder, country == "United States" | country == "Mexico")
## # A tibble: 24 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Mexico Americas 1952 50.789 30144317 3478.126
## 2 Mexico Americas 1957 55.190 35015548 4131.547
## 3 Mexico Americas 1962 58.299 41121485 4581.609
## 4 Mexico Americas 1967 60.110 47995559 5754.734
## 5 Mexico Americas 1972 62.361 55984294 6809.407
## 6 Mexico Americas 1977 65.032 63759976 7674.929
## 7 Mexico Americas 1982 67.405 71640904 9611.148
## 8 Mexico Americas 1987 69.498 80122492 8688.156
## 9 Mexico Americas 1992 71.455 88111030 9472.384
## 10 Mexico Americas 1997 73.670 95895146 9767.298
## # ... with 14 more rows
A good, handy reference list for the operators (and, or, etc) can be found here.
select()
filter
returned a subset of the data.frame’s rows. select
returns a subset of the data.frame’s columns.
Suppose we only want to see country and life expectancy.
select(gapminder, country, lifeExp)
We can choose which columns we don’t want
select(gapminder, -continent, income = gdpPercap)
## # A tibble: 1,704 x 5
## country year lifeExp pop income
## <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan 1952 28.801 8425333 779.4453
## 2 Afghanistan 1957 30.332 9240934 820.8530
## 3 Afghanistan 1962 31.997 10267083 853.1007
## 4 Afghanistan 1967 34.020 11537966 836.1971
## 5 Afghanistan 1972 36.088 13079460 739.9811
## 6 Afghanistan 1977 38.438 14880372 786.1134
## 7 Afghanistan 1982 39.854 12881816 978.0114
## 8 Afghanistan 1987 40.822 13867957 852.3959
## 9 Afghanistan 1992 41.674 16317921 649.3414
## 10 Afghanistan 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
And we can rename columns
select(gapminder, ThePlace = country, HowLongTheyLive = lifeExp)
## # A tibble: 1,704 x 2
## ThePlace HowLongTheyLive
## <fctr> <dbl>
## 1 Afghanistan 28.801
## 2 Afghanistan 30.332
## 3 Afghanistan 31.997
## 4 Afghanistan 34.020
## 5 Afghanistan 36.088
## 6 Afghanistan 38.438
## 7 Afghanistan 39.854
## 8 Afghanistan 40.822
## 9 Afghanistan 41.674
## 10 Afghanistan 41.763
## # ... with 1,694 more rows
As usual, R isn’t saving any of these outputs; just printing them to the screen. If we want to keep them around, we need to assign them to a variable.
justUS = filter(gapminder, country == "United States")
USdata = select(justUS, -country, -continent)
USdata
## # A tibble: 12 x 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 68.440 157553000 13990.48
## 2 1957 69.490 171984000 14847.13
## 3 1962 70.210 186538000 16173.15
## 4 1967 70.760 198712000 19530.37
## 5 1972 71.340 209896000 21806.04
## 6 1977 73.380 220239000 24072.63
## 7 1982 74.650 232187835 25009.56
## 8 1987 75.020 242803533 29884.35
## 9 1992 76.090 256894189 32003.93
## 10 1997 76.810 272911760 35767.43
## 11 2002 77.310 287675526 39097.10
## 12 2007 78.242 301139947 42951.65
Subsetting
- Subset the gapminder data to only Oceania countries post-1980.
- Remove the continent column
- Make a scatter plot of gdpPercap vs. population colored by country
Advanced How would you determine the median population for the North American countries between 1970 and 1980?
Bonus This can be done using base R’s subsetting, but this class doesn’t teach how. Do the original challenge without the
filter
andselect
functions. Feel free to consult Google, helpfiles, etc. to figure out how.
arrange()
You can order the rows of a data.frame by a variable using arrange
. Suppose we want to see the most populous countries:
arrange(gapminder, pop)
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Sao Tome and Principe Africa 1952 46.471 60011 879.5836
## 2 Sao Tome and Principe Africa 1957 48.945 61325 860.7369
## 3 Djibouti Africa 1952 34.812 63149 2669.5295
## 4 Sao Tome and Principe Africa 1962 51.893 65345 1071.5511
## 5 Sao Tome and Principe Africa 1967 54.425 70787 1384.8406
## 6 Djibouti Africa 1957 37.328 71851 2864.9691
## 7 Sao Tome and Principe Africa 1972 56.480 76595 1532.9853
## 8 Sao Tome and Principe Africa 1977 58.550 86796 1737.5617
## 9 Djibouti Africa 1962 39.693 89898 3020.9893
## 10 Sao Tome and Principe Africa 1982 60.351 98593 1890.2181
## # ... with 1,694 more rows
Hmm, we didn’t get the most populous countries. By default, arrange
sorts the variable in increasing order. We could see the most populous countries by examining the tail
of the last command, or we can sort the data.frame by descending population by wrapping the variable in desc()
:
arrange(gapminder, desc(pop))
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 China Asia 2007 72.96100 1318683096 4959.1149
## 2 China Asia 2002 72.02800 1280400000 3119.2809
## 3 China Asia 1997 70.42600 1230075000 2289.2341
## 4 China Asia 1992 68.69000 1164970000 1655.7842
## 5 India Asia 2007 64.69800 1110396331 2452.2104
## 6 China Asia 1987 67.27400 1084035000 1378.9040
## 7 India Asia 2002 62.87900 1034172547 1746.7695
## 8 China Asia 1982 65.52500 1000281000 962.4214
## 9 India Asia 1997 61.76500 959000000 1458.8174
## 10 China Asia 1977 63.96736 943455000 741.2375
## # ... with 1,694 more rows
arrange
can also sort by multiple variables. It will sort the data.frame by the first variable, and if there are any ties in that variable, they will be sorted by the next variable, and so on. Here we sort from newest to oldest, and within year from richest to poorest:
arrange(gapminder, desc(year), desc(gdpPercap))
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Norway Europe 2007 80.196 4627926 49357.19
## 2 Kuwait Asia 2007 77.588 2505559 47306.99
## 3 Singapore Asia 2007 79.972 4553009 47143.18
## 4 United States Americas 2007 78.242 301139947 42951.65
## 5 Ireland Europe 2007 78.885 4109086 40676.00
## 6 Hong Kong, China Asia 2007 82.208 6980412 39724.98
## 7 Switzerland Europe 2007 81.701 7554661 37506.42
## 8 Netherlands Europe 2007 79.762 16570613 36797.93
## 9 Canada Americas 2007 80.653 33390141 36319.24
## 10 Iceland Europe 2007 81.757 301931 36180.79
## # ... with 1,694 more rows
Shoutout Q: Would we get the same output if we switched the order of desc(year)
and desc(gdpPercap)
in the last line?
mutate()
We have learned how to drop rows, drop columns, and rearrange rows. To make a new column we use the mutate
function. As usual, the first argument is a data.frame. The second argument is the name of the new column you want to create, followed by an equal sign, followed by what to put in that column. You can reference other variables in the data.frame, and mutate
will treat each row independently. E.g. we can calculate the total GDP of each country in each year by multiplying the per-capita GDP by the population.
mutate(gapminder, total_gdp = gdpPercap * pop)
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap total_gdp
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875
## # ... with 1,694 more rows
Shoutout Q: How would we view the highest-total-gdp countries?
Note that didn’t change gapminder: We didn’t assign the output to anything, so it was just printed, with the new column. If we want to modify our gapminder data.frame, we can assign the output of mutate
back to the gapminder variable, but be careful doing this – if you make a mistake, you can’t just re-run that line of code, you’ll need to go back to loading the gapminder data.frame.
Also, you can create multiple columns in one call to mutate
, even using variables that you just created, separating them with commas:
gapminder = mutate(gapminder,
total_gdp = gdpPercap * pop,
log_gdp = log10(total_gdp))
MCQ: Data Reduction
Produce a data.frame with only the names, years, and per-capita GDP of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.
- Tip: The
gdpPercap
variable is annual gdp. You’ll need to adjust.- Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.
What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?
- $278
- $312
- $331
- $339
Advanced: Use dplyr functions and ggplot to plot per-capita GDP versus population for North American countries after 1970. - Once you’ve made the graph, transform both axes to a log10 scale. There are two ways to do this, one by creating new columns in the data frame, and another using functions provided by ggplot to transform the axes. Implement both, in that order. Which do you prefer and why?
Suppose we want to look at all the countries where life expectancy is greater than 80 years, sorted from poorest to richest. First, we filter
, then we arrange
. We could assign the intermediate data.frame to a variable:
lifeExpGreater80 = filter(gapminder, lifeExp > 80)
(lifeExpGreater80sorted = arrange(lifeExpGreater80, gdpPercap))
## # A tibble: 21 x 8
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 New Zealand Oceania 2007 80.204 4115771 25185.01
## 2 Israel Asia 2007 80.745 6426679 25523.28
## 3 Italy Europe 2002 80.240 57926999 27968.10
## 4 Italy Europe 2007 80.546 58147733 28569.72
## 5 Japan Asia 2002 82.000 127065841 28604.59
## 6 Japan Asia 1997 80.690 125956499 28816.58
## 7 Spain Europe 2007 80.941 40448191 28821.06
## 8 Sweden Europe 2002 80.040 8954175 29341.63
## 9 Hong Kong, China Asia 2002 81.495 6762476 30209.02
## 10 France Europe 2007 80.657 61083916 30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## # log_gdp <dbl>
In this case it doesn’t much matter, but we make a whole new data.frame (lifeExpGreater80
) and only use it once; that’s a little wasteful of system resources, and it clutters our environment. If the data are large, that can be a big problem.
Or, we could nest each function so that it appears on one line:
arrange(filter(gapminder, lifeExp > 80), gdpPercap)
## # A tibble: 21 x 8
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 New Zealand Oceania 2007 80.204 4115771 25185.01
## 2 Israel Asia 2007 80.745 6426679 25523.28
## 3 Italy Europe 2002 80.240 57926999 27968.10
## 4 Italy Europe 2007 80.546 58147733 28569.72
## 5 Japan Asia 2002 82.000 127065841 28604.59
## 6 Japan Asia 1997 80.690 125956499 28816.58
## 7 Spain Europe 2007 80.941 40448191 28821.06
## 8 Sweden Europe 2002 80.040 8954175 29341.63
## 9 Hong Kong, China Asia 2002 81.495 6762476 30209.02
## 10 France Europe 2007 80.657 61083916 30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## # log_gdp <dbl>
This would become difficult to read if we are performing a number of operations that would require a repeated nesting. But…
There is a better way, and it makes both writing and reading the code easier. The pipe from the magrittr
package (which is automatically installed and loaded with dplyr
and tidyverse
) takes the output of first line, and plugs it in as the first argument of the next line. Since many tidyverse
functions expect a data.frame as the first argument and output a data.frame, this works fluidly.
filter(gapminder, lifeExp > 80) %>%
arrange(gdpPercap)
## # A tibble: 21 x 8
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 New Zealand Oceania 2007 80.204 4115771 25185.01
## 2 Israel Asia 2007 80.745 6426679 25523.28
## 3 Italy Europe 2002 80.240 57926999 27968.10
## 4 Italy Europe 2007 80.546 58147733 28569.72
## 5 Japan Asia 2002 82.000 127065841 28604.59
## 6 Japan Asia 1997 80.690 125956499 28816.58
## 7 Spain Europe 2007 80.941 40448191 28821.06
## 8 Sweden Europe 2002 80.040 8954175 29341.63
## 9 Hong Kong, China Asia 2002 81.495 6762476 30209.02
## 10 France Europe 2007 80.657 61083916 30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## # log_gdp <dbl>
To demonstrate how it works, here are some examples where it’s unnecessary.
4 %>% sqrt()
## [1] 2
2 ^ 2 %>% sum(1)
## [1] 5
Whatever goes through the pipe becomes the first argument of the function after the pipe. This is convenient, because all dplyr
functions produce a data.frame as their output and take a data.frame as the first argument. Since R ignores white-space, we can put each function on a new line, which RStudio will automatically indent, making everything easy to read. Now each line represents a step in a sequential operation. You can read this as “Take the gapminder data.frame, filter to the rows where lifeExp is greater than 80, and arrange by gdpPercap.”
gapminder %>%
filter(lifeExp > 80) %>%
arrange(gdpPercap)
## # A tibble: 21 x 8
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 New Zealand Oceania 2007 80.204 4115771 25185.01
## 2 Israel Asia 2007 80.745 6426679 25523.28
## 3 Italy Europe 2002 80.240 57926999 27968.10
## 4 Italy Europe 2007 80.546 58147733 28569.72
## 5 Japan Asia 2002 82.000 127065841 28604.59
## 6 Japan Asia 1997 80.690 125956499 28816.58
## 7 Spain Europe 2007 80.941 40448191 28821.06
## 8 Sweden Europe 2002 80.040 8954175 29341.63
## 9 Hong Kong, China Asia 2002 81.495 6762476 30209.02
## 10 France Europe 2007 80.657 61083916 30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## # log_gdp <dbl>
Making your code easier for humans to read will save you lots of time. The human reading it is usually future-you, and operations that seem simple when you’re writing them will look like gibberish when you’re three weeks removed from them, let alone three months or three years or another person. Make your code as easy to read as possible by using the pipe where appropriate, leaving white space, using descriptive variable names, being consistent with spacing and naming, and liberally commenting code.
Challenge: Data Reduction with Pipes
Copy the code you (or the instructor) wrote to solve the previous MCQ Data Reduction challenge. Rewrite it using pipes (i.e. no assignment and no nested functions)
summarize()
Often we want to calculate a new variable, but rather than keeping each row as an independent observation, we want to group observations together to calculate some summary statistic. To do this we need two functions, one to do the grouping and one to calculate the summary statistic: group_by
and summarize
. By itself group_by
doesn’t change a data.frame; it just sets up the grouping. summarize
then goes over each group in the data.frame and does whatever calculation you want. E.g. suppose we want the average global gdp for each year. While we’re at it, let’s calculate the mean and median and see how they differ.
gapminder %>%
group_by(year) %>%
summarize(mean_gdp = mean(gdpPercap), median_gdp = median(gdpPercap))
## # A tibble: 12 x 3
## year mean_gdp median_gdp
## <int> <dbl> <dbl>
## 1 1952 3725.276 1968.528
## 2 1957 4299.408 2173.220
## 3 1962 4725.812 2335.440
## 4 1967 5483.653 2678.335
## 5 1972 6770.083 3339.129
## 6 1977 7313.166 3798.609
## 7 1982 7518.902 4216.228
## 8 1987 7900.920 4280.300
## 9 1992 8158.609 4386.086
## 10 1997 9090.175 4781.825
## 11 2002 9917.848 5319.805
## 12 2007 11680.072 6124.371
Shoutout Q: Note that summarize
eliminates any other columns. Why? What else can it do? E.g. What country should it list for the year 1952!?
There are several different summary statistics that can be generated from our data. The R base package provides many built-in functions such as mean
, median
, min
, max
, and range
. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA
(the missing data) is to use na.rm=TRUE
(rm
stands for remove). An alternate option is to use the function is.na()
, which evaluates to true if the value passed to it is not a number. This function is more useful as a part of a filter, where you can filter out everything that is not a number. For that purpose you would do something like
gapminder %>%
filter(!is.na(someColumn))
The !
symbol negates it, so we’re asking for everything that is not an NA
.
We often want to calculate the number of entries within a group. E.g. we might wonder if our dataset is balanced by country. We can do this with the n()
function, or dplyr
provides a count
function as a convenience:
gapminder %>%
group_by(country) %>%
summarize(number_entries = n())
## # A tibble: 142 x 2
## country number_entries
## <fctr> <int>
## 1 Afghanistan 12
## 2 Albania 12
## 3 Algeria 12
## 4 Angola 12
## 5 Argentina 12
## 6 Australia 12
## 7 Austria 12
## 8 Bahrain 12
## 9 Bangladesh 12
## 10 Belgium 12
## # ... with 132 more rows
count(gapminder, country)
## # A tibble: 142 x 2
## country n
## <fctr> <int>
## 1 Afghanistan 12
## 2 Albania 12
## 3 Algeria 12
## 4 Angola 12
## 5 Argentina 12
## 6 Australia 12
## 7 Austria 12
## 8 Bahrain 12
## 9 Bangladesh 12
## 10 Belgium 12
## # ... with 132 more rows
We can also do multiple groupings. Suppose we want the maximum life expectancy in each continent for each year. We group by continent and year and calculate the maximum with the max
function:
gapminder %>%
group_by(continent, year) %>%
summarize(longest_life = max(lifeExp))
## # A tibble: 60 x 3
## # Groups: continent [?]
## continent year longest_life
## <fctr> <int> <dbl>
## 1 Africa 1952 52.724
## 2 Africa 1957 58.089
## 3 Africa 1962 60.246
## 4 Africa 1967 61.557
## 5 Africa 1972 64.274
## 6 Africa 1977 67.064
## 7 Africa 1982 69.885
## 8 Africa 1987 71.913
## 9 Africa 1992 73.615
## 10 Africa 1997 74.772
## # ... with 50 more rows
Hmm, we got the longest life expectancy for each continent-year, but we didn’t get the country. To get the country, we have to ask R “Where lifeExp is at a maximum, what is the entry in country?” For that we use the which.max
function. max
returns the maximum value; which.max
returns the location of the maximum value.
max(c(1, 7, 4))
## [1] 7
which.max(c(1, 7, 4))
## [1] 2
Now, back to the question: Where lifeExp is at a maximum, what is the entry in country?
gapminder %>%
group_by(continent, year) %>%
summarize(longest_life = max(lifeExp), country = country[which.max(lifeExp)])
## # A tibble: 60 x 4
## # Groups: continent [?]
## continent year longest_life country
## <fctr> <int> <dbl> <fctr>
## 1 Africa 1952 52.724 Reunion
## 2 Africa 1957 58.089 Mauritius
## 3 Africa 1962 60.246 Mauritius
## 4 Africa 1967 61.557 Mauritius
## 5 Africa 1972 64.274 Reunion
## 6 Africa 1977 67.064 Reunion
## 7 Africa 1982 69.885 Reunion
## 8 Africa 1987 71.913 Reunion
## 9 Africa 1992 73.615 Reunion
## 10 Africa 1997 74.772 Reunion
## # ... with 50 more rows
Challenge – Part 1
- Calculate a new column: the total GDP of each country in each year.
- Calculate the variance –
var()
of countries’ gdps in each year.- Is country-level GDP getting more or less equal over time?
Challenge – Part 2
- Modify the code you just wrote to calculate the variance in both country-level GDP and per-capita GDP.
- Do both measures support the conclusion you arrived at above?
That is the core of dplyr
’s functionality, but it does more. RStudio makes a great cheatsheet that covers all the dplyr
functions we just learned, plus what we will learn in the next lesson: keeping data tidy.
Solution to challenge Subsetting
- Subset the gapminder data to only Oceania countries post-1980.
- Remove the continent column
- Make a scatter plot of gdpPercap vs. population colored by country
library(gapminder)
oc1980 = filter(gapminder, continent == "Oceania" & year > 1980)
oc1980less = select(oc1980, -continent)
library('ggplot2')
ggplot(oc1980less, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point()
Advanced: How would you determine the median population for the North American countries between 1970 and 1980?
library(gapminder)
library(tidyverse)
noAm = filter(gapminder, country == "United States" | country == "Canada" |
country == "Mexico" | country == "Puerto Rico" & (year > 1970 & year < 1980))
noAmPop = select(noAm, pop)
#median(noAmPop)
#noAmPop
#as.integer(noAmPop)
median(unlist(noAmPop))
Bonus This can be done using base R’s subsetting, but this class doesn’t teach how. Do the original challenge without the filter
and select
functions. Feel free to consult Google, helpfiles, etc. to figure out how.
noAm2 = gapminder[(gapminder$country == "United States") |
(gapminder$country == "Mexico") |
(gapminder$country == "Canada") |
(gapminder$country == "Puerto Rico") &
((gapminder$year > 1970) &
(gapminder$year < 1980)),]
median(noAm2$pop)
Solution to challenge MCQ: Data Reduction
Produce a data.frame with only the names, years, and per-capita GDP of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.
- Tip: The gdpPercap
variable is annual gdp. You’ll need to adjust. - Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.
What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?
a. $278 b. $312 c. $331 d. $339
dailyGDP = mutate(gapminder, onedayGDP = gdpPercap / 365)
dailyGDP = filter(dailyGDP, onedayGDP < 1)
dailyGDP = select(dailyGDP, country, year, gdpPercap)
dailyGDP[1,]
Advanced: Use dplyr functions and ggplot to plot per-capita GDP versus population for North American countries after 1970. - Once you’ve made the graph, transform both axes to a log10 scale. There are two ways to do this, one by creating new columns in the data frame, and another using functions provided by ggplot to transform the axes. Implement both, in that order. Which do you prefer and why?
noAm = filter(gapminder, country == "United States" |
country == "Canada" | country == "Mexico" |
country == "Puerto Rico" & year > 1970
)
ggplot(noAm, aes(x = gdpPercap, y = pop, color = country)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
Challenge: Data Reduction with Pipes
Copy the code you (or the instructor) wrote to solve the previous MCQ Data Reduction challenge. Rewrite it using pipes (i.e. no assignment and no nested functions)
Previous challenge with pipes dailyGDP = mutate(gapminder, onedayGDP = gdpPercap / 365)
dailyGDP = filter(dailyGDP, onedayGDP < 1)
dailyGDP = select(dailyGDP, country, year, gdpPercap)
smallGDP = gapminder %>%
mutate(onedayGDP = gdpPercap / 365) %>%
filter(onedayGDP < 1) %>%
select(country, year, gdpPercap)
smallGDP[1,]
OR, more fancy (without an intermediate temp variable) (gapminder %>%
mutate(onedayGDP = gdpPercap / 365) %>%
filter(onedayGDP < 1) %>%
select(country, year, gdpPercap))[1,] `