R for reproducible scientific analysis

Learning objectives

Understand the 6 main data types in R

Be able to use the six major dplyr verbs (filter, select, arrange, mutate, group_by, summarize)

Be able to use and understand the advantages of the magrittr pipe: %>%

Installing and loading packages

dplyr is not part of “base R”; rather it is a package – a library of functions that an R user wrote. This extensibility is part of the beauty of R. As of December 2016, there are 9,600 such packages in the official Comprehensive R Archive Network, better known as CRAN.

dplyr is one of the most popular packages for R. It is part of a suite of R tools that make up “The Tidyverse”. Its author conveniently bundled these tools together in a super-package called tidyverse. To use the tidyverse tools, you first need to download them to your machine (once) and then load them (each R session you want to use them). You can download a package via the RStudio menu bar Tools -> Install Packages…, or with a line of code:

install.packages('tidyverse')

You only have to download the code once. But whenever you want to use a package, you have to load it in your R session. For that, use the library function:

library(tidyverse)

Challenge – Install and load tidyverse

Install the tidyverse & gapminder packages, either with install.packages('tidyverse', 'gapminder') or via the menu bar: Tools -> Install Packages…

Load tidyverse with library(tidyverse)

Load gapminder with library(gapminder)

You will see some warnings about conflicts. That’s okay.

Vectors & Data Types

There are six main types of data in R. We’ve already covered 2–3 of them. Can anyone help me list them?

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g:

weight_g <- c(50, 60, 65, 82)
weight_g

## [1] 50 60 65 82

A vector can also contain characters:

animals <- c("mouse", "rat", "dog")
animals

## [1] "mouse" "rat"   "dog"

The quotes around “mouse”, “rat”, etc. are essential here. Without the quotes R will assume there are objects called mouse, rat and dog. As these objects don’t exist in R’s memory, there will be an error message.

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(weight_g)

## [1] 4

length(animals)

## [1] 3

An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates the class (the type of element) of an object:

class(weight_g)

## [1] "numeric"

class(animals)

## [1] "character"

The function str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:

str(weight_g)

##  num [1:4] 50 60 65 82

str(animals)

##  chr [1:3] "mouse" "rat" "dog"

You can also use the c() function to add other elements to your vector:

weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g

## [1] 30 50 60 65 82 90

In the first line, we take the original vector weight_g, add the value 90 to the end of it, and save the result back into weight_g. Then we add the value 30 to the beginning, again saving the result back into weight_g.

We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.

We just saw 2 of the 6 main atomic vector types (or data types) that R uses: “character” and “numeric”. These are the basic building blocks that all R objects are built from. The other 4 are:

“logical” for TRUE and FALSE (the boolean data type)
“integer” for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
“complex” to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them
“raw” that we won’t discuss further

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), factors (factor) and arrays (array).

Challenge

We’ve seen that atomic vectors can be of type character, numeric, integer, and logical. But what happens if we try to mix these types in a single vector? What will happen in each of these examples? (hint: use class() to check the data type of your objects):

num_char <- c(1, 2, 3, 'a')
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c('a', 'b', 'c', TRUE)
tricky <- c(1, 2, 3, '4')

Why do you think it happens?

You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?

Factors

Sometimes if we look at a data set with str() we can see columns consist of integers, character, etc. However, sometimes the columns are of a special class called a factor. Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.

Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.

Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level “female” and 2 to the level “male” (because f comes before m, even though the first element in this vector is “male”). You can check this by using the function levels(), and check the number of levels using nlevels():

levels(sex)

## [1] "female" "male"

nlevels(sex)

## [1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:

sex # current order

## [1] male   female female male  
## Levels: female male

#> [1] male   female female male  
#> Levels: female male
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering

## [1] male   female female male  
## Levels: male female

#> [1] male   female female male  
#> Levels: male female

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: “female”, “male” is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our ecology example dataset dataset).

Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

as.character(sex)

## [1] "male"   "female" "female" "male"

Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the levels() function. Compare:

f <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(f)               # wrong! and there is no warning...

## [1] 3 2 1 4 3

as.numeric(as.character(f)) # works...

## [1] 1990 1983 1977 1998 1990

as.numeric(levels(f))[f]    # The recommended way.

## [1] 1990 1983 1977 1998 1990

Notice that in the levels() approach, three important steps occur:

We obtain all the factor levels using levels(f)
We convert these levels to numeric values using as.numeric(levels(f))
We then access these numeric values using the underlying integers of the vector f inside the square brackets

Using `stringsAsFactors=FALSE`

By default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE.

In most cases, it’s preferable to set stringsAsFactors = FALSE when importing your data, and converting as a factor only the columns that require this data type.

Compare the output of str(surveys) when setting stringsAsFactors = TRUE (default) and stringsAsFactors = FALSE:

We are going to use the R function download.file() to download the CSV file that contains the survey data from figshare, and we will use read.csv() to load into memory the content of the CSV file as an object of class data.frame.

To download the data into the data/ subdirectory, run the following:

download.file("https://ndownloader.figshare.com/files/2292169",
              "data/combined.csv")

You are now ready to load the data:

surveys <- read.csv('data/combined.csv')

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can print the first 6 lines of this data using head(surveys)

Now we can look at how reading in the data in different ways affects the different data types (factor vs. character):

## Compare the difference between when the data are being read as
## `factor`, and when they are being read as `character`.
surveys <- read.csv("data/combined.csv", stringsAsFactors = TRUE)
str(surveys)
surveys <- read.csv("data/combined.csv", stringsAsFactors = FALSE)
str(surveys)
## Convert the column "plot_type" into a factor
surveys$plot_type <- factor(surveys$plot_type)

Challenge

We have seen how data frames are created when using the read.csv(), but they can also be created by hand with the data.frame() function. There are a few mistakes in this hand-crafted data.frame, can you spot and fix them? Don’t hesitate to experiment!

animal_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                          feel=c("furry", "squishy", "spiny"),
                          weight=c(45, 8 1.1, 0.8))

Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate):

Are they what you expected? Why? Why not?

What would have been different if we had added stringsAsFactors = FALSE to this call?

What would you need to change to ensure that each column had the accurate data type?

country_climate <- data.frame(
       country=c("Canada", "Panama", "South Africa", "Australia"),
       climate=c("cold", "hot", "temperate", "hot/temperate"),
       temperature=c(10, 30, 18, "15"),
       northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
       has_kangaroo=c(FALSE, FALSE, FALSE, 1)
       )

The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance.).

Data Wrangling with `dplyr`

It is an often bemoaned fact that a data scientist spends much, and often most, of her time wrangling data: getting it organized and clean. In this lesson we will learn an efficient set of tools that can handle the vast majority of most data management tasks.

Enter dplyr, a package for making data manipulation easier. More on dplyr later. dplyr is part of tidyverse, so it is already installed on your machine. You can load it individually, or with the other tidyverse packages like this:

library(tidyverse)
library(gapminder)

Those messages and conflicts are normal. The conflicts are R telling you that there are two packages with functions named “filter” and “lag”. When R gives you red text, it’s not always a bad thing, but it does mean you should pay attention and try to understand what it’s trying to tell you.

Remember that you only have to install each package once (per computer), but you have to load them for each R session in which you want to use them.

You also have to load any data you want to use each time you start a new R session. So, if it’s not already loaded, read in the gapminder data. We’re going to use tidyverse’s read_csv instead of base R’s read.csv here. It has a few nice features; the most obvious is that it makes a special kind of data.frame that only prints the first ten rows instead of all 1704.

# gapminder <- read_csv('data/gapminder-FiveYearData.csv')
class(gapminder)

## [1] "tbl_df"     "tbl"        "data.frame"

head(gapminder) # look at first few rows

## # A tibble: 6 x 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134

str(gapminder) # look at data structure

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

You can always convert a data.frame into this special kind of data.frame like this:

gapminder <- tbl_df(gapminder)

What is dplyr?

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases.dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.

The five tasks of `dplyr`

There are five actions we often want to apply to a tabular dataset:

Filter rows
Filter columns
Arrange rows
Make new columns
Summarize groups

We are about to see how to do each of those things using the dplyr package. Everything we’re going to learn to do can also be done using “base R”, but dplyr makes it easier, and the syntax is consistent, and it actually makes the computations faster.

`filter()`

Suppose we want to see just the gapminder data for the USA. First, we need to know how “USA” is written in the dataset: Is it USA or United States or what? We can see all the unique values of a variable with the unique function.

unique(gapminder$country)

##   [1] Afghanistan              Albania                 
##   [3] Algeria                  Angola                  
##   [5] Argentina                Australia               
##   [7] Austria                  Bahrain                 
##   [9] Bangladesh               Belgium                 
##  [11] Benin                    Bolivia                 
##  [13] Bosnia and Herzegovina   Botswana                
##  [15] Brazil                   Bulgaria                
##  [17] Burkina Faso             Burundi                 
##  [19] Cambodia                 Cameroon                
##  [21] Canada                   Central African Republic
##  [23] Chad                     Chile                   
##  [25] China                    Colombia                
##  [27] Comoros                  Congo, Dem. Rep.        
##  [29] Congo, Rep.              Costa Rica              
##  [31] Cote d'Ivoire            Croatia                 
##  [33] Cuba                     Czech Republic          
##  [35] Denmark                  Djibouti                
##  [37] Dominican Republic       Ecuador                 
##  [39] Egypt                    El Salvador             
##  [41] Equatorial Guinea        Eritrea                 
##  [43] Ethiopia                 Finland                 
##  [45] France                   Gabon                   
##  [47] Gambia                   Germany                 
##  [49] Ghana                    Greece                  
##  [51] Guatemala                Guinea                  
##  [53] Guinea-Bissau            Haiti                   
##  [55] Honduras                 Hong Kong, China        
##  [57] Hungary                  Iceland                 
##  [59] India                    Indonesia               
##  [61] Iran                     Iraq                    
##  [63] Ireland                  Israel                  
##  [65] Italy                    Jamaica                 
##  [67] Japan                    Jordan                  
##  [69] Kenya                    Korea, Dem. Rep.        
##  [71] Korea, Rep.              Kuwait                  
##  [73] Lebanon                  Lesotho                 
##  [75] Liberia                  Libya                   
##  [77] Madagascar               Malawi                  
##  [79] Malaysia                 Mali                    
##  [81] Mauritania               Mauritius               
##  [83] Mexico                   Mongolia                
##  [85] Montenegro               Morocco                 
##  [87] Mozambique               Myanmar                 
##  [89] Namibia                  Nepal                   
##  [91] Netherlands              New Zealand             
##  [93] Nicaragua                Niger                   
##  [95] Nigeria                  Norway                  
##  [97] Oman                     Pakistan                
##  [99] Panama                   Paraguay                
## [101] Peru                     Philippines             
## [103] Poland                   Portugal                
## [105] Puerto Rico              Reunion                 
## [107] Romania                  Rwanda                  
## [109] Sao Tome and Principe    Saudi Arabia            
## [111] Senegal                  Serbia                  
## [113] Sierra Leone             Singapore               
## [115] Slovak Republic          Slovenia                
## [117] Somalia                  South Africa            
## [119] Spain                    Sri Lanka               
## [121] Sudan                    Swaziland               
## [123] Sweden                   Switzerland             
## [125] Syria                    Taiwan                  
## [127] Tanzania                 Thailand                
## [129] Togo                     Trinidad and Tobago     
## [131] Tunisia                  Turkey                  
## [133] Uganda                   United Kingdom          
## [135] United States            Uruguay                 
## [137] Venezuela                Vietnam                 
## [139] West Bank and Gaza       Yemen, Rep.             
## [141] Zambia                   Zimbabwe                
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe

Okay, now we want to see just the rows of the data.frame where country is “United States”. The syntax for all dplyr functions is the same: The first argument is the data.frame, the rest of the arguments are whatever you want to do in that data.frame.

filter(gapminder, country == "United States")

## # A tibble: 12 x 6
##          country continent  year lifeExp       pop gdpPercap
##           <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1 United States  Americas  1952  68.440 157553000  13990.48
##  2 United States  Americas  1957  69.490 171984000  14847.13
##  3 United States  Americas  1962  70.210 186538000  16173.15
##  4 United States  Americas  1967  70.760 198712000  19530.37
##  5 United States  Americas  1972  71.340 209896000  21806.04
##  6 United States  Americas  1977  73.380 220239000  24072.63
##  7 United States  Americas  1982  74.650 232187835  25009.56
##  8 United States  Americas  1987  75.020 242803533  29884.35
##  9 United States  Americas  1992  76.090 256894189  32003.93
## 10 United States  Americas  1997  76.810 272911760  35767.43
## 11 United States  Americas  2002  77.310 287675526  39097.10
## 12 United States  Americas  2007  78.242 301139947  42951.65

We can also apply multiple conditions, e.g. the US after 2000:

filter(gapminder, country == "United States" & year > 2000)

## # A tibble: 2 x 6
##         country continent  year lifeExp       pop gdpPercap
##          <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
## 1 United States  Americas  2002  77.310 287675526  39097.10
## 2 United States  Americas  2007  78.242 301139947  42951.65

We can also use “or” conditions with the vertical pipe: |. Notice that the variable (column) names don’t go in quotes, but values of character variables do.

filter(gapminder, country == "United States" | country == "Mexico")

## # A tibble: 24 x 6
##    country continent  year lifeExp      pop gdpPercap
##     <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1  Mexico  Americas  1952  50.789 30144317  3478.126
##  2  Mexico  Americas  1957  55.190 35015548  4131.547
##  3  Mexico  Americas  1962  58.299 41121485  4581.609
##  4  Mexico  Americas  1967  60.110 47995559  5754.734
##  5  Mexico  Americas  1972  62.361 55984294  6809.407
##  6  Mexico  Americas  1977  65.032 63759976  7674.929
##  7  Mexico  Americas  1982  67.405 71640904  9611.148
##  8  Mexico  Americas  1987  69.498 80122492  8688.156
##  9  Mexico  Americas  1992  71.455 88111030  9472.384
## 10  Mexico  Americas  1997  73.670 95895146  9767.298
## # ... with 14 more rows

A good, handy reference list for the operators (and, or, etc) can be found here.

`select()`

filter returned a subset of the data.frame’s rows. select returns a subset of the data.frame’s columns.

Suppose we only want to see country and life expectancy.

select(gapminder, country, lifeExp)

We can choose which columns we don’t want

select(gapminder, -continent, income = gdpPercap)

## # A tibble: 1,704 x 5
##        country  year lifeExp      pop   income
##         <fctr> <int>   <dbl>    <int>    <dbl>
##  1 Afghanistan  1952  28.801  8425333 779.4453
##  2 Afghanistan  1957  30.332  9240934 820.8530
##  3 Afghanistan  1962  31.997 10267083 853.1007
##  4 Afghanistan  1967  34.020 11537966 836.1971
##  5 Afghanistan  1972  36.088 13079460 739.9811
##  6 Afghanistan  1977  38.438 14880372 786.1134
##  7 Afghanistan  1982  39.854 12881816 978.0114
##  8 Afghanistan  1987  40.822 13867957 852.3959
##  9 Afghanistan  1992  41.674 16317921 649.3414
## 10 Afghanistan  1997  41.763 22227415 635.3414
## # ... with 1,694 more rows

And we can rename columns

select(gapminder, ThePlace = country, HowLongTheyLive = lifeExp)

## # A tibble: 1,704 x 2
##       ThePlace HowLongTheyLive
##         <fctr>           <dbl>
##  1 Afghanistan          28.801
##  2 Afghanistan          30.332
##  3 Afghanistan          31.997
##  4 Afghanistan          34.020
##  5 Afghanistan          36.088
##  6 Afghanistan          38.438
##  7 Afghanistan          39.854
##  8 Afghanistan          40.822
##  9 Afghanistan          41.674
## 10 Afghanistan          41.763
## # ... with 1,694 more rows

As usual, R isn’t saving any of these outputs; just printing them to the screen. If we want to keep them around, we need to assign them to a variable.

justUS = filter(gapminder, country == "United States")
USdata = select(justUS, -country, -continent)
USdata

## # A tibble: 12 x 4
##     year lifeExp       pop gdpPercap
##    <int>   <dbl>     <int>     <dbl>
##  1  1952  68.440 157553000  13990.48
##  2  1957  69.490 171984000  14847.13
##  3  1962  70.210 186538000  16173.15
##  4  1967  70.760 198712000  19530.37
##  5  1972  71.340 209896000  21806.04
##  6  1977  73.380 220239000  24072.63
##  7  1982  74.650 232187835  25009.56
##  8  1987  75.020 242803533  29884.35
##  9  1992  76.090 256894189  32003.93
## 10  1997  76.810 272911760  35767.43
## 11  2002  77.310 287675526  39097.10
## 12  2007  78.242 301139947  42951.65

Subsetting

Subset the gapminder data to only Oceania countries post-1980.

Remove the continent column

Make a scatter plot of gdpPercap vs. population colored by country

Advanced How would you determine the median population for the North American countries between 1970 and 1980?

Bonus This can be done using base R’s subsetting, but this class doesn’t teach how. Do the original challenge without the filter and select functions. Feel free to consult Google, helpfiles, etc. to figure out how.

`arrange()`

You can order the rows of a data.frame by a variable using arrange. Suppose we want to see the most populous countries:

arrange(gapminder, pop)

## # A tibble: 1,704 x 6
##                  country continent  year lifeExp   pop gdpPercap
##                   <fctr>    <fctr> <int>   <dbl> <int>     <dbl>
##  1 Sao Tome and Principe    Africa  1952  46.471 60011  879.5836
##  2 Sao Tome and Principe    Africa  1957  48.945 61325  860.7369
##  3              Djibouti    Africa  1952  34.812 63149 2669.5295
##  4 Sao Tome and Principe    Africa  1962  51.893 65345 1071.5511
##  5 Sao Tome and Principe    Africa  1967  54.425 70787 1384.8406
##  6              Djibouti    Africa  1957  37.328 71851 2864.9691
##  7 Sao Tome and Principe    Africa  1972  56.480 76595 1532.9853
##  8 Sao Tome and Principe    Africa  1977  58.550 86796 1737.5617
##  9              Djibouti    Africa  1962  39.693 89898 3020.9893
## 10 Sao Tome and Principe    Africa  1982  60.351 98593 1890.2181
## # ... with 1,694 more rows

Hmm, we didn’t get the most populous countries. By default, arrange sorts the variable in increasing order. We could see the most populous countries by examining the tail of the last command, or we can sort the data.frame by descending population by wrapping the variable in desc():

arrange(gapminder, desc(pop))

## # A tibble: 1,704 x 6
##    country continent  year  lifeExp        pop gdpPercap
##     <fctr>    <fctr> <int>    <dbl>      <int>     <dbl>
##  1   China      Asia  2007 72.96100 1318683096 4959.1149
##  2   China      Asia  2002 72.02800 1280400000 3119.2809
##  3   China      Asia  1997 70.42600 1230075000 2289.2341
##  4   China      Asia  1992 68.69000 1164970000 1655.7842
##  5   India      Asia  2007 64.69800 1110396331 2452.2104
##  6   China      Asia  1987 67.27400 1084035000 1378.9040
##  7   India      Asia  2002 62.87900 1034172547 1746.7695
##  8   China      Asia  1982 65.52500 1000281000  962.4214
##  9   India      Asia  1997 61.76500  959000000 1458.8174
## 10   China      Asia  1977 63.96736  943455000  741.2375
## # ... with 1,694 more rows

arrange can also sort by multiple variables. It will sort the data.frame by the first variable, and if there are any ties in that variable, they will be sorted by the next variable, and so on. Here we sort from newest to oldest, and within year from richest to poorest:

arrange(gapminder, desc(year), desc(gdpPercap))

## # A tibble: 1,704 x 6
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1           Norway    Europe  2007  80.196   4627926  49357.19
##  2           Kuwait      Asia  2007  77.588   2505559  47306.99
##  3        Singapore      Asia  2007  79.972   4553009  47143.18
##  4    United States  Americas  2007  78.242 301139947  42951.65
##  5          Ireland    Europe  2007  78.885   4109086  40676.00
##  6 Hong Kong, China      Asia  2007  82.208   6980412  39724.98
##  7      Switzerland    Europe  2007  81.701   7554661  37506.42
##  8      Netherlands    Europe  2007  79.762  16570613  36797.93
##  9           Canada  Americas  2007  80.653  33390141  36319.24
## 10          Iceland    Europe  2007  81.757    301931  36180.79
## # ... with 1,694 more rows

Shoutout Q: Would we get the same output if we switched the order of desc(year) and desc(gdpPercap) in the last line?

`mutate()`

We have learned how to drop rows, drop columns, and rearrange rows. To make a new column we use the mutate function. As usual, the first argument is a data.frame. The second argument is the name of the new column you want to create, followed by an equal sign, followed by what to put in that column. You can reference other variables in the data.frame, and mutate will treat each row independently. E.g. we can calculate the total GDP of each country in each year by multiplying the per-capita GDP by the population.

mutate(gapminder, total_gdp = gdpPercap * pop)

## # A tibble: 1,704 x 7
##        country continent  year lifeExp      pop gdpPercap   total_gdp
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>       <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453  6567086330
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530  7585448670
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007  8758855797
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971  9648014150
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811  9678553274
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134 11697659231
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114 12598563401
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959 11820990309
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414 10595901589
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414 14121995875
## # ... with 1,694 more rows

Shoutout Q: How would we view the highest-total-gdp countries?

Note that didn’t change gapminder: We didn’t assign the output to anything, so it was just printed, with the new column. If we want to modify our gapminder data.frame, we can assign the output of mutate back to the gapminder variable, but be careful doing this – if you make a mistake, you can’t just re-run that line of code, you’ll need to go back to loading the gapminder data.frame.

Also, you can create multiple columns in one call to mutate, even using variables that you just created, separating them with commas:

gapminder = mutate(gapminder, 
                   total_gdp = gdpPercap * pop,
                   log_gdp = log10(total_gdp))

MCQ: Data Reduction

Produce a data.frame with only the names, years, and per-capita GDP of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.

Tip: The gdpPercap variable is annual gdp. You’ll need to adjust.

Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.

What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?

$278

$312

$331

$339

Advanced: Use dplyr functions and ggplot to plot per-capita GDP versus population for North American countries after 1970. - Once you’ve made the graph, transform both axes to a log10 scale. There are two ways to do this, one by creating new columns in the data frame, and another using functions provided by ggplot to transform the axes. Implement both, in that order. Which do you prefer and why?

C’est ne pas une pipe

Suppose we want to look at all the countries where life expectancy is greater than 80 years, sorted from poorest to richest. First, we filter, then we arrange. We could assign the intermediate data.frame to a variable:

lifeExpGreater80 = filter(gapminder, lifeExp > 80)
(lifeExpGreater80sorted = arrange(lifeExpGreater80, gdpPercap))

## # A tibble: 21 x 8
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1      New Zealand   Oceania  2007  80.204   4115771  25185.01
##  2           Israel      Asia  2007  80.745   6426679  25523.28
##  3            Italy    Europe  2002  80.240  57926999  27968.10
##  4            Italy    Europe  2007  80.546  58147733  28569.72
##  5            Japan      Asia  2002  82.000 127065841  28604.59
##  6            Japan      Asia  1997  80.690 125956499  28816.58
##  7            Spain    Europe  2007  80.941  40448191  28821.06
##  8           Sweden    Europe  2002  80.040   8954175  29341.63
##  9 Hong Kong, China      Asia  2002  81.495   6762476  30209.02
## 10           France    Europe  2007  80.657  61083916  30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## #   log_gdp <dbl>

In this case it doesn’t much matter, but we make a whole new data.frame (lifeExpGreater80) and only use it once; that’s a little wasteful of system resources, and it clutters our environment. If the data are large, that can be a big problem.

Or, we could nest each function so that it appears on one line:

arrange(filter(gapminder, lifeExp > 80), gdpPercap)

## # A tibble: 21 x 8
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1      New Zealand   Oceania  2007  80.204   4115771  25185.01
##  2           Israel      Asia  2007  80.745   6426679  25523.28
##  3            Italy    Europe  2002  80.240  57926999  27968.10
##  4            Italy    Europe  2007  80.546  58147733  28569.72
##  5            Japan      Asia  2002  82.000 127065841  28604.59
##  6            Japan      Asia  1997  80.690 125956499  28816.58
##  7            Spain    Europe  2007  80.941  40448191  28821.06
##  8           Sweden    Europe  2002  80.040   8954175  29341.63
##  9 Hong Kong, China      Asia  2002  81.495   6762476  30209.02
## 10           France    Europe  2007  80.657  61083916  30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## #   log_gdp <dbl>

This would become difficult to read if we are performing a number of operations that would require a repeated nesting. But…

There is a better way, and it makes both writing and reading the code easier. The pipe from the magrittr package (which is automatically installed and loaded with dplyr and tidyverse) takes the output of first line, and plugs it in as the first argument of the next line. Since many tidyverse functions expect a data.frame as the first argument and output a data.frame, this works fluidly.

filter(gapminder, lifeExp > 80) %>%
    arrange(gdpPercap)

## # A tibble: 21 x 8
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1      New Zealand   Oceania  2007  80.204   4115771  25185.01
##  2           Israel      Asia  2007  80.745   6426679  25523.28
##  3            Italy    Europe  2002  80.240  57926999  27968.10
##  4            Italy    Europe  2007  80.546  58147733  28569.72
##  5            Japan      Asia  2002  82.000 127065841  28604.59
##  6            Japan      Asia  1997  80.690 125956499  28816.58
##  7            Spain    Europe  2007  80.941  40448191  28821.06
##  8           Sweden    Europe  2002  80.040   8954175  29341.63
##  9 Hong Kong, China      Asia  2002  81.495   6762476  30209.02
## 10           France    Europe  2007  80.657  61083916  30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## #   log_gdp <dbl>

To demonstrate how it works, here are some examples where it’s unnecessary.

4 %>% sqrt()

## [1] 2

2 ^ 2 %>% sum(1)

## [1] 5

Whatever goes through the pipe becomes the first argument of the function after the pipe. This is convenient, because all dplyr functions produce a data.frame as their output and take a data.frame as the first argument. Since R ignores white-space, we can put each function on a new line, which RStudio will automatically indent, making everything easy to read. Now each line represents a step in a sequential operation. You can read this as “Take the gapminder data.frame, filter to the rows where lifeExp is greater than 80, and arrange by gdpPercap.”

gapminder %>%
    filter(lifeExp > 80) %>%
    arrange(gdpPercap)

## # A tibble: 21 x 8
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1      New Zealand   Oceania  2007  80.204   4115771  25185.01
##  2           Israel      Asia  2007  80.745   6426679  25523.28
##  3            Italy    Europe  2002  80.240  57926999  27968.10
##  4            Italy    Europe  2007  80.546  58147733  28569.72
##  5            Japan      Asia  2002  82.000 127065841  28604.59
##  6            Japan      Asia  1997  80.690 125956499  28816.58
##  7            Spain    Europe  2007  80.941  40448191  28821.06
##  8           Sweden    Europe  2002  80.040   8954175  29341.63
##  9 Hong Kong, China      Asia  2002  81.495   6762476  30209.02
## 10           France    Europe  2007  80.657  61083916  30470.02
## # ... with 11 more rows, and 2 more variables: total_gdp <dbl>,
## #   log_gdp <dbl>

Making your code easier for humans to read will save you lots of time. The human reading it is usually future-you, and operations that seem simple when you’re writing them will look like gibberish when you’re three weeks removed from them, let alone three months or three years or another person. Make your code as easy to read as possible by using the pipe where appropriate, leaving white space, using descriptive variable names, being consistent with spacing and naming, and liberally commenting code.

Challenge: Data Reduction with Pipes

Copy the code you (or the instructor) wrote to solve the previous MCQ Data Reduction challenge. Rewrite it using pipes (i.e. no assignment and no nested functions)

`summarize()`

Often we want to calculate a new variable, but rather than keeping each row as an independent observation, we want to group observations together to calculate some summary statistic. To do this we need two functions, one to do the grouping and one to calculate the summary statistic: group_by and summarize. By itself group_by doesn’t change a data.frame; it just sets up the grouping. summarize then goes over each group in the data.frame and does whatever calculation you want. E.g. suppose we want the average global gdp for each year. While we’re at it, let’s calculate the mean and median and see how they differ.

gapminder %>%
    group_by(year) %>%
    summarize(mean_gdp = mean(gdpPercap), median_gdp = median(gdpPercap))

## # A tibble: 12 x 3
##     year  mean_gdp median_gdp
##    <int>     <dbl>      <dbl>
##  1  1952  3725.276   1968.528
##  2  1957  4299.408   2173.220
##  3  1962  4725.812   2335.440
##  4  1967  5483.653   2678.335
##  5  1972  6770.083   3339.129
##  6  1977  7313.166   3798.609
##  7  1982  7518.902   4216.228
##  8  1987  7900.920   4280.300
##  9  1992  8158.609   4386.086
## 10  1997  9090.175   4781.825
## 11  2002  9917.848   5319.805
## 12  2007 11680.072   6124.371

Shoutout Q: Note that summarize eliminates any other columns. Why? What else can it do? E.g. What country should it list for the year 1952!?

There are several different summary statistics that can be generated from our data. The R base package provides many built-in functions such as mean, median, min, max, and range. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove). An alternate option is to use the function is.na(), which evaluates to true if the value passed to it is not a number. This function is more useful as a part of a filter, where you can filter out everything that is not a number. For that purpose you would do something like

gapminder %>%
  filter(!is.na(someColumn))

The ! symbol negates it, so we’re asking for everything that is not an NA.

We often want to calculate the number of entries within a group. E.g. we might wonder if our dataset is balanced by country. We can do this with the n() function, or dplyr provides a count function as a convenience:

gapminder %>%
    group_by(country) %>%
    summarize(number_entries = n())

## # A tibble: 142 x 2
##        country number_entries
##         <fctr>          <int>
##  1 Afghanistan             12
##  2     Albania             12
##  3     Algeria             12
##  4      Angola             12
##  5   Argentina             12
##  6   Australia             12
##  7     Austria             12
##  8     Bahrain             12
##  9  Bangladesh             12
## 10     Belgium             12
## # ... with 132 more rows

count(gapminder, country)

## # A tibble: 142 x 2
##        country     n
##         <fctr> <int>
##  1 Afghanistan    12
##  2     Albania    12
##  3     Algeria    12
##  4      Angola    12
##  5   Argentina    12
##  6   Australia    12
##  7     Austria    12
##  8     Bahrain    12
##  9  Bangladesh    12
## 10     Belgium    12
## # ... with 132 more rows

We can also do multiple groupings. Suppose we want the maximum life expectancy in each continent for each year. We group by continent and year and calculate the maximum with the max function:

gapminder %>%
    group_by(continent, year) %>%
    summarize(longest_life = max(lifeExp))

## # A tibble: 60 x 3
## # Groups:   continent [?]
##    continent  year longest_life
##       <fctr> <int>        <dbl>
##  1    Africa  1952       52.724
##  2    Africa  1957       58.089
##  3    Africa  1962       60.246
##  4    Africa  1967       61.557
##  5    Africa  1972       64.274
##  6    Africa  1977       67.064
##  7    Africa  1982       69.885
##  8    Africa  1987       71.913
##  9    Africa  1992       73.615
## 10    Africa  1997       74.772
## # ... with 50 more rows

Hmm, we got the longest life expectancy for each continent-year, but we didn’t get the country. To get the country, we have to ask R “Where lifeExp is at a maximum, what is the entry in country?” For that we use the which.max function. max returns the maximum value; which.max returns the location of the maximum value.

max(c(1, 7, 4))

## [1] 7

which.max(c(1, 7, 4))

## [1] 2

Now, back to the question: Where lifeExp is at a maximum, what is the entry in country?

gapminder %>%
    group_by(continent, year) %>%
    summarize(longest_life = max(lifeExp), country = country[which.max(lifeExp)])

## # A tibble: 60 x 4
## # Groups:   continent [?]
##    continent  year longest_life   country
##       <fctr> <int>        <dbl>    <fctr>
##  1    Africa  1952       52.724   Reunion
##  2    Africa  1957       58.089 Mauritius
##  3    Africa  1962       60.246 Mauritius
##  4    Africa  1967       61.557 Mauritius
##  5    Africa  1972       64.274   Reunion
##  6    Africa  1977       67.064   Reunion
##  7    Africa  1982       69.885   Reunion
##  8    Africa  1987       71.913   Reunion
##  9    Africa  1992       73.615   Reunion
## 10    Africa  1997       74.772   Reunion
## # ... with 50 more rows

Challenge – Part 1

Calculate a new column: the total GDP of each country in each year.

Calculate the variance – var() of countries’ gdps in each year.

Is country-level GDP getting more or less equal over time?

Challenge – Part 2

Modify the code you just wrote to calculate the variance in both country-level GDP and per-capita GDP.

Do both measures support the conclusion you arrived at above?

Resources

That is the core of dplyr’s functionality, but it does more. RStudio makes a great cheatsheet that covers all the dplyr functions we just learned, plus what we will learn in the next lesson: keeping data tidy.

Challenge solutions

Solution to challenge Subsetting

- Subset the gapminder data to only Oceania countries post-1980.
- Remove the continent column
- Make a scatter plot of gdpPercap vs. population colored by country

library(gapminder) oc1980 = filter(gapminder, continent == "Oceania" & year > 1980) oc1980less = select(oc1980, -continent) library('ggplot2') ggplot(oc1980less, aes(x = gdpPercap, y = lifeExp, color = country)) + geom_point()

Advanced: How would you determine the median population for the North American countries between 1970 and 1980?

library(gapminder) library(tidyverse) noAm = filter(gapminder, country == "United States" | country == "Canada" | country == "Mexico" | country == "Puerto Rico" & (year > 1970 & year < 1980)) noAmPop = select(noAm, pop) #median(noAmPop) #noAmPop #as.integer(noAmPop) median(unlist(noAmPop))

Bonus This can be done using base R’s subsetting, but this class doesn’t teach how. Do the original challenge without the filter and select functions. Feel free to consult Google, helpfiles, etc. to figure out how.

noAm2 = gapminder[(gapminder$country == "United States") | (gapminder$country == "Mexico") | (gapminder$country == "Canada") | (gapminder$country == "Puerto Rico") & ((gapminder$year > 1970) & (gapminder$year < 1980)),] median(noAm2$pop)

Solution to challenge MCQ: Data Reduction

Produce a data.frame with only the names, years, and per-capita GDP of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.

- Tip: The gdpPercap variable is annual gdp. You’ll need to adjust. - Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.

What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?

a. $278 b. $312 c. $331 d. $339

dailyGDP = mutate(gapminder, onedayGDP = gdpPercap / 365) dailyGDP = filter(dailyGDP, onedayGDP < 1) dailyGDP = select(dailyGDP, country, year, gdpPercap) dailyGDP[1,]

Advanced: Use dplyr functions and ggplot to plot per-capita GDP versus population for North American countries after 1970. - Once you’ve made the graph, transform both axes to a log10 scale. There are two ways to do this, one by creating new columns in the data frame, and another using functions provided by ggplot to transform the axes. Implement both, in that order. Which do you prefer and why?

noAm = filter(gapminder, country == "United States" | country == "Canada" | country == "Mexico" | country == "Puerto Rico" & year > 1970 ) ggplot(noAm, aes(x = gdpPercap, y = pop, color = country)) + geom_point() + scale_x_log10() + scale_y_log10()

Challenge: Data Reduction with Pipes

Copy the code you (or the instructor) wrote to solve the previous MCQ Data Reduction challenge. Rewrite it using pipes (i.e. no assignment and no nested functions)

Previous challenge with pipes dailyGDP = mutate(gapminder, onedayGDP = gdpPercap / 365) dailyGDP = filter(dailyGDP, onedayGDP < 1) dailyGDP = select(dailyGDP, country, year, gdpPercap) smallGDP = gapminder %>% mutate(onedayGDP = gdpPercap / 365) %>% filter(onedayGDP < 1) %>% select(country, year, gdpPercap) smallGDP[1,]

OR, more fancy (without an intermediate temp variable) (gapminder %>% mutate(onedayGDP = gdpPercap / 365) %>% filter(onedayGDP < 1) %>% select(country, year, gdpPercap))[1,] `

R for reproducible scientific analysis

Manipulating data.frames

Learning objectives

Installing and loading packages

Challenge – Install and load tidyverse

Vectors & Data Types

Challenge

Factors

Converting factors

Using stringsAsFactors=FALSE

Challenge

Data Wrangling with dplyr

What is dplyr?

The five tasks of dplyr

filter()

select()

Subsetting

arrange()

mutate()

MCQ: Data Reduction

C’est ne pas une pipe

Challenge: Data Reduction with Pipes

summarize()

Challenge – Part 1

Challenge – Part 2

Resources

Challenge solutions

Using `stringsAsFactors=FALSE`

Data Wrangling with `dplyr`

The five tasks of `dplyr`

`filter()`

`select()`

`arrange()`

`mutate()`

`summarize()`