Exploratory data analysis with Yeast RNAseq data

Data

This tutorial was written using data from Schurch et al.[1], where WT and MUT yeast were compared to test the underlying distribution of RNA-seq data, and to test which differential expression programs work best. They had 672 samples, we’ll use 6 of them :)

Prepare for analysis

Differential Expression

Since our data appears to be of high quality, let’s perform differential expression with edgeR like we did before. We need to read in the proper count data, because edgeR expects raw counts, not counts that have been corrected in anyway. We will allow edgeR to perform correction for library size, and then estimate dispersion.

This will give us a DGE object. We can look at the DGE object and see some of the magic of edgeR.

head(dge$samples)

##                                      group lib.size norm.factors
## data/rnaseq/ERR458493.fastq.gz.quant    wt   963028    0.8632726
## data/rnaseq/ERR458494.fastq.gz.quant    wt   955258    0.8649983
## data/rnaseq/ERR458495.fastq.gz.quant    wt   950704    0.8644693
## data/rnaseq/ERR458500.fastq.gz.quant   mut  1620091    1.1574783
## data/rnaseq/ERR458501.fastq.gz.quant   mut  1616974    1.1568370
## data/rnaseq/ERR458502.fastq.gz.quant   mut  1611768    1.1569170

head(dge$counts)

##          Samples
## Tags      data/rnaseq/ERR458493.fastq.gz.quant
##   YAL001C                                   61
##   YAL002W                                   44
##   YAL003W                                 1009
##   YAL005C                                 1864
##   YAL007C                                   86
##   YAL008W                                   40
##          Samples
## Tags      data/rnaseq/ERR458494.fastq.gz.quant
##   YAL001C                                   39
##   YAL002W                                   40
##   YAL003W                                  965
##   YAL005C                                 1916
##   YAL007C                                   93
##   YAL008W                                   34
##          Samples
## Tags      data/rnaseq/ERR458495.fastq.gz.quant
##   YAL001C                                   27
##   YAL002W                                   32
##   YAL003W                                  948
##   YAL005C                                 1775
##   YAL007C                                   77
##   YAL008W                                   46
##          Samples
## Tags      data/rnaseq/ERR458500.fastq.gz.quant
##   YAL001C                                  126
##   YAL002W                                  106
##   YAL003W                                 1103
##   YAL005C                                 1481
##   YAL007C                                  219
##   YAL008W                                   87
##          Samples
## Tags      data/rnaseq/ERR458501.fastq.gz.quant
##   YAL001C                                  129
##   YAL002W                                  117
##   YAL003W                                 1105
##   YAL005C                                 1454
##   YAL007C                                  198
##   YAL008W                                  107
##          Samples
## Tags      data/rnaseq/ERR458502.fastq.gz.quant
##   YAL001C                                  121
##   YAL002W                                   96
##   YAL003W                                 1033
##   YAL005C                                 1489
##   YAL007C                                  240
##   YAL008W                                  116

Lastly, we perform differential expression

Data Visualization

Differential expression outputs a large file, which may leave you thinking…what next??

MDS Plot

We can make an MDS plot to understand how our samples relate to one another. From the edgeR manual: “a multi-dimensional scaling plot of the RNA samples in which distances correspond to leading log-fold-changes between each pair of RNA samples. The leading log-fold-change is the average (root-mean-square) of the largest absolute log-foldchanges between each pair of samples. This plot can be viewed as a type of unsupervised clustering.” Our MUT samples cluster more closely to each other than to the WT samples. Does this agree with the sniff tests from pairwise comparisons we performed above?

plotMDS(dge, labels=labels)

MA Plot

We can also make an MA plot. From wikipedia: “an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualises the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.”

etp$table$logFC = -etp$table$logFC # because our comparison is mut vs. wt, let's flip the tags for our plot
plot(etp$table$logCPM, etp$table$logFC, xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, col = ifelse( etp$table$FDR < .2, "red", "black" ) )

Histogram

We can also make a histogram of the corrected p value (FDR). This shows that almost an equal number of differentially expressed and non-differentially expressed genes.

library(ggplot2)
ggplot(etp$table, aes(FDR)) + geom_histogram(binwidth = .01) + ggtitle("Histogram of FDR for all Differentially Expressed Genes")

Heatmaps

Heatmaps are also a very popular way to visualize data. Let’s make a heat map of the 75 most variable genes!

# Load library for heatmap
library(gplots)
heatmap.2(highly_variable_norm_counts, col=rev(morecols(50)), trace="none", 
              main="Top 75 most variable genes across samples", scale="row",
              labCol = c("WT", "WT", "WT", "MUT", "MUT", "MUT"))

Let’s do the same thing again, but this time we will plot the top 75 least variable genes. Even though we did cpm normalization, a lot of these genes are still very lowly expressed.

This is important to keep in mind when interpretting what this heatmap means.

# Load library for heatmap
library(gplots)
heatmap.2(not_variable_norm_counts, col=rev(morecols(50)), trace="none", 
              main="Top 75 least variable genes across samples", scale="row",
              labCol = c("WT", "WT", "WT", "MUT", "MUT", "MUT"))

This lesson is adapted from the ANGUS: Analyzing High Throughput Sequencing Data Exploratory RNAseq data analysis using RMarkdown materials written by Taylor Reiter and Marian Schmidt.

References

1. Nicholas J. Schurch MG Pietá Schofield, Barton GJ. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016; 839–851. doi:10.1261/rna.053959.115