Differential gene expression analysis

The main goal of most RNA-seq experiments is to discover which genes are differentially expressed between different groups (treatments, tissues, genotypes): the list of differentially expressed genes (DEGs). After the previous section, we now have a count table with the number of reads that map to each gene in each sample. How do we get to our goal from this table? We will need to use statistical models! In this section, we will use the DESeq2 package in R for differential gene expression analysis. Several other packages with different statistical models and assumptions exist (e.g. EdgeR and Limma): we pick DESeq2 because it is robust, widely-used, and user friendly.

Reading the count table into DESeq2

In this tutorial, we will explore the transcriptomes of A. thaliana plants that experienced microgravity by growing on the International Space Station, while control plants were grown on earth. The experimental design is simple: there are three replicates in both conditions space_flight and ground_control.

Note

Download the data from the repository of this workshop. Store the files in a folder on your computer. See below for a good workshop folder setup:

RNA-seq-workshop
├── data/
   ├── GLDS38_raw_counts.csv
   └── samples_to_condition.csv
├── data_processed/
   └── Normalized counts will go here
├── results/
   └── Graphs will go here
└── RNA-seq.R

First we will load the packages that we need:

R
set.seed(1992)

library(DESeq2)
library(tidyverse)
library(ggrepel)

Then, load the count table and metadata file:

R
raw_counts <- read.csv("data/GLDS38_raw_counts.csv", header = T, stringsAsFactors = F)

raw_counts <- raw_counts %>% column_to_rownames("gene")

metadata <- read.csv("data/samples_to_condition.csv", header = T)

raw_counts[1:4,1:4]
1
Make sure you use the correct path to where your data is stored. Depending on your folder structure, you may not need the ../ and you could use data/GLDS38_raw_counts.csv instead.
2
Here, we store the column gene as row names instead of a dedicated column.
          Atha_WT_Col_0_sl_FLT_Rep1_G1S1 Atha_WT_Col_0_sl_FLT_Rep2_G1S2
AT1G01010                            339                            383
AT1G01020                            126                            117
AT1G01030                            153                            158
AT1G01040                           1238                            442
          Atha_WT_Col_0_sl_FLT_Rep3_G1S3 Atha_WT_Col_0_sl_GC_Rep1_G2S1
AT1G01010                            363                           650
AT1G01020                             89                            60
AT1G01030                            143                            97
AT1G01040                            543                           783

That looks good! Now, it’s important to note that DESeq2 expects the sample names (columns in count table) to exactly match the sample names in the metadata file, and be in the same order! For this small dataset, we can inspect that by eye. In addition, and probably useful for larger datasets, we can use the all() function to check this.

R
head(metadata)
                     sample_name      condition
1 Atha_WT_Col_0_sl_FLT_Rep1_G1S1   space_flight
2 Atha_WT_Col_0_sl_FLT_Rep2_G1S2   space_flight
3 Atha_WT_Col_0_sl_FLT_Rep3_G1S3   space_flight
4  Atha_WT_Col_0_sl_GC_Rep1_G2S1 ground_control
5  Atha_WT_Col_0_sl_GC_Rep2_G2S2 ground_control
6  Atha_WT_Col_0_sl_GC_Rep3_G2S3 ground_control
R
all(colnames(raw_counts) == metadata$sample)
1
If this does not return TRUE, you need to reorder or rename sample names in one of your files.
[1] TRUE

Creating the dds object

We are now ready to create a DESeqDataSet object, commonly abbreviated as dds. The dds object will contain our count tables and metadata, but later, also the normalized counts and differentially expressed gene lists. As such, the dds objects help us keep things neat in our RStudio session. To make one, we need to specify our experimental design ‘formula’. In this tutorial, there’s only one variable: the design formula will be as simple as ~ condition. However, in multi-factor experiments it can include additional variables, can also include unwanted sources of variation such as RNA isolation batch, the researcher who extracted RNA, or on which table the plants were grown. Including these factors in the design formula will help DESeq2 to account for these soures of variation, allowing more accurate estimation of the primary condition’s effect. For example, in an experiment with a potential batch effect, treatments, and different genotypes: ~ batch + treatment + genotype. If you also want to model the interaction, that is whether the treatment effect varies by genotype, change the + to a *: ~ batch + treatment * genotype.

R
dds <- DESeqDataSetFromMatrix(countData = raw_counts, 
                              colData = metadata, 
                              design = ~ condition)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors

Assesing the quality of an RNA-seq experiment via PCA

As essential step in RNA-seq analysis is to inspect similarity between samples. In particular, we should confirm that replicates with the same treatment are similar to each other, and make sure that the experimental condition is the major source of variation in the data. In addition, these quality-control explorations will also help identify if any samples behalve as outliers, or whether there may have been a sample swap. We will use Principal Component Analysis (PCA) to do this. PCA is a dimensionality reduction technique that transforms complex high-dimensional data (like expression of thousands of genes) into a limited number of new variables (‘principal components’) that capture the most variation in the dataset.

Performing variance stabilization

Before performing the PCA itself, we need to take an import feature of RNA-seq data into account: the variance of a gene is strongly correlated to the expression level of the gene. In statistics language, our data is not homoscedastic, while PCA assumes homoscedastic data. We can solve this by performing a variance stabilizing transformation vst():

R
variance_stabilized_dataset <- vst(dds, blind = TRUE)

Let’s inspect the average expression and standard deviation of each gene to show that this transformation worked. In the following plots, each dot represents one A. thaliana gene:

Show the code to make the plots
R
library(patchwork)

without_vst <- raw_counts %>% 
    as.data.frame() %>% 
    rownames_to_column("gene") %>% 
    pivot_longer(cols = - gene, names_to = "sample", values_to = "count") %>% 
    group_by(gene) %>% 
    summarise(gene_mean = mean(count), gene_sd = sd(count)) %>% 
    ungroup() %>% 
    ggplot(aes(x = log10(gene_mean), y = log10(gene_sd))) +
    geom_point(alpha = 0.2) +
    labs(x = "Gene count average\n(log10 scale)",
        y = "Gene count standard deviation\n(log10 scale)") +
    ggtitle("No variance stabilization")
    
variance_stabilised_counts <- assay(variance_stabilized_dataset)

with_vst <- variance_stabilised_counts %>% 
  as.data.frame() %>% 
  rownames_to_column("gene") %>% 
  pivot_longer(cols = - gene, names_to = "sample", values_to = "count") %>% 
  group_by(gene) %>% 
  summarise(gene_mean = mean(count), gene_sd = sd(count)) %>% 
  ungroup() %>% 
  ggplot(aes(x = log10(gene_mean), y = log10(gene_sd))) +
  geom_point(alpha = 0.2) +
  labs(x = "Gene count average\n(log10 scale)",
       y = "Gene count standard deviation\n(log10 scale)") +
  ggtitle("Variance stabilized")

without_vst | with_vst

Show the code to make the plots
variance_stabilised_counts_df <- variance_stabilised_counts %>%
  as.data.frame() %>% 
  rownames_to_column("gene")  

write.csv(variance_stabilised_counts_df, "data_processed/variance_stabilized_dataset.csv", row.names=FALSE)

Indeed, we can observe that genes that are highly expressed (have high mean count) also have a high standard deviation. This correlation is no longer there after stabilizing the variance.

Performing the PCA

Okay, finally we are ready to perform the PCA. DESeq2 makes this very easy for us with a simple function, plotPCA(), which directly gives us a PCA plot.

R
plotPCA(variance_stabilized_dataset)
using ntop=500 top features by variance
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the DESeq2 package.
  Please report the issue to the authors.

Let’s break this plot down:

  • We see that principal component 1 (PC1) explains 60% of the variance, while PC2 explains 29%. PC1 does not seem to separate our two conditions. This means that there’s a source of variation in this dataset that we are seemingly unaware of.
  • We see that PC2 nicely separates our two conditions. This is good news.
  • We see that two samples from space_flight cluster very closely together, while one sample is quite a distant from those two. This means that this one replicate behaves a bit differently than the rest. However, since it’s still similar to the other two samples on the PC2 axis, this does not worry me.

If you want to have full control and make the PCA plot yourself in ggplot, you can add returnData=TRUE)

R
pca_data <- plotPCA(variance_stabilized_dataset, returnData=TRUE) 

While it is impossible to give examples of all situations that can occur in PCAs, we highlight a few below in fake PCA plots:

Show the code to make the plots
R
df_swap <- data.frame(
  PC1 = c(rnorm(5, mean = 1, sd = 0.2),   
          rnorm(5, mean = -2, sd = 0.2)), 
  PC2 = c(rnorm(5, mean = 0.4, sd = 0.2),
          rnorm(5, mean = 0.3, sd = 0.2)),
  genotype = c(rep("WT", 4), "mutant", rep("mutant", 4), "WT"))

df_weak_sep <- data.frame(
  PC1 = c(rnorm(5, mean = 0.2, sd = 0.45),   
          rnorm(5, mean = 0, sd = 0.6)), 
  PC2 = c(rnorm(5, mean = 0, sd = 0.25),
          rnorm(5, mean = 0, sd = 0.25)),
  genotype = rep(c("WT", "mutant"), each = 5)
)

p1 <- df_swap %>% ggplot(aes(x = PC1, y = PC2, colour = genotype)) + geom_point() +
  xlab("PC1 (48%)") +
  ylab("PC2 (13%)") +
  ggtitle("Sample swap")

p2 <- df_weak_sep %>% ggplot(aes(x = PC1, y = PC2, colour = genotype)) + geom_point() +
    xlab("PC1 (12%)") +
    ylab("PC2 (4%)") +
    ggtitle("Weak separation")

p1 | p2

In the first plot, we see one WT sample clustering with mutant samples, and vice versa. This is a clear indication that two samples were swapped somewhere in the process: during sampling, RNA extraction, cDNA synthesis, library prep, or in the metadata file. If you can trace this back in your labjournal, you could swap the sample label back. If not… it’s probably better to discard these two samples completely. In the second plot, we can see that there’s no clear separation between WT and mutant samples. In addition, the two PCs explain little of the variance present in the dataset. This is an indication that the genotype actually has little impact on the transcriptome. While worrying, this does not mean that all is lost! You can still proceed to differential expression analysis, maybe the difference between the two genotypes is quite subtle.

Show the code to make the plots
R
df_confounding_1 <- data.frame(
  PC1 = c(rnorm(5, mean = 1, sd = 0.45),   
          rnorm(5, mean = -1, sd = 0.6)), 
  PC2 = c(rnorm(5, mean = 0, sd = 0.25),
          rnorm(5, mean = 0, sd = 0.25)),
  genotype = rep(c("WT", "mutant"), each = 5),
  gender = rep(c("male", "female"), each = 5)
)

p1 <- df_confounding_1 %>% ggplot(aes(x = PC1, y = PC2, colour = genotype)) + geom_point() +
  xlab("PC1 (48%)") +
  ylab("PC2 (13%)") +
  ggtitle("Genotype effect ...")

p2 <- df_confounding_1 %>% ggplot(aes(x = PC1, y = PC2, colour = gender)) + geom_point() +
  xlab("PC1 (48%)") +
  ylab("PC2 (13%)") +
  ggtitle("... or gender effect?")

p1 | p2

In this example, we see separation of our wildtype and mutant samples. Experiment succesful! … or is it? Upon closer inpection, we can see that gender of our samples also separates our samples in the same way. It turns out that all wildtypes were male mice, and all mutants were female. We will therefore never know if differentially expressed genes are caused by the genotype, or simply by the gender of the mice: a clear case of confounding variable. This is an experimental design flaw, and should have been caught before sampling. Yet, it happens!

Question

Besides PCA, how else could you assess whether replicate samples from the same treatment show similar results?

We can use a correlation analysis to do this. For this analysis, we will also need the variance stabilized counts.

R
library(pheatmap)

correlation_matrix <- cor(variance_stabilised_counts)

metadata_2 <- metadata %>% column_to_rownames("sample_name")

pheatmap(correlation_matrix, annotation_row = metadata_2, 
         clustering_distance_rows = "correlation", 
         clustering_distance_cols = "correlation", 
         display_numbers = TRUE, fontsize = 7)

We can see that all the ground_control and space_flight samples cluster together. For example, ground_control samples have a 0.98 or greater correlation with each other, while they have a slightly lower correlation with space_flight samples. However, the samples still have a high correlation also accross the different treatment. This shows that the transcriptomes are actually highly similar, regardless of treatment, probably because the majority of genes is not differentially expressed.

Differential gene expression analysis

DESeq2 handles all steps of DEG analysis, from sample normalization (e.g., to account for difference in sequencing depth per sample) to the statistical models and tests in one function. Easy! We run this function with the dds object as input, while storing the output in the dds object as well. In this way, we will ‘fill’ the dds object with the new analysis. R will now print all the individual steps that the DESeq() function performed for us.

R
dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Let’s write the normalized counts to a file:

R
norm_counts <- counts(dds, normalized = TRUE)

norm_counts <- norm_counts %>% 
  as.data.frame() %>%
  rownames_to_column("gene_id") 

write.csv(norm_counts, file = "data_processed/normalized_counts.csv", row.names = FALSE)
Question

How would you visualize the effect of the normalization on the distribution of read counts?

You would make a boxplot of all the read counts per sample, for both raw counts and normalized counts.

R
# Get the raw counts and the normalized counts
raw_counts <- counts(dds, normalized=FALSE)
normalized_counts <- counts(dds, normalized=TRUE) 

# Convert matrices to tidy format to visualize difference between raw and scaled counts
tidy_raw <-  raw_counts %>% 
    as.data.frame() %>% 
    rownames_to_column("Gene") %>% 
    pivot_longer(cols = -Gene, names_to = "Sample", values_to = "counts") %>% 
    mutate(dataset = "raw")

tidy_normalized <- normalized_counts %>% 
  as.data.frame() %>% 
  rownames_to_column("Gene") %>% 
  pivot_longer(cols = -Gene, names_to = "Sample", values_to = "counts") %>% 
  mutate(dataset = "normalized")

tidy_total <- rbind(tidy_raw, tidy_normalized)

# boxplot shows that the medians of the samples moved closer to log10(2)
tidy_total %>% 
  ggplot(aes(x = Sample, y = log10(counts + 1))) +
  geom_boxplot(aes(fill = dataset)) + facet_wrap(~ dataset)

Notice how the median read counts of the normalized datasets are closer together than those of the raw read counts.

Lists of DEGs

Now, to extract the list of DEGs from the dds we run another line of code. The argument alpha is used to specify the p-value cutoff for significance, the default value is alpha = 0.1. We will use 0.05 here. We will also sort the table on p-value:

R
res <- results(dds, alpha = 0.05)

summary(res)

results <- res %>%
  as.data.frame() %>% 
  rownames_to_column("genes") %>%
  arrange(padj) 

head(results) 

write.csv(results, 'data_processed/all_genes_spaceflight_vs_ground.csv', quote = FALSE, row.names = FALSE)
1
Generate results table
2
Print a quick summary of the results: how many genes are significantly differentially expressed?
3
Turn results table into a data frame, generate a genes column from the rownames, make a new column with -log10 transformed p-values, then sort by adjusted p-value.
4
Write to a file!

out of 30151 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 1792, 5.9%
LFC < 0 (down)     : 1557, 5.2%
outliers [1]       : 16, 0.053%
low counts [2]     : 7564, 25%
(mean count < 8)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

      genes  baseMean log2FoldChange     lfcSE      stat       pvalue
1 AT5G04120  456.5672       3.626601 0.2462679  14.72625 4.373346e-49
2 AT2G28780 1964.3983      -3.630964 0.3295509 -11.01792 3.132194e-28
3 AT1G62280  217.8345      -3.901515 0.3558769 -10.96310 5.749324e-28
4 AT1G32450 5241.0272      -3.468312 0.3201042 -10.83495 2.350980e-27
5 AT1G30530  708.2953       2.033682 0.1938999  10.48831 9.776462e-26
6 AT1G15380  739.4577      -2.714855 0.2601611 -10.43529 1.710992e-25
          padj
1 9.871079e-45
2 3.534838e-24
3 4.325600e-24
4 1.326599e-23
5 4.413291e-22
6 6.436465e-22

There is a row for each gene in this dataframe. We’ll discuss the most important columns below:

  • baseMean: is the mean of the normalized counts, across all samples.
  • log2FoldChange: is a way to describe how much the gene expression changes between the two conditions tested.
  • pvalue and padj: the result from the Wald test to test whether the expression is different between the two conditions tested. pvalue is the ‘raw’, uncorrected value, while padj is adjusted for multiple testing (of thousands of genes).
Question
  1. What is the biological meaning of a log2 fold change of 1?
  2. Similarly, what is the biological meaning of a log2 fold change of -1?
  3. Compute the log2FoldChange (“treated vs untreated”) of a gene that has an average expression of 230 in treated condition, and 75 in untreated condition.

1 and 2:

A normal fold change would be calculated as such:

\[ \frac{\text{mean expression B}}{\text{mean expression A}} \]

If a gene’s expression is not changed, the fold change would be 1, a fold change of 2 means double expression, while 0.5 means that the expression is halved. Taking the log2() of the fold change:

\[ \log_2\left(\frac{B}{A}\right) \]

helps to make the fold change symmetric and easier to interpret:

log2FC Meaning Fold change
+2 Large increase
+1 Increase
0 No change
−1 Decrease 0.5× (half)
−2 Large decrease 0.25×

3:

log2(230/75)
[1] 1.616671

Shrinkage of fold changes

DESeq2 allows for the shrinkage of log2FoldChanges values towards zero if the information for a gene is low (e.g. low amount of counts). See DESeq2 vignette for more information. In this workshop, we will skip this step.

Extra contrasts

In our simple experimental setup, only one contrast is possible: space_flight vs ground_control. That’s why would could simply call res <- results(dds, alpha = 0.05) to generate our results table. If we would have had another condition, for example, the plants were also grown with Martian gravity levels, we could have specified the contrast we wanted to test:

R
contrast_to_test <-  c("condition", "martian_gravity", "ground_control")
res_martian_vs_ground <- results(dds, contrast=contrast_to_test, alpha = 0.05)

Volcano plots

One way to visualize DEG results is to display them in a Volcano plot. Such a plot shows a measure of effect size (log2FoldChange) versus a measure of significance (padj). There are tools available (developed by Joachim Goedhart, assistant professor at SILS) to help you make such a plot using the DEG list we just saved as a .csv file. Alternatively, we can make one ourselves for full control of the plot:

R
# Define fold change and p-value cutoffs
lfc_cutoff <- 1
padj_cutoff <- 0.05

# Make new categorical variable containing significance information
results <- results %>% 
        mutate(significance = case_when(
          padj < padj_cutoff & log2FoldChange > lfc_cutoff ~ 'Significantly upregulated',
          padj < padj_cutoff & log2FoldChange < -lfc_cutoff ~ 'Significantly downregulated',
          padj < padj_cutoff ~ 'Significant but small effect size',
          TRUE ~ 'Not significant'
        ))

DEGs <- results %>% filter(significance == "Significantly upregulated" | significance == "Significantly downregulated")
write.csv(DEGs, 'data_processed/DEGs_spaceflight_vs_ground.csv', quote = FALSE, row.names = FALSE)

colors <- c("Significantly upregulated" = "#E69F00", 
            "Significantly downregulated" = "#56B4E9", 
            "Not significant" = "gray80", 
            "Significant but small effect size" = 'grey50')

# select top 10 genes to highlight
top_genes <- results[1:10, ]

volcano <- results %>% 
  ggplot(aes(x = log2FoldChange, y = -log10(padj), colour = significance)) +
  geom_point(alpha = 0.5, size = 0.8) + 
  geom_hline(aes(yintercept = -log10(padj_cutoff)), linetype = "dashed") +
  geom_vline(aes(xintercept = lfc_cutoff), linetype = "dashed") +
  geom_vline(aes(xintercept = -lfc_cutoff), linetype = "dashed") +
  geom_point(data = top_genes, shape = 21,fill = NA, color = "black") +  # Highlight top10
  geom_text_repel(data = top_genes, aes(label = genes), size = 2, min.segment.length = 0) +
  scale_color_manual(values=colors) +
  xlim(c(-10,10)) +
  theme_bw() 

ggsave("results/volcano_plot.png", volcano, width = 14, height = 8, units = "cm")
Warning: Removed 10262 rows containing missing values or values outside the scale range
(`geom_point()`).
volcano 
Warning: Removed 10262 rows containing missing values or values outside the scale range
(`geom_point()`).

We can plot the DESeq2-normalized counts of two genes, just to confirm whether the volcano plot is correct. We pick one that is highly upregulated in space_flightconditions (AT5G04120), and one that is strongly downregulated (AT1G62280).

R
gene_1 <- plotCounts(dds, gene="AT5G04120", intgroup="condition", 
                returnData=TRUE)

gene_2 <- plotCounts(dds, gene="AT1G62280", intgroup="condition", 
                        returnData=TRUE)

gene_1 %>% ggplot(aes(x = condition, y = count, colour = condition)) +
  geom_jitter(width = 0.05) +
  theme_bw()

gene_2 %>% ggplot(aes(x = condition, y = count, colour = condition)) +
  geom_jitter(width = 0.05) +
  theme_bw()

Yep, that seems about right.

Heatmap

Alternatively, we could also display the expression levels of the DEGs in a heatmap. With the following code, we make a ‘tidy’ dataframe from the variance stabilised counts, and select only the genes that we consider DEGs. Remember, these are the genes that passed both a padj threshold and log2FoldChange threshold above.

R
variance_stabilised_df <- variance_stabilised_counts %>% 
  as.data.frame() %>% 
  rownames_to_column("gene") %>% 
  pivot_longer(cols = - gene, names_to = "sample", values_to = "count")

selection <- DEGs$genes

vsd_selection <- variance_stabilised_df %>% 
                    filter(gene %in% selection) %>%
                    merge(metadata, by.x = "sample", by.y = "sample_name")

Then, we draw a heatmap using the tidyheatmaps package. Essentially, this is a tidyverse-style wrapper of the more famous heatmap package pheatmap.

R
library(tidyheatmaps)

heatmap <- tidyheatmap(df = vsd_selection,
            rows = gene,
            columns = sample,
            values = count,
            scale = "row",
            annotation_col = c(condition),
            gaps_col = condition,
            cluster_rows = TRUE,
            # cluster_cols = TRUE,
            color_legend_n = 7,
            show_rownames = FALSE,
            show_colnames = FALSE
)
heatmap
1
Scaling is very important here! Try removing this line, and see what happens to the heatmap. It will be dominated by some genes that are very highly expressed.
2
This argument makes sure that genes with a similar expression pattern are clustered together in the heatmap.

That looks pretty good. We can see that there are two clusters of genes: those expressed higher in the spaceflight condition, and those higher expressed in the ground_control condition. We can ask tidyheatmaps to gather genes into such clusters by adding the kmeans_k argument:

R
heatmap <- tidyheatmap(df = vsd_selection,
            rows = gene,
            columns = sample,
            values = count,
            scale = "row", 
            annotation_col = c(condition),
            gaps_col = condition,
            kmeans_k = 2,
            cluster_rows = TRUE, 
            # cluster_cols = TRUE,
            color_legend_n = 7,
            show_rownames = TRUE,
            show_colnames = FALSE
)
heatmap
1
This argument is added to perform k-means clustering and gather different genes into n clusters, in this case, 2 clusters.
2
Since we will be plotting just two clusters, we can set show_rownames to TRUE.

This simplifies our heatmap into two clusters, as expected. As you can imagine, more complex (and more interesting!) clustering patterns may emerge when the experimental design gets more complicated.

What’s next

So far, we managed to find lists of genes that we find interesting. For example, we could be interested in all genes that are differentially expressed between two conditions, or choose to zoom in to only the upregulated or downregulated genes. In the next section, we will look into characterizing lists of genes using overrepresentation analysis. For example, we will look at GO-term enrichment.