Enrichment analysis

In the previous episode, we ended up with lists of differentially expressed genes. We could examine this list manually and look for our favorite genes of interest. However, the power of RNA-seq is that we can move beyond a simple analysis of a few genes and start to understand which processes or pathways are involved in the response to your treatments. To do this, we will use overrepresentation analysis (ORA) of Gene Ontology (GO) terms.

What is a GO term exactly?

GO terms offer a standardized way of describing gene functions and the biological processes they participate in. For example, a GO term could be: nucleotide binding or response to biotic stress. GO terms come in three categories:

Biological Process (BP)
Molecular Function (MF)
Cellular Component (CC)

What is overrepresentation analysis exactly?

In the R package that we will use, ORA uses a hypergeometric test to determine whether a particular GO term is enriched in your list of genes. Conceptually, it answers the following question:

Given my background list of genes, are genes annotated to this GO term observed more often in my input gene list than expected by chance?

The background list of genes, sometimes called the ‘universe’ in ORA jargon, is important. One could decide to use all genes present in annotation of your organism as universe. However, the universe is defined as all the genes that were elegible to be tested for differential expression. For an RNA-seq experiment, this means that genes that were not detected in any of your samples should be removed from the background universe for ORA.

Let’s get started

In case you continued directly from the previous episode, you will only need to load a few more packages.

clusterProfiler is the main package for enrichment analysis in R, extensively documented here.
enrichplot() is a package to visualize ORA results from clusterProfiler.
org.At.tair.db is a database containing functional annotation for Arabidopsis thaliana (At). Functional annotations are available for several model organisms such as Homo sapiens (org.Hs.tair.db), see Bioconductor for all available organisms.

library(clusterProfiler)
library(org.At.tair.db)
library(enrichplot)

In case you didn’t manage to finish the previous episode, or you’re starting here for other reasons, you can run the following code to get back to the starting point of this episode by running. We will focus on the DEGs of the aba treated samples

Show the code to get to the starting point of this episode

set.seed(1992)

# Load packages
library(DESeq2)
library(tidyverse)
library(ggrepel)
library(clusterProfiler)
library(org.At.tair.db)
library(enrichplot)

# Read the data
raw_counts <- read.csv("data/Deforges_2019-raw-counts.csv", header = T, stringsAsFactors = F) 
raw_counts <- raw_counts %>% column_to_rownames("Geneid") 
metadata <- read.csv("data/Deforges_2019-experiment-design.csv", header = T)

# Generate DDS object and run DESeq function:
dds <- DESeqDataSetFromMatrix(countData = raw_counts, 
                              colData = metadata, 
                              design = ~ condition)

dds <- DESeq(dds)

# Run DEG analysis for all hormone treatments for all hormone treatments vs control
contrast_to_test <-  c("condition", "aba", "control")
res_aba_vs_control <- results(dds, contrast=contrast_to_test, alpha = 0.05)

contrast_to_test <-  c("condition", "auxin", "control")
res_auxin_vs_control <- results(dds, contrast=contrast_to_test, alpha = 0.05)

contrast_to_test <-  c("condition", "meja", "control")
res_meja_vs_control <- results(dds, contrast=contrast_to_test, alpha = 0.05)

contrast_to_test <-  c("condition", "acc", "control")
res_acc_vs_control <- results(dds, contrast=contrast_to_test, alpha = 0.05)

# Create DEG lists for all hormone treatments vs control
padj_cutoff = 0.05
lfc_cutoff = 1.5

DEGs_aba <- res_aba_vs_control %>% 
  as.data.frame() %>%
  rownames_to_column("geneID") %>% 
  filter(padj < padj_cutoff) %>%
  filter(abs(log2FoldChange) > lfc_cutoff) %>% 
  pull(geneID)

DEGs_meja <- res_meja_vs_control %>% 
  as.data.frame() %>%
  rownames_to_column("geneID") %>% 
  filter(padj < padj_cutoff) %>%
  filter(abs(log2FoldChange) > lfc_cutoff) %>% 
  pull(geneID)

DEGs_auxin <- res_auxin_vs_control %>% 
  as.data.frame() %>%
  rownames_to_column("geneID") %>% 
  filter(padj < padj_cutoff) %>%
  filter(abs(log2FoldChange) > lfc_cutoff) %>% 
  pull(geneID)

DEGs_acc <- res_acc_vs_control %>% 
  as.data.frame() %>%
  rownames_to_column("geneID") %>% 
  filter(padj < padj_cutoff) %>%
  filter(abs(log2FoldChange) > lfc_cutoff) %>% 
  pull(geneID)

Let’s remind ourselves of what our list of genes of interest now looks like. This list is stored in the variable DEGs_aba:

head(DEGs_aba)

[1] "AT1G01120" "AT1G01240" "AT1G01470" "AT1G01520" "AT1G01560" "AT1G01570"

length(DEGs_aba)

[1] 2044

Ok, this list contains of 2044 entries of Arabidopsis TAIR gene IDs.

GO enrichment

We can perform the GO enrichment analysis with the enrichGO() function, and then plot the results using dotplot()

go_results <- enrichGO(
  gene          = DEGs_aba,         # Your gene list
  OrgDb         = org.At.tair.db,    # Arabidopsis TAIR10 annotation
  keyType       = "TAIR",            # TAIR10 gene IDs (AT1Gxxx)
  ont           = "BP",              # "BP" (Biological Process), "MF", "CC"
  pAdjustMethod = "BH",              # Benjamini-Hochberg correction
  pvalueCutoff  = 0.05,              # Significance threshold
  qvalueCutoff  = 0.05,
  readable      = TRUE               # Convert gene IDs to common names
)

dotplot(go_results, showCategory = 12, x = "FoldEnrichment")

For example, this plot shows us that genes with the GO annotation glucosinolate biosynthetic process occur about 5 times as often in our gene list, compared to random chance (e.g., picking 2044 random genes from our background gene universe). Glucosinolates are defensive compounds of plants, that’s an interesting result! The color of the plotted dots indicates the adjusted p-value, while the size of the dot corresponds to the number of genes belonging to this GO term in our gene list.

Question

Come up with a way to get a better list of the background ‘universe’. Store this list in a variable called universe_list.
Repeat the BP GO enrichment with this universe and see if it changes your results. Hint: you have to add an argument to the enrichGO() function: universe = universe_list.

Solution

universe_list <- res_aba_vs_control %>% 
                  as.data.frame() %>%
                  rownames_to_column("gene") %>%  
                  filter(baseMean > 3) %>% 
                  pull(gene)

head(universe_list)

[1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01046" "AT1G01050"

length(universe_list)

[1] 23316

go_results <- enrichGO(
  gene          = DEGs_aba,
  universe      = universe_list,         
  OrgDb         = org.At.tair.db,    
  keyType       = "TAIR",            
  ont           = "BP",              
  pAdjustMethod = "BH",             
  pvalueCutoff  = 0.05,              
  qvalueCutoff  = 0.05,
  readable      = TRUE               
)

dotplot(go_results, showCategory = 10, x = "FoldEnrichment")

In this case, not so much changed compared to our earlier BP enrichment analysis, although some categories changed a little bit in their FoldEnrichment value. At least, we will sleep better tonight, knowing that we performed a statiscally sound ORA.

KEGG enrichment

KEGG terms describe gene functions at the level of pathways. While there will be quite some overlap with GO terms, it can happen that KEGG terms happen to describe what’s happening in your data a bit more clearly. Running and visualizing KEGG enrichment works very similar as GO term enrichment, but first we will need to find the ‘KEGG code’ of the organism we are studying, by searching by the scientific name.

kegg_result <- search_kegg_organism('Arabidopsis thaliana', by='scientific_name')
head(kegg_result)

    kegg_code      scientific_name common_name
565       ath Arabidopsis thaliana thale cress

Okay, the KEGG code of Arabidopsis thaliana is ath.

KEGG_result <- enrichKEGG(gene = DEGs_aba,
                                universe = universe_list,
                                organism = "ath",
                                minGSSize = 10,
                                maxGSSize = 500,
                                pAdjustMethod = "BH",
                                qvalueCutoff = 0.05,
                                use_internal_data = FALSE)

Reading KEGG annotation online: "https://rest.kegg.jp/link/ath/pathway"...

Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/ath"...

dotplot(KEGG_result, x = "FoldEnrichment")

GSEA (Gene Set Enrichment Analysis)

In the GO and KEGG ORA sections, we started with a binary gene list, that is, ‘significant’ vs ‘not significant’. GSEA is another enrichment test that employs a different strategy. Here, we start with a ranked list of all the genes in our experiment. They might be ranked by Log2FoldChange, or adjusted p-value. Then, GSEA answers the following conceptual question:

Are genes from a certain pathway consistently found toward the top or bottom of the ranked list?

In other words, ORA requires us to make a choice about a threshold of significance and/or fold change, while GSEA is a ‘threshold free’ method.

First, we will prepare the data structure to perform GSEA:

results_aba <- res_aba_vs_control %>% 
                as.data.frame() %>%
                rownames_to_column("gene")

gene_list <- results_aba$log2FoldChange
names(gene_list) <- results_aba$gene

gene_list <- gene_list[!is.na(gene_list)]
gene_list <- sort(gene_list, decreasing = TRUE)

1: First, we take all the fold change values and store it in a list
2: Then, we add the gene IDs as names of the list items
3: We remove all entries where there was no valid fold change values (NA)
4: Then, we sort the list.

Let’s now run the GSEA:

gsea_go <- gseGO(
  geneList     = gene_list,
  OrgDb        = org.At.tair.db,
  keyType      = "TAIR",
  ont          = "BP",
  minGSSize    = 10,
  maxGSSize    = 500,
  pvalueCutoff = 0.05,
  verbose      = FALSE
)

gseaplot2(
  gsea_go,
  geneSetID = 1,
  title = gsea_go@result$Description[1]
)

1: By picking 1 here, we will plot the GO term with the lowest p-value. Make sure to explore a few more than just the first one, by changing the 1 to other numbers. You can inspect all significant GO terms in table format by using summary(gsea_go).

This graph is rather complex but very interesting! Let’s break the most important parts down:

The green line shows the Running Enrichment Score, and the highest point is enrichment score of this GO category. If the peak is at the left, this pathway is enriched among up-regulated genes.
Each vertical tick represents one gene from this GO term. So, there is clustering of (in this case) response to water deprivation annotated genes in the top end of the ranked list.

Exercise

Perform GO term enrichment analysis for different gene sets. It’s up to you what you would like to analyze, but here are some ideas:

We have access to 4 different hormone treatments, perform GO term analysis for each hormone treatment.
Perform GO term analysis on upregulated and downregulated gene lists seperately, for example for aba. Can you identify different biological processes in each analysis?

Solution

Just change input gene list from DEGs_aba to e.g. DEGs_meja.

KEGG_result <- enrichKEGG(gene = DEGs_meja,
                                universe = universe_list,
                                organism = "ath",
                                minGSSize = 10,
                                maxGSSize = 500,
                                pAdjustMethod = "BH",
                                qvalueCutoff = 0.05,
                                use_internal_data = FALSE)

dotplot(KEGG_result, x = "FoldEnrichment")

# Get just the UP regulated genes
DEGs_aba_up <- res_aba_vs_control %>% 
  as.data.frame() %>%
  rownames_to_column("geneID") %>% 
  filter(padj < padj_cutoff) %>%
  filter(log2FoldChange > lfc_cutoff) %>% 
  pull(geneID)

# Get just the DOWN regulated genes
DEGs_aba_down <- res_aba_vs_control %>% 
  as.data.frame() %>%
  rownames_to_column("geneID") %>% 
  filter(padj < padj_cutoff) %>%
  filter(log2FoldChange < -lfc_cutoff) %>% 
  pull(geneID)

KEGG_result <- enrichKEGG(gene = DEGs_aba_up,
                                universe = universe_list,
                                organism = "ath",
                                minGSSize = 10,
                                maxGSSize = 500,
                                pAdjustMethod = "BH",
                                qvalueCutoff = 0.05,
                                use_internal_data = FALSE)

dotplot(KEGG_result, x = "FoldEnrichment")

KEGG_result <- enrichKEGG(gene = DEGs_aba_down,
                                universe = universe_list,
                                organism = "ath",
                                minGSSize = 10,
                                maxGSSize = 500,
                                pAdjustMethod = "BH",
                                qvalueCutoff = 0.05,
                                use_internal_data = FALSE)

dotplot(KEGG_result, x = "FoldEnrichment")

Very interesting! glucosinolate biosynthesis apparently is downregulated by ABA treatment in Arabidopsis roots.

Performing enrichment analysis on webservers

It’s also possible to perform GO term enrichment on webservers. While less reproducible than doing GO term enrichment with code, it is indeed a fast way to explore many gene lists. On these webservers, you can simply paste a list of geneIDs, select your organism, and visually inspect the results quickly.

Some popular options are:

A big advantage of such webservers is that it contains GO annotations of hundreds of organisms, including many that are not available on Bioconductor for R-based analysis.

Concluding remarks

This concludes the bioDSC workshop on RNA-seq. We went all the way from experimental design to mapping reads to differential gene expression in R. If you want to practice more with existing datasets, continue to the next episode.