Session 1.5. 16S rRNA analysis using dada2#
The next step after sequencing data generation is the analysis of the raw sequencing data. For that we will use a pipeline that takes raw 16S rRNA sequencing data as input and generates a table of unique 16S sequences and their abundances in samples. This table is often called OTU-Table or ASV-Table.
Learning Goal#
The goal of this session is to understand the primary analysis of 16S sequencing data and execute the analysis pipeline described below on a small set of samples.
Input Data#
Raw sequencing data is usually encoded in the FASTQ format which combines sequence information with values that describe the quality of every single base.
The FASTQ files serve as input for our pipeline. There are always 2 files per sample, one for the R1 reads and one for the R2 reads (see lecture). Yo can find the FASTQ files corresponding to 2 samples from the AML dataset in the following location (use the Terminal, not the R Console):
ls -l ../../551-1119-00L_masterdata/tutorials/dada2/00_rawData
Exercise:
Use linux commands to access one of the files and have a look to its content. Can you describe what are the different parts of the content?
Use linux commands to access all files and count the number of sequences in each. How would you count sequences in a FASTQ file?
Hint: you may want to check the wc linux command in Google or in the linux cheatsheet provided in the first session.
Sequence Adapter Removal#
Targeted sequencing, such as 16S sequencing, is executed with specific primers that are used to extract only the piece of DNA of interest. These primer, even though they are part of the biological sequence need to be removed prior to analysis. There are two main reasons for that:
The sequence corresponding to the region the primer binds will most likely be identical to the primer. If any of the real DNA sequences in a sample would have any difference to the primer this won’t be detected as the concentration of primers in the PCR reaction is much higher than the one from the template sequences.
The primers need to be detected to ensure that all sequences analyzed downstream corresponds to the exact same 16S region and that all the sequences are properly aligned to the same position.
For that purpose we use the tool cutadapt. It takes the primer sequences and the FASTQ files as input and removes the primers that are in most cases at the beginning of the reads. We won’t be running this initial step. But you have the FASTQ files with the primers already removed in ../../551-1119-00L_masterdata/tutorials/dada2/01_cutAdapt.
Exercise:
Use linux commands to access all files and count the number of sequences in each. What proportion of sequences had the primers and thus were kept at this step?
Quality Control#
Sequencers have massively improved in terms of throughput and quality in the recent years. However they still produce erroneous bases. These appear usually at the end of the sequences but it can also happen that entire cycles lead to to quality bases. Then all reads have a lower quality at a specific position.
In order to remove sequences with bad quality and to remove low quality tails of otherwise good sequences we use the filterAndTrim routine from the dada2 package.
# 1. Load the dada package
library(dada2)
# 2. Create an output directory
dir.create("./dada2/02_filterAndTrim/",recursive=T)
# 3. Execute the command:
infqgz1 <- '../../551-1119-00L_masterdata/tutorials/dada2/01_cutAdapt/AML_Mock_03_SUSHI_METAB_R1.fastq.gz'
infqgz2 <- '../../551-1119-00L_masterdata/tutorials/dada2/01_cutAdapt/AML_Mock_03_SUSHI_METAB_R2.fastq.gz'
outfqgz1 <- './dada2/02_filterAndTrim/AML_Mock_03_SUSHI_METAB_R1.fastq.gz'
outfqgz2 <- './dada2/02_filterAndTrim/AML_Mock_03_SUSHI_METAB_R2.fastq.gz'
filterAndTrim(fwd=infqgz1, filt=outfqgz1, rev=infqgz2, filt.rev=outfqgz2, matchIDs=TRUE, maxEE=2, truncQ=3, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=1, minLen=150, trimRight = c(40,40))
Analyze the same way the other pair of FASTQ files the same way.
Exercise:
How many sequences do make it through the quality control?
Do the two samples lose the same amount of sequences?
Learn the Error-model#
The dada2 tool that we will use for this analysis uses a statistic approach that aims to predict if sequences are actually true biological sequences or are partly generated by errors that appear during sequencing.
This step learns the error model of this sequencing run and is then used in the next step.
This step has to be run twice. Once for each orientation=R1|R2
library(ggplot2)
dir.create("./dada2/03_learnErrors/",recursive=T)
outfile <- './dada2/03_learnErrors/{orientation}.errors.rds'
outfile.plot <- paste(outfile, '.pdf', sep = '')
samples <- list.files("./dada2/02_filterAndTrim/",pattern="_{orientation}",full.names = T)
err <- learnErrors(samples, nbases=10000, multithread=FALSE, randomize=TRUE, verbose = 1)
saveRDS(err, file = outfile)
plot <- plotErrors(err,nominalQ=TRUE)
ggsave(outfile.plot, plot = plot)
Exercise:
Inspect the plot that this function generates. What do you see in this plot? How should it look like?
Sample Inference#
This step is the actual core of the dada2 tool. The dada2 tool will inspect every sequence and decide, based on the error model, if a sequence is a real biological sequence with no errors or a sequence that contains errors.
This step has to be run twice. Once for each orientation=R1|R2
dir.create("./dada2/04_sampleInference/",recursive=T)
samples <- list.files("./dada2/02_filterAndTrim/",pattern="_{orientation}",full.names = T)
outfile.dd <- './dada2/04_sampleInference/sampleInference_{orientation}.rds'
err.rds <- './dada2/03_learnErrors/{orientation}.errors.rds'
err <- readRDS(err.rds)
dd <- dada(samples, err=err, pool='pseudo', multithread = FALSE)
saveRDS(dd, file = outfile.dd)
Exercise:
You’re lucky. No exercises at this step
Read Merging#
So far we have been working on read level. This means that all steps were executed for R1 and R2 files individually. In this step we will merge reads that come from the same insert (see lecture).
dir.create("./dada2/05_mergeReads/",recursive=T)
samples.r1 <- list.files("./dada2/02_filterAndTrim/",pattern="_R1",full.names = T)
samples.r2 <- list.files("./dada2/02_filterAndTrim/",pattern="_R2",full.names = T)
infile.r1 <- './dada2/04_sampleInference/sampleInference_R1.rds'
infile.r2 <- './dada2/04_sampleInference/sampleInference_R2.rds'
outfile <- './dada2/05_mergeReads/merged_seqtab.rds'
dd.r1 <- readRDS(infile.r1)
dd.r2 <- readRDS(infile.r2)
mergers <- mergePairs(dd.r1, samples.r1, dd.r2, samples.r2, verbose = TRUE)
seqtab.m <- makeSequenceTable(mergers)
saveRDS(seqtab.m, file = outfile)
Exercise:
How many sequences could be merged?
Are there differences between samples?
Hint: Use the output object (seqtab.m)
Ch|Bimera Removal#
One step of the preparation of the sequencing library is the amplication of 16S rRNA fragments. This step is prone to generate so-called Chimeras. That are sequences that were formed from two or more sequences during the PCR step. They’re not biological sequences but are also not detectable by sequence quality. These sequences are filtered out in this step.
dir.create("./dada2/06_noBimera/",recursive=T)
wbim.file <- './dada2/05_mergeReads/merged_seqtab.rds'
nobim.file <- './dada2/06_noBimera/nobimera_seqtab.rds'
wbim.tab <- readRDS(wbim.file)
nobim.tab <- removeBimeraDenovo(wbim.tab, method="pooled", multithread=FALSE, verbose=TRUE)
saveRDS(nobim.tab, file = nobim.file)
Exercise:
How many sequences were removed?
Are there differences between samples?
Compare how many reads/inserts made it through the pipeline.
Taxonomic annotation#
Once we have inferred ASVs we need to taxonomically annotate them. For that purpose we will use the SILVA database which we will download.
DO NOT RUN THE CODE BELOW YET. IT TAKES LONG SO YOU WON’T BE ABLE TO RUN ANYTHING ELSE
dir.create("./dada2/07_taxonomy/",recursive=T)
taxa.file<-'./dada2/07_taxonomy/taxa.rds'
download.file(url = "https://zenodo.org/record/3986799/files/silva_nr99_v138_wSpecies_train_set.fa.gz?download=1",destfile = "./dada2/07_taxonomy/silva_nr99_v138_wSpecies_train_set.fa.gz")
taxa <- assignTaxonomy(nobim.tab, "./dada2/07_taxonomy/silva_nr99_v138_wSpecies_train_set.fa.gz", multithread=F)
saveRDS(taxa, file = taxa.file)