Session 1.5. 16S amplicon pipeline#
Starting specifications#
We are going to:
define the working directory (WD).
create a
out
folder where results are going to be saved.define the path to the taxonomic database.
load the required modules and check the software versions.
The subfolder data
has to contain fastq files with _R1_ and _R2_ within the files name.
WD=/nfs/practical/BlockCourses/551-1119-00L/home/biolcourse-70
DATA=/nfs/practical/BlockCourses/551-1119-00L/masterdata/16S/test_raw/
# Create the output folder
out=${WD}/results
db=/nfs/practical/BlockCourses/551-1119-00L/masterdata/silva/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta
mkdir -p $out
# Load required modules
ml USEARCH
ml cutadapt
export OMP_NUM_THREADS=4
Merging paired reads#
We are going to execute fastq_mergepairs
command within USEARCH. The fastq_mergepairs
command merges (assembles) paired-end reads to create consensus sequences. You can find more information here.
relabel
: gets the sample identifier from the FASTQ file name by truncating at the first underscore (_) or period (.).fastq_maxdiffs
: Maximum number of mismatches in the alignment.fastq_maxdiffpct
: Maximum number of mismatches as an integer percentage.fastq_minovlen
: Minimum overlap for merging
#Merging paired reads...
usearch -fastq_mergepairs ${DATA}/*_R1.fastq -fastqout $out/merged.fq -fastq_minovlen 16 -relabel @ -fastq_pctid 90 -fastq_maxdiffs 300 -threads ${threads} 2> $out/merge.log
#Done merging reads
Filtering merged reads#
Quality filtering is performed by executing the fastq_filter
command within USEARCH. More info here
fastq_maxee
: Discard reads with > E total expected errors for all bases in the readfastq_minlen
: Minimum length of sequence after filtering
#Filtering merged reads
usearch -fastq_filter $out/merged.fq -fastq_maxee 0.1 -fastq_minlen 100 -fastaout $out/filtered.fa -threads ${threads} 2>> $out/filter.log
#Done filtering merged reads
Selecting reads with perfect primer matches#
Only the reads with a perfect match with the primers are kept. This is done by executing CUTADAPT.
#Selecting reads with perfect primer matches
cutadapt --discard-untrimmed -g GTGYCAGCMGCCGCGGTAA -O 19 -e 0 --minimum-length 100 -o $out/filtered.tmp.F.fa $out/filtered.fa &> $out/primermatch.log
cutadapt --discard-untrimmed -a ATTAGAWACCCBNGTAGTCC -O 20 -e 0 --minimum-length 100 -o $out/filtered_primermatch.fa $out/filtered.tmp.F.fa &>> $out/primermatch.log
rm $out/filtered.tmp.F.fa
#Reverse complement reads
usearch -fastx_revcomp $out/filtered.fa -label_suffix _RC -fastaout $out/filtered.tmp.RC.fa &>> $out/primermatch.log
cutadapt --discard-untrimmed -g GTGYCAGCMGCCGCGGTAA -O 19 -e 0 --minimum-length 100 -o $out/filtered.tmp.RC.F.fa $out/filtered.tmp.RC.fa &>> $out/primermatch.log
cutadapt --discard-untrimmed -a ATTAGAWACCCBNGTAGTCC -O 20 -e 0 --minimum-length 100 -o $out/filtered_primermatch_RC.fa $out/filtered.tmp.RC.F.fa &>> $out/primermatch.log
rm $out/filtered.tmp.RC.F.fa
rm $out/filtered.tmp.RC.fa
cat $out/filtered_primermatch_RC.fa >> $out/filtered_primermatch.fa
rm $out/filtered_primermatch_RC.fa
#Done selecting reads
Dereplication#
Dereplication is the process of finding the set of unique sequences in an input file. Input is a FASTA or FASTQ file. Sequences are compared letter-by-letter and must be identical over the full length of both sequences. This is done with the USEARCH fastx_uniques
command. More info here.
sizeout
: specifies that size annotations should be added to the output sequence labels.relabel
: specifies a string that is used to re-label the dereplicated sequences.
#Dereplicating reads
usearch -fastx_uniques $out/filtered_primermatch.fa -minuniquesize 1 -sizeout -relabel Uniq -fastaout $out/uniques.uparse.fa 2> $out/uniques.uparse.log
#Done dereplicating reads
Clustering#
Different clustering methods exist and new ones are continuosly developed. Here we will use the UPARSE-OTU algorithm implemented within USEARCH.
The cluster_otus command
performs 97% OTU clustering using the UPARSE-OTU algorithm. More info here.
minsize
: specify a minimum abundance; for example you can use -minsize 2 to discard singletons.relabel
: specifies a string that is used to re-label OTUs.
#Clustering reads
usearch -cluster_otus $out/uniques.uparse.fa -minsize 2 -otus $out/otus.uparse.fa -relabel Otu 2> $out/cluster_otus.uparse.log
# also removes chimera
#Done clustering reads.
Taxonomical annotation of OTUs#
We now search each OTU representative sequence against a reference 16S database, the SILVA database in this case.
#Annotating OTUs
usearch -usearch_global $out/otus.uparse.fa -db ${db} -id 0.95 -maxaccepts 500 -maxrejects 500 -strand both -top_hits_only --output_no_hits -blast6out $out/taxsearch_uparse.tax -threads ${threads} 2> $out/taxsearch_uparse.log
# LCA
function lca(){ cat $@ | sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}' | awk -F ";" '{$NF=""; OFS=";"; print $0}'; return; }
for i in $(cut -f 1 -d $'\t' $out/taxsearch_uparse.tax | sort | uniq); do id=$(grep -m 1 -P $i'\t' $out/taxsearch_uparse.tax | cut -f 3 -d$'\t'); res=$(grep -P $i'\t' $out/taxsearch_uparse.tax | cut -f 2 -d$'\t' | cut -f 1 -d ' ' --complement | lca); echo -e $i'\t'$res'\t'$id; done > $out/taxonomy_uparse_lca.txt
# Done annotating OTUs
Quantification of OTU abundances#
Once we generate and annotate the OTUs we need to quantify the abundance of each OTU by mapping the reads to the OTU reference sequences. The mapping is done by using the usearch_global
command.
#Quantifying vs OTUs using all filtered reads
usearch -otutab $out/filtered.fa -otus $out/otus.uparse.fa -strand both -id 0.97 -otutabout $out/otutab_uparse.txt -sample_delim . -threads ${threads} &> $out/make_otutab_uparse.log
#one quantifying vs OTUs using al reads
Using the qsub command#
# SGE parameters
#$ -cwd # run in current directory
#$ -S /bin/bash # Interpreting shell for the job
#$ -N job1 # Name of the job
#$ -V # .bashrc is read in all nodes
#$ -pe smp 10 # number of job slots to be reserved
#$ -l h_vmem=16G # memory required
#$ -e error.log # error file
#$ -o out.log # output file
#$ -m abe # send an email at the beggining, end and if aborted
#$ -M yourmail@ethz.ch
ml R # you can load modules in this same script
date # execute any command