Project 1 - DNA-Seq Analysis#
The overall goal of this project is to practice the work with DNA variants. This will include tasks like genome alignment and variant calling. We will be working with real sequencing data, generated from real individuals. In fact, we will be dealing with human DNA-Sequencing data. This project description will guide you through these tasks. However, you might be required to search and use additional information sources available to you.
Project Task#
You are provided with 12 raw DNA-sequencing read sets taken from 3 different individuals, 3 samples per individual. The raw-sequencing samples have been pre-processed to only contain signal for a part of the reference genome, so the data becomes manageable for this kind of in-class project. The three individuals that were sequenced form a so-called trio (that is they are parents and child). For the three samples it is your task to identify which two samples are the parents and which one is the child. As an additional task, you can try to determine which samples belong to father and mother, respectively, and identify the genetic sex of the child.
Although the task is relatively easy to formulate, there are several analysis steps involved to get from raw data to a list of interpretable sequence variants. Below, we provide an overview on the roadmap:
Data preparation and analysis setup
Quality control
Sequence alignment to the reference genome
Alignment post-processing
Variant calling
Variant filtering
Data gathering and analysis
We will provide some guidelines on each of these steps and suggest possible tools you can use to complete them. However, if you would like to try out a different path or follow a tutorial you found online, please do so.
1. Data preparation and analysis setup#
As already mentioned earlier, we recommend thinking out an analysis plan together with a directory structure for analysis. You will be provided with some storage space on the Cousteau compute server that you can use for your outputs. General inputs will also be provided via that storage location.
Project Task P1.1
Draft a directory structure that allows you to implement the above analysis roadmap. You should think about how to hierarchically organise your code/scripts, inputs, outputs and logs.
Here an example structure:
|- inputs
|- outputs
|- logs
|- alignments
|- variants
|- results
|- alignments
|- variants
|- scripts
For each of the tasks you will also be provided with software. As part of preparation, we invite you to review the manuals of the different software to understand what assumptions are made and which parameters can be chosen. If you would like to use different software than the ones suggested here, you would need to install them yourself in user space. For some tasks, we suggest alternatives and you are free to choose which one(s) to use.
Task |
Tool |
Module Command |
Manual |
---|---|---|---|
Quality control |
fastqc |
|
|
Alignment |
bwa-mem2 |
|
|
Alignment post process |
sambamba |
|
|
Variant calling |
freebayes |
|
|
Variant filtering |
vcftools |
|
Note
Often, the software tools provide a quite accessible help page that can most of the time be accessed via -h
or --help
.
For instance, to see the help message of the FastQC software, you would run fastqc --help
.
Project Task P1.2
Familiarize yourself with the above tools and review what they do. Try to identify key parameters and check how to load them via the Module system available on Cousteau. Try whether you can run them (even without input).
In addition, when preparing a new analysis on an HPC system, a central question is always how much resources (number of CPUs and main memory (RAM)) the process will use and how long it will likely run. This is relevant as these resources need to be requested from the HPC scheduler. The manuals of the software can give important hints at how much resources are needed to run. If the manuals should not provide any information, a good strategy is to search for additional online resources or to have a test run on a single file and slowly increase the provided resources until the job is successful.
Note
To save you the work to figure out the resource usage of the bwa-mem2 alignment program on the human genome index, we state here that the RAM usage will be not more than 40GB.
2. Quality control#
Quality control is an integral part of data analysis. It is important to verify the assumptions we ourselves make of the data, but also those that are made by any analysis algorithms. For our project, at least one of the given files does not meet the required quality standards and should be excluded from further analysis. Can you identify which one(s)?
Project Task P1.3
Use a quality control software (e.g., FastQC) to review the quality statistics of the input data. If a sample severely violates the quality criteria, it should be excluded from further analysis. (If you cannot exclude a sample, you should move ahead with caution and take any possible artifacts into account during interpretation.)
As this is data taken from a published research project, you might find that no sample meets all the test criteria. Review the quality criteria that are not met and research online, whether some of them can be remedied.
All read input data is prepared for you at the following location:
/science/scratch/genomics/project1_DNA_inputs/reads
Note
You can use the pre-installed FastQC module on Cousteau, via module load FastQC
3. Sequence alignment to the reference genome#
As discussed in the lecture, the first step for many bioinformatic analysis based on DNA sequencing data (including variant identification) is the alignment of read data to a reference genome of choice. For aligning the given read data to the human reference genome, we will use the bwa-mem2 alignment software, which uses the Burrows-Wheeler transform of the input genome for fast seed finding and then a local Smith-Waterman algorithm for alignment extension.
Project Task P1.4
For each of the given fastq files, generate an alignment against the given human reference
genome using the alignment program bwa-mem2. We provide the genome refernce sequence under
/science/scratch/genomics/project1_DNA_inputs/references
.
Please review the baw-mem2 manual for how to use the alignment program. You will find that the first
step is to generate an index (which is in fact a Burrows-Wheeler transform based index) from the reference genome
sequence. As this index is very large, we have decided to pre-compute and provide it to you:
/science/scratch/genomics/project1_DNA_inputs/indexes/GRCh38.bwa_index
You can the pass the index to bwa-mem2
as
/science/scratch/genomics/project1_DNA_inputs/indexes/GRCh38.bwa_index/GRCh38.primary_assembly.genome.fa
Also bwa-mem2 is available via the software stack as a module:
module load bwa-mem2
Please also make use of the directory structure you have designed at the beginning of this project to name and place your outputs and any analysis scripts in a sensible manner.
It is useful to generate a separate output directory for each of the indidvidual samples.
Note
The bwa-mem2 tool generates output in sam format. To prevent unnecessarily large uncompressed output files, we recommend that you directly pipe the bwa-mem output into samtools for compression. This would look like the following:
bwa-mem2 mem <options> | samtools view -bS -o <out_filename> -
The software samtools
is available via the Module system, using module load
SAMtools
.
Once the alignments have been generated, you need to post-process them to first sort by
coordinate and generate an index. This again can be accomplished using the samtools
software.
4. Alignment post-processing#
Once the alignments have been generated, we will need to post-process them to make them readily usable for variant calling. The reads come out of the aligner in roughly the same order as they were provided as input. (If the alignment was run in parallel, slight deviations might occur.)
For variant calling, it is important to sort the alignments by genomic coordinates, so the different reads spanning the same genomic region are co-localized.
A second aspect of post-processing is de-duplication. It might happen that identical fragments and thus identical reads have been generated from the sequencing. We would like to recognize and mark those reads, as it is useful to give them a lower weight or even ignore them during variant calling.
Project Task P1.5
Making the alignment files accessible for search and direct access, as needed for variant
calling, requires to first sort and then index them. Both tasks can be completed with the
samtools
software. Familiarize yourself with the samtools
module and sort your alignment
files in bam format by coordinate. Once completed, use samtools
to also generate an index
for each sorted alignment file.
Once you have generated sorted and indexed alignment files, you should remove the unsorted bam files to save space.
Project Task P1.6
As described above, marking duplicate reads in the alignment files can be beneficial for variant calling. This is an optional step. That is, variant calling will also work without it.
Familiarize yourself with the Sambamba
module available on Cousteau and use its markdup
sub-command. Apply the sambamba
tool to mark duplicates in your alignment files.
Once you have successfully generated the alignment files cotaining the marked duplicate reads, you should remove the unmarked input alignment files to save space.
5. Variant calling#
To identify sequence variants, we will use also a Bayesian approach (similar to the one discussed in
the lecture). Our analysis will utilize the freebayes
software (see table above).
Project Task P1.7
Review the manual of the freebayes
tool and use it to generate variant calls for each of the
three given samples. Please make sure that you provide all alignment files for each sample at
once to the variant calling procedure. This will generate a single set of variant calls per
sample, integrating the information across all read sets.
Also this software is provided by the Module system on Cousteau. You can load it via:
module load freebayes
Please also note that you will require the reference genome used for alignment (not the index),
which you can find under: /science/scratch/genomics/project1_DNA_inputs/references
.
Question: Why don’t we need the indexed version of the genome?
As output freebayes
will generate variant calls in VCF format. If you would like to learn more
about the format, we recommend having a look at the format specification.
6. Variant filtering#
Even when using specific call settings, many variant callers (including freebayes
will generate
a quite comprehensive call set. That is, the resulting VCF file will also contain variants with a
relatively low quality or insufficient read support. However, each variant call in the file is
enriched with additional information that can be used for filtering and thus for creating a set of
high-confidence variant calls.
Project Task P1.8
Review structure and content of the VCF files output by the variant caller. Understand the descriptive information provided for each variant call and select quantities that can be used for filtering. If you are uncertain about picking a threshold, you can visualize the distribution of the quantity via Python or R.
Use the software vcftools
to filter your vcf files with the thresholds you picked. (The
vctools
documentation can also give you hints on which fields to use for filtering.)
The software vcftools
is pre-installed on Cousteau and available via the Module system. You can
just use module load vctools
to make it available.
7. Data gathering and analysis#
This last step might need the most creativity, as you cannot use a pre-built tool to accomplish the task. As a reminder, our goal is to identify the sample out of the three given which comes from the child and not the parents.
Project Task 1.9
Based on the filtered VCF variant files collect variant statistics that help you determine which of the three given samples is likely the child. Define an analysis strategy and implement it using a language of your choice.
As an additional task, try to extract information from the variant call files that helps you determine which sex each of the three given samples had. That is, to determine which parent is mother and which father and determine whether the child is a son or a daughter. (Just for clarification, this task is concerned with the genomic sex, not with gender identity.)
If you want to analyse your data directly on Cousteau, you can use a Python script. You can
make Python available via module load SciPy-bundle
.