Project 1 - DNA-Seq Analysis#

The overall goal of this project is to practice the work with DNA variants. This will include tasks like genome alignment and variant calling. We will be working with real sequencing data, generated from real individuals. In fact, we will be dealing with human DNA-Sequencing data. This project description will guide you through these tasks. However, you might be required to search and use additional information sources available to you.

Project Task#

You are provided with 12 raw DNA-sequencing read sets taken from 3 different individuals, 3 samples per individual. The raw-sequencing samples have been pre-processed to only contain signal for a part of the reference genome, so the data becomes manageable for this kind of in-class project. The three individuals that were sequenced form a so-called trio (that is they are parents and child). For the three samples it is your task to identify which two samples are the parents and which one is the child. As an additional task, you can try to determine which samples belong to father and mother, respectively, and identify the genetic sex of the child.

Although the task is relatively easy to formulate, there are several analysis steps involved to get from raw data to a list of interpretable sequence variants. Below, we provide an overview on the roadmap:

  1. Data preparation and analysis setup

  2. Quality control

  3. Sequence alignment to the reference genome

  4. Alignment post-processing

  5. Variant calling

  6. Variant filtering

  7. Data gathering and analysis

We will provide some guidelines on each of these steps and suggest possible tools you can use to complete them. However, if you would like to try out a different path or follow a tutorial you found online, please do so.

1. Data preparation and analysis setup#

As already mentioned earlier, we recommend thinking out an analysis plan together with a directory structure for analysis. You will be provided with some storage space on the Cousteau compute server that you can use for your outputs. General inputs will also be provided via that storage location.

Project Task P1.1

Draft a directory structure that allows you to implement the above analysis roadmap. You should think about how to hierarchically organise your code/scripts, inputs, outputs and logs.

Here an example structure:

|- inputs
|- outputs
     |- logs
         |- alignments
         |- variants
     |- results
         |- alignments
         |- variants
|- scripts

For each of the tasks you will also be provided with software. As part of preparation, we invite you to review the manuals of the different software to understand what assumptions are made and which parameters can be chosen. If you would like to use different software than the ones suggested here, you would need to install them yourself in user space. For some tasks, we suggest alternatives and you are free to choose which one(s) to use.

Task

Tool

Module Command

Manual

Quality control

fastqc

module load FastQC

Link

Alignment

bwa-mem2

module load bwa-mem2

Link

Alignment post process

sambamba

module load Sambamba

Link

Variant calling

freebayes

module load freebayes

Link

Variant filtering

vcftools

module load VCFtools

Link

Note

Often, the software tools provide a quite accessible help page that can most of the time be accessed via -h or --help. For instance, to see the help message of the FastQC software, you would run fastqc --help.

Project Task P1.2

Familiarize yourself with the above tools and review what they do. Try to identify key parameters and check how to load them via the Module system available on Cousteau. Try whether you can run them (even without input).

In addition, when preparing a new analysis on an HPC system, a central question is always how much resources (number of CPUs and main memory (RAM)) the process will use and how long it will likely run. This is relevant as these resources need to be requested from the HPC scheduler. The manuals of the software can give important hints at how much resources are needed to run. If the manuals should not provide any information, a good strategy is to search for additional online resources or to have a test run on a single file and slowly increase the provided resources until the job is successful.

Note

To save you the work to figure out the resource usage of the bwa-mem2 alignment program on the human genome index, we state here that the RAM usage will be not more than 40GB.

2. Quality control#

Quality control is an integral part of data analysis. It is important to verify the assumptions we ourselves make of the data, but also those that are made by any analysis algorithms. For our project, at least one of the given files does not meet the required quality standards and should be excluded from further analysis. Can you identify which one(s)?

Project Task P1.3

Use a quality control software (e.g., FastQC) to review the quality statistics of the input data. If a sample severely violates the quality criteria, it should be excluded from further analysis. (If you cannot exclude a sample, you should move ahead with caution and take any possible artifacts into account during interpretation.)

As this is data taken from a published research project, you might find that no sample meets all the test criteria. Review the quality criteria that are not met and research online, whether some of them can be remedied.

All read input data is prepared for you at the following location: /science/scratch/genomics/project1_DNA_inputs/reads

Note

You can use the pre-installed FastQC module on Cousteau, via module load FastQC

3. Sequence alignment to the reference genome#

As discussed in the lecture, the first step for many bioinformatic analysis based on DNA sequencing data (including variant identification) is the alignment of read data to a reference genome of choice. For aligning the given read data to the human reference genome, we will use the bwa-mem2 alignment software, which uses the Burrows-Wheeler transform of the input genome for fast seed finding and then a local Smith-Waterman algorithm for alignment extension.

Project Task P1.4

For each of the given fastq files, generate an alignment against the given human reference genome using the alignment program bwa-mem2. We provide the genome refernce sequence under /science/scratch/genomics/project1_DNA_inputs/references.

Please review the baw-mem2 manual for how to use the alignment program. You will find that the first step is to generate an index (which is in fact a Burrows-Wheeler transform based index) from the reference genome sequence. As this index is very large, we have decided to pre-compute and provide it to you: /science/scratch/genomics/project1_DNA_inputs/indexes/GRCh38.bwa_index

You can the pass the index to bwa-mem2 as /science/scratch/genomics/project1_DNA_inputs/indexes/GRCh38.bwa_index/GRCh38.primary_assembly.genome.fa

Also bwa-mem2 is available via the software stack as a module: module load bwa-mem2

Please also make use of the directory structure you have designed at the beginning of this project to name and place your outputs and any analysis scripts in a sensible manner.

It is useful to generate a separate output directory for each of the indidvidual samples.

Note

The bwa-mem2 tool generates output in sam format. To prevent unnecessarily large uncompressed output files, we recommend that you directly pipe the bwa-mem output into samtools for compression. This would look like the following:

bwa-mem2 mem <options> | samtools view -bS -o <out_filename> -

The software samtools is available via the Module system, using module load SAMtools.

Once the alignments have been generated, you need to post-process them to first sort by coordinate and generate an index. This again can be accomplished using the samtools software.

4. Alignment post-processing#

Once the alignments have been generated, we will need to post-process them to make them readily usable for variant calling. The reads come out of the aligner in roughly the same order as they were provided as input. (If the alignment was run in parallel, slight deviations might occur.)

For variant calling, it is important to sort the alignments by genomic coordinates, so the different reads spanning the same genomic region are co-localized.

A second aspect of post-processing is de-duplication. It might happen that identical fragments and thus identical reads have been generated from the sequencing. We would like to recognize and mark those reads, as it is useful to give them a lower weight or even ignore them during variant calling.

Project Task P1.5

Making the alignment files accessible for search and direct access, as needed for variant calling, requires to first sort and then index them. Both tasks can be completed with the samtools software. Familiarize yourself with the samtools module and sort your alignment files in bam format by coordinate. Once completed, use samtools to also generate an index for each sorted alignment file.

Once you have generated sorted and indexed alignment files, you should remove the unsorted bam files to save space.

Project Task P1.6

As described above, marking duplicate reads in the alignment files can be beneficial for variant calling. This is an optional step. That is, variant calling will also work without it.

Familiarize yourself with the Sambamba module available on Cousteau and use its markdup sub-command. Apply the sambamba tool to mark duplicates in your alignment files.

Once you have successfully generated the alignment files cotaining the marked duplicate reads, you should remove the unmarked input alignment files to save space.

5. Variant calling#

To identify sequence variants, we will use also a Bayesian approach (similar to the one discussed in the lecture). Our analysis will utilize the freebayes software (see table above).

Project Task P1.7

Review the manual of the freebayes tool and use it to generate variant calls for each of the three given samples. Please make sure that you provide all alignment files for each sample at once to the variant calling procedure. This will generate a single set of variant calls per sample, integrating the information across all read sets.

Also this software is provided by the Module system on Cousteau. You can load it via: module load freebayes

Please also note that you will require the reference genome used for alignment (not the index), which you can find under: /science/scratch/genomics/project1_DNA_inputs/references.

Question: Why don’t we need the indexed version of the genome?

As output freebayes will generate variant calls in VCF format. If you would like to learn more about the format, we recommend having a look at the format specification.

6. Variant filtering#

Even when using specific call settings, many variant callers (including freebayes will generate a quite comprehensive call set. That is, the resulting VCF file will also contain variants with a relatively low quality or insufficient read support. However, each variant call in the file is enriched with additional information that can be used for filtering and thus for creating a set of high-confidence variant calls.

Project Task P1.8

Review structure and content of the VCF files output by the variant caller. Understand the descriptive information provided for each variant call and select quantities that can be used for filtering. If you are uncertain about picking a threshold, you can visualize the distribution of the quantity via Python or R.

Use the software vcftools to filter your vcf files with the thresholds you picked. (The vctools documentation can also give you hints on which fields to use for filtering.)

The software vcftools is pre-installed on Cousteau and available via the Module system. You can just use module load vctools to make it available.

7. Data gathering and analysis#

This last step might need the most creativity, as you cannot use a pre-built tool to accomplish the task. As a reminder, our goal is to identify the sample out of the three given which comes from the child and not the parents.

Project Task 1.9

Based on the filtered VCF variant files collect variant statistics that help you determine which of the three given samples is likely the child. Define an analysis strategy and implement it using a language of your choice.

As an additional task, try to extract information from the variant call files that helps you determine which sex each of the three given samples had. That is, to determine which parent is mother and which father and determine whether the child is a son or a daughter. (Just for clarification, this task is concerned with the genomic sex, not with gender identity.)

If you want to analyse your data directly on Cousteau, you can use a Python script. You can make Python available via module load SciPy-bundle.