Project 2 - RNA-Seq Data Analysis#
The overall goal of this project is to make you familiar with quantitative data analysis using RNA-Sequencing data. In this project, we will use real research data, generated from mice. We will formulate an overall project task / question. To solve this task, you need to implement and carry out several analysis steps on the original data. This project description will guide you through these tasks. However, you might be required to search and use additional information sources available to you.
Project Task#
You are provided with raw sequencing reads from 8 different mouse RNA-Seq samples in FASTQ format. They originate from a published research project that studies the effect of knocking down a specific gene. Often such knockdowns (or even knockouts) have quite visible consequences for the remaining transcriptome (and consequently the whole organism). Hence, from the data alone it will be not straightforward to unambiguously identify the knocked down gene. In our small project setting, we would like to invite you to narrow down the list of candidate genes and determine the 10 most differentially expressed genes between wild type and knockout samples. You can then check whether the knocked out gene is amongst the ones you have identified as most different. Likewise, you can compare to the results of the original publication (using a different analysis strategy) and check for overlaps in your findings. However, as the project time is very limited, we do not expect that you replicate the full analysis as in a scientific publication, but take the first steps towards generating differential analyses.
Although the task is relatively easy to formulate, there are several analysis steps involved to get from raw data to a list of differentially expressed genes. Below, we provide an overview on the roadmap:
Data preparation and analysis setup
Quality control
Sequence alignment (to genome or transcriptome)
Expression quantification / counting
Data gathering and transformation
Differential analysis
Post-processing and interpretation
We will provide some guidelines on each of these steps and suggest possible tools you can use to complete them. However, if you would like to try out a different path or follow a tutorial you found online, please do so.
1. Data preparation and analysis setup#
Again, we recommend thinking out an analysis plan together with a directory structure for analysis. You will be provided some storage space on the Cousteau compute server that you can use for your outputs. General inputs will also be provided via that storage platform.
Project Task P2.1
Draft a directory structure that allows you to implement the above analysis roadmap. You should think about how to hierarchically organise your code/scripts, inputs, outputs and logs.
Here an example structure:
|- inputs
|- outputs
|- logs
|- alignments
|- counting
|- results
|- alignments
|- counting
|- scripts
For each of the tasks you will also be provided with software. As part of preparation, we invite you to review the manuals of the different software to understand what assumptions are made and which parameters can be chosen. If you would like to use different software than the ones suggested here, you would need to install them yourself in user space. For some tasks, we suggest alternatives and you are free to choose which one(s) to use.
Task |
Tool |
Manual |
---|---|---|
Quality control |
FastQC |
|
Alignment |
STAR |
|
Quantification |
Salmon |
|
Quantification |
kallisto |
|
Quantification |
HTSeq |
|
Differential Analysis |
DESeq2 |
Project Task P2.2
Familiarize yourself with the above tools and review what they do. Try to identify key parameters and check how to load them via the Module system available on Cousteau. Try whether you can run them (even without input).
In addition, when preparing a new analysis on an HPC system, a central question is always how much resources (number of CPUs and main memory (RAM)) the process will use and how long it will likely run. This is relevant as these resources need to be requested from the HPC scheduler. The manuals of the software can give important hints at how much resources are needed to run. If the manuals should not provide any information, a good strategy is to search for additional online resources or to have a test run on a single file and slowly increase the provided resources until the job is successful.
Note
To save you the work to figure out the resource usage of the STAR alignment program on the mouse genome index, we state here that the RAM usage will be at approximately 30GB.
2. Quality control#
Quality control is an integral part of data analysis. It is important to verify the assumptions we ourselves make of the data, but also those that are made by any analysis algorithms. For our project, at least one of the given files does not meet the required quality standards and should be excluded from further analysis. Can you identify which one(s)?
Project Task P2.3
Use a quality control software (e.g., FastQC) to review the quality statistics of the input data. If a sample severely violates the quality criteria, it should be excluded from further analysis. (If you cannot exclude a sample, you should move ahead with caution and take any possible artifacts into account during interepretation.)
You might find that some samples are worse than others in terms of quality. As this is data taken from a published research project, you might find that no sample meets all the test criteria. Make a decision which samples to exclude and which samples to proceed with.
All read input data is prepared for you at the following location:
/science/scratch/genomics/project2_RNA_inputs/reads
We also provide a text file with sample metadata in the
reads
folder.
Note
You can use the pre-installed FastQC module on Cousteau, via module load FastQC
3. Sequence alignment (to genome or transcriptome)#
Whether you need to carry out this step is dependent on your choice for the subsequent step 4. Please review the next section and come back to this once you have decided to perform genome / transcriptome alignments.
An often performed step for generating a quantitative read-out from our given samples is to align them against a reference sequence. This reference can be a genome or transcriptome. For the purpose of this project we recommend using the STAR alignment software. When aligning against a genome, this tool is able to provide two kinds of output at the same time: one set of alignments against the genome and one set of alignments against the transcriptome.
Project Task P2.4
For each of the fastq input files that you decided to keep after your initial quality control,
generate an alignment against the given reference genome, which you can find under
/science/scratch/genomics/project2_RNA_inputs/references
using the
alignment program STAR.
Please review the STAR manual for how to use the alignment program. You will find that the first
step is to generate an index (which is in fact a suffix array index) from the reference genome
sequence. As this index is uncompressed and thus very large, we have decided to pre-compute it
and provide it to you:
/science/scratch/genomics/project2_RNA_inputs/indexes/GRCm39_star_index
Please also make use of the directory structure you have designed at the beginning of this project to name and place your outputs and any analysis scripts in a sensible manner.
It is useful to generate a separate output directory for each of the indidvidual samples.
4. Expression quantification / counting#
There are several options available for how to collect gene expression quantification from
RNA-Sequencing data. In the following we will discuss all options and you can decide whether you
would like to try only one or try several and compare their results. For the variants, we will be
using the Salmon
and the HTSeq
software (see table above).
It is made available on Cousteau via the Module system. You can load it via:
module load Salmon
Likewise, for HTSeq you can do:
module load HTSeq
After loading the module, you can invoke HTSeq via htseq-count
.
Alignment-based quantification
The first option is to use quantification based on already prepared read alignments to the
transcriptome. The Salmon software supports quantification of alignments to the
transcriptome. As inputs this step require the transcriptome alignments generated by STAR (see
previous step) as well as a copy of the transcriptome sequences. We provide the transcriptome
sequence in the input folder for reference sequences:
/science/scratch/genomics/project2_RNA_inputs/references
For further details on how to use the Salmon tool in alignment mode, we refer to the Salmon Manual.
A second option is to use quantification based on already prepared read alignments to the genome.
As Salmon does not support genome alignments, we recommend using the software HTSeq count tools.
For quantifying expression with htseq-count
you need the genome alignments from STAR (see
previous step) and the corresponding genome annotation in GTF format. We have provided the
annotation as part of our reference files under:
/science/scratch/genomics/project2_RNA_inputs/references
For further details on how to htseq-count
, we refer to the HTSeq Manual.
Pseudo-alignment based quantification
An alternative to using genome or transcriptome alignments is to use mappings or pseudo-alignments.
With Salmon you can directly quantify reads from a given raw fastq file. For this, however, it is
necessary to build an index on the given reference transcripts. For your convenience, we have
pre-computed this index already and provide it as part of the reference material:
/science/scratch/genomics/project2_RNA_inputs/indexes
For running Salmon in mapping mode, you also need to provide a LIBTYPE
option, describing the
nature of your given sequencing library. For the data provided the correct LIBTYPE is U
.
For further details on how to use the Salmon tool in mapping mode, we refer to the Salmon Manual.
A second alternative for alignment-free quantification is to use pseudo-alignments. This strategy is
similar to the Salmon mapping mode. The tool kallisto
(see tool table above) performs
pseudo-alignments. Again, you need to provide your raw reads in fastq format and an index of the
desired transcriptome. We also have provided a pre-indexed transcriptome for kallisto in the
reference material:
/science/scratch/genomics/project2_RNA_inputs/indexes
For further details on how to run the pseudo-alignments we refer to the kallisto Manual.
Project Task P2.5
Decide on at least two out of the four quantification methods above and generate quantifications for the given annotation. You should generate one quantification per methods per given input fastq/alignment sample.
5. Data gathering and transformation#
In the previous analysis step, you have generated expression counts either for transcripts (via Salmon) or for genes (via HTSeq). As the goal is to perform differential gene expression analysis, we need to aggregate transcript expression counts to gene level, if counts are not at gene level yet.
For aggregating transcript level counts, we recommend the tximeta Bioconductor package, which runs in R. This should be already pre-installed in the R Jupyter server available via the Moodle JupyterHub Link.
We recommend following the tximeta user documentation for learning how to import Salmon counts into the R environment for further use with Bioconductor.
For importing HTSeq counts, which are already at gene level, one can rely on the DESeq2 Bioconuctor package. Also here, we would like to refer you to the DESeq documentation for learning how to import HTSeq counts.
Project Task P2.6
Import the expression counts into the R workspace using the Bioconductor packages mentioned above. If the counts are at transcript level, aggregate them to gene level.
6. Differential analysis#
Once the expression counts are prepared, we can use them for differential analysis. We will compare the wild type vs knockout group. The labels have been provided to you in the initial sample table.
To stay within the R framework, for differential analysis we recommend using the DESeq2 Bioconductor
package. If you used the tools mentioned in the previous step, you will have already imported the
data into a SummarizedExperiment
data frame, which can be directly used by DESeq2.
There is an excellent tutorial available describing how to perform differential gene expression analysis with DESeq2. We would like to invite you to review the tutorial and apply the relevant parts to our differential test setting.
Project Task P2.7
Peform a differential gene expression analysis between the groups wild type and knockout to determine all genes that are significantly different between the groups.
As you have generated counts using different approaches, you can generate a testing result for each of the different count tables.
7. Post-processing and interpretation#
As a last step, we need to connect the raw results from the differential analysis to the initial project task. We had set out to narrow down the list of candidate genes to figure out which of the genes might have been the direct target of the knockout. From the testing results, you should have received a list of p-values and fold-changes attached to each gene identifier.
Project Task P2.8
Narrow down the list of differential candidate genes using appropiate thresholds for p-value (adjusted) and fold-change. Write your results to disk as a csv or tsv file.
When repeating the filtering for the test results from the different count tables, how much do your narrowed down lists overlap? Can you suggest reasons motivating the possibly different outcomes?