Preliminaries and Setup

Preliminaries and Setup#

Here we provide some general information on the project work.

Inputs and Outputs#

We provide a central storage location where all input data is provided and that you can also use to record your outputs. All data will be located under /science/scratch. For the Genomics block specifically we will be using /science/scratch/genomics/.

Your first setup task is to generate your user directory that will contain all your output data. Please create it as your NETHZ_ID: /science/scratch/genomics/student_data/<NETHZ_ID>. So if your ETH ID is aeinstein, your corresponding path would be /science/scratch/genomics/student_data/aeinstein.

Note

You can use the mkdir command to create directories.

Software#

All software required for the project work is already pre-installed on the Cousteau system. You can see all available packages via module avail. If you are looking for a software package specifically, let’s assume bwa, you can use module spider bwa to find any relevant modules known to the system.

Once you know the module name, let’s say you found bwa-mem2, then you can load it via module load bwa-mem2.

Practical advice#

In addition to the Linux refresher, we provide here a few points that will be helpful to carry out the project work and should allow you to better focus on the data processing steps themselves.

Organising input and output data

The input data is generally organised as follows. We will use Project 1 as an example. If you change into /science/scratch/genomics/, you will find all input data under project1_DNA_inputs. This directory has three sub-directories:

  • indexes

  • reads

  • references

We will detail below what they contain. You will only ever need to read from this location and hence do not have writing permissions.

After creating your working directory in /science/scratch/genomics/student_data/<NETHZ_ID>, you can place all code (like scripts) and output (like results and logs) there.

Editing text files

Often you will be required to edit a text file on the server, for instance to write a bash script. For editing such files, you will need an editor. The following editors are installed on the cousteau server:

  • nano

  • vim

  • emacs

As an alternative, you can also use the Jupyter Hub Server at http://cousteau-jupyter.ethz.ch/ to create and edit text files at a location of your choice. This server has the same file system as cousteau.ethz.ch.

It is not necessary to learn how to use all these editors. Instead, we recommend you become familiar with one of them. If you have not decided yet, nano and vim are available on most Linux systems.

Writing a bash script

For larger analyses it is not only tedious and error-prone but also very slow and inefficient to spell out each analysis step as a command entered into the terminal. Instead, one usually bundles analysis steps into bash scripts that are then carried out together. We will here provide a few examples that will become better understandable when read in context of the upcoming project task. So please use this section as a reference.

The following script will run a QC analysis with the tool fastqc. We will use the read data in project 1 as an example. (Lines starting with # are comments and not executed.)

#!/bin/bash

# define variables
input_dir="/science/scratch/genomics/project1_DNA_inputs/reads"
result_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/results/fastqc"
log_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/logs/fastqc"

# create output directories in case they do not exists
mkdir -p $result_dir
mkdir -p $log_dir

# load relevant software from module library
module load FastQC

# loop over samples (we have 3 samples)
for sample in 1 2 3
do
    # define output directories for current sample / read_file combination
    current_result_dir="${result_dir}/sample${sample}/"
    current_log_dir="${log_dir}/sample${sample}/"

    # create them in case they do not exist
    mkdir -p ${current_result_dir}
    mkdir -p ${current_log_dir}

    # loop over read files in each sample directory (4 read files per sample)
    for read_file in 1 2 3 4
    do
        # loop over read pair mates
        for read_pair in 1 2
        do
            # run fastqc analysis for the current read_pair
            current_fastq=${input_dir}/sample${sample}/sample${sample}_set${read_file}.R${read_pair}.fastq.gz
            current_log=${current_log_dir}/sample${sample}_set${read_file}.R${read_pair}.log
            fastqc -o ${current_result_dir} ${current_fastq} 2> ${current_log}
        done
    done
done

Using the batch submission system (SLURM)

As all students are sharing the same server, we will use the SLURM scheduler to distribute the given (limited) resources to the different workers. We thus would like to ask you to submit your computations as a job script. For an intro, please review the SLURM section of the Unix refresher.

For convenience, we will here provide an example script for submitting a quality control job via the scheduler. What we would like to achieve is that fastqc as called in our script above does not run sequentially but in parallel, scheduled via SLURM. For this to work out, we need to separate the logic that selects the input data from the code that actually runs the fastqc analysis. So let us extract all relevant parts from the above script related to running fastqc. We will parametrise the script with input parameters for input file and result directory. We will call the script run_fastqc.sh. Please note that we also have added the SLURM-specific comments to the header section of the script that provide information on the resources required for the computation.

#!/bin/bash
#SBATCH --job-name fastqc               # Job name
#SBATCH --cpus-per-task 1               # Number of CPUs
#SBATCH --mem-per-cpu=2G                # Memory per CPU
#SBATCH --time=1:00:00                  # Approximate time needed

# this section handles the reading and assignment of input parameters
# in case no input parameters are provided, we are friendly and give a help message
# $0 always contains the name of the script, $1, $2, etc contain any command line parameters

if [ -z "${2}" ]    # this command checks whether the second input parameter given is empty
then
    echo "Usage: ${0} <input_fastq> <output_dir>"
    exit 1
fi
input_fastq="${1}"
output_dir ="${2}"

# load relevant software from module library
module load FastQC

# run fastqc
fastqc -o ${output_dir} ${input_fastq}

With this so called job-script in place, we can now modify our original bash script to use the scheduler. Let’s assume you have stored the above job-script as /science/scratch/genomics/student_data/<NETHZ_ID>/Project1/scripts/run_fastqc.sh.

#!/bin/bash

# define variables
input_dir="/science/scratch/genomics/project1_DNA_inputs/reads"
result_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/results/fastqc"
log_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/logs/fastqc"

# create output directories in case they do not exists
mkdir -p $result_dir
mkdir -p $log_dir

# set path to job script
job_script="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/scripts/run_fastqc.sh"

# loop over samples (we have 3 samples)
for sample in 1 2 3
do
    # define output directories for current sample / read_file combination
    current_result_dir="${result_dir}/sample${sample}/"
    current_log_dir="${log_dir}/sample${sample}/"

    # create them in case they do not exist
    mkdir -p ${current_result_dir}
    mkdir -p ${current_log_dir}

    # loop over read files in each sample directory (4 read files per sample)
    for read_file in 1 2 3 4
    do
        # loop over read pair mates
        for read_pair in 1 2
        do
            # submit fastqc analysis for the current read_pair to the batch system
            current_fastq=${input_dir}/sample${sample}/sample${sample}_set${read_file}.R${read_pair}.fastq.gz
            current_log=${current_log_dir}/sample${sample}_set${read_file}.R${read_pair}.log
            sbatch -o ${current_log} ${job_script} ${current_fastq} ${current_result_dir}
        done
    done
done

If you don’t want to separate submission to SLURM into two scripts, there is also a way to provide all parameters and resource requests to sbatch directly, without a separate job-script. For this, you would need to wrap the call to fastqc into a command and provide it to sbatch. In the following, we will just provide the adapted inner-most loop from our original script form the beginning.

...
    # loop over read pair mates
    for read_pair in 1 2
    do
        # run fastqc analysis for the current read_pair
        current_fastq=${inputdir}/sample${sample}/sample${sample}_set${read_file}.R${read_pair}.fastq.gz
        current_log=${current_log_dir}/sample${sample}_set${read_file}.R${read_pair}.log
        # call to sbatch (note that we need to load the software module as part of the call)
        sbatch -o ${current_log}--job-name=fastqc  --cpus-per-task=1 --time=1:00:00 --mem-per-cpu=2G --wrap "module load FastQC; fastqc -o ${current_result_dir} ${current_fastq}"
    done
 ...

The above scripts are just one possible solution to the problem. Bash is a versatile scripting language and you can find your personal solution just as well. We would recommend reviewing one of the many available online tutorials on bash scripting, such as this one.

List of generally useful commands

This list is by far not exhaustive and we invite you to extend it with your own commands that you find worthwhile remembering.

Command

What it does

ls -lah

List (ls) all (-a) current directory contents in long form (-l) and human readable sizes (-h)

pwd

Print path to current working directory to the screen

ln -s <source> <target>

Create a symbolic link between <sourcs> and <target>

mkdir -p sample1/read1/analysis1

Create nested directories, forming a whole path.