Preliminaries and Setup#
Here we provide some general information on the project work.
Inputs and Outputs#
We provide a central storage location where all input data is provided and that you can also use to
record your outputs. All data will be located under /science/scratch
. For the Genomics block
specifically we will be using /science/scratch/genomics/
.
Your first setup task is to generate your user directory that will contain all your output data.
Please create it as your NETHZ_ID: /science/scratch/genomics/student_data/<NETHZ_ID>
. So if your
ETH ID is aeinstein
, your corresponding path would be
/science/scratch/genomics/student_data/aeinstein
.
Note
You can use the mkdir
command to create directories.
Software#
All software required for the project work is already pre-installed on the Cousteau system. You can
see all available packages via module avail
. If you are looking for a software package
specifically, let’s assume bwa
, you can use module spider bwa
to find any relevant modules
known to the system.
Once you know the module name, let’s say you found bwa-mem2
, then you can load it via module
load bwa-mem2
.
Practical advice#
In addition to the Linux refresher, we provide here a few points that will be helpful to carry out the project work and should allow you to better focus on the data processing steps themselves.
Organising input and output data
The input data is generally organised as follows. We will use Project 1 as an example. If you change
into /science/scratch/genomics/
, you will find all input data under project1_DNA_inputs
.
This directory has three sub-directories:
indexes
reads
references
We will detail below what they contain. You will only ever need to read from this location and hence do not have writing permissions.
After creating your working directory in /science/scratch/genomics/student_data/<NETHZ_ID>
, you
can place all code (like scripts) and output (like results and logs) there.
Editing text files
Often you will be required to edit a text file on the server, for instance to write a bash script. For editing such files, you will need an editor. The following editors are installed on the cousteau server:
nano
vim
emacs
As an alternative, you can also use the Jupyter Hub Server at http://cousteau-jupyter.ethz.ch/ to create and edit text files at a location of your choice. This server has the same file system as cousteau.ethz.ch.
It is not necessary to learn how to use all these editors. Instead, we recommend you become familiar with one of them. If you have not decided yet, nano and vim are available on most Linux systems.
Writing a bash script
For larger analyses it is not only tedious and error-prone but also very slow and inefficient to spell out each analysis step as a command entered into the terminal. Instead, one usually bundles analysis steps into bash scripts that are then carried out together. We will here provide a few examples that will become better understandable when read in context of the upcoming project task. So please use this section as a reference.
The following script will run a QC analysis with the tool fastqc
. We will use the read data in
project 1 as an example. (Lines starting with #
are comments and not executed.)
#!/bin/bash
# define variables
input_dir="/science/scratch/genomics/project1_DNA_inputs/reads"
result_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/results/fastqc"
log_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/logs/fastqc"
# create output directories in case they do not exists
mkdir -p $result_dir
mkdir -p $log_dir
# load relevant software from module library
module load FastQC
# loop over samples (we have 3 samples)
for sample in 1 2 3
do
# define output directories for current sample / read_file combination
current_result_dir="${result_dir}/sample${sample}/"
current_log_dir="${log_dir}/sample${sample}/"
# create them in case they do not exist
mkdir -p ${current_result_dir}
mkdir -p ${current_log_dir}
# loop over read files in each sample directory (4 read files per sample)
for read_file in 1 2 3 4
do
# loop over read pair mates
for read_pair in 1 2
do
# run fastqc analysis for the current read_pair
current_fastq=${input_dir}/sample${sample}/sample${sample}_set${read_file}.R${read_pair}.fastq.gz
current_log=${current_log_dir}/sample${sample}_set${read_file}.R${read_pair}.log
fastqc -o ${current_result_dir} ${current_fastq} 2> ${current_log}
done
done
done
Using the batch submission system (SLURM)
As all students are sharing the same server, we will use the SLURM scheduler to distribute the given (limited) resources to the different workers. We thus would like to ask you to submit your computations as a job script. For an intro, please review the SLURM section of the Unix refresher.
For convenience, we will here provide an example script for submitting a quality control job via the
scheduler. What we would like to achieve is that fastqc
as called in our script above does not
run sequentially but in parallel, scheduled via SLURM. For this to work out, we need to separate the
logic that selects the input data from the code that actually runs the fastqc analysis. So let us
extract all relevant parts from the above script related to running fastqc. We will parametrise the
script with input parameters for input file and result directory. We will call the script
run_fastqc.sh
. Please note that we also have added the SLURM-specific comments to the header
section of the script that provide information on the resources required for the computation.
#!/bin/bash
#SBATCH --job-name fastqc # Job name
#SBATCH --cpus-per-task 1 # Number of CPUs
#SBATCH --mem-per-cpu=2G # Memory per CPU
#SBATCH --time=1:00:00 # Approximate time needed
# this section handles the reading and assignment of input parameters
# in case no input parameters are provided, we are friendly and give a help message
# $0 always contains the name of the script, $1, $2, etc contain any command line parameters
if [ -z "${2}" ] # this command checks whether the second input parameter given is empty
then
echo "Usage: ${0} <input_fastq> <output_dir>"
exit 1
fi
input_fastq="${1}"
output_dir ="${2}"
# load relevant software from module library
module load FastQC
# run fastqc
fastqc -o ${output_dir} ${input_fastq}
With this so called job-script in place, we can now modify our original bash script to use the
scheduler. Let’s assume you have stored the above job-script as
/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/scripts/run_fastqc.sh
.
#!/bin/bash
# define variables
input_dir="/science/scratch/genomics/project1_DNA_inputs/reads"
result_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/results/fastqc"
log_dir="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/logs/fastqc"
# create output directories in case they do not exists
mkdir -p $result_dir
mkdir -p $log_dir
# set path to job script
job_script="/science/scratch/genomics/student_data/<NETHZ_ID>/Project1/scripts/run_fastqc.sh"
# loop over samples (we have 3 samples)
for sample in 1 2 3
do
# define output directories for current sample / read_file combination
current_result_dir="${result_dir}/sample${sample}/"
current_log_dir="${log_dir}/sample${sample}/"
# create them in case they do not exist
mkdir -p ${current_result_dir}
mkdir -p ${current_log_dir}
# loop over read files in each sample directory (4 read files per sample)
for read_file in 1 2 3 4
do
# loop over read pair mates
for read_pair in 1 2
do
# submit fastqc analysis for the current read_pair to the batch system
current_fastq=${input_dir}/sample${sample}/sample${sample}_set${read_file}.R${read_pair}.fastq.gz
current_log=${current_log_dir}/sample${sample}_set${read_file}.R${read_pair}.log
sbatch -o ${current_log} ${job_script} ${current_fastq} ${current_result_dir}
done
done
done
If you don’t want to separate submission to SLURM into two scripts, there is also a way to provide
all parameters and resource requests to sbatch
directly, without a separate job-script. For
this, you would need to wrap the call to fastqc into a command and provide it to sbatch
. In the
following, we will just provide the adapted inner-most loop from our original script form the
beginning.
...
# loop over read pair mates
for read_pair in 1 2
do
# run fastqc analysis for the current read_pair
current_fastq=${inputdir}/sample${sample}/sample${sample}_set${read_file}.R${read_pair}.fastq.gz
current_log=${current_log_dir}/sample${sample}_set${read_file}.R${read_pair}.log
# call to sbatch (note that we need to load the software module as part of the call)
sbatch -o ${current_log}--job-name=fastqc --cpus-per-task=1 --time=1:00:00 --mem-per-cpu=2G --wrap "module load FastQC; fastqc -o ${current_result_dir} ${current_fastq}"
done
...
The above scripts are just one possible solution to the problem. Bash is a versatile scripting language and you can find your personal solution just as well. We would recommend reviewing one of the many available online tutorials on bash scripting, such as this one.
List of generally useful commands
This list is by far not exhaustive and we invite you to extend it with your own commands that you find worthwhile remembering.
Command |
What it does |
---|---|
|
List ( |
|
Print path to current working directory to the screen |
|
Create a symbolic link between |
|
Create nested directories, forming a whole path. |