General tasks#

As genomics work is usually more heavy in its resource usage, you will be using a dedicated compute server to complete the given project tasks. (The additional exercises provided alongside the OLM can be completed using the Jupyter Hub that has already been introduced and that is reachable via the link in the Moodle course page.)

Refresh the command line usage#

Although you got into contact with using the command line and the Unix operating system before, we would recommend that you at least briefly review the refresher for using Unix <1_Refresher> provided in the Online Learning Materials (OLM) of this course.

Specifically, you should feel comfortable completing the following tasks:

  • creating, moving, copying, deleting files

  • creating, moving, copying, deleting directories

  • re-directing output of a scripts / process

  • creating / editing files in a text editor

  • connecting different commands into a pipeline

  • creating and executing a bash/python script

In addition, you will be using the batch compute system Slurm to submit compute jobs to the worker queue of the dedicated compute server. We have also added a short intro on how to use the Slurm scheduler at the end of the Unix refresher in the OLM.

Review the Unix intro#

Independent of which project you end up working on, you will be using the dedicated Cousteau compute server. Before you get started, we would like to ask you to complete the Unix Refresher provided in the Online Learning Material. This will provide an overview on how to use the scheduling system and informs you what resources are available for your use.

Draft a sensible directory structure#

When it comes to working with large and complex data sets, a practical and clear directory structure is your key to success. The art is to find the right level of abstraction between a single directory containing possibly hundreds of different files and a very deep directory structure that lets you pass through ten hierarchical levels before reaching the relevant target. As it is hard to change the overall project structure late in the project, we would like to invite you to think of an overall analysis plan and to design your directory structure accordingly.

We would like to provide you with the following guidelines / rules of thumb:

  • for directory and file names use short yet expressive names (e.g., rather “alignments” than “output1” )

  • separate scripts / code from data

  • separate inputs from outputs

  • separate data outputs (e.g., counts or alignments) from logging information (e.g., runtime logs)

Follow the DRY rule#

One of the main rules in scripting / programming is the DRY rule: Don’t Repeat Yourself! Try to use parameters wherever possible and sensible and avoid hard-coding file names or parameters where it can be avoided. For instance to generate an alignment for one of many given read samples, encapsulate the call to the aligner in a loop that automatically loops over all given files, instead of having a separate call to the aligner for each of the samples. Wherever you see potential for re-use of things you already have implemented, try to realize it.