Computing at the Institute of Microbiology#

Learning objectives#

The principle aim of this workshop is to provide information about various aspects of computing at the Institute of Microbiology:

What data storage is available and your responsibilities
What computing servers are available
What software is available
How to be a considerate user on a computing server
What services are available from the institute bioinformaticians

Data storage#

For information on the Work Folders provided by IT in your Windows environments, see here.

This section is about the storage of scientific data and associated files such as scripts.

Attention

You have a number of responsibilities relating to your own data:

Your data should be appropriately backed up.
Your data should be organised, documented and accessible to those in your group after you have left.
Any data associated with your publications should be made available according to the rules of the journal and funding bodies.

More details on these topics will be covered in the follow-up workshop, Good Practice in Computing.

Gram#

Gram is the name we give to our ‘Network File System’ or NFS. It actually consists of several components that we can imagine as individual drives. You have access to:

Your group drive, located (in Windows) at \\gram\biol_micro_<GROUP>
The institute drive for internal sharing, \\gram\biol_micro_shared
Other drives with specific purposes

The advantage of NFS is that it is regularly backed up throughout the day, so if you make a mistake you can restore previous data, see here for details.

The space on Gram is limited however - NFS storage is more expensive than other types partly due to the backup feature. It is not suitable for the storage of large quantities of sequencing or image data, so we provide a separate archiving system for this.

Note

Please do not use the Gram shared drive for storage, it is supposed to be for transfer between different groups and machines. Anything left there could be deleted any time to make space

Harvest#

Harvest is the name of our long-term storage database for sequencing data. It is currently being expanded to also accommodate microscopy image data, and to utilise a self-service interface that will mean that you no longer have to contact a bioinformatician directly.

You have a responsibility to ensure that any sequencing data you generate is registered for submission to Harvest, and there are instructions here. Your data will need to be put onto a specific Gram drive: \\gram\biol_micro_openbis_dropbox.

Archiving#

For other kinds of data that you want to archive for long-term storage, we also have a solution, detailed here.

Publication data#

There will be more detail on this in the follow-up workshop but in brief:

We recommend submitting sequence data, assemblies and annotations to the European Nucleotide Archive (ENA)
Software and scripts are best deposited into a Github repository, but also deposited on Zenodo
Any other data, supplementary material and the like that isn’t sequence or code should also be desposited on Zenodo

Computing servers#

As a member of the institute, you have access not only to our internal server Morgan, but also (because the Department of Biology buys into it) the ETH-wide computing cluster Euler.

Information on how to connect to these servers and use them is available here.

Currently, Morgan uses a different queuing system to Euler, but we intend to change this in the future.

The institute maintains some other linux servers, most of which belong to the Sunagawa group, but, for instance, Fleming hosts some web-based tools and services, and Cousteau has been repurposed for use in teaching.

Storage#

When you log into Morgan you are taken to your home folder - this storage space is maintained across all linux servers run by the institute, but is limited, so not suitable for storing large data.

You can access the shared space on Gram at /nfs/shared and we also hope to make each group drive available at /nfs/<GROUP> soon (watch this space).

There is also a built-in drive that can be accessed at /science, however you will need to talk to Chris to be given a space with appropriate permissions there. It is sensible to use this for larger data that you need to work on, or when speed of access is important for your software.

On Euler, there are designated spaces, described here.

Software#

When you log into Morgan, our Software Module System (sometimes called a software stack) is automatically loaded in the background. We provide instructions on how to use this here. New or updated software can be installed by contacting Chris.

Any software that requires a database will have that database installed with it. Further, we make certain broad-use databases available at /nfs/modules/databases.

Euler maintains its own software stacks, an old one (the default) and a new one that acts the same as the one on Morgan. When you log into Euler, the easiest way to get to the new software stack is with the command env2lmod. Our own module system can also be loaded using the following commands:

unset MODULEPATH_ROOT
unset MODULESHOME
unset MODULEPATH
source /nfs/nas22/fs2201/biol_micro_unix_modules/Lmod-7.8/lmod/lmod/init/profile

They also provide certain databases. A lot of information about working on Euler can be found here if not elsewhere on that wiki.

How to be a considerate user#

On any server, there are limited resources for computing, mainly:

CPUs, or cores, or threads
Memory (RAM)
Storage space
Read and write capacity
Compute time

When you want to run a script or piece of software you should consider what resources it will use. Particularly with software, sometimes it will by default try to use all available resources, which will mean that everyone else on the server will have problems performing even basic tasks.

Whenever possible therefore, use the job queuing system, instructions for which are here.

However, we know that you want to test things out and sometimes just try commands quickly so not everything has to be queued. As a rule of thumb, a job should definitely be queued if any of the following is true:

It uses more than 8 cores
It needs more than 50GB RAM
It involves reading or writing hundreds of files
It will take longer than an hour

What does my software use?#

If you can’t find out from the software documentation what requirements it has, you can see what a command is doing with the command top.

This brings up a list of all commands currently running on the system. To reduce the list to only your commands type u followed by your username. To exit, press q.

The columns of main interest are %CPU and %MEM that show the amount of CPU usage in percent (so for instance 400 is 4 CPUs in full use) and memory usage in percent (of the total server memory which is 1TB).

If your software is using too many resources, you can go to the terminal where it is running and Ctrl-C should kill it. If not, then from top you can type k followed by the PID of the task and then 9 (for reasons not worth going into here). Alternatively on the terminal you can use the command kill -9 <PID>.

Use of screen#

A compromise if you have problems with software when it runs in the queue is to use the screen command. This allows commands to keep running even when you have logged out of the server.

# Launch a screen
screen

# Reset any modules you have previously loaded
ml purge
ml <MODULE> # load what you need

# Then run the command you want to

# Leave the screen running
Press Ctrl+a, Ctrl+d

# Go back to a screen
screen -dr

# If you run multiple screens, you can name them
screen -S first_screen

# And then reattach to them using the name
screen -dr first_screen

Use of screen is appropriate for long-running jobs that don’t use many cores or much memory, otherwise they should definitely be queued.

Bioinformatic services#

Your friendly neighbourhood bioinformaticians are Chris, Anna and Lilith. Chris works for the whole institute, whilst Anna and Lilith are hired by the NCCR Microbiomes. You can contact us via email by clicking on our pictures below:

We offer support for your research with advice, analysis and software as the project requires. We have standard pipelines for the following data:

16S sequencing
Genome and metagenome assembly
Sequence annotation
transposon or tag-based sequencing experiments (mBARq)

But we also have expertise beyond these topics - it really never hurts to ask whether we can help with a problem you may be having in your sequencing, analysis or statistics.