Introduction to Unix 1

General information

Main objective

In this lecture we will introduce Unix: an operating system that runs on almost all high performance computing (HPC) servers, which we interface with via the command line.

Learning objectives

  • Students can use the command line to issue basic commands with arguments

  • Students can navigate the Unix file system and perform basic file operations

  • Students can get help with commands and programs

  • Students can inspect files on the server

Resources

This section requires the use of the R Workbench.

The structure of a command

Commands are our tool to tell the computer what to do. Most commands have options and arguments. Arguments are often essential for a command to operate properly; they are the pieces of information required by a command, such as a file name. Options are, of course, optional, and offer ways to modify the way the command works.

../../_images/command_structure3.png

For instance, echo will take any text you give it as an argument and then send it back to you as output:

# My first command
echo 'Hello World!'

If you use the option -n, then it will not add a ‘new line’ to the end of the output:

# My second command
echo -n 'Hello World!'

Some commands end up with very complex structures, because they can have many options and arguments. In general, options will be of the format -a where a is a single letter or --word where word is a string (a series of letters, in computer terms).

  • Note: the command line is case-sensitive! So it does matter if you write -a or -A.

Useful command line tricks

  • You can use the up and down arrow keys to navigate through previously used commands (known as your history) and repeat or modify them.

  • Windows: To copy text from the terminal you will have to highlight it and right-click to use the in-browser menu and copy. Similarly you have to use the in-browser menu to paste into the terminal. The reason for this is that Ctrl + c and Ctrl + v have effects inside the terminal.

  • Mac: You can fortunately use Cmd + c and Cmd + v to copy and paste as normal. You can use Ctrl and various keys for in-terminal commands.

  • When typing a command or file name, you can press the ‘tab’ key to auto complete what you are typing. If there are multiple commands or files with similar names, auto completion will fill in as far as the first ambiguous character before you have to give it some more input. This method makes it much less likely that you make a spelling error.

  • Pressing Ctrl + c will send an interrupt signal that cancels the currently running command and brings you back to the command line.

  • Pressing Ctrl + r will allow you to search through your command history.

  • Pressing Ctrl + l will clear the screen.

  • See previuos commands by typing history and pressing enter.

  • Double-click to select a word, triple-click to select a line

  • Using a # character allows you to make comments that have no effect when run.

Exercises

  • Try to echo “My first command”

  • Use the arrow key to execute the same command again

  • Try typing e then pressing tab twice, what do you see?

  • Try adding c to make ec and pressing tab again. What happens?

  • Try to copy/paste your echo command “echo ‘My first command’

  • Try to clear the screen, can you still paste your echo command?

  • Try to echo ‘My first command ‘once with the -n option and once with the -N option. What do you notice?

+ show/hide code

The file system

You may be used to the file system in Windows or Mac OS X, where directories can contain files and more directories. The Unix filesystem is structured in the same way, as a tree, that begins at the ‘root’ directory ‘/’. Directories are separated by slash characters /.

../../_images/filesystem_hierarchy3.png

When you work on the command line, you are located in a directory somewhere in this tree. There are two ways to refer to a location: its absolute path, starting at the root directory, or its relative path.

# Absolute path
/nfs/course/home/<user_name>

# Relative path
../../home/<user_name>

The .. refers to the directory above a location, so the relative path here goes up twic, then back down to your home directory. If a path starts with ~/ then it refers to your home directory. If a path starts with ./ then it refers to the current directory.

# References the level above
../

# References the home directory
~/

# References the current directory
./

Getting help

man will show a manual for most basic commands, providing the correct syntax to use it and the various options available.

# Read the manual
man ls

Other programs have different ways to provide help on how to use them. A online tutorial is best, or a comprehensive manual, but sometimes you only have the command line to help you.

# Help please!
python3 -h
python3 --help

Basic file operations

cp copies a file from one location to another. The example will copy a file containing the genome sequence of E. coli K12 MG1655 to your home directory.

# Copy
cp <source> <destination>
cp /nfs/course/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna ~/

mv moves a file from one location to another. The example actually renames the file, because the destination is not a directory. Thus you can move and rename a file with the same command.

# Move or rename
mv <source> <destination>
mv /nfs/course/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna ~/E.coli_K12_MG1655.fna

rm removes a file, so use it with care.

# Remove
rm <path_to_file>
rm ~/E.coli_K12_MG1655.fna

mkdir creates a new directory with the given name.

# Make directory
mkdir genomes

rmdir removes an empty directory.

# Remove an empty directory
rmdir genomes

Exercises

  • Create a new directory called “genomes”

  • Copy the E. coli file into your new directory “genomes”

  • Rename the file to “E.coli_file”

  • Use the help option of the ls function to find with option gives you the size of the genome

  • Using the man and cp, find out how to copy a directory.

+ show/hide code

File name conventions

In Unix systems there are only really two types of files: text or binary. The file name ending (.txt or .jpg) doesn’t really matter like it does in Windows or Mac OS, however it is used to indicate the file type by convention. Some file types you will encounter include:

  • .txt - A generic text file.

  • .csv - A ‘comma separated values’ file, which is usually a table of data with each line a row and each column separated by a comma.

  • .tsv - A ‘tab separated values’ file, which is the same but separated by tab characters.

  • .fasta or .fa - A fasta formatted sequence file, in which each sequence has a header line starting with ‘>’.

  • .fna - A fasta formatted nucleotide sequence file, usually gene sequences.

  • .faa - A fasta formatted protein sequence file.

  • .sh - A ‘shell script’, which contains commands to run.

  • .r - An R script, which contains R commands to run.

  • .py - A python script, which contains python commands to run.

  • .gz or .tar.gz - A file that has been compressed using a protocol called ‘gzip’ so that it takes up less space on the disk and transfers over the internet faster.

Transferring files

There are many different protocols for transferring files between servers. You may have heard of FTP - File Transfer Protocol - which is a non-secure but commonly used example. A more secure file transfer protocol is SCP, and programs such as WinSCP use it. The command scp is an easy way to transfer a file immediately between the server you are working on and another (or two different servers!). Another command to copy files is rsync, which can be used with many options such as preserving the ownership and date of creation of a file (and much more).

# Secure CoPy
man scp
scp source user@server:destination # local to server
scp user@server:source destination # server to local

# Rsync
man rsync
rsync -a source user@server:destination # local to server
rsync -a user@server:source destination # server to local

# Download an E. coli genome from the server to your local computer
# First open Windows Command or Mac Terminal
scp user@micro-rstudio.ethz.ch:/nfs/course/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna .
# or
rsync -a user@micro-rstudio.ethz.ch:/nfs/course/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna .

# Copy the E.coli genome (or any file) from your local computer to the home folder on the server
# Again, on your local system, run the following commands in Windows Command or Mac Terminal
scp GCF_000005845.2_ASM584v2_genomic.fna user@micro-rstudio.ethz.ch:~/
rsync -a GCF_000005845.2_ASM584v2_genomic.fna user@micro-rstudio.ethz.ch:~/

Sometimes you want to download a file directly from the internet to the server, rather than going via your local machine. wget allows you to download files in this way.

# Download from the internet
wget source-URL
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/482/265/GCF_000482265.1_EC_K12_MG1655_Broad_SNP/GCF_000482265.1_EC_K12_MG1655_Broad_SNP_genomic.fna.gz

Compressing and decompressing files

Files can be compressed to take up less space on the hard drive (disk), or for transfer over the internet. The file you downloaded is an example, and we can decompress it using the gunzip command:

# Decompress a file
gunzip GCF_000482265.1_EC_K12_MG1655_Broad_SNP_genomic.fna.gz

If you ever need to compress a file, for instance to send it to someone, you can use the gzip command:

# Compress a file
gzip GCF_000482265.1_EC_K12_MG1655_Broad_SNP_genomic.fna

Exercises

  • Windows: Using Windows Command:

  • Mac OS X: In the Mac Terminal:
    • Upload a file of your choice to the server.

  • On the server, download the E. coli file in the example above to your home folder.

  • Decompress the file.

+ show/hide code

Working with files

Looking at files

The command cat displays the entire contents of a file directly on the terminal. For large files this can be disastrous, so remember that you can cancel commands in progress with ctrl + c.

# ConCATenate
cat E.coli_K12_MG1655.fna

The command head displays only the first 10 lines of a file directly on the terminal. If you look at the available options for the command, -n x outputs the first x lines instead, and using a negative number outputs the lines except for the last x.

# Show file head
head E.coli_K12_MG1655.fna
head -n 1 E.coli_K12_MG1655.fna

The command tail displays only the last 10 lines of a file directly on the terminal. It has similar options to head; -n x outputs the last x lines, and using a positive number +x (note the “+” character) outputs the lines except for the first x.

# Show file tail
tail E.coli_K12_MG1655.fna

The command less is a versatile way to look at a file in the command line. Instead of showing you the contents of a file directly on the terminal, it ‘opens’ the file to browse. You can use the arrow keys, page up, page down, home, end and the spacebar to navigate the file. Pressing q will quit. A number of useful options exist for the command, such as showing line numbers or displaying without line wrapping. It also has a search feature that we will cover later.

# Browse file
less E.coli_K12_MG1655.fna

The command wc is a command that will quickly count the number of lines, words and characters in a file, including invisible characters like ‘newline’ and whitespace. Its options allow you to specify which value to return, otherwise it gives all three.

# Count things
wc E.coli_K12_MG1655.fna

Exercises

  • Use cat to look at the E. coli genome file you copied last time, is it suitable for looking at this file?

  • Use head and tail to examine the first and last 10 lines of the genome file. Now try to look at the first and last 20 lines.

  • Use less to look at the genome file. Navigate through the file with the keys listed above, then return to the Terminal.

  • Use the man command we learned to read about the wc command.

  • Can you find out how many lines are in the genome file?

+ show/hide code

Wildcards

When providing a file path as an argument to a command, it is often possible to provide multiple file paths using wildcards. These are special characters or strings that can be substituted for a matching pattern.

  • ? matches any single character

  • * matches any number of any characters

  • […] matches any character within the brackets

  • {word1,word2,…} matches any string inside the brackets

For instance:

# Pattern matching
ls /cluster/home/ssunagaw/teaching/ecoli/*      # lists all files in the ecoli directory
ls /cluster/home/ssunagaw/teaching/ecoli/*.fna  # lists all nucleotide fasta files there
ls /cluster/home/ssunagaw/teaching/ecoli/*.f?a  # lists all nucleotide and protein fasta files there

Homework

  • Upload a picture of yourself or a pet into your home folder and name it homework.jpg or .png or whatever format it is

  • Find the out-of-place file in /nfs/course/genomes and copy it into your home folder

  • For next time, find out:
    • What happens when you copy a file with the same name as an existing file?

    • What happens when you delete the directory you are currently in?

    • What happens when you create a directory with the same name as an existing one?

    • What happens if you *echo* –help ? And how can you get the help information for echo?