Session 1.1. - Introduction to Bash and UNIX#

0. UNIX system and terminals#

The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux and Apple).

Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suited to working with such files and has several powerful (and flexible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So if you can learn just five Unix commands, you will be able to do a lot more than just five things.

A terminal is the common name for the program that does two main things. It allows you to type input to the computer (i.e. run programs, move/view files etc.) and it allows you to see output from those programs. All Unix machines will have a terminal program and on Apple computers, the terminal application is unsurprisingly named ‘Terminal’.

Unix keeps files arranged in a hierarchical structure. From the top-level of the computer, there will be a number of directories, each of which can contain files and subdirectories, and each of those in turn can of course contain more files and directories and so on, ad infinitum. It is important to note that you will always be in a directory when using the terminal. The default behavior is that when you open a new terminal you start in your own home directory (containing files and directories that only you can modify).

Directory structures

Directory structures#

1. Basic Navigation#

a. pwd#

Tells you which directory you currently are in.

pwd

b. ls#

Lists your files. ls has many options: -l lists files in ‘long format’, which contains the exact size of the file, who owns the file, who has the right to look at it, and when it was last modified. -a lists all files, including hidden files. For more information on this command check this link.

ls option

What are the different arguments doing in each example:

ls -la
ls -ltra
ls -ltra <./path/to/directory>
ls -lh <./path/to/directory>

c. cd#

Moves you from one directory to other. Running this

cd

moves you to home directory. This command accepts an optional dirname, which moves you to that directory.

cd dirname

Your current directory is ., while .. points to the directory up to where you currently are:

cd ..
pwd
cd <./path/to/directory>
pwd

d. ssh#

Use to connect to a remote computer (example connecting to ETH-morgan):

ssh <username>@morgan.ethz.ch

e. man#

Shows the manual for specified command.

man command

2. Basic Operations#

a. mkdir#

Makes a new directory.

mkdir dirname

You can use this to create multiple directories at once within your current directory.

mkdir 1stDirectory 2ndDirectory 3rdDirectory

b. cp#

Copies a file from one location to other.

cp filename1 filename2

Where filename1 is the source path to the file and filename2 is the destination path to the file.

When copying directories, we need to add the -R (recursive) option:

cp -R <./path/to/directory> ./

We are going to copy this tutorial directory to your home (done after ssh to morgan):

cp -R  /nfs/nas22/fs2202/biol_micro_teaching/551-1119-00L-2024/s02_bash ./

c. mv#

Moves a file from one location to other.

mv filename1 filename2

Where filename1 is the source path to the file and filename2 is the destination path to the file.

Also it can be used for rename a file.

mv old_name new_name

d. rm#

Removes a file. Using this command on a directory gives you an error. rm: directory: is a directory To remove a directory you have to pass -r which will remove the content of the directory recursively. Optionally you can use -f flag to force the deletion i.e. without any confirmations etc.

rm filename

3. Inspecting files#

In Unix systems there are only really two types of files: text or binary. The file name ending (.txt or .jpg) doesn’t really matter like it does in Windows or MacOS, however it is used to indicate the file type by convention. Some file types you will encounter include:

  • .txt - A generic text file

  • .csv - A ‘comma separated values’ file, which is usually a table of data with each line a row and each column separated by a comma

  • .tsv - A ‘tab separated values’ file, which is the same by separated by tab characters

  • .fasta or .fa - A fasta formatted sequence file, in which each sequence has a header line starting with ‘>’

  • .fna - A fasta formatted nucleotide sequence file, usually gene sequences

  • .faa - A fasta formatted protein sequence file

  • .sh - A ‘shell script’, which contains terminal commands to run sequentially

  • .r - An R script, which contains R commands to run

  • .py - A python script, which contains python commands to run

  • .gz or .tar.gz - A file that has been compressed using a protocol called ‘gzip’ so that it takes up less space on the disk and transfers over the internet faster

a. cat#

It can be used for the following purposes under UNIX or Linux.

  • Display text files on screen

  • Copy text files

  • Combine text files

  • Create new text files

cat filename                        # Inspect the file
cat file1 file2                     # Opens the two files one after the other
cat file1 file2 > newcombinedfile   # Create a file with file1 and file2
cat file1 >> newcombinedfile        # Paste file1 after the content in newcombinedfile
cat < file1 > file2                 # Copy file1 to file2
zcat file1.gz                       # To print a compressed file

b. head#

Outputs the first 10 lines of file

head filename

To output a different number of lines (ex. 100):

head -100 filename

c. tail#

Outputs the last 10 lines of file

tail filename
tail -100 filename

d. more#

Shows the first part of a file (move with space and type q to quit).

more filename

e. less#

Instead of showing you the contents of a file directly on the terminal, it ‘opens’ the file to browse. You can use the arrow keys, page up, page down, home, end and the spacebar to navigate the file. Pressing q will quit.

less filename
less -S filename    # To nicely print a tsv file

4. Text Operations#

a. echo#

Display a line of text

display “Hello World”

echo Hello World
Hello World

display “Hello World” with newlines between words

echo -ne "Hello\nWorld\n"
Hello
World

b. grep#

Print lines matching a pattern

Find the exact string ‘AUUACUGACGCUCAUGGACGAA’ in example.fasta

grep 'AUUACUGACGCUCAUGGACGAA' example.fasta

You can check for more than one pattern:

grep 'AUUACUGACGCUCAUGGACGAA|GACGAAAGCCAGGGGAGCGAAAGGG' example.fasta

You can read patterns from a file:

grep -f taxa.txt example.fasta

c. sed#

Stream editor for filtering and transforming text

Create an example.txt:

echo 'Hello This is a Test 1 2 3 4' > example.txt

replace all spaces with hyphens

sed 's/ /-/g' example.txt

replace all digits with “d”

sed 's/[0-9]/d/g' example.txt

d. awk#

awk is the most useful command for handling text files. It operates on an entire file line by line. By default it uses whitespace to separate the fields. The most common syntax for awk command is

awk '/search_pattern/ { action_to_take_if_pattern_matches; }' file_to_parse

Lets take following file /etc/passwd. Here’s the sample data that this file contains:

root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync

So now lets get only username from this file. Where -F specifies that on which base we are going to separate the fields. In our case it’s :. { print $1 } means print out the first matching field.

awk -F':' '{ print $1 }' /etc/passwd

After running the above command you will get following output.

root
daemon
bin
sys
sync

For more detail on how to use awk, check following link.

e. sort#

Sort lines of text files

cat example_sort.txt

Sort

sort example_sort.txt

Randomize

sort -r example_sort.txt

f. uniq#

Report or omit repeated lines show only unique lines of example.txt (first you need to sort it, otherwise it won’t see the overlap)

sort example_sort.txt | uniq
a
b
c
d

show the unique items for each line, and tell me how many instances it found

sort example_sort.txt | uniq -c
3 a
2 b
2 c
1 d

The vertical line |, or pipe, allows to link input - output between functions.

g. cut#

The cut command in Bash is very useful for extracting specific columns of text or fields from files or command outputs.

cut -d',' -f1 data.csv
cut -f1,3 -d$'\t' data.tsv

h. wc#

Tells you how many lines, words and characters there are in a file.

wc example.fasta

Example:

$ wc example.fasta

Where 10896 is lines, 13479 is words and 890979 is characters.

You can also count lines:

$ wc -l example.fasta

Exercise#

Use the previous commands to:

  • Count the number of entries in the fasta file

  • Count the number of times the subsequence ‘AUUACUGACGCUCAUGGACGAA’ appears in the fasta file

  • Which limitations do you see with this approach?

5. Other useful operations#

a. gunzip#

Un-compresses files compressed by gzip.

gunzip filename

b. gzcat#

Lets you look at gzipped file without actually having to gunzip it.

gzcat filename

c. gzip#

Compresses files.

gzip filename

d. wget#

Downloads file.

wget file

6. Basic Shell Programming#

The first line that you will write in bash script files is called shebang. This line in any script determines the script’s ability to be executed like a standalone executable without typing sh, bash, python, php etc beforehand in the terminal.

#!/usr/bin/env bash

Creating variables in bash is similar to other languages. There are no data types. A variable in bash can contain a number, a character, a string of characters, etc. You have no need to declare a variable, just assigning a value to its reference will create it.

Example:

str="hello world"

The above line creates a variable str and assigns “hello world” to it. The value of variable is retrieved by putting the $ in the beginning of variable name.

Example:

echo $str   # hello world

Like other languages bash has also arrays. An array is a variable containing multiple values. There’s no maximum limit on the size of array. Arrays in bash are zero based. The first element is indexed with element 0. There are several ways for creating arrays in bash which are given below.

Examples:

array[0]=val
array[1]=val
array[2]=val
array=([2]=val [0]=val [1]=val)
array=(val val val)

To display a value at specific index use following syntax:

${array[i]}     # where i is the index

If no index is supplied, array element 0 is assumed. To find out how many values there are in the array use the following syntax:

${#array[@]}

Bash has also support for the ternary conditions. Check some examples below.

${varname:-word}    # if varname exists and isn't null, return its value; otherwise return word
${varname:=word}    # if varname exists and isn't null, return its value; otherwise set it word and then return its value
${varname:+word}    # if varname exists and isn't null, return word; otherwise return null
${varname:offset:length}    # performs substring expansion. It returns the substring of $varname starting at offset and up to length characters

Check some of the syntax on how to manipulate strings

${variable#pattern}         # if the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest
${variable##pattern}        # if the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest
${variable%pattern}         # if the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest
${variable%%pattern}        # if the pattern matches the end of the variable's value, delete the longest part that matches and return the rest
${variable/pattern/string}  # the longest match to pattern in variable is replaced by string. Only the first match is replaced
${variable//pattern/string} # the longest match to pattern in variable is replaced by string. All matches are replaced
${#varname}     # returns the length of the value of the variable as a character string

Bash has multiple shorthand tricks for doing various things to strings.

${variable,,}    #this converts every letter in the variable to lowercase
${variable^^}    #this converts every letter in the variable to uppercase

${variable:2:8}  #this returns a substring of a string, starting at the character at the 2 index(strings start at index 0, so this is the 3rd character),
                 #the substring will be 8 characters long, so this would return a string made of the 3rd to the 11th characters.

Here are some handy pattern matching tricks

if [[ "$variable" == *subString* ]]  #this returns true if the provided substring is in the variable
if [[ "$variable" != *subString* ]]  #this returns true if the provided substring is not in the variable
if [[ "$variable" == subString* ]]   #this returns true if the variable starts with the given subString
if [[ "$variable" == *subString ]]   #this returns true if the variable ends with the given subString

The above can be shortened using a case statement and the IN keyword

case "$var" in
    begin*)
        #variable begins with "begin"
    ;;
    *subString*)
        #subString is in variable
    ;;

    *otherSubString*)
        #otherSubString is in variable
    ;;
esac

As in almost any programming language, you can use functions to group pieces of code in a more logical way or practice the divine art of recursion. Declaring a function is just a matter of writing function my_func { my_code }. Calling a function is just like calling another program, you just write its name.

function name() {
    shell commands
}

Example:

#!/bin/bash
function hello {
   echo world!
}
hello

function say {
    echo $1
}
say "hello world!"

When you run the above example the hello function will output “world!”. The above two functions hello and say are identical. The main difference is function say. This function, prints the first argument it receives. Arguments, within functions, are treated in the same manner as arguments given to the script.

The conditional statement in bash is similar to other programming languages. Conditions have many form like the most basic form is if expression then statement where statement is only executed if expression is true.

if [ expression ]; then
    will execute only if expression is true
else
    will execute if expression is false
fi

Sometime if conditions becoming confusing so you can write the same condition using the case statements.

case expression in
    pattern1 )
        statements ;;
    pattern2 )
        statements ;;
    ...
esac

Expression Examples:

statement1 && statement2  # both statements are true
statement1 || statement2  # at least one of the statements is true

str1=str2       # str1 matches str2
str1!=str2      # str1 does not match str2
str1<str2       # str1 is less than str2
str1>str2       # str1 is greater than str2
-n str1         # str1 is not null (has length greater than 0)
-z str1         # str1 is null (has length 0)

-a file         # file exists
-d file         # file exists and is a directory
-e file         # file exists; same -a
-f file         # file exists and is a regular file (i.e., not a directory or other special type of file)
-r file         # you have read permission
-s file         # file exists and is not empty
-w file         # you have write permission
-x file         # you have execute permission on file, or directory search permission if it is a directory
-N file         # file was modified since it was last read
-O file         # you own file
-G file         # file's group ID matches yours (or one of yours, if you are in multiple groups)

file1 -nt file2     # file1 is newer than file2
file1 -ot file2     # file1 is older than file2

-lt     # less than
-le     # less than or equal
-eq     # equal
-ge     # greater than or equal
-gt     # greater than
-ne     # not equal

There are three types of loops in bash. for, while and until.

Different for Syntax:

for x := 1 to 10 do
begin
  statements
end

for name [in list]
do
  statements that can use $name
done

for (( initialisation ; ending condition ; update ))
do
  statements...
done

while Syntax:

while condition; do
  statements
done

until Syntax:

until condition; do
  statements
done

Content adapted from:

Exercises:

Cheatsheets: