Session 1.1. - Introduction to Bash and UNIX#
0. UNIX system and terminals#
The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux and Apple).
Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suited to working with such files and has several powerful (and flexible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So if you can learn just five Unix commands, you will be able to do a lot more than just five things.
A terminal is the common name for the program that does two main things. It allows you to type input to the computer (i.e. run programs, move/view files etc.) and it allows you to see output from those programs. All Unix machines will have a terminal program and on Apple computers, the terminal application is unsurprisingly named ‘Terminal’.
Unix keeps files arranged in a hierarchical structure. From the top-level of the computer, there will be a number of directories, each of which can contain files and subdirectories, and each of those in turn can of course contain more files and directories and so on, ad infinitum. It is important to note that you will always be in a directory when using the terminal. The default behavior is that when you open a new terminal you start in your own home directory (containing files and directories that only you can modify).
2. Basic Operations#
a. mkdir
#
Makes a new directory.
mkdir dirname
You can use this to create multiple directories at once within your current directory.
mkdir 1stDirectory 2ndDirectory 3rdDirectory
b. cp
#
Copies a file from one location to other.
cp filename1 filename2
Where filename1
is the source path to the file and filename2
is
the destination path to the file.
When copying directories, we need to add the -R
(recursive) option:
cp -R <./path/to/directory> ./
We are going to copy this tutorial directory to your home (done after ssh to morgan):
cp -R /nfs/nas22/fs2202/biol_micro_teaching/551-1119-00L-2024/s02_bash ./
c. mv
#
Moves a file from one location to other.
mv filename1 filename2
Where filename1
is the source path to the file and filename2
is
the destination path to the file.
Also it can be used for rename a file.
mv old_name new_name
d. rm
#
Removes a file. Using this command on a directory gives you an error.
rm: directory: is a directory
To remove a directory you have to pass
-r
which will remove the content of the directory recursively.
Optionally you can use -f
flag to force the deletion i.e. without
any confirmations etc.
rm filename
3. Inspecting files#
In Unix systems there are only really two types of files: text or binary. The file name ending (.txt or .jpg) doesn’t really matter like it does in Windows or MacOS, however it is used to indicate the file type by convention. Some file types you will encounter include:
.txt - A generic text file
.csv - A ‘comma separated values’ file, which is usually a table of data with each line a row and each column separated by a comma
.tsv - A ‘tab separated values’ file, which is the same by separated by tab characters
.fasta or .fa - A fasta formatted sequence file, in which each sequence has a header line starting with ‘>’
.fna - A fasta formatted nucleotide sequence file, usually gene sequences
.faa - A fasta formatted protein sequence file
.sh - A ‘shell script’, which contains terminal commands to run sequentially
.r - An R script, which contains R commands to run
.py - A python script, which contains python commands to run
.gz or .tar.gz - A file that has been compressed using a protocol called ‘gzip’ so that it takes up less space on the disk and transfers over the internet faster
a. cat
#
It can be used for the following purposes under UNIX or Linux.
Display text files on screen
Copy text files
Combine text files
Create new text files
cat filename # Inspect the file
cat file1 file2 # Opens the two files one after the other
cat file1 file2 > newcombinedfile # Create a file with file1 and file2
cat file1 >> newcombinedfile # Paste file1 after the content in newcombinedfile
cat < file1 > file2 # Copy file1 to file2
zcat file1.gz # To print a compressed file
b. head
#
Outputs the first 10 lines of file
head filename
To output a different number of lines (ex. 100):
head -100 filename
c. tail
#
Outputs the last 10 lines of file
tail filename
tail -100 filename
d. more
#
Shows the first part of a file (move with space and type q to quit).
more filename
e. less
#
Instead of showing you the contents of a file directly on the terminal, it ‘opens’ the file to browse. You can use the arrow keys, page up, page down, home, end and the spacebar to navigate the file. Pressing q will quit.
less filename
less -S filename # To nicely print a tsv file
4. Text Operations#
a. echo
#
Display a line of text
display “Hello World”
echo Hello World
Hello World
display “Hello World” with newlines between words
echo -ne "Hello\nWorld\n"
Hello
World
b. grep
#
Print lines matching a pattern
Find the exact string ‘AUUACUGACGCUCAUGGACGAA’ in example.fasta
grep 'AUUACUGACGCUCAUGGACGAA' example.fasta
You can check for more than one pattern:
grep 'AUUACUGACGCUCAUGGACGAA|GACGAAAGCCAGGGGAGCGAAAGGG' example.fasta
You can read patterns from a file:
grep -f taxa.txt example.fasta
c. sed
#
Stream editor for filtering and transforming text
Create an example.txt:
echo 'Hello This is a Test 1 2 3 4' > example.txt
replace all spaces with hyphens
sed 's/ /-/g' example.txt
replace all digits with “d”
sed 's/[0-9]/d/g' example.txt
d. awk
#
awk is the most useful command for handling text files. It operates on an entire file line by line. By default it uses whitespace to separate the fields. The most common syntax for awk command is
awk '/search_pattern/ { action_to_take_if_pattern_matches; }' file_to_parse
Lets take following file /etc/passwd
. Here’s the sample data that
this file contains:
root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
So now lets get only username from this file. Where -F
specifies
that on which base we are going to separate the fields. In our case it’s
:
. { print $1 }
means print out the first matching field.
awk -F':' '{ print $1 }' /etc/passwd
After running the above command you will get following output.
root
daemon
bin
sys
sync
For more detail on how to use awk
, check following
link.
e. sort
#
Sort lines of text files
cat example_sort.txt
Sort
sort example_sort.txt
Randomize
sort -r example_sort.txt
f. uniq
#
Report or omit repeated lines show only unique lines of example.txt (first you need to sort it, otherwise it won’t see the overlap)
sort example_sort.txt | uniq
a
b
c
d
show the unique items for each line, and tell me how many instances it found
sort example_sort.txt | uniq -c
3 a
2 b
2 c
1 d
The vertical line |
, or pipe, allows to link input - output between
functions.
g. cut
#
The cut command in Bash is very useful for extracting specific columns of text or fields from files or command outputs.
cut -d',' -f1 data.csv
cut -f1,3 -d$'\t' data.tsv
h. wc
#
Tells you how many lines, words and characters there are in a file.
wc example.fasta
Example:
$ wc example.fasta
Where 10896
is lines, 13479
is words and 890979
is
characters.
You can also count lines:
$ wc -l example.fasta
Exercise#
Use the previous commands to:
Count the number of entries in the fasta file
Count the number of times the subsequence ‘AUUACUGACGCUCAUGGACGAA’ appears in the fasta file
Which limitations do you see with this approach?
5. Other useful operations#
a. gunzip
#
Un-compresses files compressed by gzip.
gunzip filename
b. gzcat
#
Lets you look at gzipped file without actually having to gunzip it.
gzcat filename
c. gzip
#
Compresses files.
gzip filename
d. wget
#
Downloads file.
wget file
6. Basic Shell Programming#
The first line that you will write in bash script files is called
shebang
. This line in any script determines the script’s ability to
be executed like a standalone executable without typing sh, bash,
python, php etc beforehand in the terminal.
#!/usr/bin/env bash
Creating variables in bash is similar to other languages. There are no data types. A variable in bash can contain a number, a character, a string of characters, etc. You have no need to declare a variable, just assigning a value to its reference will create it.
Example:
str="hello world"
The above line creates a variable str
and assigns “hello world” to
it. The value of variable is retrieved by putting the $
in the
beginning of variable name.
Example:
echo $str # hello world
Like other languages bash has also arrays. An array is a variable containing multiple values. There’s no maximum limit on the size of array. Arrays in bash are zero based. The first element is indexed with element 0. There are several ways for creating arrays in bash which are given below.
Examples:
array[0]=val
array[1]=val
array[2]=val
array=([2]=val [0]=val [1]=val)
array=(val val val)
To display a value at specific index use following syntax:
${array[i]} # where i is the index
If no index is supplied, array element 0 is assumed. To find out how many values there are in the array use the following syntax:
${#array[@]}
Bash has also support for the ternary conditions. Check some examples below.
${varname:-word} # if varname exists and isn't null, return its value; otherwise return word
${varname:=word} # if varname exists and isn't null, return its value; otherwise set it word and then return its value
${varname:+word} # if varname exists and isn't null, return word; otherwise return null
${varname:offset:length} # performs substring expansion. It returns the substring of $varname starting at offset and up to length characters
Check some of the syntax on how to manipulate strings
${variable#pattern} # if the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest
${variable##pattern} # if the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest
${variable%pattern} # if the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest
${variable%%pattern} # if the pattern matches the end of the variable's value, delete the longest part that matches and return the rest
${variable/pattern/string} # the longest match to pattern in variable is replaced by string. Only the first match is replaced
${variable//pattern/string} # the longest match to pattern in variable is replaced by string. All matches are replaced
${#varname} # returns the length of the value of the variable as a character string
Bash has multiple shorthand tricks for doing various things to strings.
${variable,,} #this converts every letter in the variable to lowercase
${variable^^} #this converts every letter in the variable to uppercase
${variable:2:8} #this returns a substring of a string, starting at the character at the 2 index(strings start at index 0, so this is the 3rd character),
#the substring will be 8 characters long, so this would return a string made of the 3rd to the 11th characters.
Here are some handy pattern matching tricks
if [[ "$variable" == *subString* ]] #this returns true if the provided substring is in the variable
if [[ "$variable" != *subString* ]] #this returns true if the provided substring is not in the variable
if [[ "$variable" == subString* ]] #this returns true if the variable starts with the given subString
if [[ "$variable" == *subString ]] #this returns true if the variable ends with the given subString
The above can be shortened using a case statement and the IN keyword
case "$var" in
begin*)
#variable begins with "begin"
;;
*subString*)
#subString is in variable
;;
*otherSubString*)
#otherSubString is in variable
;;
esac
As in almost any programming language, you can use functions to group pieces of code in a more logical way or practice the divine art of recursion. Declaring a function is just a matter of writing function my_func { my_code }. Calling a function is just like calling another program, you just write its name.
function name() {
shell commands
}
Example:
#!/bin/bash
function hello {
echo world!
}
hello
function say {
echo $1
}
say "hello world!"
When you run the above example the hello
function will output
“world!”. The above two functions hello
and say
are identical.
The main difference is function say
. This function, prints the first
argument it receives. Arguments, within functions, are treated in the
same manner as arguments given to the script.
The conditional statement in bash is similar to other programming
languages. Conditions have many form like the most basic form is if
expression then
statement where statement is only executed if
expression is true.
if [ expression ]; then
will execute only if expression is true
else
will execute if expression is false
fi
Sometime if conditions becoming confusing so you can write the same
condition using the case statements
.
case expression in
pattern1 )
statements ;;
pattern2 )
statements ;;
...
esac
Expression Examples:
statement1 && statement2 # both statements are true
statement1 || statement2 # at least one of the statements is true
str1=str2 # str1 matches str2
str1!=str2 # str1 does not match str2
str1<str2 # str1 is less than str2
str1>str2 # str1 is greater than str2
-n str1 # str1 is not null (has length greater than 0)
-z str1 # str1 is null (has length 0)
-a file # file exists
-d file # file exists and is a directory
-e file # file exists; same -a
-f file # file exists and is a regular file (i.e., not a directory or other special type of file)
-r file # you have read permission
-s file # file exists and is not empty
-w file # you have write permission
-x file # you have execute permission on file, or directory search permission if it is a directory
-N file # file was modified since it was last read
-O file # you own file
-G file # file's group ID matches yours (or one of yours, if you are in multiple groups)
file1 -nt file2 # file1 is newer than file2
file1 -ot file2 # file1 is older than file2
-lt # less than
-le # less than or equal
-eq # equal
-ge # greater than or equal
-gt # greater than
-ne # not equal
There are three types of loops in bash. for
, while
and
until
.
Different for
Syntax:
for x := 1 to 10 do
begin
statements
end
for name [in list]
do
statements that can use $name
done
for (( initialisation ; ending condition ; update ))
do
statements...
done
while
Syntax:
while condition; do
statements
done
until
Syntax:
until condition; do
statements
done
Content adapted from:
Idnan/bash-guide (License: CCBY 4.0)
https://sunagawalab.ethz.ch/share/teaching/home/551-1119-00L_Fall2022/documentation/01_02_setup.html
Exercises:
Cheatsheets: