Solutions¶

Unix 1 + 2¶

What happens when you copy a file with the same name as an existing file?

If you use cp to copy a file and give a destination path that already exists, it will simply overwrite the existing file.

What happens when you delete the directory you are currently in?

You appear to still be in the directory, but any commands involving relative paths will fail.

What happens when you create a directory with the same name as an existing one?

An error will occur, ‘File exists’. A useful option is mkdir -p which only tries to create the directory if it doesn’t exist.

How many samples are there in total?

There are various ways to count the lines, ignoring the header line: 2898.

How many different body sites are there? Subsites? How do the subsites correspond to the sites?

Extract only the column with body site information.
Sort the data so that uniq can return only the unique entries.
For correspondence between subsites and sites, extract and sort multiple columns.

What is the distribution of data for each column?

Using uniq -c gives a count of each unique entry.

Are there any clear biases in the distribution of metadata?

For example: looking at the distribution of Sex and Visit No. you can see that many women dropped out of the study after the first visit, which may bias the 2nd visit data.

Sequence Data¶

How large (in bases, or base pairs, bp) is the genome?

Using only command line tools, you could ditch the header line then count characters, subtracting the number of lines because the newline character n is counted by wc -c.
If there were multiple contigs you would have to grep -v “^>” to remove all header lines.
You could open the file in Biopython and then len(seq) gives you a sequence length.
You could also use seqtk comp test.fasta to very easily get sequence length (second column).

What is the GC content of the genome?

Using only command line tools, you can use a similar trick (count characters minus lines) as before, but you have to remove all As and Ts with tr.
Biopython again can do the job - Seq objects act like strings so you can use the count method.
Once more though, seqtk comp does it all for you, columns 3-6 are ACGT counts.

How many genes does the genome contain?

If you have only the gene sequences (cds_genomic or similar) then you can grep “^>”.
If it’s a genbank file you have to find a regex to extract gene features.
Biopython is also quick of course, filtering for features with type ‘gene’ and counting.

What are the lengths of those genes?

For genbank files the easiest way is with Biopython - import the file, extract the gene feature sequences and report their lengths.
If you have gene sequences in a fasta then it IS possible with terminal only tools but involves at least six complicated steps with the commands we have taught.
But with such a file, seqtk comp is easiest as it reports statistics per fasta entry in a file.

Consider how you might write a python script that outputs DNA or amino acid sequences of genes from a genbank file.

Import the file. Create a list to contain the gene sequences.
For each record:
- For each feature:
  Extract the sequence.
  
  Convert to amino acid sequence (if desired).
  
  Adjust the ID/Description, for instance using the ‘locus_tag’ feature qualifier.
  
  Put the new Seq record into the list.
Output the list with SeqIO.

Alignment¶

Given the gene sequences of a single organism, how would we find potential paralogs?

Make a BLAST database from the gene sequences file.
Align the genes against themselves with blastn (or blastp for amino acid sequences).
Filter out alignments where a gene aligns to itself.
Filter out poor and short alignments - choose a cutoff based on the data, as it depends on how far apart the organisms are in evolutionary time.
Cluster the remaining alignments to find paralog groups.

Given the gene sequences of two organisms, how would we find potential orthologs?

In principle you align one set against the other, but you have to take into account paralogs in both organisms. A method for doing this is to find ‘reciprocal best hits’ - when a gene in one organism aligns best with a gene in the other organism and vice-versa.
Put both sets of genes together in one file and make a BLAST database from it.
Align the first set of genes to this database, and then the second.
Filter out all self-, poor and short alignments, as above.
Now for each gene check if the best hit is to a gene in the other organism, then check if that gene’s best hit is the initial gene - a reciprocal best hit.

What additional evidence could we use from our work to support the argument that the gene-pairs are paralogs or orthologs?

Consider how many orthologs are found and how similar the two organisms are overall - more closely related suggests more genuine orthologs.
Paralogs may be resolved with phylogenetics, revealing when the gene duplication occured relative to the speciation.
But it’s a difficult problem, especially with bacteria who frequently horizontally transfer genes among themselves.

Annotation¶

How would you search for exact matches to a restriction enzyme recognition site?

Find out the nucleotide sequence that EcoRI recognises by searching on the internet - GAATTC.
Use grep to find the sequence in the genome with options -b and -o.
If the genome sequence is split across multiple lines, they need to be removed (for instance with tr) before searching with grep.

How would you search for primer binding sites, allowing for one base mismatch?

For exact matches:

Find the nucleotide sequences of the primers 27F and 1492R by searching on the internet (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC106531/).
Note that several (degenerate) versions of primers that are named 27F exist, that is, some of its positions have several possible bases. Combinations of bases are encoded by specific letters (for more information see: https://www.bioinformatics.org/sms/iupac.html).
27F: AGAGTTTGATCMTGGCTCAG or AGAGTTTGATC[CA]TGGCTCAG.
1492R: TACGGYTACCTTGTTACGACTT or TACGG[CT]TACCTTGTTACGACTT.
Exact matches can be found with the grep method we used for the restriction enzyme recognition site.
The sequences can also be encoded on both strands of DNA, therefore we have to also search for the reverse complement sequences of the original primer sequences.
We find 7 primer binding sites in GCF_000005845.2_ASM584v2_genomic.fna.

Then to allow for 1 mismatch:

Grep will search for exact matches only, we have to use a different approach to allow for 1 mismatch.
One option is to use BLAST then filter the results to find full length matches with only 1 mismatch.
You could also use Python to generate a list of all sequences with 1 error relative to the primer and search for exact matches for all such sequences.

Imagine you had identified a new prokaryote - is your organism capable of breaking down glucose into pyruvate by glycolysis?

The first task is to identify which enzymes are required to break down glucose into pyruvate:
- Hexo- or Glucokinase
- Glucose-6-P isomerase
- 6-Phosphofructokinase
- Fructose-bisphosphate aldolase
- Triosephosphate-isomerase
- Glyceraldehyde-3-phosphate dehydrogenase
- Phosphoglycerate-kinase
- Phosphoglycerate-mutase
- Enolase
- Pyruvate kinase
Looking at the output of prokka, the fasta headers of the nucleotide and amino acid sequence files (*.faa, *.ffn) contain gene/protein names. These can be searched to see if all required enzymes are present.
However, since gene names are often not exactly the same across organisms, researchers have assigned systematic identifiers to genes or gene families. One database that contains curated information is the Kyoto Encyclopedia of Genes and Genomes (KEGG). For example, the KEGG website for Glycoysis in E. coli can be found here: https://www.genome.jp/module/eco_M00001. Assuming you know which KEGG identifiers were found in your genome of interest, this information can be used to check for the completeness of a specific metabolic pathway, such as glycolysis.

Structural bioinformatics - Project¶

The structure of Sphingolipid delta(4)-desaturase (DES1)¶

First we need to download the corresponding PDB file into your homefolder on cousteau. In the terminal

terminal
cd
wget https://alphafold.ebi.ac.uk/files/AF-O15121-F1-model_v4.pdb

Afterwards we switch to R-Studio.

Note: It is easier if you use the R-Studio provided on cousteau (http://cousteau-rstudio.ethz.ch) rather than your locally installed R-Studio. On cousteau the necessary modules for the project are pre-installed and you can easily access the files your downloaded into your homefolder on cousteau.

In R-Studio we load the bio3d package and read in the pdb file.

library(bio3d)
pdb <-read.pdb("AF-O15121-F1-model_v4.pdb")

How many chains are in this file?

There are many ways to check for the number of chains. The easiest one is to just look at the pdb-file.

> pdb

What is the x coordinate of the atom in the index number 23?

We use pdb$atom to access the ATOM information. With pdb$atom[Number,”coordinate”] we can access a coordinate of a specific atom. For example, the y coordinate of the 12 atom.

> pdb$atom[12,"y"]

Now you just need to adapt the code above for the x coordinate of the 23 atom.

Using the dm() function in bio3d to calculate all distances between the calpha of the protein. What is the mean distance between all residues, rounded to the first decimal place.

Use the dm() function and store it into a new variable

mydist <- dm(pdb, inds="calpha")

Afterwards use the mean() function to get the mean distance. Do not forget to remove all the NA.

Comparing the structure of human DES1 with other desaturases¶

First, we need to get the necessary files. Switch back to the terminal.

wget https://alphafold.ebi.ac.uk/files/AF-Q8R2F2-F1-model_v4.pdb
wget https://alphafold.ebi.ac.uk/files/AF-Q94515-F1-model_v4.pdb
wget https://alphafold.ebi.ac.uk/files/AF-P38992-F1-model_v4.pdb

Switch back to R-Studio.

Load in the pdb files, align them and calculate Calculate the RMSD.

my_pdbs<- c("./AF-O15121-F1-model_v4.pdb","./AF-P38992-F1-model_v4.pdb","./AF-Q8R2F2-F1-model_v4.pdb","./AF-Q94515-F1-model_v4.pdb")
aln_pdbs <- pdbaln(my_pdbs)
rd <- rmsd(xyz)

What is the RMSD distance between the human (O15121) and the fly (Q94515) structure (round to the first decimal) ?

There are multiple ways to the solution. A simple way would be to print out all of he RMSD values and have a look at the corresponding value.

> rd

What is the corresponding species of the most similar structure to the human structure?

Once again print out the values and find the closest to the mouse

> rd

Mutations in Sphingolipid delta(4)-desaturase¶

This part need to be done in the terminal.

First, we have to make sure that all the requirements are met to run FoldX.

You have to make sure that you are in the same folder as the human structure (probably your homefolder).

cd

You need to have the rotabase.txt file in the same folder

cp /nfs/nas22/fs2202/biol_micro_teaching/software/easybuild/software/FoldX/4.0/bin/rotabase.txt ./

If you have created a separate folder for this (for example “foldxtest”), make sure the necessary files are there.

Load the FoldX module

ml FoldX

Afterwards you have to use the commands learned in Predicting protein folding and the impact of mutations.

In the terminal:

foldx --command=RepairPDB --pdb=AF-O15121-F1-model_v4.pdb

foldx --command=BuildModel --pdb=AF-O15121-F1-model_v4_Repair.pdb --mutant-file=individual_list.txt

The answers to the questions will be in the file:

Dif_AF-O15121-F1-model_v4_Repair.fxout