Project

Project#

Overview#

This project will test your learning and understanding of the first half of the course. You must complete the work successfully to complete the project and to pass the course. The deadline for submission of your solutions is Thursday 18.04.2024 23:59. During the class on Friday 19.04.2024, we will go through an example solution, after which you will have time until **Thursday 25.04.2024 23:59 to correct your solutions.

You should submit your solutions on moodle. https://moodle-app2.let.ethz.ch/mod/quiz/view.php?id=999004

Revisions should be submitted to: https://moodle-app2.let.ethz.ch/mod/quiz/view.php?id=1041332

Part 1#

From your problem set you were allocated a fasta file named <your ETH user name>.fasta. Reminder this is an individual bacterial genome specific to you that you are going to investigate. Using newly acquired skills from Alignment, Annotation and Phylogenetics, answer the following questions:

Question 1

Use prodigal to predict the protein-coding genes in your genome. How many genes are predicted?

Question 2

Extract the 16S sequence(s) of your genome with barrnap. You can load the software using ml barrnap and familiarize yourself with its usage by typing barrnap -h. How many copies of the 16S rRNA gene do you find in ‘your’ genome?

Question 3

Please visit the Microbe Atlas website (https://microbeatlas.org/) and use (one of) the 16S rRNA gene sequence(s) to find out:
  1. What is the taxonomic classification of the organism in the Microbe Atlas database (i.e. target organism) that is most similar to ‘your’ organism (i.e. query organism)?

  2. Exploring the environmental statistics of the target organism, in which habitats has it been detected most frequently and/or at highest abundance?

Part 2#

Each of you will find a fasta file named <your ETH user name>.fna in the directory /nfs/teaching/551-0132-00L/7_Project/FS24_Genes - this is an individualised set of between 20 and 40 homologous gene sequences from vertebrates that you are going to investigate. Two of the sequences are homologs from the human genome, but the others could be from a variety of organisms. The data was sourced from the Ensembl database.

Answer the following questions about your gene sequences:

Warning

Before reading further, copy your gene sequences to your home folder and work from this copy - should you ever accidentally overwrite or modify your copy, you can make a new copy from the original.

Question 4

What are the names and functions of the human homologs?

Question 5

Construct a multiple alignment and phylogeny of the gene sequences. Are the two human genes nearest neighbours to one another? Which other organism(s) are nearest to the human genes?

Question 6

Consider your answer to Question 5, the number of gene homologs seen in other organisms, and the phylogeny you built. What do you deduce about the historical timing of the duplication event that created the two gene copies seen in humans?

Part 3#

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an enveloped, positive-sense, single-stranded RNA virus that causes coronavirus disease 2019 (COVID-19). Virus particles include the RNA genetic material and structural proteins needed for invasion of host cells. Once inside the cell the infecting RNA is used to encode structural proteins that make up virus particles, nonstructural proteins that direct virus assembly, transcription, replication and host control, and accessory proteins whose function have not been determined.

ORF1ab, the largest gene, contains overlapping open reading frames that encode polyproteins PP1ab and PP1a. The polyproteins are cleaved to yield 16 nonstructural proteins, NSP1-16. Production of the longer (PP1ab) or shorter protein (PP1a) depends on a -1 ribosomal frameshifting event. The proteins, based on similarity to other coronaviruses, include the papain-like proteinase protein (NSP3), 3C-like proteinase (NSP5), RNA-dependent RNA polymerase (NSP12, RdRp), helicase (NSP13, HEL), endoRNAse (NSP15), 2’-O-Ribose-Methyltransferase (NSP16) and other nonstructural proteins. SARS-CoV-2 nonstructural proteins are responsible for viral transcription, replication, proteolytic processing, suppression of host immune responses and suppression of host gene expression.

The structural proteins of SARS-CoV-2 include the envelope protein (E), spike or surface glycoprotein (S), membrane protein (M) and the nucleocapsid protein (N). The spike glycoprotein is found on the outside of the virus particle and gives coronavirus viruses their crown-like appearance. This glycoprotein mediates attachment of the virus particle and entry into the host cell.

Source: https://www.ncbi.nlm.nih.gov/sars-cov-2/

../_images/sarscov2_structural.png

Source: https://www.prof.uzh.ch/en/news/Coronavirus-(2019-nCoV).html

You will find two fasta files called RdRp.faa and S.faa in the directory /nfs/teaching/551-0132-00L/7_Project/FS24_Virus - these contain collected amino acid sequences of the RNA-dependent RNA polymerase and spike glycoprotein S of sequenced Coronavirus SARS-CoV-2 samples. You will also find the reference sequence for the virus itself, SARS-CoV-2.fa, and an indexed database of reference virus sequences, RefSeq_Virus.fa.

Answer the following questions about the virus sequences:

Warning

Do not copy these files - they are significantly larger than the previous files you have worked on and it is a waste of space for you to all copy the same data. Instead you can create a symlink to the folder in your home directory as follows:

cd
ln -s /nfs/teaching/551-0132-00L/7_Project/FS24_Virus FS24_Virus
ll # should show a line FS24_Virus -> /nfs/teaching/551-0132-00L/7_Project/FS24_Virus

Question 7

Align the reference sequence of the virus to the provided database of virus sequences. Looking at the nearest virus in a non-human host, what is its host?

Question 8

Based on your knowledge on how the immune system works, which of the two proteins RdRp and S will have more sequence variants? Formulate a biologically meaningful hypothesis.

Question 9

How many unique sequences are in each file, RdRp.faa and S.faa?

Question 10

Consider how differences in the length of the two genes impact your results, and correct for this in your answers to Question 9. Do the final numbers support your hypothesis?