Hands-on session on genome mining¶
The goal of this tutorial is to provide a brief introduction to bioinformatic approaches for genome mining.
The index of these notebook includes:
Notes on notebooks usage
You are working in a jupyter notebook
. These documents allow to both write executable code, for example in python
, and document it using markdown
in different cells
.
Some useful shortcuts include:
- Shift+Enter : execute cell
- Esc, b : add a cell below
- Esc, a : add a cell above
- Esc, m : transform cell to
markdown
- Esc, d : delete cell
Copying this tutorial:¶
Before starting, run this command from your home to copy the material:
cp -r /nfs/nas22/fs2202/biol_micro_teaching/551-1119-00L-2024/s07_hands_on .
1. Gene annotation¶
This section will introduce you to sequence handling in Python, covering DNA, RNA, proteins, and coding a basic Open Reading Frame (ORF) finder. By the end, you'll be familiar with using Biopython for sequence manipulation and creating a simple ORF detection function.
Core Concepts
- DNA (Deoxyribonucleic Acid): The molecular blueprint of organisms. Consists of nucleotides represented by the letters A, T, C, and G.
- RNA (Ribonucleic Acid): Transcribed from DNA and serves as a template for protein synthesis. It contains U (Uracil) instead of T.
- Protein: Composed of amino acids, translated from RNA through codons (triplets of nucleotides).
- Open Reading Frames (ORFs): DNA segments that have the potential to code for proteins. ORFs start with a start codon (usually ATG) and end with a stop codon (e.g., TAA, TAG, TGA).
1.1. Working with DNA Sequences in Python¶
First, let's load and manipulate DNA sequences using basic Python functions and Biopython. Example DNA Sequence
Let's start with a small sample DNA sequence:
sample_dna = "ATGCGATACGCTTGA"
Task 1: Converting DNA to RNA
To convert DNA to RNA, replace thymine (T) with uracil (U).
1.2. Basic ORF Finder¶
An ORF finder identifies sequences between a start codon (ATG) and a stop codon (TAA, TAG, TGA). Here’s a simple function to find ORFs:
def find_orfs(dna_sequence):
"""
This function scans the sequence for ATG, then continues
in triplets until it finds a stop codon.
It then extracts and saves each ORF.
"""
start_codon = "ATG"
stop_codons = ["TAA", "TAG", "TGA"]
orfs = []
# Complete the code :)
return orfs
# Test the ORF finder with the sample DNA
orfs = find_orfs(sample_dna)
print("ORFs Found:")
for orf in orfs:
print(orf)
1.3. Using Biopython for Sequence Handling¶
Biopython’s Seq object offers powerful tools for sequence manipulation. Here’s how to create sequences and translate them to protein sequences. Example: Creating and Translating a Sequence
from Bio.Seq import Seq
# Create a Seq object
dna_seq = Seq("ATGCGTCTAA")
# Translate the DNA sequence to a protein
protein_seq = dna_seq.translate()
print(f"Protein Sequence: {protein_seq}")
1.4. Loading a genome with Biopython¶
from Bio import SeqIO
# Load the sequence from a FASTA file
with open("./data/test_genome.fna") as file:
sequence_record = SeqIO.read(file, "fasta")
# Display the sequence and its metadata
print(f"ID: {sequence_record.id}")
print(f"Description: {sequence_record.description}")
print(f"Sequence: {sequence_record.seq}")
1.5. Putting It All Together¶
Here’s a final example combining all the steps into a small ORF finder that translates found ORFs into proteins:
from Bio.Seq import Seq
def find_and_translate_orfs(dna_sequence):
start_codon = "ATG"
stop_codons = ["TAA", "TAG", "TGA"]
orf_proteins = []
# Complete the code
return orf_proteins
# Test with a longer DNA sequence
proteins = find_and_translate_orfs(sequence_record.seq)
print("Translated Proteins from ORFs:")
for protein in proteins:
print(protein)
1.6. Using Biopython for multifasta Handling¶
# Install Biopython if needed: !pip install biopython
from Bio.Seq import Seq
from Bio import SeqIO
# Sample DNA sequence
dna_sequence = Seq("ATGCTAGCTAGCTCGTAGCT")
print("Sequence:", dna_sequence)
print("Reverse Complement:", dna_sequence.reverse_complement())
print("Protein Translation:", dna_sequence.translate())
# Load a FASTA file (replace with actual file path if needed)
for record in SeqIO.parse("./data/test_contigs.fna", "fasta"):
print(f"ID: {record.id}, Length: {len(record.seq)}")
Exercise: Load sequences from a FASTA file into a dictionary, compute the reverse complement, and translate to protein and create a new dictionary with the structure {id:[dna_seq, revcomp_seq, aa_seqs]}
2. Alignment-based approaches¶
2.1. Introduction to Sequence Homology and Identity¶
Objective: Explain the concepts of homology and sequence identity, crucial in alignment-based approaches.
- Sequence Homology: Similarity due to shared ancestry; can be orthologous (same function in different species) or paralogous (functionally diverged post-duplication).
- Sequence Identity: Percentage of identical residues in the alignment of two sequences, helping to quantify similarity.
Assumption: we can infer function of a protein by homology given with a set of sequences with known functions.
2.2. Position weight matrices¶
In this example we will create a Position Weight Matrix (PWM) for the sequence "AGN[A|G]GG", use it to scan a larger DNA sequence, and plot the probability of finding "AGGAGG" at each position.
Step 1: Define the PWM for "AGGAGG" A Position Weight Matrix quantifies the likelihood of each nucleotide at each position in the motif. For this example, we'll assume a simple PWM based on "AGGAGG" alone, assigning high probabilities to the observed bases in this motif.
import numpy as np
# Define the target sequence
target_motif = "AGGAGG"
# Define PWM for a motif of 6 bases (length of AGGAGG)
# Rows: A, C, G, T. Columns correspond to each position in "AGGAGG"
pwm = np.array([
[1, 0.25, 0.5, 0, 1, 0], # Probability of 'A' at each position
[0, 0.25, 0 , 0, 0, 0], # Probability of 'C' at each position
[0, 0.25, 0.5, 0, 0, 1], # Probability of 'G' at each position
[0, 0.25, 0 , 1, 0, 0] # Probability of 'T' at each position
], dtype=float)
# To handle probabilities in a real PWM, add a small pseudocount to avoid zeros
pwm = (pwm + 0.01) / pwm.sum(axis=0) # Normalize each column
Step 2: Calculate the Match Score for Each Window Define a function to calculate the probability of observing the pattern in different windows across a larger DNA sequence.
def calculate_pwm_score(sequence, pwm):
scores = []
for i in range(len(sequence) - pwm.shape[1] + 1):
window = sequence[i:i + pwm.shape[1]]
score = 1.0
for j, nucleotide in enumerate(window):
if nucleotide == 'A':
score *= pwm[0, j]
elif nucleotide == 'C':
score *= pwm[1, j]
elif nucleotide == 'G':
score *= pwm[2, j]
elif nucleotide == 'T':
score *= pwm[3, j]
scores.append(score)
return scores
Step 3: Apply the PWM to a Test Sequence Define a test DNA sequence, apply the PWM, and compute the scores
# Define a test sequence
test_sequence = "TTTAGGAGGCTAGGAGGATGGAGGTTAGGAGGGT"
# Calculate the scores for each window
scores = calculate_pwm_score(test_sequence, pwm)
Exercise: Plot the Probability per Window Finally, plot the results to visualize the probability of "AGGAGG" appearing at each position.
import matplotlib.pyplot as plt
# Your code here
2.3. Running BLAST¶
Core Concepts: BLAST (Basic Local Alignment Search Tool) compares sequences to find regions of similarity. We’ll use the Biopython NCBIWWW module to run BLAST against NCBI’s online database.
Note: Running BLAST on NCBI servers may have rate limits. Optionally, download and run BLAST locally for larger datasets.
2.3.1 Running BLAST locally¶
Open a new terminal and run the following commands
{bash}
# Create a dictionary
cd s07_hands_on
mkdir blast_exercise
# Load blast
ml BLAST+
# Inspect the file to make the database
head -20 ./data/PlasticDB.fasta
# Make a database
makeblastdb -in ./data/PlasticDB.fasta -dbtype prot -out ./blast_exercise/plasticdb -title plasticdb -parse_seqids
# Run BlastP
blastp -query ./data/omd2_candidate.faa -db ./data/blast_exercise/plasticdb -outfmt 6 -evalue 0.0001 -out ./data/omd2_candidate_blast.out
2.4. Parsing BLAST Results and Checking Hits with Pandas¶
import pandas as pd
# Read the file with pandas and store it in blast_results variable
blast_results
This seems hard to interpret without a header...
blast_results = pd.read_csv('./data/omd2_candidate_blast.out', sep='\t', header=None)
blast_results.columns = ["qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send",
"evalue", "bitscore"]
blast_results
Exercise:
- Subset the previous results to get only results with an e-value <0.0000001.
- Further filter cases with pident >=50 and length >150.
- How many genes satisfy the condition 1 and 2?
- What are the different degrading enzymes identified?
- Can you do a plot showing something interesting about the different biodegradation categories?
Exercise:
- Load the genome collection (and metadata) from OMD2
- Can you identify the genus and species taxonomy of the organism we are analyzing?
- Would you trust the quality of the genome analyzed?
import pandas as pd
df = pd.read_csv('../s0506_python/data/genome_summary.csv.gz', compression='gzip', index_col=0)
md = pd.read_csv('../s0506_python/data/metadata.tsv', sep='\t', index_col=0)
df
3. HMMs-based approaches¶
Core concepts: Run multiple HMMs on a collection of sequences, explore concepts such as cluster completeness, and further developments such as antismash.
We will use the operon Nif as example, important because of the process of nitrogen fixation:
- Nitrogen is a limiting nutrient for plants, and although it makes up ~70% of the atmosphere, the gaseous form (N2) is unusable for them.
- So how does nitrogen get from the atmosphere into a usable form? → “nitrogen fixation”
- bacteria can convert N2 into organic nutrients like ammonium (NH4+) and nitrate (NO3–) that are usable by plants.
- in the ocean, blue-green cyanobacteria are the most abundant type of bacteria to fix nitrogen.
- Collectively, these organisms are called diazotrophs, and account for ~90% of natural nitrogen fixation.
This process involves 3 main genes:
We have produced HMMs for a series of H, D, K proteins. Given the long evolutionary history of these proteins, Blast cannot capture all the diversity and HMMs, with their higher sensitivity, can help in predicting diazotrophs in our genome collection.
To work with HMMs, we will us the python module for HMMER pyhmmer.
import pandas as pd
import pyhmmer
import glob
import pyhmmer.easel as easel
import collections
def retrieve_hits(seqs_path, hmms, fields=["query", "subject", "bitscore", "evalue"]):
# Load cluster proteins
with pyhmmer.easel.SequenceFile(seqs_path, digital=True, alphabet=easel.Alphabet.amino()) as seqs_file:
proteins = seqs_file.read_block()
# Run HMMs
Result = collections.namedtuple("Result", fields)
results = []
for hits in pyhmmer.hmmsearch(hmms, proteins, E=1):
cog = hits.query_name.decode()
for hit in hits:
if hit.included:
results.append(Result(hit.name.decode(), cog, hit.score, hit.evalue))
# Results --> df
hits_df = {}
c = 0
for i in results:
hits_df[c] = list(i)
c += 1
hits_df = pd.DataFrame.from_dict(hits_df, orient='index', columns=fields)
return hits_df
What is the previous code doing?
# Find and load a collection of HMMs
HMMS = []
for fil in glob.glob('./data/hmms_nifHDK/*.hmm'):
with pyhmmer.plan7.HMMFile(fil) as hmm_file:
HMMS.append(hmm_file.read())
HMMS
results = retrieve_hits('./data/omd2_candidate.faa', HMMS)
results
Do you think this bacteria is a diazotroph?
Let's try a different set:
retrieve_hits('./data/cyanobact.faa', HMMS)
Do you think this bacteria is a diazotroph?
- Does it has the 3 genes?
- What would be your interpretation from the raw predictions?
- What do you think has happened here evolutionary speaking?
- What would be your interpretation from the predictions with evalue<5e-10?
- What else you could try to validate your hypothesis?
3.2. Working with antismash outputs¶
Exercise: Given the antismash output from a genome given in the previous file:
- How many BGCs the genome has?
- What are the different products of those BGCs?
- How many of each?
- How many present a regulatory, resistance or transport gene?
- You can produce some graphs to inspect the problem if you are done quick ;)
antismash = pd.read_csv('./data/omd2_candidate-antismash.tsv', sep='\t')
antismash.head(10)
3.3. Working with merged antismash outputs¶
omd_bgcs = pd.read_csv('./data/bgcs.csv.gz', compression='gzip')
omd_bgcs['biosample'] = [i.split('_')[1] for i in omd_bgcs['GENOME']]
md = pd.read_csv('./data/metadata.tsv', sep='\t', index_col=0)
omd_bgcs
Exercise:
- Filter high quality mags (completion >= 90, contamination <5) from the BGC table.
- Which is the GTDB taxonomy with largest number of Biosynthetic Regions?.
- Add metadata information as done in the previous python sessions.
- Can you group genomes by biosample making the average of the columns:
['# Biosynthetic Regions', '# Biosynthetic Products', 'RiPPs', 'RiPPs:Proteusins', 'NRPS', 'PKSI', 'Other PKS', 'Terpenes', 'Saccharides', 'Other', 'temp']
- Can you plot a figure to explore if higher temperature relates to more biosynthetic regions? and by product?
- Extras:
- can you make a barplot showing in the X axis the nr of products for the top 5 producers with their taxonomies as labels in the y-axis?
- can you make a map plot coloring by number of products?
3.4. Working with EggNOG outputs¶
egg = pd.read_csv('/nfs/home/smiravet/bc/bc2024/s07_hands_on/data/eggnog_output.tsv', sep='\t')
Exercise:
- Explore together the eggnog output
- Make a function to parse the PFAM column subsetting Eukaryotic-like proteins
- ankyrin repeat
- tetratrico peptide repeat
- tetratricopeptide repeat
- WD40 region
- WD40-like
- Leucine-rich repeat
- What percentage of the total gene collection are ELPs?
Note use <your_str>.lower() or <your_str>.upper() to ensure the match is case insensitive!
4. Feature based approaches¶
4.1. Defining a Random Forest Classifier to identify AMPs¶
Sometimes the features in a protein that allow for a specific function are not given based the sequence but other composed factors, such as hydrophobicity, secondary structure motifs, a single reactive amino acid, etc.
For these cases, machine learning and feature-based approaches can be integrated to integrate and analyze based on a composition of numerical descriptors. Antimicrobial peptides are one of the classic examples:
from Bio import SeqIO
import pandas as pd
import peptides
# Define a function to load sequences from a FASTA file
def load_sequences(fasta_file):
sequences = {}
for record in SeqIO.parse(fasta_file, "fasta"):
sequences[record.id] = str(record.seq)
return sequences
# Load sequences from the fasta file
training_seqs = load_sequences("./data/amps/training_set.faa") # Positive AMP sequences
training_seqs
Explore and interpret together the output. What does AMP and NAMP mean?
Now check the code below and understand what it does:
aas = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
cols = ['aliphatic_index', 'boman', 'charge'] + [f'count_{i}' for i in aas] + [f'freq_{i}' for i in aas]
cols += ['hydrophobic_moment', 'hydrophobicity', 'instability_index', 'isoelectric_point', 'mass_shift']
cols += ['molecular_weight', 'mz']
def predict_additional(peptide):
aas = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
l = [peptide.aliphatic_index(), peptide.boman(), peptide.charge()]+[peptide.counts().get(i) for i in aas]+[peptide.frequencies().get(i) for i in aas]
l += [peptide.hydrophobic_moment(), peptide.hydrophobicity(), peptide.instability_index(), peptide.isoelectric_point(), peptide.mass_shift()]
l += [peptide.molecular_weight(), peptide.mz() ]
additional_feats = {k:v for k, v in zip(cols, l)}
return additional_feats
def featurize_seq(seq):
peptide = peptides.Peptide(seq)
feats = peptide.descriptors()
feats.update(predict_additional(peptide))
return feats
featurize_seq('MGMRMMFTVFLLVVLATTVVSIPSDRASDGRNAVVHERAPELVVTATTNCCGYNPMTICPPCMCTYSCPPKRKPGRRND')
Work together in producing code for the following:
training_seqs = load_sequences("./data/amps/training_set.faa") # Positive AMP sequences
# Predict features for all the training sequences
df = pd.DataFrame.from_dict({k:featurize_seq(v) for k, v in training_seqs.items()}, orient='index')
df
# Add a label +/-
df['label'] = [0 if 'NAMP' in i else 1 for i in df.index]
df
Step 3: Train a Random Forest Model
With the features prepared, we can split the data and train a Random Forest classifier.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Separate features and labels
X = df.drop(columns=["label"])
y = df["label"]
# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
print("Random Forest model training complete.")
To evaluate the model, we’ll plot the ROC curve using scikit-learn and matplotlib.
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Predict probabilities for the validation set
y_probs = rf_model.predict_proba(X_val)[:, 1] # Probability of the positive class
# Calculate ROC curve
fpr, tpr, _ = roc_curve(y_val, y_probs)
roc_auc = auc(fpr, tpr)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
4.2. Test the Model on New Data¶
Now, we can use the model to predict labels for the test sequences.
# Extract features from the test sequences
to_test = load_sequences("./data/amps/test_set.faa") # Positive AMP sequences
# Predict features for all the training sequences
df2 = pd.DataFrame.from_dict({k:featurize_seq(v) for k, v in to_test.items()}, orient='index')
# Predict the probability of each test sequence being positive
test_probs = rf_model.predict_proba(df2)[:, 1]
test_predictions = rf_model.predict(df2)
# Display predictions
for i, seq in enumerate(to_test):
print(f"Sequence: {seq}")
print(f"Prediction: {'Positive' if test_predictions[i] == 1 else 'Negative'}")
print(f"Probability of being AMP: {test_probs[i]:.2f}\n")
Exercise: think about potential visualizations of this data. What information could be relevant to highlight?
4.3. Exploring importances from a RF model¶
Exercise 1: Identify and Visualize the Top 10 Most Important Features
In this exercise, you’ll identify the top 10 most important features in the Random Forest model and visualize them to understand which features are most predictive of the positive class.
Calculate Feature Importance: Use the feature_importances_ attribute of the trained Random Forest model to get the importance scores of each feature.
Visualize the Top 10 Features: Plot a bar chart to show the importance of each feature.
import matplotlib.pyplot as plt
import numpy as np
# Get feature importance from the model
feature_importances = rf_model.feature_importances_
feature_names = X.columns
# Sort features by importance
sorted_idx = np.argsort(feature_importances)[::-1]
top_features = sorted_idx[:10]
top_feature_names = [feature_names[i] for i in top_features]
top_feature_importances = feature_importances[top_features]
# Plot the top 10 important features
plt.figure(figsize=(10, 6))
plt.barh(top_feature_names[::-1], top_feature_importances[::-1], color='skyblue')
plt.xlabel("Feature Importance Score")
plt.title("Top 10 Most Important Features")
plt.show()
Exercise 2: Compare Top Features between Positive and Negative Sets
For the top 10 features identified, compare their distributions between the positive and negative sets. This helps you understand how these features differ across classes.
Subset the Top Features: Select only the top 10 features from the data DataFrame. Plot Feature Distributions: For each of the top features, create a boxplot or violin plot to compare the distributions between positive and negative sets.
import seaborn as sns
# Subset data to include only top 10 features and labels
top_features_data = df2[top_feature_names]
top_features_data['label'] = test_predictions
# Plot each feature as a boxplot to compare distributions in positive vs. negative sets
plt.figure(figsize=(12, 10))
for i, feature in enumerate(top_feature_names, 1):
plt.subplot(5, 2, i)
sns.boxplot(x='label', y=feature, data=top_features_data, palette=["#FFA07A", "#8FBC8F"])
plt.title(f"Distribution of {feature} by Class")
plt.xlabel("Class (0 = Negative, 1 = Positive)")
plt.ylabel(feature)
plt.tight_layout()
plt.show()
Explanation:
- Feature Importance Plot: The bar chart shows the importance of each feature in distinguishing between classes, based on the Random Forest model.
- Boxplots: Each boxplot displays the distribution of values for a given feature, with classes 0 (negative) and 1 (positive) shown separately.
Additional Questions for Analysis
- Which features show a clear separation between positive and negative sets?
- Do any of the top features overlap heavily, indicating they may not be strong discriminators?
- Can you hypothesize why certain features are predictive of positive or negative classes?
Additional: statistical tests¶
subdf = top_features_data[['label', top_feature_names[0]]]
subdf
from scipy.stats import mannwhitneyu
U1, p = mannwhitneyu(subdf[subdf['label']==1][top_feature_names[0]], subdf[subdf['label']==0][top_feature_names[0]], method="exact")
p
Exercise:
- Iterate the top features listed in top_features_names and print
4.4. OMD AMPs¶
omd_amps = pd.read_csv('./data/amps/smorfs_summary.tsv.gz', compression='gzip', sep='\t', index_col=0)
omd_bgcs = pd.read_csv('./data/amps/morfs_bgcs_amps.tsv.gz', compression='gzip', sep='\t', index_col=0)
md = pd.read_csv('../s0506_python/data/metadata.tsv', sep='\t', index_col=0)
Exercise:
- Explore together the given tables.
- Add metadata information as done in the previous python sessions to the AMP table.
- Filter high quality mags (completion >= 90, contamination <5) from the AMP table.
- Which are the top 10 GTDB taxonomies with largest number of AMPs normalized by genome size?.
- Can you group genomes by biosample making the average number of total AMPs and by each class?
- Can you plot a figure to explore if GC content relates to more AMPs?
- Extra:
- can you make a map plot coloring by number of products?