The goal of this tutorial is to introduce the standard software tools used to produce, explore and work with microbial GEMs. As most of these are working as Python packages, the tutorial also includes a brief introduction to Python programming.
This tutorial is expected to be run in our jupyter server: http://cousteau-jupyter.ethz.ch/
Before starting, run this command from your home to copy the material:
cp -r /nfs//nas22/fs2202/biol_micro_teaching/course_home/smiravet/hands_on_GEMs .
The index of these notebook includes:
Notes on notebooks usage
You are working in a jupyter notebook
. These documents allow to both write executable code, for example in python
, and document it using markdown
in different cells
.
Some useful shortcuts include:
markdown
Python is a programming language that has been under development for over 25 years.
This Chapter will not cover everything in Python. If you would like, please consider the following resources:
Getting Started with Python:
Learning Python in Notebooks:
This is handy to always have available for reference:
Python Reference:
Python is an imperative language based on statements. That is, programs in Python consists of lines composed of statements. A statement can be:
1 # This a comment: 1 is a number
1
2
-3
1
2
3.14
3.14
print(type(1), type(3.14)) # type() tells you the class of the object
<class 'int'> <class 'float'>
'apple'
'apple'
"apple"
'apple'
True
True
False
False
Python has very useful iterable data structures built into the language:
a = [1, 2, 3] # Equal is used to define a variable 'a'
print(a)
[1, 2, 3]
a[0] # Lists can be accessed using square brackets and the index (it is 0-based!)
1
a.append(10) # append adds a new element to a list
a
[1, 2, 3, 10]
a[0] = 10 # We can edit values in the list
a # No need to use print when working in a notebook
[10, 2, 3]
b = (1, 2, 3) # This is a tuple
b
(1, 2, 3)
c = set([1,2,3,3,3])
c
{1, 2, 3}
Exercise : What happens when you try to change an element in a tuple or a set?
d = {"apple": "a fruit", "banana": "an herb", "monkey": "a mammal"}
d["apple"]
'a fruit'
len(a), len(b), len(c), len(d) # the len() function tells you how long is an iterable (list, tuple, set, dictionary)
(3, 3, 3, 3)
print(1)
print(2)
print(3)
1 2 3
1
2
2
There are two ways to call functions in Python:
Infix operator name:
1 + 2
3
Exercise : What are these operators doing?
# Other operators:
print(1 - 2)
print(1 * 2)
print(1 / 2)
print(4 % 2)
print(4 ** 2)
-1 2 0.5 0 16
# Variable can be operated on the fly
a = 1
a += 10
print(a)
b = 10
b *= 10
print(b)
11 100
Exercise : What are these functions doing?
# Other operations
a = [1,3,-3]
abs(a[2]), sum(a), len(a)
(3, 1, 3)
None
def plus(a, b):
return a + b
plus(3, 4)
7
Exercise : What happens if you do not use return?
def plus(a, b):
a + b
a = 1
plus(a, 4)
a
1
What happened? All functions return something, even if you don't specify it. If you don't specify a return value, then it will default to returning None
.
"a" + 1
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /home/smiravet/eth/sushitasks/gems/bc_2023/Chapter 02 - Introduction to Python.ipynb Cell 48 line <cell line: 1>() ----> <a href='vscode-notebook-cell://wsl%2Bubuntu/home/smiravet/eth/sushitasks/gems/bc_2023/Chapter%2002%20-%20Introduction%20to%20Python.ipynb#X63sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a> "a" + 1 TypeError: can only concatenate str (not "int") to str
How to Read Python Error Messages
TypeError: Can't convert 'int' object to str implicitly
Above the error message is the "traceback" also called the "call stack". This is a representation of the sequence of procedure calls that lead to the error. If the procedure call originated from code from a file, the filename would be listed after the word "File" on each line. If the procedure call originated from a notebook cell, then the word "ipython-input-#-HEX".
1 == 1
True
[] is []
False
list() is list()
False
tuple() is tuple()
True
57663463467 == 57663463469
False
57663463467 != 57663463469
True
The Zen of Python:
import this
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
import operator # import can be used to use installed packages or your own functions
operator.add(1, 2)
import operator as op # libraries can be imported to abbreviations
op.add(1,2)
3
from operator import add # specific libraries can be imported
add(1,10)
11
Is not always clear:
y = 0
for x in range(10):
y = x
x
9
[x for x in range(10, 20)]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
x
9
Python follows the LEGB Rule (after https://www.amazon.com/dp/0596513984/):
x = 3
def foo():
x=4
def bar():
print(x) # Accesses x from foo's scope
bar() # Prints 4
x=5
bar() # Prints 5
foo()
def function():
for i in range(10):
yield i
function()
for y in function():
print(y)
def do_something(a, b, c):
return (a, b, c)
do_something(1, 2, 3)
def do_something_else(a=1, b=2, c=3):
return (a, b, c)
do_something_else()
def some_function(start=[]):
start.append(1)
return start
result = some_function()
result
result.append(2)
other_result = some_function()
other_result
"List comprehension" is the idea of writing some code inside of a list that will generate a list. Using for
we can iterate on a list, string, etc.
Consider the following:
[x ** 2 for x in range(10)]
temp_list = []
for x in range(10):
temp_list.append(x ** 2)
temp_list
But list comprehension is much more concise.
for i in 'hello':
print(i)
h e l l o
a = 3
b = 4
if a>b:
print(f'{a} is greater than {b}')
elif a==b:
print(f'{a} == {b}')
else:
print(f'{b} is greater than {a}')
4 is greater than 3
Exercise : write a function to calculate the GC content of the sequence
seq = 'ACGTATTGCTTAGGCTGAGGCTAGGAGAGGGGACCCCCTAGCTAGGATCGT'
Hint: there are two different ways to do this:
.count(X)
in a string
%matplotlib inline
After the magic, we then need to import the matplotlib library:
import matplotlib.pyplot as plt
To create a simple line plot, just give a list of y-values to the function plt.plot().
plt.plot([5, 8, 2, 6, 1, 8, 2, 3, 4, 5, 6])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('A plot')
Text(0.5, 1.0, 'A plot')
plt.boxplot([[1,2,3], [3,4,5]])
plt.show() # To show the plot, not required in notebooks
plt.savefig('./hands_on_files/boxplot.svg') # You can save in png, svg, pdf, etc
<Figure size 432x288 with 0 Axes>
Read the documentation on matplotlib:
Dataframes are useful structures when working with tables (csv, tsv). pandas
is the main library to work with them:
import pandas as pd # Classical way to import pandas
df = pd.read_csv('./hands_on_files/genomes-summary.csv')
df
Genome | # Biosynthetic Regions | # Biosynthetic Products | RiPPs | RiPPs:Proteusins | NRPS | PKSI | Other PKS | Saccharides | Terpenes | ... | station | depth | depth layer | latitude | longitude | date | size fraction | temperature (°C) | oxygen (µmol/kg) | detected in the water column | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MARD_SAMN10141426_REFG_MMP10141426 | 43.0 | 76.0 | 14.0 | 0.0 | 22.0 | 11.0 | 10.0 | 0.0 | 10.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
1 | MARD_SAMEA2272178_REFG_MMP2272178 | 37.0 | 40.0 | 8.0 | 0.0 | 10.0 | 3.0 | 6.0 | 1.0 | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
2 | MARD_SAMD00040648_REFG_MMP00040648 | 37.0 | 53.0 | 2.0 | 0.0 | 25.0 | 8.0 | 3.0 | 0.0 | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
3 | MARD_SAMN04488110_REFG_MMP04488110 | 36.0 | 46.0 | 5.0 | 0.0 | 11.0 | 14.0 | 3.0 | 3.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
4 | MARD_SAMN04571751_REFG_MMP04571751 | 34.0 | 48.0 | 10.0 | 0.0 | 13.0 | 4.0 | 6.0 | 0.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34810 | TARA_SAMN05326641_METAG_ANW00074 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34811 | TARA_SAMN05326641_METAG_ANW00080 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34812 | TARA_SAMN05326641_METAG_ANW00088 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34813 | TARA_SAMN05326642_METAG_ASE00032 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34814 | TARA_SAMN05326642_METAG_ASE00036 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34815 rows × 46 columns
df.columns
Index(['Genome', '# Biosynthetic Regions', '# Biosynthetic Products', 'RiPPs', 'RiPPs:Proteusins', 'NRPS', 'PKSI', 'Other PKS', 'Saccharides', 'Terpenes', 'Other', 'Mean Completeness', 'Mean Contamination', 'Differential Coverage Index', 'CheckM Completeness', 'CheckM Contamination', 'CheckM Strain heterogeneity', 'Anvio Completion', 'Anvio Redundancy', '# tRNAs (AA)', '16S rRNA', '16S rRNA (partial)', '# CRISPRs', 'GTDB Taxonomy', 'Checkm Taxonomy', 'Anvio Domain', 'SpecI Species Cluster', 'mOTUs Species Cluster', 'dRep Species Cluster', 'dRep Representative Genome', 'Genome size', '# scaffolds', 'N50 (scaffolds)', 'Longest scaffold', 'Coding density', 'GC Content', 'station', 'depth', 'depth layer', 'latitude', 'longitude', 'date', 'size fraction', 'temperature (°C)', 'oxygen (µmol/kg)', 'detected in the water column'], dtype='object')
df['Genome']
0 MARD_SAMN10141426_REFG_MMP10141426 1 MARD_SAMEA2272178_REFG_MMP2272178 2 MARD_SAMD00040648_REFG_MMP00040648 3 MARD_SAMN04488110_REFG_MMP04488110 4 MARD_SAMN04571751_REFG_MMP04571751 ... 34810 TARA_SAMN05326641_METAG_ANW00074 34811 TARA_SAMN05326641_METAG_ANW00080 34812 TARA_SAMN05326641_METAG_ANW00088 34813 TARA_SAMN05326642_METAG_ASE00032 34814 TARA_SAMN05326642_METAG_ASE00036 Name: Genome, Length: 34815, dtype: object
df.loc[22]
Genome TARA_SAMEA2621033_METAG_KKHCHOLN # Biosynthetic Regions 23.0 # Biosynthetic Products 25.0 RiPPs 5.0 RiPPs:Proteusins 0.0 NRPS 8.0 PKSI 1.0 Other PKS 2.0 Saccharides 0.0 Terpenes 3.0 Other 6.0 Mean Completeness 94.98 Mean Contamination 8.12 Differential Coverage Index 5.0 CheckM Completeness 93.55 CheckM Contamination 6.17 CheckM Strain heterogeneity 0.0 Anvio Completion 96.4 Anvio Redundancy 10.07 # tRNAs (AA) 19.0 16S rRNA 0.0 16S rRNA (partial) 1.0 # CRISPRs 4.0 GTDB Taxonomy d__Bacteria;p__Myxococcota;c__Polyangia;o__Pol... Checkm Taxonomy c__Deltaproteobacteria Anvio Domain BACTERIA SpecI Species Cluster cluster_5530 mOTUs Species Cluster gom_003509 dRep Species Cluster 373_0 dRep Representative Genome True Genome size 11250240 # scaffolds 734 N50 (scaffolds) 22805 Longest scaffold 104071 Coding density 0.9128 GC Content 0.7188 station TARA_068 depth 5.0 depth layer EPI latitude -31.0266 longitude 4.665 date 2010-09-14 06:55:00 size fraction <-0.22 temperature (°C) 16.8632 oxygen (µmol/kg) 234.4763 detected in the water column True Name: 22, dtype: object
df[df['Genome']=='TARA_SAMN05326642_METAG_ASE00032']
Genome | # Biosynthetic Regions | # Biosynthetic Products | RiPPs | RiPPs:Proteusins | NRPS | PKSI | Other PKS | Saccharides | Terpenes | ... | station | depth | depth layer | latitude | longitude | date | size fraction | temperature (°C) | oxygen (µmol/kg) | detected in the water column | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34813 | TARA_SAMN05326642_METAG_ASE00032 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
1 rows × 46 columns
df[(df['Mean Completeness']>=90) & (df['Mean Contamination']<5)]
Genome | # Biosynthetic Regions | # Biosynthetic Products | RiPPs | RiPPs:Proteusins | NRPS | PKSI | Other PKS | Saccharides | Terpenes | ... | station | depth | depth layer | latitude | longitude | date | size fraction | temperature (°C) | oxygen (µmol/kg) | detected in the water column | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MARD_SAMN10141426_REFG_MMP10141426 | 43.0 | 76.0 | 14.0 | 0.0 | 22.0 | 11.0 | 10.0 | 0.0 | 10.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
1 | MARD_SAMEA2272178_REFG_MMP2272178 | 37.0 | 40.0 | 8.0 | 0.0 | 10.0 | 3.0 | 6.0 | 1.0 | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
3 | MARD_SAMN04488110_REFG_MMP04488110 | 36.0 | 46.0 | 5.0 | 0.0 | 11.0 | 14.0 | 3.0 | 3.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
4 | MARD_SAMN04571751_REFG_MMP04571751 | 34.0 | 48.0 | 10.0 | 0.0 | 13.0 | 4.0 | 6.0 | 0.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
5 | MARD_SAMN03999393_REFG_MMP03999393 | 34.0 | 44.0 | 8.0 | 0.0 | 8.0 | 8.0 | 2.0 | 0.0 | 7.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34644 | MARD_SAMN06006700_REFG_MMP06006700 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
34656 | MARD_SAMN05216170_REFG_MMP05216170 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
34657 | MARD_SAMN05216216_REFG_MMP05216216 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34664 | MARD_SAMN08998225_REFG_MMP08998225 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
34665 | MARD_SAMN05418212_REFG_MMP05418212 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
6450 rows × 46 columns
df.sort_values('# Biosynthetic Products', ascending=False)
Genome | # Biosynthetic Regions | # Biosynthetic Products | RiPPs | RiPPs:Proteusins | NRPS | PKSI | Other PKS | Saccharides | Terpenes | ... | station | depth | depth layer | latitude | longitude | date | size fraction | temperature (°C) | oxygen (µmol/kg) | detected in the water column | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MARD_SAMN10141426_REFG_MMP10141426 | 43.0 | 76.0 | 14.0 | 0.0 | 22.0 | 11.0 | 10.0 | 0.0 | 10.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
2 | MARD_SAMD00040648_REFG_MMP00040648 | 37.0 | 53.0 | 2.0 | 0.0 | 25.0 | 8.0 | 3.0 | 0.0 | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
4 | MARD_SAMN04571751_REFG_MMP04571751 | 34.0 | 48.0 | 10.0 | 0.0 | 13.0 | 4.0 | 6.0 | 0.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
7 | MARD_SAMN09475466_REFG_MMP09475466 | 32.0 | 47.0 | 11.0 | 0.0 | 9.0 | 9.0 | 5.0 | 1.0 | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
3 | MARD_SAMN04488110_REFG_MMP04488110 | 36.0 | 46.0 | 5.0 | 0.0 | 11.0 | 14.0 | 3.0 | 3.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34810 | TARA_SAMN05326641_METAG_ANW00074 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34811 | TARA_SAMN05326641_METAG_ANW00080 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34812 | TARA_SAMN05326641_METAG_ANW00088 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34813 | TARA_SAMN05326642_METAG_ASE00032 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34814 | TARA_SAMN05326642_METAG_ASE00036 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
34815 rows × 46 columns
Exercise : How many MAGs have >15 Biosynthetic Products?
Exercise : How many MAGs have >15 Biosynthetic Products and completion >= 90 and contamination < 5?
You can save a dataframe to a tsv/csv or excel file using:
df.to_csv('./hands_on_files/test.csv') # Comma-separated
df.to_csv('./hands_on_files/test.tsv', sep='\t') # Tab-separated
df.to_excel('./hands_on_files/test.xlsx') # Tab-separated
Finally, seaborn
is a library that makes easy to interact with dataframes and plotting functions between pandas
and matplotlib
:
import seaborn as sns
sns.scatterplot(x='Mean Completeness', y='Mean Contamination', data=df)
<Axes: xlabel='Mean Completeness', ylabel='Mean Contamination'>
sns.jointplot(x='Mean Completeness', y='Mean Contamination', data=df)
<seaborn.axisgrid.JointGrid at 0x7f4943fc8c70>
Exercise : How do you interprete the previous plot?
sns.violinplot(x='depth layer', y='GC Content', data=df)
<Axes: xlabel='depth layer', ylabel='GC Content'>
sns.boxplot(x='depth layer', y='GC Content', hue='size fraction', data=df.head(10000), palette='viridis')
<Axes: xlabel='depth layer', ylabel='GC Content'>
You can explore other representations in the seaborn page:
You need to load carveme first:
ml CarveMe
Then carveme can be run from the terminal by:
carve /input/faa/ -o /path/output/name.xml
Check the documentation of carveme here: https://carveme.readthedocs.io/en/latest/
Exercise : check the documentation an run carveme for the 3 faa files in hands_on_files/faa_files
storing the output in hands_on/gem_files
Exercise : open and explore one of the output files, what does it include? is this human-redable?
Cobra is a huge library to perform multiple function on metabolic models.
Check the documentation for cobra.py here: https://cobrapy.readthedocs.io/en/latest/
# The start of everything is reading the model with built-in function:
from cobra.io import read_sbml_model
btheta = read_sbml_model('./hands_on_files/gem_files/btheta.fa.xml')
btheta
Name | btheta_fa |
Memory address | 7fa1b7f91d90 |
Number of metabolites | 1081 |
Number of reactions | 1634 |
Number of genes | 620 |
Number of groups | 0 |
Objective expression | 1.0*Growth - 1.0*Growth_reverse_699ae |
Compartments | cytosol, periplasm, extracellular space |
Exercise : repeat the previous process assigning the models to ecoli
and arect
. How many metabolites, reactions and genes have each model?
You can access reactions by index or by name:
btheta.reactions[50]
Reaction identifier | AACPS4 |
Name | Acyl-[acyl-carrier-protein] synthetase (n-C16:1) |
Memory address | 0x7f77e9512a30 |
Stoichiometry |
ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c Acyl carrier protein + ATP C10H12N5O13P3 + Hexadecenoate (n-C16:1) --> AMP C10H12N5O7P + Cis-hexadec-9-enoyl-[acyl-carrier protein] (n-C16:1) + Diphosphate |
GPR | lcl_CP092641_1_prot_UML60641_1_3835 |
Lower bound | 0.0 |
Upper bound | 1000.0 |
btheta.reactions.get_by_id('AACPS4')
Reaction identifier | AACPS4 |
Name | Acyl-[acyl-carrier-protein] synthetase (n-C16:1) |
Memory address | 0x7fa1b5890c40 |
Stoichiometry |
ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c Acyl carrier protein + ATP C10H12N5O13P3 + Hexadecenoate (n-C16:1) --> AMP C10H12N5O7P + Cis-hexadec-9-enoyl-[acyl-carrier protein] (n-C16:1) + Diphosphate |
GPR | lcl_CP092641_1_prot_UML60641_1_3835 |
Lower bound | 0.0 |
Upper bound | 1000.0 |
The information of the reaction can be explored as strings:
acp = btheta.reactions.get_by_id('AACPS4')
print(acp.name)
print(acp.reaction)
Acyl-[acyl-carrier-protein] synthetase (n-C16:1) ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c
We can also ensure the reaction is mass balanced. This function will return elements which violate mass balance. If it comes back empty, then the reaction is mass balanced.
acp.check_mass_balance()
{'charge': -2.0}
We can list all the reactions names in a model using comprehension lists:
reactions = [i.id for i in btheta.reactions]
reactions[:10] # Show the 10 first reactions
['12DGR140tipp', '12DGR160tipp', '12DGR161tipp', '12PPDRDH', '13PPDH2', '13PPDH2_1', '14GLUCANabcpp', '14GLUCANtexi', '24DECOAR', '2AGPE120tipp']
There is other information in the the reactions objects, to check the different methods we can apply to an object in python we can use the method dir()
:
print(dir(btheta.reactions[0]))
['__add__', '__class__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__isub__', '__le__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__weakref__', '_annotation', '_associate_gene', '_check_bounds', '_dissociate_gene', '_genes', '_gpr', '_id', '_lower_bound', '_metabolites', '_model', '_repr_html_', '_set_id_with_model', '_update_awareness', '_upper_bound', 'add_metabolites', 'annotation', 'boundary', 'bounds', 'build_reaction_from_string', 'build_reaction_string', 'check_mass_balance', 'compartments', 'copy', 'delete', 'flux', 'flux_expression', 'forward_variable', 'functional', 'gene_name_reaction_rule', 'gene_reaction_rule', 'genes', 'get_coefficient', 'get_coefficients', 'get_compartments', 'gpr', 'id', 'knock_out', 'lower_bound', 'metabolites', 'model', 'name', 'notes', 'objective_coefficient', 'products', 'reactants', 'reaction', 'reduced_cost', 'remove_from_model', 'reverse_id', 'reverse_variable', 'reversibility', 'subsystem', 'subtract_metabolites', 'summary', 'update_genes_from_gpr', 'update_variable_bounds', 'upper_bound', 'x', 'y']
# Useful attributes shared between this object and the following are:
print(btheta.reactions[50].id)
print(btheta.reactions[50].name)
print(btheta.reactions[50].compartments)
print(btheta.reactions[50].reaction)
AACPS4 Acyl-[acyl-carrier-protein] synthetase (n-C16:1) {'C_c'} ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c
In the same way, you can access metabolites:
btheta.metabolites.get_by_id("atp_c")
Metabolite identifier | atp_c |
Name | ATP C10H12N5O13P3 |
Memory address | 0x7f77e97c6d30 |
Formula | C10H12N5O13P3 |
Compartment | C_c |
In 303 reaction(s) | NAabc, NADS2, UAMAS, NDPK7, RIBabc1, CYTDK1_1, GMPS, RIBabcpp, NAabcO, NDPK8, UAMAGS, ANHMK, XYLBabc, GLYCK_1, PG180abcpp, XYLK, ACTNabc, GLUTRS, PG161abcpp, DTMPK, SULAabc, ASPK, RIBabc,... |
We can print out the metabolite name, formula, compartment (cytosol in this case) directly as string, and charge.
atp = btheta.metabolites.get_by_id("atp_c")
print(atp.name)
print(atp.formula)
print(atp.compartment)
print(atp.charge)
ATP C10H12N5O13P3 C10H12N5O13P3 C_c -4
To get the reactions for a metabolite we can combine the previous methods:
atp_reactions = btheta.metabolites.get_by_id("atp_c").reactions
len(atp_reactions)
303
A metabolite like glucose 6-phosphate will participate in fewer reactions.
len(btheta.metabolites.get_by_id("g6p_c").reactions)
9
Exercise : What happens when you do atp_reactions[1]?
Exercise : correct the following code to print how many reactions use atp in each organism
models = [] # Include here a list of models
len_reactions = []
for model in models:
atp_reactions = model.metabolites.get_by_id("atp").reactions
len_reactions.append(sum(atp_reactions))
303
Exercise : complete the code below to compare the number of reactions using ATP in each model (calculated before). Who has more reactions using ATP?
model_names = ['A', 'B', 'C']
plt.bar(x=model_names, height=) # What should we put in height?
Input In [65] plt.bar(x=model_names, height=) # What should we put in height? ^ SyntaxError: invalid syntax
Alternatively, we can summarize the reactions with the .summary()
method
btheta.metabolites.atp_c.summary()
C10H12N5O13P3
Percent | Flux | Reaction | Definition |
---|---|---|---|
18.84% | 1000 | ATPS4rpp | adp_c + 4.0 h_p + pi_c <=> atp_c + h2o_c + 3.0 h_c |
18.84% | 1000 | GALKr | atp_c + gal_c <=> adp_c + gal1p_c + h_c |
15.92% | 844.9 | NDPK1 | atp_c + gdp_c <=> adp_c + gtp_c |
18.84% | 1000 | PGK | 3pg_c + atp_c <=> 13dpg_c + adp_c |
18.84% | 1000 | PYK | adp_c + h_c + pep_c --> atp_c + pyr_c |
8.58% | 455.4 | SR3 | adp_c + h_c + isobutp_c --> atp_c + isobuta_c |
0.16% | 8.44 | SUCOAS | atp_c + coa_c + succ_c <=> adp_c + pi_c + succoa_c |
Percent | Flux | Reaction | Definition |
---|---|---|---|
18.84% | -1000 | A6PAG | atp_c + gal_c --> adp_c + dgal6p_c + h_c |
0.45% | -23.7 | ADNK1 | adn_c + atp_c --> adp_c + amp_c + h_c |
0.11% | -5.65 | ALAALAr | 2.0 ala__D_c + atp_c <=> adp_c + alaala_c + h_c + pi_c |
0.00% | -0.03254 | APPAT | atp_c + h_c + pan4p_c --> dpcoa_c + ppi_c |
0.31% | -16.71 | ARGabc | arg__L_e + atp_c + h2o_c --> adp_c + arg__L_c + h_c + pi_c |
0.26% | -13.62 | ASNabc | asn__L_e + atp_c + h2o_c --> adp_c + asn__L_c + h_c + pi_c |
0.11% | -5.65 | ASPK | asp__L_c + atp_c <=> 4pasp_c + adp_c |
0.01% | -0.2941 | CA2abc | atp_c + ca2_e + h2o_c --> adp_c + ca2_c + h_c + pi_c |
1.17% | -62.2 | CDPMEK | 4c2me_c + atp_c --> 2p4c2me_c + adp_c + h_c |
1.34% | -71.27 | CYTK1 | atp_c + cmp_c <=> adp_c + cdp_c |
0.00% | -0.04006 | Cuabc | atp_c + cu2_e + h2o_c --> adp_c + cu2_c + h_c + pi_c |
0.03% | -1.526 | DGK1 | atp_c + dgmp_c <=> adp_c + dgdp_c |
0.03% | -1.526 | DGNSK | atp_c + dgsn_c --> adp_c + dgmp_c + h_c |
0.00% | -0.03254 | DPCOAK | atp_c + dpcoa_c --> adp_c + coa_c + h_c |
0.03% | -1.478 | DTMPK | atp_c + dtmp_c <=> adp_c + dtdp_c |
0.01% | -0.4412 | FE3abc | atp_c + fe3_e + h2o_c --> adp_c + fe3_c + h_c + pi_c |
0.00% | -0.0126 | FMNAT | atp_c + fmn_c + h_c --> fad_c + ppi_c |
4.99% | -265 | FTHFLi | atp_c + for_c + thf_c --> 10fthf_c + adp_c + pi_c |
0.28% | -14.87 | GLNabc | atp_c + gln__L_e + h2o_c --> adp_c + gln__L_c + h_c + pi_c |
57.60% | -3058 | Growth | 0.000223 10fthf_c + 0.513689 ala__L_c + 0.000223 amet_c + 0.295792 arg__L_c + 0.241055 asn__L_c + 0.241055 asp__L_c + 54.124831 atp_c + 0.005205 ca2_c + 0.005205 cl_c + 0.000576 coa_c + 0.0001 cobalt2_c + 0.133508 ctp_c + 0.000709 cu2_c + 0.09158 cys__L_c + 0.026166 datp_c + 0.027017 dctp_c + 0.027017 dgtp_c + 0.026166 dttp_c + 0.000223 fad_c + 0.006715 fe2_c + 0.007808 fe3_c + 0.26316 gln__L_c + 0.26316 glu__L_c + 0.612638 gly_c + 0.215096 gtp_c + 48.601527 h2o_c + 0.094738 his__L_c + 0.290529 ile__L_c + 0.195193 k_c + 0.450531 leu__L_c + 0.343161 lys__L_c + 0.153686 met__L_c + 0.008675 mg2_c + 0.000223 mlthf_c + 0.000691 mn2_c + 0.0001 mql8_c + 0.001831 nad_c + 0.000447 nadp_c + 0.185265 phe__L_c + 0.221055 pro__L_c + 0.000223 pydx5p_c + 0.000223 ribflv_c + 0.215792 ser__L_c + 0.004338 so4_c + 0.000223 thf_c + 0.000223 thmpp_c + 0.253687 thr__L_c + 0.056843 trp__L_c + 0.137896 tyr__L_c + 0.1 uaagmda_c + 0.144104 utp_c + 0.423162 val__L_c + 0.000341 zn2_c --> 53.95 adp_c + 53.95 h_c + 53.945662 pi_c + 0.773903 ppi_c |
0.00% | -0.0126 | HETZK | 4mhetz_c + atp_c --> 4mpetz_c + adp_c + h_c |
8.25% | -437.8 | HEX1 | atp_c + glc__D_c --> adp_c + g6p_c + h_c |
0.10% | -5.353 | HISabc | atp_c + h2o_c + his__L_e --> adp_c + h_c + his__L_c + pi_c |
0.48% | -25.46 | LEUabc | atp_c + h2o_c + leu__L_e --> adp_c + h_c + leu__L_c + pi_c |
0.37% | -19.39 | LYSabc | atp_c + h2o_c + lys__L_e --> adp_c + h_c + lys__L_c + pi_c |
0.00% | -0.01825 | METAT | atp_c + h2o_c + met__L_c --> amet_c + pi_c + ppi_c |
0.16% | -8.696 | METabc | atp_c + h2o_c + met__L_e --> adp_c + h_c + met__L_c + pi_c |
0.00% | -0.02526 | NADK | atp_c + nad_c --> adp_c + h_c + nadp_c |
0.00% | -0.1287 | NADS1 | atp_c + dnad_c + nh4_c --> amp_c + h_c + nad_c + ppi_c |
0.00% | -0.1287 | NNATr | atp_c + h_c + nicrnt_c <=> dnad_c + ppi_c |
0.00% | -0.03254 | PANTS | ala_B_c + atp_c + pant__R_c --> amp_c + h_c + pnto__R_c + ppi_c |
0.20% | -10.47 | PHEabc | atp_c + h2o_c + phe__L_e --> adp_c + h_c + phe__L_c + pi_c |
0.00% | -0.0126 | PMPK | 4ampm_c + atp_c --> 2mahmp_c + adp_c |
0.00% | -0.03254 | PNTK | atp_c + pnto__R_c --> 4ppan_c + adp_c + h_c |
0.23% | -12.28 | PRPPS | atp_c + r5p_c <=> amp_c + h_c + prpp_c |
0.00% | -0.0126 | RBFK | atp_c + ribflv_c --> adp_c + fmn_c + h_c |
3.52% | -186.8 | RBK_L1 | atp_c + rbl__L_c --> adp_c + h_c + ru5p__L_c |
0.27% | -14.33 | THRabc | atp_c + h2o_c + thr__L_e --> adp_c + h_c + pi_c + thr__L_c |
0.03% | -1.478 | TMDK1_1 | atp_c + thymd_c --> adp_c + dtmp_c + h_c |
0.00% | -0.0126 | TMPK | atp_c + thmmp_c --> adp_c + thmpp_c |
0.15% | -7.791 | TYRabc | atp_c + h2o_c + tyr__L_e --> adp_c + h_c + pi_c + tyr__L_c |
0.11% | -5.65 | UAAGDS | 26dap__M_c + atp_c + uamag_c --> adp_c + h_c + pi_c + ugmd_c |
0.11% | -5.65 | UAMAGS | atp_c + glu__D_c + uama_c --> adp_c + h_c + pi_c + uamag_c |
0.11% | -5.65 | UAMAS | ala__L_c + atp_c + uamr_c --> adp_c + h_c + pi_c + uama_c |
0.11% | -5.65 | UGMDDS | alaala_c + atp_c + ugmd_c --> adp_c + h_c + pi_c + ugmda_c |
0.26% | -13.79 | UMPK | atp_c + ump_c <=> adp_c + udp_c |
Exercise : check in the BiGG database http://bigg.ucsd.edu/ the symbol for nitrite and check if the three models have this as metabolite
We can list all the metabolites in a model using comprehension lists:
metabolites = [i.id for i in btheta.reactions]
metabolites[:10] # Show the 10 first reactions
['12DGR140tipp', '12DGR160tipp', '12DGR161tipp', '12PPDRDH', '13PPDH2', '13PPDH2_1', '14GLUCANabcpp', '14GLUCANtexi', '24DECOAR', '2AGPE120tipp']
The gene_reaction_rule is a boolean representation of the gene requirements for this reaction to be active.
We will consider the reaction glucose 6-phosphate isomerase, which interconverts glucose 6-phosphate and fructose 6-phosphate. The reaction id for this reaction in our test model is PGI.
pgi = btheta.reactions.get_by_id("PGI")
pgi
Reaction identifier | PGI |
Name | Glucose-6-phosphate isomerase |
Memory address | 0x7f77d330ac40 |
Stoichiometry |
g6p_c <=> f6p_c D-Glucose 6-phosphate <=> D-Fructose 6-phosphate |
GPR | lcl_CP092641_1_prot_UML61488_1_4728 |
Lower bound | -1000.0 |
Upper bound | 1000.0 |
pgi.check_mass_balance() # This reaction is mass balanced
{}
pgi.genes
frozenset({<Gene lcl_CP092641_1_prot_UML61488_1_4728 at 0x7f77d3b1e430>})
pgi_gene = btheta.genes.get_by_id('lcl_CP092641_1_prot_UML61488_1_4728')
Each gene keeps track of the reactions it catalyzes
pgi_gene.reactions
frozenset({<Reaction G6PI3 at 0x7f77d37082b0>, <Reaction G6PI at 0x7f77d36f8cd0>, <Reaction PGI1c at 0x7f77d330ae50>, <Reaction PGI at 0x7f77d330ac40>})
We can list all the genes in a model using comprehension lists:
genes = [i.id for i in btheta.genes]
genes[:10] # Show the 10 first reactions
['lcl_CP092641_1_prot_UML59985_1_3140', 'lcl_CP092641_1_prot_UML61431_1_4671', 'lcl_CP092641_1_prot_UML60641_1_3835', 'lcl_CP092641_1_prot_UML60141_1_3306', 'lcl_CP092641_1_prot_UML59397_1_2519', 'lcl_CP092641_1_prot_UML60650_1_3844', 'lcl_CP092641_1_prot_UML62416_1_957', 'lcl_CP092641_1_prot_UML61752_1_257', 'lcl_CP092641_1_prot_UML61881_1_400', 'lcl_CP092641_1_prot_UML62526_1_1074']
We can produce different visualizations, such as networks, from the produced metabolic models. Escher
is an interactive tool to do this.
In the following code, you will save a model in JSON format (the default input for the library) and then you can interactively add reactions to a map using Edit > Add reaction mode > click on the reaction of interest. You can click in compound to link reactions using those metabolites:
import escher
from escher import Builder
from cobra.io import save_json_model
# This requires JSON format
btheta = read_sbml_model('./hands_on_files/gem_files/btheta.fa.xml')
save_json_model(btheta, "./hands_on_files/gem_files/test_solution.json")
builder = Builder(
model_json= "./hands_on_files/gem_files/test_solution.json"
)
builder
Builder()
This library has very comprehensive and complex options including flux representations or even animations, check their web for more https://escher.readthedocs.io/en/latest/index.html . The common workflow requires the user to subset the reactions of interest, save the JSON model, and then plot it. This is an example from escher
bundled maps:
builder = Builder(
map_name='e_coli_core.Core metabolism',
model_name='e_coli_core',
)
builder
Downloading Map from https://escher.github.io/1-0-0/6/maps/Escherichia%20coli/e_coli_core.Core%20metabolism.json Downloading Model from https://escher.github.io/1-0-0/6/models/Escherichia%20coli/e_coli_core.json
Builder()
Simulations using flux balance analysis can be solved using Model.optimize(). This will maximize or minimize (maximizing is the default) flux through the objective reactions.
solution = ecoli.optimize()
solution.objective_value
65.78274069086645
Models solved using FBA can be further analyzed by using summary methods, which output printed text to give a quick representation of model behavior. Calling the summary method on the entire model displays information on the input and output behavior of the model (uptake and secretion), along with the optimized objective.
summary = ecoli.summary()
summary
1.0 Growth = 65.78274069086643
Metabolite | Reaction | Flux | C-Number | C-Flux |
---|---|---|---|---|
4abzglu_e | EX_4abzglu_e | 0.04401 | 12 | 0.00% |
LalaDgluMdap_e | EX_LalaDgluMdap_e | 6.578 | 15 | 0.88% |
adn_e | EX_adn_e | 13.45 | 10 | 1.20% |
akg_e | EX_akg_e | 642.2 | 5 | 28.63% |
arbt6p_e | EX_arbt6p_e | 0.04401 | 12 | 0.00% |
arg__L_e | EX_arg__L_e | 19.46 | 6 | 1.04% |
asn__L_e | EX_asn__L_e | 15.86 | 4 | 0.57% |
ca2_e | EX_ca2_e | 0.3424 | 0 | 0.00% |
cl_e | EX_cl_e | 0.3424 | 0 | 0.00% |
cobalt2_e | EX_cobalt2_e | 0.006578 | 0 | 0.00% |
cu2_e | EX_cu2_e | 0.04664 | 0 | 0.00% |
cytd_e | EX_cytd_e | 20.04 | 9 | 1.61% |
dcyt_e | EX_dcyt_e | 72.46 | 9 | 5.81% |
fe2_e | EX_fe2_e | 0.4417 | 0 | 0.00% |
fe3_e | EX_fe3_e | 0.5136 | 0 | 0.00% |
g3ps_e | EX_g3ps_e | 14.23 | 6 | 0.76% |
gam6p_e | EX_gam6p_e | 13.16 | 6 | 0.70% |
gln__L_e | EX_gln__L_e | 17.31 | 5 | 0.77% |
glu__L_e | EX_glu__L_e | 447.6 | 5 | 19.95% |
gly_e | EX_gly_e | 40.27 | 2 | 0.72% |
glyc3p_e | EX_glyc3p_e | 985.8 | 3 | 26.36% |
gsn_e | EX_gsn_e | 16 | 10 | 1.43% |
his__L_e | EX_his__L_e | 6.232 | 6 | 0.33% |
ile__L_e | EX_ile__L_e | 19.11 | 6 | 1.02% |
indole_e | EX_indole_e | 3.739 | 8 | 0.27% |
k_e | EX_k_e | 12.84 | 0 | 0.00% |
leu__L_e | EX_leu__L_e | 29.64 | 6 | 1.59% |
lys__L_e | EX_lys__L_e | 22.57 | 6 | 1.21% |
met__L_e | EX_met__L_e | 10.12 | 5 | 0.45% |
mg2_e | EX_mg2_e | 0.5707 | 0 | 0.00% |
minohp_e | EX_minohp_e | 6.208 | 6 | 0.33% |
mn2_e | EX_mn2_e | 0.04546 | 0 | 0.00% |
nmn_e | EX_nmn_e | 0.1499 | 11 | 0.01% |
no2_e | EX_no2_e | 62.16 | 0 | 0.00% |
no3_e | EX_no3_e | 411.3 | 0 | 0.00% |
o2_e | EX_o2_e | 268.9 | 0 | 0.00% |
phe__L_e | EX_phe__L_e | 12.19 | 9 | 0.98% |
pnto__R_e | EX_pnto__R_e | 0.03789 | 9 | 0.00% |
pro__L_e | EX_pro__L_e | 14.54 | 5 | 0.65% |
s_e | EX_s_e | 6.062 | 0 | 0.00% |
skm_e | EX_skm_e | 0.006578 | 7 | 0.00% |
so4_e | EX_so4_e | 0.2854 | 0 | 0.00% |
thm_e | EX_thm_e | 0.01467 | 12 | 0.00% |
thr__L_e | EX_thr__L_e | 16.69 | 4 | 0.60% |
thymd_e | EX_thymd_e | 1.721 | 10 | 0.15% |
tyr__L_e | EX_tyr__L_e | 9.071 | 9 | 0.73% |
val__L_e | EX_val__L_e | 27.84 | 5 | 1.24% |
zn2_e | EX_zn2_e | 0.02243 | 0 | 0.00% |
Metabolite | Reaction | Flux | C-Number | C-Flux |
---|---|---|---|---|
ala__L_e | EX_ala__L_e | -851 | 3 | 30.36% |
co2_e | EX_co2_e | -1000 | 1 | 11.89% |
for_e | EX_for_e | -13.37 | 1 | 0.16% |
fum_e | EX_fum_e | -72.49 | 4 | 3.45% |
h2o_e | EX_h2o_e | -893.2 | 0 | 0.00% |
hqn_e | EX_hqn_e | -0.04401 | 6 | 0.00% |
hxan_e | EX_hxan_e | -0.006578 | 5 | 0.00% |
inost_e | EX_inost_e | -6.208 | 6 | 0.44% |
mal__L_e | EX_mal__L_e | -56.68 | 4 | 2.70% |
mththf_e | EX_mththf_e | -0.006578 | 5 | 0.00% |
nh4_e | EX_nh4_e | -62.16 | 0 | 0.00% |
pi_e | EX_pi_e | -985.7 | 0 | 0.00% |
succ_e | EX_succ_e | -1000 | 4 | 47.56% |
ura_e | EX_ura_e | -72.46 | 4 | 3.45% |
Secreted metabolites and those that are externally required can be accessed from these summaries:
summary.secretion_flux
flux | reaction | metabolite | |
---|---|---|---|
EX_12ppd__R_e | 0.0 | EX_12ppd__R_e | 12ppd__R_e |
EX_12ppd__S_e | 0.0 | EX_12ppd__S_e | 12ppd__S_e |
EX_14glucan_e | 0.0 | EX_14glucan_e | 14glucan_e |
EX_15dap_e | 0.0 | EX_15dap_e | 15dap_e |
EX_23camp_e | 0.0 | EX_23camp_e | 23camp_e |
... | ... | ... | ... |
sink_amob_c | 0.0 | sink_amob_c | amob_c |
sink_hemeO_c | 0.0 | sink_hemeO_c | hemeO_c |
sink_lipopb_c | 0.0 | sink_lipopb_c | lipopb_c |
sink_mobd_c | 0.0 | sink_mobd_c | mobd_c |
sink_sheme_c | 0.0 | sink_sheme_c | sheme_c |
302 rows × 3 columns
summary.uptake_flux
(48, 3)
Exercise : Which organism requires more external compounds?
Hint: you can use .shape in a dataframe to retrieve (nr_rows, nr_columns)
Exercise : List the external compounds required by E. coli
Hint: you can use access the dataframe[column_name]
This is useful in adjusting metabolic models to simulate conditions such as deletions of genes or adding a specific compound to the media.
Usually, this would be performed editing the model, then optimizing, then comparing with the original model. Editing models is out of the scope of this BC but we can use carveme to run similar processes automatically without editing the models themselves.
Here we use CarveMe to predict the uptake and secretion capabilities of an organism only from genetic evidence, and to produce a simulation-ready model without gap-filling for any particular media.
However, there are situations where you want to guarantee that the model is able to reproduce growth in one, or several, experimentally verified media.
For instance, you can ensure the model reproduces growth on M9 and LB media:
carve /input/faa/ -o /path/output/name.xml --gapfill M9
or short version:
carve /input/faa/ -o /path/output/name.xml -g M9
Note: CarveMe includes different media compositions but the user can specify custom media, check here https://carveme.readthedocs.io/en/latest/advanced.html#media-database to learn how to specify your own media compositions.
Run the following commands from the terminal:
carve hands_on_files/faa_files/btheta.fa -o hands_on_files/gem_media_files/btheta_m9lb.xml -g M9,LB
carve hands_on_files/faa_files/arect.fa -o hands_on_files/gem_media_files/arect_m9lb.xml -g M9,LB
carve hands_on_files/faa_files/ecoli.fa -o hands_on_files/gem_media_files/ecoli_m9lb.xml -g M9,LB
Alternatively you can run the following python code to run all models in a directory:
import os # To run commands in the terminal from python
import glob # Package to find files
os.system('ml CarveMe') # Load carveme from module system
for fil in glob.glob('./hands_on_files/faa_files/*.fa'):
cmd = f"carve {fil} -o ./hands_on_files/gem_media_files/{fil.split('/')[-1]}_m9lb.xml -g M9,LB" # Compose the filename
print(f'Running {fil}')
os.system(cmd) # This call the program
sh: 1: ml: not found
Running ./hands_on_files/faa_files/arect.fa Running ./hands_on_files/faa_files/btheta.fa Running ./hands_on_files/faa_files/ecoli.fa
Now we can explore the produce models to check which 'gaps' have been filled:
# We can cross link lists using zip()
for original_model, gapfill_model in zip(glob.glob('./hands_on_files/gem_files/*.xml'),
glob.glob('./hands_on_files/gem_media_files/*.xml')):
print(original_model, gapfill_model)
./hands_on_files/gem_files/arect.fa.xml ./hands_on_files/gem_media_files/arect_m9lb.xml ./hands_on_files/gem_files/btheta.fa.xml ./hands_on_files/gem_media_files/btheta_m9lb.xml ./hands_on_files/gem_files/ecoli.fa.xml ./hands_on_files/gem_media_files/ecoli_m9lb.xml
Exercise : Go together along the following code and understand what is every line doing
for original_model, gapfill_model in zip(glob.glob('./hands_on_files/gem_files/*.xml'),
glob.glob('./hands_on_files/gem_media_files/*.xml')):
organism = original_model.split('/')[-1].split('.')[0]
# Load model 1 and 2
m1 = read_sbml_model(original_model)
m2 = read_sbml_model(gapfill_model)
# Check for the reactions in each model
r1 = set([i.reaction for i in m1.reactions])
r2 = set([i.reaction for i in m2.reactions])
# Check for the genes in each model
g1 = set([i.name for i in m1.genes])
g2 = set([i.name for i in m2.genes])
print('\n------\n', organism, '\n------\n')
print('Reactions filled:')
print(r2.difference(r1))
print('Genes added:')
print(g2.difference(g1))
------ arect ------ Reactions filled: {'akg_c + h_c + ichor_c <=> 2shchc_c + co2_c + pyr_c', '2shchc_c --> h2o_c + sucbz_c', 'atp_c + coa_c + sucbz_c --> amp_c + ppi_c + sbzcoa_c'} Genes added: set() ------ btheta ------ Reactions filled: {'nadp_c + pphn_c --> 34hpp_c + co2_c + nadph_c', 'gam1p_c <=> gam6p_c', '2shchc_c --> h2o_c + sucbz_c', 'accoa_c + gam1p_c --> acgam1p_c + coa_c + h_c', 'atp_c + coa_c + sucbz_c --> amp_c + ppi_c + sbzcoa_c', 'orn_c --> nh4_c + pro__L_c'} Genes added: set() ------ ecoli ------ Reactions filled: set() Genes added: set()
A commonly asked question when analyzing metabolic models is what will happen if a certain reaction was not allowed to have any flux at all. This can tested using cobrapy by:
from cobra.io import load_model
ecoli_model = load_model("textbook") # text book model bundled in cobra
print('complete model: ', ecoli_model.optimize())
with ecoli_model:
ecoli_model.reactions.PFK.knock_out()
print('pfk knocked out: ', ecoli_model.optimize())
ecoli_model.reactions.PGK.knock_out()
print('pfk knocked out: ', ecoli_model.optimize())
complete model: <Solution 0.874 at 0x7f77be69eb50> pfk knocked out: <Solution 0.704 at 0x7f77be69ee50> pfk knocked out: <Solution 0.000 at 0x7f77be709fd0>
Exercise : How do you interpret the previous results? which conclusion can you extract?
We can also perform all single gene deletions on a model:
from cobra.flux_analysis import single_gene_deletion # Function to evaluate all deletions
deletion_results = single_gene_deletion(ecoli_model)
deletion_results
ids | growth | status | |
---|---|---|---|
0 | {b3736} | 0.374230 | optimal |
1 | {b3925} | 0.873922 | optimal |
2 | {b1101} | 0.873922 | optimal |
3 | {b1612} | 0.873922 | optimal |
4 | {b3962} | 0.873922 | optimal |
... | ... | ... | ... |
132 | {b0726} | 0.858307 | optimal |
133 | {b0767} | 0.863813 | optimal |
134 | {b2465} | 0.873922 | optimal |
135 | {b3733} | 0.374230 | optimal |
136 | {b3735} | 0.374230 | optimal |
137 rows × 3 columns
Or by double deletions
from cobra.flux_analysis import double_gene_deletion # Function to evaluate double deletions
deletion_results = double_gene_deletion(ecoli_model)
deletion_results
ids | growth | status | |
---|---|---|---|
0 | {b3212, b0485} | 0.873922 | optimal |
1 | {b3612, b4154} | 0.873922 | optimal |
2 | {b0875, b2276} | 0.211663 | optimal |
3 | {b4232, b1276} | 0.873922 | optimal |
4 | {b2976, b2926} | 0.000000 | optimal |
... | ... | ... | ... |
9448 | {b1603, b2415} | NaN | infeasible |
9449 | {b2284, b0485} | 0.211663 | optimal |
9450 | {b1621, b2458} | 0.873922 | optimal |
9451 | {b1773, b0720} | 0.000000 | optimal |
9452 | {b2282, b2279} | 0.211663 | optimal |
9453 rows × 3 columns
The output of these functions are pandas dataframes that can be parsed as described above.
memote
goal is to achieve two major shifts in the metabolic model building community:
The memote
tool therefore performs four subfunctions:
And in order to make this process as easy as possible the generated repository can easily be integrated with continuous integration testing providers such as Travis CI, which means that anytime you push a model change to GitHub, the test suite will be run automatically and a report will be available for you to look at via GitHub pages for your repository.
From: https://memote.readthedocs.io/en/latest/index.html
A useful pipeline on how and why use memote in specific conditions can be found here: https://memote.readthedocs.io/en/latest/flowchart.html
To benchmark the performance of a single model, run this command in your terminal:
ml memote # load the tool
memote report snapshot path/to/model.xml
This will generate the performance report as index.html
.
The output filename can be changed by adding the following option. To illustrate here it is changed to report.html
.
memote report snapshot --filename "report.html" path/to/model.xml
This will produce a html file with the information. Alternatively, the output can be directed to the terminal for quick inspection:
memote run path/to/model.xml
Exercise : Run and inspect a model. Discuss the content together.
To compare the performance of two (or more) models, run this command in your terminal:
memote report diff path/to/model1.xml path/to/model2.xml [path/to/model3.xml ...]
Exercise : Compare the original model of Agathobacter rectalis with the gap filled one.
Final command involves tracking every change penformed in a model
memote report history
This is extra useful when curating models or exploring different phenotype. The purpose of this block course is not to curate models and cobra can be used for most of the requirements but it is good to know (and good practice) to use tools such as memote
to keep track of the changes and quality of a model.
We can combine different GEMs into a community model. Carveme allows to generate models from single models by:
merge_community ./hands_on_files/community/*.xml -o ./hands_on_files/community/community.xml
This generates an SBML file with a community where each organism is assigned to its own compartment and a common community biomass equation is also generated. You can import the merged model into any simulation tool, just as any normal constraint-based model and apply different types of simulation methods (FBA, FVA, etc…). You can initialize the community with a pre-defined medium (just like during single-species reconstruction):
Alternatively, SMETANA (https://smetana.readthedocs.io/en/latest/) can be used to predict interactions between models and provide metrics that can be used in direct comparisons. You can get these by running in the bash terminal:
# Load SMETANA
ml SMETANA
# For the global (simplified) mode
smetana ./hands_on_files/gem_files/*.xml -o ./hands_on_files/community/community -g
# For the detailed mode
smetana ./hands_on_files/gem_files/*.xml -o ./hands_on_files/community/community -d
The produced files can be inspected with pandas.
import pandas as pd
df_global = pd.read_csv('./hands_on_files/community/community_global.tsv', sep='\t')
df_global
community | medium | size | mip | mro | |
---|---|---|---|---|---|
0 | all | complete | 3 | 2 | 0.619048 |
From previous lecture:
Scenarios in which this mode can be useful include:
Here is another example (predicted in a community of 8 members):
df_global = pd.read_csv('./hands_on_files/community_examples/example_global.tsv', sep='\t')
df_global
community | medium | size | mip | mro | |
---|---|---|---|---|---|
0 | all | complete | 8 | 8 | 0.600907 |
Exercise : what community compete less for the resources? what community is less dependent on media composition?
In detailed mode the following metrics are reported:
df_detail = pd.read_csv('./hands_on_files/community/community_detailed.tsv', sep='\t')
df_detail
community | medium | receiver | donor | compound | scs | mus | mps | smetana | |
---|---|---|---|---|---|---|---|---|---|
0 | all | minimal | arect.fa | ecoli.fa | M_ala__L_e | 1.0 | 0.01 | 1 | 0.01 |
1 | all | minimal | arect.fa | ecoli.fa | M_arg__L_e | 1.0 | 0.03 | 1 | 0.03 |
2 | all | minimal | arect.fa | ecoli.fa | M_asp__L_e | 1.0 | 0.36 | 1 | 0.36 |
3 | all | minimal | arect.fa | ecoli.fa | M_cgly_e | 1.0 | 0.01 | 1 | 0.01 |
4 | all | minimal | arect.fa | ecoli.fa | M_cys__L_e | 1.0 | 0.09 | 1 | 0.09 |
5 | all | minimal | arect.fa | ecoli.fa | M_fe3_e | 1.0 | 0.90 | 1 | 0.90 |
6 | all | minimal | arect.fa | ecoli.fa | M_fe3pyovd_kt_e | 1.0 | 0.10 | 1 | 0.10 |
7 | all | minimal | arect.fa | ecoli.fa | M_g3pg_e | 1.0 | 0.65 | 1 | 0.65 |
8 | all | minimal | arect.fa | ecoli.fa | M_glc__D_e | 1.0 | 0.99 | 1 | 0.99 |
9 | all | minimal | arect.fa | ecoli.fa | M_glu__L_e | 1.0 | 0.02 | 1 | 0.02 |
10 | all | minimal | arect.fa | ecoli.fa | M_glyc3p_e | 1.0 | 0.30 | 1 | 0.30 |
11 | all | minimal | arect.fa | ecoli.fa | M_gua_e | 1.0 | 0.02 | 1 | 0.02 |
12 | all | minimal | arect.fa | ecoli.fa | M_h2s_e | 1.0 | 0.02 | 1 | 0.02 |
13 | all | minimal | arect.fa | ecoli.fa | M_h_e | 1.0 | 0.03 | 1 | 0.03 |
14 | all | minimal | arect.fa | ecoli.fa | M_ile__L_e | 1.0 | 0.01 | 1 | 0.01 |
15 | all | minimal | arect.fa | ecoli.fa | M_lys__L_e | 1.0 | 0.04 | 1 | 0.04 |
16 | all | minimal | arect.fa | ecoli.fa | M_nh4_e | 1.0 | 0.06 | 1 | 0.06 |
17 | all | minimal | arect.fa | ecoli.fa | M_orn_e | 1.0 | 0.02 | 1 | 0.02 |
18 | all | minimal | arect.fa | ecoli.fa | M_pi_e | 1.0 | 0.05 | 1 | 0.05 |
19 | all | minimal | arect.fa | ecoli.fa | M_thr__L_e | 1.0 | 0.07 | 1 | 0.07 |
20 | all | minimal | arect.fa | ecoli.fa | M_tyr__L_e | 1.0 | 0.02 | 1 | 0.02 |
21 | all | minimal | arect.fa | ecoli.fa | M_ura_e | 1.0 | 0.02 | 1 | 0.02 |
22 | all | minimal | btheta.fa | ecoli.fa | M_ala__D_e | 1.0 | 0.01 | 1 | 0.01 |
23 | all | minimal | btheta.fa | ecoli.fa | M_ala__L_e | 1.0 | 0.01 | 1 | 0.01 |
24 | all | minimal | btheta.fa | ecoli.fa | M_arg__L_e | 1.0 | 0.02 | 1 | 0.02 |
25 | all | minimal | btheta.fa | ecoli.fa | M_asp__L_e | 1.0 | 0.01 | 1 | 0.01 |
26 | all | minimal | btheta.fa | ecoli.fa | M_cgly_e | 1.0 | 0.19 | 1 | 0.19 |
27 | all | minimal | btheta.fa | ecoli.fa | M_cys__L_e | 1.0 | 0.03 | 1 | 0.03 |
28 | all | minimal | btheta.fa | ecoli.fa | M_fe3_e | 1.0 | 0.60 | 1 | 0.60 |
29 | all | minimal | btheta.fa | ecoli.fa | M_fe3pyovd_kt_e | 1.0 | 0.40 | 1 | 0.40 |
30 | all | minimal | btheta.fa | ecoli.fa | M_fum_e | 1.0 | 0.03 | 1 | 0.03 |
31 | all | minimal | btheta.fa | ecoli.fa | M_glc__D_e | 1.0 | 0.03 | 1 | 0.03 |
32 | all | minimal | btheta.fa | ecoli.fa | M_glu__L_e | 1.0 | 0.02 | 1 | 0.02 |
33 | all | minimal | btheta.fa | ecoli.fa | M_gly_e | 1.0 | 0.02 | 1 | 0.02 |
34 | all | minimal | btheta.fa | ecoli.fa | M_glyc3p_e | 1.0 | 0.99 | 1 | 0.99 |
35 | all | minimal | btheta.fa | ecoli.fa | M_gsn_e | 1.0 | 0.02 | 1 | 0.02 |
36 | all | minimal | btheta.fa | ecoli.fa | M_h_e | 1.0 | 0.01 | 1 | 0.01 |
37 | all | minimal | btheta.fa | ecoli.fa | M_ile__L_e | 1.0 | 0.01 | 1 | 0.01 |
38 | all | minimal | btheta.fa | ecoli.fa | M_leu__L_e | 1.0 | 0.02 | 1 | 0.02 |
39 | all | minimal | btheta.fa | ecoli.fa | M_mal__L_e | 1.0 | 0.01 | 1 | 0.01 |
40 | all | minimal | btheta.fa | ecoli.fa | M_met__L_e | 1.0 | 0.01 | 1 | 0.01 |
41 | all | minimal | btheta.fa | ecoli.fa | M_nh4_e | 1.0 | 0.02 | 1 | 0.02 |
42 | all | minimal | btheta.fa | ecoli.fa | M_pi_e | 1.0 | 0.01 | 1 | 0.01 |
43 | all | minimal | btheta.fa | ecoli.fa | M_pyovd_kt_e | 1.0 | 0.01 | 1 | 0.01 |
44 | all | minimal | btheta.fa | ecoli.fa | M_rib__D_e | 1.0 | 0.02 | 1 | 0.02 |
45 | all | minimal | btheta.fa | ecoli.fa | M_ser__L_e | 1.0 | 0.01 | 1 | 0.01 |
46 | all | minimal | btheta.fa | ecoli.fa | M_tyr__L_e | 1.0 | 1.00 | 1 | 1.00 |
47 | all | minimal | btheta.fa | ecoli.fa | M_val__L_e | 1.0 | 0.01 | 1 | 0.01 |
Exercise : based on the 4 metrics reported, which conclusions can we extract from the community?
import networkx as nx
import matplotlib.pyplot as plt
# Load the model
df = pd.read_csv('./hands_on_files/community/community_detailed.tsv', sep='\t')
# Create an empty graph
G = nx.Graph()
# Add nodes and edges to the graph
for _, row in df.iterrows():
receiver = row[2]
donor = row[3]
compound = row[4]
smetana = row[8]
G.add_edge(receiver, donor, compound=compound, smetana=smetana)
# Generate a layout for the graph
pos = nx.spring_layout(G)
# Create a colormap based on the receiver column
receivers = set(df['receiver']).union(set(df['donor']))
colors = plt.cm.rainbow(range(len(receivers)+1))
receiver_color_map = dict(zip(list(receivers)+['compound'], colors))
# Draw nodes and edges with labels and sizes
for (u, v, d) in G.edges(data=True):
compound = d["compound"]
smetana = d["smetana"]
edge_size = smetana * 100 # Scale edge size
nx.draw_networkx_edges(
G,
pos,
edgelist=[(u, v)],
width=edge_size,
alpha=0.5,
edge_color="gray",
)
nx.draw_networkx_nodes(
G,
pos,
node_color=[receiver_color_map[node] for node in G.nodes()],
node_size=300,
alpha=0.8,
)
nx.draw_networkx_labels(G, pos, font_size=8, font_color="black")
# Show the plot
plt.axis("off")
plt.show()
Exercise : what are your conclusions based on this network?
# Create an empty graph
G = nx.Graph()
# Add edges between receiver and donor nodes
for index, row in df.iterrows():
receiver = row['receiver']
donor = row['donor']
smetana = row['smetana']*10
compound = row['compound']
G.add_node(compound)
G.add_node(receiver)
G.add_node(donor)
G.add_edge(receiver, compound, weight=smetana)
G.add_edge(compound, donor, weight=smetana)
# Create a layout for the graph
pos = nx.spring_layout(G)
# Draw nodes
nx.draw_networkx_nodes(G, pos, node_color='lightgray', node_size=20)
# Draw edges
edge_sizes = [data['weight'] for u, v, data in G.edges(data=True)]
nx.draw_networkx_edges(G, pos, width=edge_sizes, edge_color='gray')
# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')
# Show the plot
plt.axis('off')
plt.show()
Exercise : go through the previous code and explain what are we plotting in the plot above
import networkx as nx
import matplotlib.pyplot as plt
# Create an empty graph
G = nx.Graph()
# Add edges between receiver and donor nodes
for index, row in df.iterrows():
receiver = row['receiver']
donor = row['donor']
smetana = row['smetana']*10
compound = row['compound']
if donor=='ecoli.fa':
G.add_node(compound)
G.add_node(donor)
G.add_edge(compound, donor, weight=smetana)
# Draw nodes
pos = nx.random_layout(G)
nx.draw_networkx_nodes(G, pos, node_color='lightgray', node_size=20)
# Draw edges
edge_sizes = [data['weight'] for u, v, data in G.edges(data=True)]
nx.draw_networkx_edges(G, pos, width=edge_sizes, edge_color='gray')
# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')
# Show the plot
plt.axis('off')
plt.show()
Exercise : go through the previous code and explain what are we plotting in the plot above
Exercise : can you edit the following code to show the compounds received by btheta?
import networkx as nx
import matplotlib.pyplot as plt
# Create an empty graph
G = nx.Graph()
# Add edges between receiver and donor nodes
for index, row in df.iterrows():
receiver = row['receiver']
donor = row['donor']
smetana = row['smetana']*10
compound = row['compound']
if donor=='ecoli.fa':
G.add_node(compound)
G.add_node(donor)
G.add_edge(compound, donor, weight=smetana)
# Draw nodes
pos = nx.random_layout(G)
nx.draw_networkx_nodes(G, pos, node_color='lightgray', node_size=20)
# Draw edges
edge_sizes = [data['weight'] for u, v, data in G.edges(data=True)]
nx.draw_networkx_edges(G, pos, width=edge_sizes, edge_color='gray')
# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')
# Show the plot
plt.axis('off')
plt.show()
Heatmaps are also useful data representations, we can use seaborn to quickly plot the number of compounds shared between individuals in the community:
import seaborn as sns
# Pivot the DataFrame to count the number of compounds shared between donors and receivers
heatmap_data = df.pivot_table(index='donor', columns='receiver', values='compound', aggfunc='count', fill_value=0)
# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt='d')
plt.title('Compound Sharing Heatmap')
plt.xlabel('Receiver')
plt.ylabel('Donor')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
Exercise : What are the conclusion from the previous graph?
# Pivot the DataFrame to count the number of compounds shared between donors and receivers
heatmap_data = df[df['donor']=='ecoli.fa'].pivot_table(index='compound', columns='receiver', values='smetana', aggfunc='count', fill_value=0)
# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt='d')
plt.title('Compound sharing between E. coli and the two receivers')
plt.xlabel('Receiver')
plt.ylabel('Compound')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
Exercise :
Exercise : Can you edit the previous code to represent btheta as receiver showing in the columns who is the donor?
# Pivot the DataFrame to count the number of compounds shared between donors and receivers
heatmap_data = df[df['donor']=='ecoli.fa'].pivot_table(index='compound', columns='receiver', values='smetana', aggfunc='count', fill_value=0)
# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt='d')
plt.title('Compound sharing between E. coli and the two receivers')
plt.xlabel('Receiver')
plt.ylabel('Compound')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
That was not very meaningful, right? Can you try loading the example in hands_on_files/community_examples/example_detailed.tsv
and repeating the example?