Hands-on Tutorial on GEMs¶

The goal of this tutorial is to introduce the standard software tools used to produce, explore and work with microbial GEMs. As most of these are working as Python packages, the tutorial also includes a brief introduction to Python programming.

Where to work?¶

This tutorial is expected to be run in our jupyter server: http://cousteau-jupyter.ethz.ch/

Copying this tutorial:¶

Before starting, run this command from your home to copy the material:

cp -r /nfs//nas22/fs2202/biol_micro_teaching/course_home/smiravet/hands_on_GEMs .

INDEX¶

The index of these notebook includes:

  1. Python Introduction
  2. Building and exploring GEMs with CarveMe and CobraPy
  3. Evaluation and version control of models using memote
  4. Defining metabolic interactions with SMETANA

Notes on notebooks usage

You are working in a jupyter notebook. These documents allow to both write executable code, for example in python, and document it using markdown in different cells.

Some useful shortcuts include:

  • Shift+Enter : execute cell
  • Esc, b : add a cell below
  • Esc, a : add a cell above
  • Esc, m : transform cell to markdown
  • Esc, d : delete cell

1. Python Introduction¶

Python is a programming language that has been under development for over 25 years.

This Chapter will not cover everything in Python. If you would like, please consider the following resources:

Getting Started with Python:

  • https://www.codecademy.com/learn/python
  • http://docs.python-guide.org/en/latest/intro/learning/
  • https://learnpythonthehardway.org/book/
  • https://www.codementor.io/learn-python-online
  • https://websitesetup.org/python-cheat-sheet/

Learning Python in Notebooks:

  • http://mbakker7.github.io/exploratory_computing_with_python/

This is handy to always have available for reference:

Python Reference:

  • https://docs.python.org/3.5/reference/

1.1 Statements¶

Python is an imperative language based on statements. That is, programs in Python consists of lines composed of statements. A statement can be:

  • a single expression
  • an assignment
  • a function call
  • a function definition
  • a statement; statement

1.1.1 Expressions¶

  • Numbers
    • integers
    • floating-point
    • complex numbers
  • strings
  • boolean values
  • lists and dicts

1.1.1.1 Numbers¶

In [ ]:
1     # This a comment: 1 is a number
Out[ ]:
1
In [ ]:
2
In [ ]:
-3
In [ ]:
1
2
In [ ]:
3.14
Out[ ]:
3.14
In [ ]:
print(type(1), type(3.14))    # type() tells you the class of the object
<class 'int'> <class 'float'>

1.1.1.2 Strings¶

In [ ]:
'apple'
Out[ ]:
'apple'
In [ ]:
"apple"
Out[ ]:
'apple'

1.1.1.3 Boolean Values¶

In [ ]:
True
Out[ ]:
True
In [ ]:
False
Out[ ]:
False

1.1.1.4 Iterables¶

Python has very useful iterable data structures built into the language:

  • lists: [] --> mutable list of items
  • tuples: (item, ...) --> read-only data structure (immutable)
  • sets: set([item, ...]) --> like a tuple but keeping only unique items
  • dictionaries (hash tables): {} --> Dictionaries can relate a set of unique keys with a value (this can be a list, a number, etc.)
In [ ]:
a = [1, 2, 3]    # Equal is used to define a variable 'a'
print(a)
[1, 2, 3]
In [ ]:
a[0]   # Lists can be accessed using square brackets and the index (it is 0-based!)
Out[ ]:
1
In [ ]:
a.append(10)  # append adds a new element to a list
a
Out[ ]:
[1, 2, 3, 10]
In [ ]:
a[0] = 10    # We can edit values in the list
a            # No need to use print when working in a notebook
Out[ ]:
[10, 2, 3]
In [ ]:
b = (1, 2, 3) # This is a tuple
b    
Out[ ]:
(1, 2, 3)
In [ ]:
c = set([1,2,3,3,3])
c
Out[ ]:
{1, 2, 3}

Exercise : What happens when you try to change an element in a tuple or a set?

In [ ]:
 
In [ ]:
d = {"apple": "a fruit", "banana": "an herb", "monkey": "a mammal"}
In [ ]:
d["apple"]
Out[ ]:
'a fruit'
In [ ]:
len(a), len(b), len(c), len(d)    # the len() function tells you how long is an iterable (list, tuple, set, dictionary)
Out[ ]:
(3, 3, 3, 3)

1.1.2 Function Calls¶

1.1.2.1 Print¶

Evaluating and display result as an Out, versus evaluating and printing result (side-effect).

In [ ]:
print(1)
print(2)
print(3)
1
2
3
In [ ]:
1
2
Out[ ]:
2

1.1.2.2 Operators¶

There are two ways to call functions in Python:

  1. by pre-defined infix operator name
  2. by function name, followed by parentheses

Infix operator name:

In [ ]:
1 + 2
Out[ ]:
3

Exercise : What are these operators doing?

In [ ]:
# Other operators:
print(1 - 2)
print(1 * 2)
print(1 / 2)
print(4 % 2)
print(4 ** 2)
-1
2
0.5
0
16
In [ ]:
# Variable can be operated on the fly
a = 1
a += 10
print(a)

b = 10
b *= 10
print(b)
11
100

Exercise : What are these functions doing?

In [ ]:
# Other operations
a = [1,3,-3]

abs(a[2]), sum(a), len(a)
Out[ ]:
(3, 1, 3)

1.1.3 Special Values¶

In [ ]:
None

1.1.4 Defining Functions¶

In [ ]:
def plus(a, b):
    return a + b    
In [ ]:
plus(3, 4)
Out[ ]:
7

Exercise : What happens if you do not use return?

In [ ]:
def plus(a, b):
    a + b
In [ ]:
a = 1
plus(a, 4)
a
Out[ ]:
1

What happened? All functions return something, even if you don't specify it. If you don't specify a return value, then it will default to returning None.

In [ ]:
"a" + 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/smiravet/eth/sushitasks/gems/bc_2023/Chapter 02 - Introduction to Python.ipynb Cell 48 line <cell line: 1>()
----> <a href='vscode-notebook-cell://wsl%2Bubuntu/home/smiravet/eth/sushitasks/gems/bc_2023/Chapter%2002%20-%20Introduction%20to%20Python.ipynb#X63sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a> "a" + 1

TypeError: can only concatenate str (not "int") to str

How to Read Python Error Messages

TypeError: Can't convert 'int' object to str implicitly

Above the error message is the "traceback" also called the "call stack". This is a representation of the sequence of procedure calls that lead to the error. If the procedure call originated from code from a file, the filename would be listed after the word "File" on each line. If the procedure call originated from a notebook cell, then the word "ipython-input-#-HEX".

1.2 Equality¶

1.2.1 ==¶

In [ ]:
1 == 1
Out[ ]:
True

1.2.2 is¶

In [ ]:
[] is []
Out[ ]:
False
In [ ]:
list() is list()
Out[ ]:
False
In [ ]:
tuple() is tuple()
Out[ ]:
True
In [ ]:
57663463467 == 57663463469
Out[ ]:
False
In [ ]:
57663463467 != 57663463469
Out[ ]:
True

1.3 Advanced Topics¶

The Zen of Python:

In [ ]:
import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
In [ ]:
import operator    # import can be used to use installed packages or your own functions
operator.add(1, 2)
In [ ]:
import operator as op     # libraries can be imported to abbreviations
op.add(1,2)
3
In [ ]:
from operator import add  # specific libraries can be imported 
add(1,10)
11

2.1 Scope of variables¶

Is not always clear:

In [ ]:
y = 0
for x in range(10):
    y = x
In [ ]:
x
Out[ ]:
9
In [ ]:
[x for x in range(10, 20)]
Out[ ]:
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
In [ ]:
x
Out[ ]:
9

2.2 Scope¶

Python follows the LEGB Rule (after https://www.amazon.com/dp/0596513984/):

  • L, Local: Names assigned in any way within a function (def or lambda)), and not declared global in that function.
  • E, Enclosing function locals: Name in the local scope of any and all enclosing functions (def or lambda), from inner to outer.
  • G, Global (module): Names assigned at the top-level of a module file, or declared global in a def within the file.
  • B, Built-in (Python): Names preassigned in the built-in names module : open, range, SyntaxError,...
In [ ]:
x = 3
def foo():
    x=4
    def bar():
        print(x)  # Accesses x from foo's scope
    bar()  # Prints 4
    x=5
    bar()  # Prints 5
In [ ]:
foo()

2.3 Generators¶

In [ ]:
def function():
    for i in range(10):
        yield i
In [ ]:
function()
In [ ]:
for y in function():
    print(y)

2.4 Default arguments¶

In [ ]:
def do_something(a, b, c):
    return (a, b, c)
In [ ]:
do_something(1, 2, 3)
In [ ]:
def do_something_else(a=1, b=2, c=3):
    return (a, b, c)
In [ ]:
do_something_else()
In [ ]:
def some_function(start=[]):
    start.append(1)
    return start
In [ ]:
result = some_function()
In [ ]:
result
In [ ]:
result.append(2)
In [ ]:
other_result = some_function()
In [ ]:
other_result

2.5 List comprehension¶

"List comprehension" is the idea of writing some code inside of a list that will generate a list. Using for we can iterate on a list, string, etc.

Consider the following:

In [ ]:
[x ** 2 for x in range(10)]
In [ ]:
temp_list = []
for x in range(10):
    temp_list.append(x ** 2)
temp_list

But list comprehension is much more concise.

In [ ]:
for i in 'hello':
    print(i)
h
e
l
l
o

2.6 Conditions¶

In [ ]:
a = 3
b = 4

if a>b:
    print(f'{a} is greater than {b}')
elif a==b:
    print(f'{a} == {b}')
else:
    print(f'{b} is greater than {a}')
4 is greater than 3

Exercise : write a function to calculate the GC content of the sequence

seq = 'ACGTATTGCTTAGGCTGAGGCTAGGAGAGGGGACCCCCTAGCTAGGATCGT'

Hint: there are two different ways to do this:

  • iterating and checking the character
  • using the method .count(X) in a string
In [ ]:
 

1.4 Plotting¶

In [ ]:
%matplotlib inline

After the magic, we then need to import the matplotlib library:

In [ ]:
import matplotlib.pyplot as plt

To create a simple line plot, just give a list of y-values to the function plt.plot().

In [ ]:
plt.plot([5, 8, 2, 6, 1, 8, 2, 3, 4, 5, 6])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('A plot')
Out[ ]:
Text(0.5, 1.0, 'A plot')
In [ ]:
plt.boxplot([[1,2,3], [3,4,5]])
plt.show()    # To show the plot, not required in notebooks
plt.savefig('./hands_on_files/boxplot.svg')    # You can save in png, svg, pdf, etc
<Figure size 432x288 with 0 Axes>

Read the documentation on matplotlib:

http://matplotlib.org/api/pyplot_api.html

1.5 Dataframes¶

Dataframes are useful structures when working with tables (csv, tsv). pandas is the main library to work with them:

In [ ]:
import pandas as pd    # Classical way to import pandas
In [ ]:
df = pd.read_csv('./hands_on_files/genomes-summary.csv')
In [ ]:
df
Out[ ]:
Genome # Biosynthetic Regions # Biosynthetic Products RiPPs RiPPs:Proteusins NRPS PKSI Other PKS Saccharides Terpenes ... station depth depth layer latitude longitude date size fraction temperature (°C) oxygen (µmol/kg) detected in the water column
0 MARD_SAMN10141426_REFG_MMP10141426 43.0 76.0 14.0 0.0 22.0 11.0 10.0 0.0 10.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
1 MARD_SAMEA2272178_REFG_MMP2272178 37.0 40.0 8.0 0.0 10.0 3.0 6.0 1.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
2 MARD_SAMD00040648_REFG_MMP00040648 37.0 53.0 2.0 0.0 25.0 8.0 3.0 0.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
3 MARD_SAMN04488110_REFG_MMP04488110 36.0 46.0 5.0 0.0 11.0 14.0 3.0 3.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
4 MARD_SAMN04571751_REFG_MMP04571751 34.0 48.0 10.0 0.0 13.0 4.0 6.0 0.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34810 TARA_SAMN05326641_METAG_ANW00074 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34811 TARA_SAMN05326641_METAG_ANW00080 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34812 TARA_SAMN05326641_METAG_ANW00088 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34813 TARA_SAMN05326642_METAG_ASE00032 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34814 TARA_SAMN05326642_METAG_ASE00036 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True

34815 rows × 46 columns

In [ ]:
df.columns
Out[ ]:
Index(['Genome', '# Biosynthetic Regions', '# Biosynthetic Products', 'RiPPs',
       'RiPPs:Proteusins', 'NRPS', 'PKSI', 'Other PKS', 'Saccharides',
       'Terpenes', 'Other', 'Mean Completeness', 'Mean Contamination',
       'Differential Coverage Index', 'CheckM Completeness',
       'CheckM Contamination', 'CheckM Strain heterogeneity',
       'Anvio Completion', 'Anvio Redundancy', '# tRNAs (AA)', '16S rRNA',
       '16S rRNA (partial)', '# CRISPRs', 'GTDB Taxonomy', 'Checkm Taxonomy',
       'Anvio Domain', 'SpecI Species Cluster', 'mOTUs Species Cluster',
       'dRep Species Cluster', 'dRep Representative Genome', 'Genome size',
       '# scaffolds', 'N50 (scaffolds)', 'Longest scaffold', 'Coding density',
       'GC Content', 'station', 'depth', 'depth layer', 'latitude',
       'longitude', 'date', 'size fraction', 'temperature (°C)',
       'oxygen (µmol/kg)', 'detected in the water column'],
      dtype='object')
In [ ]:
df['Genome']
Out[ ]:
0        MARD_SAMN10141426_REFG_MMP10141426
1         MARD_SAMEA2272178_REFG_MMP2272178
2        MARD_SAMD00040648_REFG_MMP00040648
3        MARD_SAMN04488110_REFG_MMP04488110
4        MARD_SAMN04571751_REFG_MMP04571751
                        ...                
34810      TARA_SAMN05326641_METAG_ANW00074
34811      TARA_SAMN05326641_METAG_ANW00080
34812      TARA_SAMN05326641_METAG_ANW00088
34813      TARA_SAMN05326642_METAG_ASE00032
34814      TARA_SAMN05326642_METAG_ASE00036
Name: Genome, Length: 34815, dtype: object
In [ ]:
df.loc[22]
Out[ ]:
Genome                                           TARA_SAMEA2621033_METAG_KKHCHOLN
# Biosynthetic Regions                                                       23.0
# Biosynthetic Products                                                      25.0
RiPPs                                                                         5.0
RiPPs:Proteusins                                                              0.0
NRPS                                                                          8.0
PKSI                                                                          1.0
Other PKS                                                                     2.0
Saccharides                                                                   0.0
Terpenes                                                                      3.0
Other                                                                         6.0
Mean Completeness                                                           94.98
Mean Contamination                                                           8.12
Differential Coverage Index                                                   5.0
CheckM Completeness                                                         93.55
CheckM Contamination                                                         6.17
CheckM Strain heterogeneity                                                   0.0
Anvio Completion                                                             96.4
Anvio Redundancy                                                            10.07
# tRNAs (AA)                                                                 19.0
16S rRNA                                                                      0.0
16S rRNA (partial)                                                            1.0
# CRISPRs                                                                     4.0
GTDB Taxonomy                   d__Bacteria;p__Myxococcota;c__Polyangia;o__Pol...
Checkm Taxonomy                                            c__Deltaproteobacteria
Anvio Domain                                                             BACTERIA
SpecI Species Cluster                                                cluster_5530
mOTUs Species Cluster                                                  gom_003509
dRep Species Cluster                                                        373_0
dRep Representative Genome                                                   True
Genome size                                                              11250240
# scaffolds                                                                   734
N50 (scaffolds)                                                             22805
Longest scaffold                                                           104071
Coding density                                                             0.9128
GC Content                                                                 0.7188
station                                                                  TARA_068
depth                                                                         5.0
depth layer                                                                   EPI
latitude                                                                 -31.0266
longitude                                                                   4.665
date                                                          2010-09-14 06:55:00
size fraction                                                              <-0.22
temperature (°C)                                                          16.8632
oxygen (µmol/kg)                                                         234.4763
detected in the water column                                                 True
Name: 22, dtype: object
In [ ]:
df[df['Genome']=='TARA_SAMN05326642_METAG_ASE00032']
Out[ ]:
Genome # Biosynthetic Regions # Biosynthetic Products RiPPs RiPPs:Proteusins NRPS PKSI Other PKS Saccharides Terpenes ... station depth depth layer latitude longitude date size fraction temperature (°C) oxygen (µmol/kg) detected in the water column
34813 TARA_SAMN05326642_METAG_ASE00032 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True

1 rows × 46 columns

In [ ]:
df[(df['Mean Completeness']>=90) & (df['Mean Contamination']<5)]
Out[ ]:
Genome # Biosynthetic Regions # Biosynthetic Products RiPPs RiPPs:Proteusins NRPS PKSI Other PKS Saccharides Terpenes ... station depth depth layer latitude longitude date size fraction temperature (°C) oxygen (µmol/kg) detected in the water column
0 MARD_SAMN10141426_REFG_MMP10141426 43.0 76.0 14.0 0.0 22.0 11.0 10.0 0.0 10.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
1 MARD_SAMEA2272178_REFG_MMP2272178 37.0 40.0 8.0 0.0 10.0 3.0 6.0 1.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
3 MARD_SAMN04488110_REFG_MMP04488110 36.0 46.0 5.0 0.0 11.0 14.0 3.0 3.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
4 MARD_SAMN04571751_REFG_MMP04571751 34.0 48.0 10.0 0.0 13.0 4.0 6.0 0.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
5 MARD_SAMN03999393_REFG_MMP03999393 34.0 44.0 8.0 0.0 8.0 8.0 2.0 0.0 7.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34644 MARD_SAMN06006700_REFG_MMP06006700 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
34656 MARD_SAMN05216170_REFG_MMP05216170 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
34657 MARD_SAMN05216216_REFG_MMP05216216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34664 MARD_SAMN08998225_REFG_MMP08998225 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
34665 MARD_SAMN05418212_REFG_MMP05418212 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False

6450 rows × 46 columns

In [ ]:
df.sort_values('# Biosynthetic Products', ascending=False)
Out[ ]:
Genome # Biosynthetic Regions # Biosynthetic Products RiPPs RiPPs:Proteusins NRPS PKSI Other PKS Saccharides Terpenes ... station depth depth layer latitude longitude date size fraction temperature (°C) oxygen (µmol/kg) detected in the water column
0 MARD_SAMN10141426_REFG_MMP10141426 43.0 76.0 14.0 0.0 22.0 11.0 10.0 0.0 10.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
2 MARD_SAMD00040648_REFG_MMP00040648 37.0 53.0 2.0 0.0 25.0 8.0 3.0 0.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
4 MARD_SAMN04571751_REFG_MMP04571751 34.0 48.0 10.0 0.0 13.0 4.0 6.0 0.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
7 MARD_SAMN09475466_REFG_MMP09475466 32.0 47.0 11.0 0.0 9.0 9.0 5.0 1.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
3 MARD_SAMN04488110_REFG_MMP04488110 36.0 46.0 5.0 0.0 11.0 14.0 3.0 3.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34810 TARA_SAMN05326641_METAG_ANW00074 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34811 TARA_SAMN05326641_METAG_ANW00080 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34812 TARA_SAMN05326641_METAG_ANW00088 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34813 TARA_SAMN05326642_METAG_ASE00032 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True
34814 TARA_SAMN05326642_METAG_ASE00036 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN True

34815 rows × 46 columns

Exercise : How many MAGs have >15 Biosynthetic Products?

Exercise : How many MAGs have >15 Biosynthetic Products and completion >= 90 and contamination < 5?

In [ ]:
 

You can save a dataframe to a tsv/csv or excel file using:

In [ ]:
df.to_csv('./hands_on_files/test.csv')  # Comma-separated
df.to_csv('./hands_on_files/test.tsv', sep='\t')  # Tab-separated
df.to_excel('./hands_on_files/test.xlsx')  # Tab-separated

1.6 Plotting directly from dataframes¶

Finally, seaborn is a library that makes easy to interact with dataframes and plotting functions between pandas and matplotlib:

In [ ]:
import seaborn as sns
In [ ]:
sns.scatterplot(x='Mean Completeness', y='Mean Contamination', data=df)
Out[ ]:
<Axes: xlabel='Mean Completeness', ylabel='Mean Contamination'>
In [ ]:
sns.jointplot(x='Mean Completeness', y='Mean Contamination', data=df)
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x7f4943fc8c70>

Exercise : How do you interprete the previous plot?

In [ ]:
sns.violinplot(x='depth layer', y='GC Content', data=df)
Out[ ]:
<Axes: xlabel='depth layer', ylabel='GC Content'>
In [ ]:
sns.boxplot(x='depth layer', y='GC Content', hue='size fraction', data=df.head(10000), palette='viridis')
Out[ ]:
<Axes: xlabel='depth layer', ylabel='GC Content'>

You can explore other representations in the seaborn page:

  • https://seaborn.pydata.org
  • https://seaborn.pydata.org/examples/index.html

2. Building and exploring GEMs with CarveMe and CobraPy¶

2.1 Running carve me in our server¶

You need to load carveme first:

ml CarveMe

Then carveme can be run from the terminal by:

carve /input/faa/ -o /path/output/name.xml

Check the documentation of carveme here: https://carveme.readthedocs.io/en/latest/

Exercise : check the documentation an run carveme for the 3 faa files in hands_on_files/faa_files storing the output in hands_on/gem_files

Exercise : open and explore one of the output files, what does it include? is this human-redable?

In [ ]:
 

2.2 Exploring models with Cobra¶

Cobra is a huge library to perform multiple function on metabolic models.

Check the documentation for cobra.py here: https://cobrapy.readthedocs.io/en/latest/

In [ ]:
# The start of everything is reading the model with built-in function:

from cobra.io import read_sbml_model 

btheta = read_sbml_model('./hands_on_files/gem_files/btheta.fa.xml')

btheta
Out[ ]:
Name btheta_fa
Memory address 7fa1b7f91d90
Number of metabolites 1081
Number of reactions 1634
Number of genes 620
Number of groups 0
Objective expression 1.0*Growth - 1.0*Growth_reverse_699ae
Compartments cytosol, periplasm, extracellular space

Exercise : repeat the previous process assigning the models to ecoli and arect. How many metabolites, reactions and genes have each model?

In [ ]:
 

2.2.1 Accessing reactions¶

You can access reactions by index or by name:

In [ ]:
btheta.reactions[50]
Out[ ]:
Reaction identifierAACPS4
NameAcyl-[acyl-carrier-protein] synthetase (n-C16:1)
Memory address 0x7f77e9512a30
Stoichiometry

ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c

Acyl carrier protein + ATP C10H12N5O13P3 + Hexadecenoate (n-C16:1) --> AMP C10H12N5O7P + Cis-hexadec-9-enoyl-[acyl-carrier protein] (n-C16:1) + Diphosphate

GPRlcl_CP092641_1_prot_UML60641_1_3835
Lower bound0.0
Upper bound1000.0
In [ ]:
btheta.reactions.get_by_id('AACPS4')
Out[ ]:
Reaction identifierAACPS4
NameAcyl-[acyl-carrier-protein] synthetase (n-C16:1)
Memory address 0x7fa1b5890c40
Stoichiometry

ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c

Acyl carrier protein + ATP C10H12N5O13P3 + Hexadecenoate (n-C16:1) --> AMP C10H12N5O7P + Cis-hexadec-9-enoyl-[acyl-carrier protein] (n-C16:1) + Diphosphate

GPRlcl_CP092641_1_prot_UML60641_1_3835
Lower bound0.0
Upper bound1000.0

The information of the reaction can be explored as strings:

In [ ]:
acp = btheta.reactions.get_by_id('AACPS4')
print(acp.name)
print(acp.reaction)
Acyl-[acyl-carrier-protein] synthetase  (n-C16:1)
ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c

We can also ensure the reaction is mass balanced. This function will return elements which violate mass balance. If it comes back empty, then the reaction is mass balanced.

In [ ]:
acp.check_mass_balance()
Out[ ]:
{'charge': -2.0}

We can list all the reactions names in a model using comprehension lists:

In [ ]:
reactions = [i.id for i in btheta.reactions]
reactions[:10]   # Show the 10 first reactions
Out[ ]:
['12DGR140tipp',
 '12DGR160tipp',
 '12DGR161tipp',
 '12PPDRDH',
 '13PPDH2',
 '13PPDH2_1',
 '14GLUCANabcpp',
 '14GLUCANtexi',
 '24DECOAR',
 '2AGPE120tipp']

There is other information in the the reactions objects, to check the different methods we can apply to an object in python we can use the method dir():

In [ ]:
print(dir(btheta.reactions[0]))
['__add__', '__class__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__isub__', '__le__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__weakref__', '_annotation', '_associate_gene', '_check_bounds', '_dissociate_gene', '_genes', '_gpr', '_id', '_lower_bound', '_metabolites', '_model', '_repr_html_', '_set_id_with_model', '_update_awareness', '_upper_bound', 'add_metabolites', 'annotation', 'boundary', 'bounds', 'build_reaction_from_string', 'build_reaction_string', 'check_mass_balance', 'compartments', 'copy', 'delete', 'flux', 'flux_expression', 'forward_variable', 'functional', 'gene_name_reaction_rule', 'gene_reaction_rule', 'genes', 'get_coefficient', 'get_coefficients', 'get_compartments', 'gpr', 'id', 'knock_out', 'lower_bound', 'metabolites', 'model', 'name', 'notes', 'objective_coefficient', 'products', 'reactants', 'reaction', 'reduced_cost', 'remove_from_model', 'reverse_id', 'reverse_variable', 'reversibility', 'subsystem', 'subtract_metabolites', 'summary', 'update_genes_from_gpr', 'update_variable_bounds', 'upper_bound', 'x', 'y']
In [ ]:
# Useful attributes shared between this object and the following are:
print(btheta.reactions[50].id)
print(btheta.reactions[50].name)
print(btheta.reactions[50].compartments)
print(btheta.reactions[50].reaction)
AACPS4
Acyl-[acyl-carrier-protein] synthetase  (n-C16:1)
{'C_c'}
ACP_c + atp_c + hdcea_c --> amp_c + hdeACP_c + ppi_c

2.2.2 Accessing metabolites¶

In the same way, you can access metabolites:

In [ ]:
btheta.metabolites.get_by_id("atp_c")
Out[ ]:
Metabolite identifieratp_c
NameATP C10H12N5O13P3
Memory address 0x7f77e97c6d30
FormulaC10H12N5O13P3
CompartmentC_c
In 303 reaction(s) NAabc, NADS2, UAMAS, NDPK7, RIBabc1, CYTDK1_1, GMPS, RIBabcpp, NAabcO, NDPK8, UAMAGS, ANHMK, XYLBabc, GLYCK_1, PG180abcpp, XYLK, ACTNabc, GLUTRS, PG161abcpp, DTMPK, SULAabc, ASPK, RIBabc,...

We can print out the metabolite name, formula, compartment (cytosol in this case) directly as string, and charge.

In [ ]:
atp = btheta.metabolites.get_by_id("atp_c")

print(atp.name)
print(atp.formula)
print(atp.compartment)
print(atp.charge)
ATP C10H12N5O13P3
C10H12N5O13P3
C_c
-4

To get the reactions for a metabolite we can combine the previous methods:

In [ ]:
atp_reactions = btheta.metabolites.get_by_id("atp_c").reactions
In [ ]:
len(atp_reactions)
Out[ ]:
303

A metabolite like glucose 6-phosphate will participate in fewer reactions.

In [ ]:
len(btheta.metabolites.get_by_id("g6p_c").reactions)
Out[ ]:
9

Exercise : What happens when you do atp_reactions[1]?

In [ ]:
 

Exercise : correct the following code to print how many reactions use atp in each organism

In [ ]:
models = []  # Include here a list of models
len_reactions = []
for model in models:
    atp_reactions = model.metabolites.get_by_id("atp").reactions
    len_reactions.append(sum(atp_reactions))
In [ ]:
 
Out[ ]:
303

Exercise : complete the code below to compare the number of reactions using ATP in each model (calculated before). Who has more reactions using ATP?

In [ ]:
model_names = ['A', 'B', 'C']
plt.bar(x=model_names, height=)   # What should we put in height?
  Input In [65]
    plt.bar(x=model_names, height=)   # What should we put in height?
                                  ^
SyntaxError: invalid syntax

Alternatively, we can summarize the reactions with the .summary() method

In [ ]:
btheta.metabolites.atp_c.summary()
Out[ ]:

atp_c

C10H12N5O13P3

Producing Reactions

Percent Flux Reaction Definition
18.84% 1000 ATPS4rpp adp_c + 4.0 h_p + pi_c <=> atp_c + h2o_c + 3.0 h_c
18.84% 1000 GALKr atp_c + gal_c <=> adp_c + gal1p_c + h_c
15.92% 844.9 NDPK1 atp_c + gdp_c <=> adp_c + gtp_c
18.84% 1000 PGK 3pg_c + atp_c <=> 13dpg_c + adp_c
18.84% 1000 PYK adp_c + h_c + pep_c --> atp_c + pyr_c
8.58% 455.4 SR3 adp_c + h_c + isobutp_c --> atp_c + isobuta_c
0.16% 8.44 SUCOAS atp_c + coa_c + succ_c <=> adp_c + pi_c + succoa_c

Consuming Reactions

Percent Flux Reaction Definition
18.84% -1000 A6PAG atp_c + gal_c --> adp_c + dgal6p_c + h_c
0.45% -23.7 ADNK1 adn_c + atp_c --> adp_c + amp_c + h_c
0.11% -5.65 ALAALAr 2.0 ala__D_c + atp_c <=> adp_c + alaala_c + h_c + pi_c
0.00% -0.03254 APPAT atp_c + h_c + pan4p_c --> dpcoa_c + ppi_c
0.31% -16.71 ARGabc arg__L_e + atp_c + h2o_c --> adp_c + arg__L_c + h_c + pi_c
0.26% -13.62 ASNabc asn__L_e + atp_c + h2o_c --> adp_c + asn__L_c + h_c + pi_c
0.11% -5.65 ASPK asp__L_c + atp_c <=> 4pasp_c + adp_c
0.01% -0.2941 CA2abc atp_c + ca2_e + h2o_c --> adp_c + ca2_c + h_c + pi_c
1.17% -62.2 CDPMEK 4c2me_c + atp_c --> 2p4c2me_c + adp_c + h_c
1.34% -71.27 CYTK1 atp_c + cmp_c <=> adp_c + cdp_c
0.00% -0.04006 Cuabc atp_c + cu2_e + h2o_c --> adp_c + cu2_c + h_c + pi_c
0.03% -1.526 DGK1 atp_c + dgmp_c <=> adp_c + dgdp_c
0.03% -1.526 DGNSK atp_c + dgsn_c --> adp_c + dgmp_c + h_c
0.00% -0.03254 DPCOAK atp_c + dpcoa_c --> adp_c + coa_c + h_c
0.03% -1.478 DTMPK atp_c + dtmp_c <=> adp_c + dtdp_c
0.01% -0.4412 FE3abc atp_c + fe3_e + h2o_c --> adp_c + fe3_c + h_c + pi_c
0.00% -0.0126 FMNAT atp_c + fmn_c + h_c --> fad_c + ppi_c
4.99% -265 FTHFLi atp_c + for_c + thf_c --> 10fthf_c + adp_c + pi_c
0.28% -14.87 GLNabc atp_c + gln__L_e + h2o_c --> adp_c + gln__L_c + h_c + pi_c
57.60% -3058 Growth 0.000223 10fthf_c + 0.513689 ala__L_c + 0.000223 amet_c + 0.295792 arg__L_c + 0.241055 asn__L_c + 0.241055 asp__L_c + 54.124831 atp_c + 0.005205 ca2_c + 0.005205 cl_c + 0.000576 coa_c + 0.0001 cobalt2_c + 0.133508 ctp_c + 0.000709 cu2_c + 0.09158 cys__L_c + 0.026166 datp_c + 0.027017 dctp_c + 0.027017 dgtp_c + 0.026166 dttp_c + 0.000223 fad_c + 0.006715 fe2_c + 0.007808 fe3_c + 0.26316 gln__L_c + 0.26316 glu__L_c + 0.612638 gly_c + 0.215096 gtp_c + 48.601527 h2o_c + 0.094738 his__L_c + 0.290529 ile__L_c + 0.195193 k_c + 0.450531 leu__L_c + 0.343161 lys__L_c + 0.153686 met__L_c + 0.008675 mg2_c + 0.000223 mlthf_c + 0.000691 mn2_c + 0.0001 mql8_c + 0.001831 nad_c + 0.000447 nadp_c + 0.185265 phe__L_c + 0.221055 pro__L_c + 0.000223 pydx5p_c + 0.000223 ribflv_c + 0.215792 ser__L_c + 0.004338 so4_c + 0.000223 thf_c + 0.000223 thmpp_c + 0.253687 thr__L_c + 0.056843 trp__L_c + 0.137896 tyr__L_c + 0.1 uaagmda_c + 0.144104 utp_c + 0.423162 val__L_c + 0.000341 zn2_c --> 53.95 adp_c + 53.95 h_c + 53.945662 pi_c + 0.773903 ppi_c
0.00% -0.0126 HETZK 4mhetz_c + atp_c --> 4mpetz_c + adp_c + h_c
8.25% -437.8 HEX1 atp_c + glc__D_c --> adp_c + g6p_c + h_c
0.10% -5.353 HISabc atp_c + h2o_c + his__L_e --> adp_c + h_c + his__L_c + pi_c
0.48% -25.46 LEUabc atp_c + h2o_c + leu__L_e --> adp_c + h_c + leu__L_c + pi_c
0.37% -19.39 LYSabc atp_c + h2o_c + lys__L_e --> adp_c + h_c + lys__L_c + pi_c
0.00% -0.01825 METAT atp_c + h2o_c + met__L_c --> amet_c + pi_c + ppi_c
0.16% -8.696 METabc atp_c + h2o_c + met__L_e --> adp_c + h_c + met__L_c + pi_c
0.00% -0.02526 NADK atp_c + nad_c --> adp_c + h_c + nadp_c
0.00% -0.1287 NADS1 atp_c + dnad_c + nh4_c --> amp_c + h_c + nad_c + ppi_c
0.00% -0.1287 NNATr atp_c + h_c + nicrnt_c <=> dnad_c + ppi_c
0.00% -0.03254 PANTS ala_B_c + atp_c + pant__R_c --> amp_c + h_c + pnto__R_c + ppi_c
0.20% -10.47 PHEabc atp_c + h2o_c + phe__L_e --> adp_c + h_c + phe__L_c + pi_c
0.00% -0.0126 PMPK 4ampm_c + atp_c --> 2mahmp_c + adp_c
0.00% -0.03254 PNTK atp_c + pnto__R_c --> 4ppan_c + adp_c + h_c
0.23% -12.28 PRPPS atp_c + r5p_c <=> amp_c + h_c + prpp_c
0.00% -0.0126 RBFK atp_c + ribflv_c --> adp_c + fmn_c + h_c
3.52% -186.8 RBK_L1 atp_c + rbl__L_c --> adp_c + h_c + ru5p__L_c
0.27% -14.33 THRabc atp_c + h2o_c + thr__L_e --> adp_c + h_c + pi_c + thr__L_c
0.03% -1.478 TMDK1_1 atp_c + thymd_c --> adp_c + dtmp_c + h_c
0.00% -0.0126 TMPK atp_c + thmmp_c --> adp_c + thmpp_c
0.15% -7.791 TYRabc atp_c + h2o_c + tyr__L_e --> adp_c + h_c + pi_c + tyr__L_c
0.11% -5.65 UAAGDS 26dap__M_c + atp_c + uamag_c --> adp_c + h_c + pi_c + ugmd_c
0.11% -5.65 UAMAGS atp_c + glu__D_c + uama_c --> adp_c + h_c + pi_c + uamag_c
0.11% -5.65 UAMAS ala__L_c + atp_c + uamr_c --> adp_c + h_c + pi_c + uama_c
0.11% -5.65 UGMDDS alaala_c + atp_c + ugmd_c --> adp_c + h_c + pi_c + ugmda_c
0.26% -13.79 UMPK atp_c + ump_c <=> adp_c + udp_c

Exercise : check in the BiGG database http://bigg.ucsd.edu/ the symbol for nitrite and check if the three models have this as metabolite

In [ ]:
 

We can list all the metabolites in a model using comprehension lists:

In [ ]:
metabolites = [i.id for i in btheta.reactions]
metabolites[:10]   # Show the 10 first reactions
Out[ ]:
['12DGR140tipp',
 '12DGR160tipp',
 '12DGR161tipp',
 '12PPDRDH',
 '13PPDH2',
 '13PPDH2_1',
 '14GLUCANabcpp',
 '14GLUCANtexi',
 '24DECOAR',
 '2AGPE120tipp']

2.2.3 Accessing genes¶

The gene_reaction_rule is a boolean representation of the gene requirements for this reaction to be active.

We will consider the reaction glucose 6-phosphate isomerase, which interconverts glucose 6-phosphate and fructose 6-phosphate. The reaction id for this reaction in our test model is PGI.

In [ ]:
pgi = btheta.reactions.get_by_id("PGI")
pgi
Out[ ]:
Reaction identifierPGI
NameGlucose-6-phosphate isomerase
Memory address 0x7f77d330ac40
Stoichiometry

g6p_c <=> f6p_c

D-Glucose 6-phosphate <=> D-Fructose 6-phosphate

GPRlcl_CP092641_1_prot_UML61488_1_4728
Lower bound-1000.0
Upper bound1000.0
In [ ]:
pgi.check_mass_balance()   # This reaction is mass balanced
Out[ ]:
{}
In [ ]:
pgi.genes
Out[ ]:
frozenset({<Gene lcl_CP092641_1_prot_UML61488_1_4728 at 0x7f77d3b1e430>})
In [ ]:
pgi_gene = btheta.genes.get_by_id('lcl_CP092641_1_prot_UML61488_1_4728')

Each gene keeps track of the reactions it catalyzes

In [ ]:
pgi_gene.reactions
Out[ ]:
frozenset({<Reaction G6PI3 at 0x7f77d37082b0>,
           <Reaction G6PI at 0x7f77d36f8cd0>,
           <Reaction PGI1c at 0x7f77d330ae50>,
           <Reaction PGI at 0x7f77d330ac40>})

We can list all the genes in a model using comprehension lists:

In [ ]:
genes = [i.id for i in btheta.genes]
genes[:10]   # Show the 10 first reactions
Out[ ]:
['lcl_CP092641_1_prot_UML59985_1_3140',
 'lcl_CP092641_1_prot_UML61431_1_4671',
 'lcl_CP092641_1_prot_UML60641_1_3835',
 'lcl_CP092641_1_prot_UML60141_1_3306',
 'lcl_CP092641_1_prot_UML59397_1_2519',
 'lcl_CP092641_1_prot_UML60650_1_3844',
 'lcl_CP092641_1_prot_UML62416_1_957',
 'lcl_CP092641_1_prot_UML61752_1_257',
 'lcl_CP092641_1_prot_UML61881_1_400',
 'lcl_CP092641_1_prot_UML62526_1_1074']

2.2.4 Basic visualization¶

We can produce different visualizations, such as networks, from the produced metabolic models. Escher is an interactive tool to do this.

In the following code, you will save a model in JSON format (the default input for the library) and then you can interactively add reactions to a map using Edit > Add reaction mode > click on the reaction of interest. You can click in compound to link reactions using those metabolites:

In [ ]:
import escher
from escher import Builder
from cobra.io import save_json_model

# This requires JSON format 
btheta = read_sbml_model('./hands_on_files/gem_files/btheta.fa.xml')
save_json_model(btheta, "./hands_on_files/gem_files/test_solution.json")

builder = Builder(
    model_json= "./hands_on_files/gem_files/test_solution.json"
)
builder
Builder()

This library has very comprehensive and complex options including flux representations or even animations, check their web for more https://escher.readthedocs.io/en/latest/index.html . The common workflow requires the user to subset the reactions of interest, save the JSON model, and then plot it. This is an example from escher bundled maps:

In [ ]:
builder = Builder(
    map_name='e_coli_core.Core metabolism',
    model_name='e_coli_core',
)

builder
Downloading Map from https://escher.github.io/1-0-0/6/maps/Escherichia%20coli/e_coli_core.Core%20metabolism.json
Downloading Model from https://escher.github.io/1-0-0/6/models/Escherichia%20coli/e_coli_core.json
Builder()

2.3 Model optimization¶

Simulations using flux balance analysis can be solved using Model.optimize(). This will maximize or minimize (maximizing is the default) flux through the objective reactions.

In [ ]:
solution = ecoli.optimize()
solution.objective_value
Out[ ]:
65.78274069086645

Models solved using FBA can be further analyzed by using summary methods, which output printed text to give a quick representation of model behavior. Calling the summary method on the entire model displays information on the input and output behavior of the model (uptake and secretion), along with the optimized objective.

In [ ]:
summary = ecoli.summary()
summary
Out[ ]:

Objective

1.0 Growth = 65.78274069086643

Uptake

Metabolite Reaction Flux C-Number C-Flux
4abzglu_e EX_4abzglu_e 0.04401 12 0.00%
LalaDgluMdap_e EX_LalaDgluMdap_e 6.578 15 0.88%
adn_e EX_adn_e 13.45 10 1.20%
akg_e EX_akg_e 642.2 5 28.63%
arbt6p_e EX_arbt6p_e 0.04401 12 0.00%
arg__L_e EX_arg__L_e 19.46 6 1.04%
asn__L_e EX_asn__L_e 15.86 4 0.57%
ca2_e EX_ca2_e 0.3424 0 0.00%
cl_e EX_cl_e 0.3424 0 0.00%
cobalt2_e EX_cobalt2_e 0.006578 0 0.00%
cu2_e EX_cu2_e 0.04664 0 0.00%
cytd_e EX_cytd_e 20.04 9 1.61%
dcyt_e EX_dcyt_e 72.46 9 5.81%
fe2_e EX_fe2_e 0.4417 0 0.00%
fe3_e EX_fe3_e 0.5136 0 0.00%
g3ps_e EX_g3ps_e 14.23 6 0.76%
gam6p_e EX_gam6p_e 13.16 6 0.70%
gln__L_e EX_gln__L_e 17.31 5 0.77%
glu__L_e EX_glu__L_e 447.6 5 19.95%
gly_e EX_gly_e 40.27 2 0.72%
glyc3p_e EX_glyc3p_e 985.8 3 26.36%
gsn_e EX_gsn_e 16 10 1.43%
his__L_e EX_his__L_e 6.232 6 0.33%
ile__L_e EX_ile__L_e 19.11 6 1.02%
indole_e EX_indole_e 3.739 8 0.27%
k_e EX_k_e 12.84 0 0.00%
leu__L_e EX_leu__L_e 29.64 6 1.59%
lys__L_e EX_lys__L_e 22.57 6 1.21%
met__L_e EX_met__L_e 10.12 5 0.45%
mg2_e EX_mg2_e 0.5707 0 0.00%
minohp_e EX_minohp_e 6.208 6 0.33%
mn2_e EX_mn2_e 0.04546 0 0.00%
nmn_e EX_nmn_e 0.1499 11 0.01%
no2_e EX_no2_e 62.16 0 0.00%
no3_e EX_no3_e 411.3 0 0.00%
o2_e EX_o2_e 268.9 0 0.00%
phe__L_e EX_phe__L_e 12.19 9 0.98%
pnto__R_e EX_pnto__R_e 0.03789 9 0.00%
pro__L_e EX_pro__L_e 14.54 5 0.65%
s_e EX_s_e 6.062 0 0.00%
skm_e EX_skm_e 0.006578 7 0.00%
so4_e EX_so4_e 0.2854 0 0.00%
thm_e EX_thm_e 0.01467 12 0.00%
thr__L_e EX_thr__L_e 16.69 4 0.60%
thymd_e EX_thymd_e 1.721 10 0.15%
tyr__L_e EX_tyr__L_e 9.071 9 0.73%
val__L_e EX_val__L_e 27.84 5 1.24%
zn2_e EX_zn2_e 0.02243 0 0.00%

Secretion

Metabolite Reaction Flux C-Number C-Flux
ala__L_e EX_ala__L_e -851 3 30.36%
co2_e EX_co2_e -1000 1 11.89%
for_e EX_for_e -13.37 1 0.16%
fum_e EX_fum_e -72.49 4 3.45%
h2o_e EX_h2o_e -893.2 0 0.00%
hqn_e EX_hqn_e -0.04401 6 0.00%
hxan_e EX_hxan_e -0.006578 5 0.00%
inost_e EX_inost_e -6.208 6 0.44%
mal__L_e EX_mal__L_e -56.68 4 2.70%
mththf_e EX_mththf_e -0.006578 5 0.00%
nh4_e EX_nh4_e -62.16 0 0.00%
pi_e EX_pi_e -985.7 0 0.00%
succ_e EX_succ_e -1000 4 47.56%
ura_e EX_ura_e -72.46 4 3.45%

Secreted metabolites and those that are externally required can be accessed from these summaries:

In [ ]:
summary.secretion_flux
Out[ ]:
flux reaction metabolite
EX_12ppd__R_e 0.0 EX_12ppd__R_e 12ppd__R_e
EX_12ppd__S_e 0.0 EX_12ppd__S_e 12ppd__S_e
EX_14glucan_e 0.0 EX_14glucan_e 14glucan_e
EX_15dap_e 0.0 EX_15dap_e 15dap_e
EX_23camp_e 0.0 EX_23camp_e 23camp_e
... ... ... ...
sink_amob_c 0.0 sink_amob_c amob_c
sink_hemeO_c 0.0 sink_hemeO_c hemeO_c
sink_lipopb_c 0.0 sink_lipopb_c lipopb_c
sink_mobd_c 0.0 sink_mobd_c mobd_c
sink_sheme_c 0.0 sink_sheme_c sheme_c

302 rows × 3 columns

In [ ]:
summary.uptake_flux
Out[ ]:
(48, 3)

Exercise : Which organism requires more external compounds?

Hint: you can use .shape in a dataframe to retrieve (nr_rows, nr_columns)

In [ ]:
 

Exercise : List the external compounds required by E. coli

Hint: you can use access the dataframe[column_name]

In [ ]:
 

This is useful in adjusting metabolic models to simulate conditions such as deletions of genes or adding a specific compound to the media.

Usually, this would be performed editing the model, then optimizing, then comparing with the original model. Editing models is out of the scope of this BC but we can use carveme to run similar processes automatically without editing the models themselves.

2.3.1 Gap Filling:¶

2.3.2 Example of model optimization in specific media conditions¶

Here we use CarveMe to predict the uptake and secretion capabilities of an organism only from genetic evidence, and to produce a simulation-ready model without gap-filling for any particular media.

However, there are situations where you want to guarantee that the model is able to reproduce growth in one, or several, experimentally verified media.

For instance, you can ensure the model reproduces growth on M9 and LB media:

carve /input/faa/ -o /path/output/name.xml --gapfill M9

or short version:

carve /input/faa/ -o /path/output/name.xml -g M9

Note: CarveMe includes different media compositions but the user can specify custom media, check here https://carveme.readthedocs.io/en/latest/advanced.html#media-database to learn how to specify your own media compositions.

Run the following commands from the terminal:

carve hands_on_files/faa_files/btheta.fa -o hands_on_files/gem_media_files/btheta_m9lb.xml -g M9,LB
carve hands_on_files/faa_files/arect.fa -o hands_on_files/gem_media_files/arect_m9lb.xml -g M9,LB
carve hands_on_files/faa_files/ecoli.fa -o hands_on_files/gem_media_files/ecoli_m9lb.xml -g M9,LB

Alternatively you can run the following python code to run all models in a directory:

In [ ]:
import os   # To run commands in the terminal from python
import glob # Package to find files

os.system('ml CarveMe')     # Load carveme from module system

for fil in glob.glob('./hands_on_files/faa_files/*.fa'):
    cmd = f"carve {fil} -o ./hands_on_files/gem_media_files/{fil.split('/')[-1]}_m9lb.xml -g M9,LB"    # Compose the filename
    print(f'Running {fil}')
    os.system(cmd)   # This call the program
sh: 1: ml: not found
Running ./hands_on_files/faa_files/arect.fa
Running ./hands_on_files/faa_files/btheta.fa
Running ./hands_on_files/faa_files/ecoli.fa

Now we can explore the produce models to check which 'gaps' have been filled:

In [ ]:
# We can cross link lists using zip()
for original_model, gapfill_model in zip(glob.glob('./hands_on_files/gem_files/*.xml'), 
                                         glob.glob('./hands_on_files/gem_media_files/*.xml')):
    print(original_model, gapfill_model)
./hands_on_files/gem_files/arect.fa.xml ./hands_on_files/gem_media_files/arect_m9lb.xml
./hands_on_files/gem_files/btheta.fa.xml ./hands_on_files/gem_media_files/btheta_m9lb.xml
./hands_on_files/gem_files/ecoli.fa.xml ./hands_on_files/gem_media_files/ecoli_m9lb.xml

Exercise : Go together along the following code and understand what is every line doing

In [ ]:
for original_model, gapfill_model in zip(glob.glob('./hands_on_files/gem_files/*.xml'), 
                                         glob.glob('./hands_on_files/gem_media_files/*.xml')):
    organism = original_model.split('/')[-1].split('.')[0]
    
    # Load model 1 and 2
    m1 = read_sbml_model(original_model)
    m2 = read_sbml_model(gapfill_model)

    # Check for the reactions in each model
    r1 = set([i.reaction for i in m1.reactions])
    r2 = set([i.reaction for i in m2.reactions])

    # Check for the genes in each model
    g1 = set([i.name for i in m1.genes])
    g2 = set([i.name for i in m2.genes])

    print('\n------\n', organism, '\n------\n')
    print('Reactions filled:')
    print(r2.difference(r1))

    print('Genes added:')
    print(g2.difference(g1))
------
 arect 
------

Reactions filled:
{'akg_c + h_c + ichor_c <=> 2shchc_c + co2_c + pyr_c', '2shchc_c --> h2o_c + sucbz_c', 'atp_c + coa_c + sucbz_c --> amp_c + ppi_c + sbzcoa_c'}
Genes added:
set()

------
 btheta 
------

Reactions filled:
{'nadp_c + pphn_c --> 34hpp_c + co2_c + nadph_c', 'gam1p_c <=> gam6p_c', '2shchc_c --> h2o_c + sucbz_c', 'accoa_c + gam1p_c --> acgam1p_c + coa_c + h_c', 'atp_c + coa_c + sucbz_c --> amp_c + ppi_c + sbzcoa_c', 'orn_c --> nh4_c + pro__L_c'}
Genes added:
set()

------
 ecoli 
------

Reactions filled:
set()
Genes added:
set()

2.3.3 Example of model optimization simulating deletions¶

A commonly asked question when analyzing metabolic models is what will happen if a certain reaction was not allowed to have any flux at all. This can tested using cobrapy by:

In [ ]:
from cobra.io import load_model

ecoli_model = load_model("textbook")    # text book model bundled in cobra

print('complete model: ', ecoli_model.optimize())
with ecoli_model:
    ecoli_model.reactions.PFK.knock_out()
    print('pfk knocked out: ', ecoli_model.optimize())
    ecoli_model.reactions.PGK.knock_out()
    print('pfk knocked out: ', ecoli_model.optimize())
complete model:  <Solution 0.874 at 0x7f77be69eb50>
pfk knocked out:  <Solution 0.704 at 0x7f77be69ee50>
pfk knocked out:  <Solution 0.000 at 0x7f77be709fd0>

Exercise : How do you interpret the previous results? which conclusion can you extract?

2.3.3.1 Single deletions¶

We can also perform all single gene deletions on a model:

In [ ]:
from cobra.flux_analysis import single_gene_deletion     # Function to evaluate all deletions

deletion_results = single_gene_deletion(ecoli_model)
deletion_results
Out[ ]:
ids growth status
0 {b3736} 0.374230 optimal
1 {b3925} 0.873922 optimal
2 {b1101} 0.873922 optimal
3 {b1612} 0.873922 optimal
4 {b3962} 0.873922 optimal
... ... ... ...
132 {b0726} 0.858307 optimal
133 {b0767} 0.863813 optimal
134 {b2465} 0.873922 optimal
135 {b3733} 0.374230 optimal
136 {b3735} 0.374230 optimal

137 rows × 3 columns

2.3.3.2 Double deletions¶

Or by double deletions

In [ ]:
from cobra.flux_analysis import double_gene_deletion     # Function to evaluate double deletions

deletion_results = double_gene_deletion(ecoli_model)
deletion_results
Out[ ]:
ids growth status
0 {b3212, b0485} 0.873922 optimal
1 {b3612, b4154} 0.873922 optimal
2 {b0875, b2276} 0.211663 optimal
3 {b4232, b1276} 0.873922 optimal
4 {b2976, b2926} 0.000000 optimal
... ... ... ...
9448 {b1603, b2415} NaN infeasible
9449 {b2284, b0485} 0.211663 optimal
9450 {b1621, b2458} 0.873922 optimal
9451 {b1773, b0720} 0.000000 optimal
9452 {b2282, b2279} 0.211663 optimal

9453 rows × 3 columns

The output of these functions are pandas dataframes that can be parsed as described above.


3. Evaluation and version control of models using memote¶

memote goal is to achieve two major shifts in the metabolic model building community:

  1. Models should be version-controlled such that changes can be tracked and if necessary reverted. Ideally, they should be available through a public repository such as GitHub that will allow other researchers to inspect, share, and contribute to the model.
  2. Models should, for the benefit of the community and for research gain, live up to certain standards and minimal functionality.

The memote tool therefore performs four subfunctions:

  1. Create a skeleton git repository for the model.
  2. Run the current model through a test suite that represents the community standard.
  3. Generate an informative report which details the results of the test suite in a visually appealing manner.
  4. (Re-)compute test statistics for an existing version controlled history of a metabolic model.

And in order to make this process as easy as possible the generated repository can easily be integrated with continuous integration testing providers such as Travis CI, which means that anytime you push a model change to GitHub, the test suite will be run automatically and a report will be available for you to look at via GitHub pages for your repository.

From: https://memote.readthedocs.io/en/latest/index.html

A useful pipeline on how and why use memote in specific conditions can be found here: https://memote.readthedocs.io/en/latest/flowchart.html

3.1 Snapshot of a model¶

To benchmark the performance of a single model, run this command in your terminal:

ml memote # load the tool
memote report snapshot path/to/model.xml

This will generate the performance report as index.html.

The output filename can be changed by adding the following option. To illustrate here it is changed to report.html.

memote report snapshot --filename "report.html" path/to/model.xml

This will produce a html file with the information. Alternatively, the output can be directed to the terminal for quick inspection:

memote run path/to/model.xml

Exercise : Run and inspect a model. Discuss the content together.

3.2 Difference between models¶

To compare the performance of two (or more) models, run this command in your terminal:

memote report diff path/to/model1.xml path/to/model2.xml [path/to/model3.xml ...]

Exercise : Compare the original model of Agathobacter rectalis with the gap filled one.

3.3 History of a model¶

Final command involves tracking every change penformed in a model

memote report history

This is extra useful when curating models or exploring different phenotype. The purpose of this block course is not to curate models and cobra can be used for most of the requirements but it is good to know (and good practice) to use tools such as memote to keep track of the changes and quality of a model.


4. Defining metabolic interactions with SMETANA¶

We can combine different GEMs into a community model. Carveme allows to generate models from single models by:

merge_community ./hands_on_files/community/*.xml -o ./hands_on_files/community/community.xml

This generates an SBML file with a community where each organism is assigned to its own compartment and a common community biomass equation is also generated. You can import the merged model into any simulation tool, just as any normal constraint-based model and apply different types of simulation methods (FBA, FVA, etc…). You can initialize the community with a pre-defined medium (just like during single-species reconstruction):

Alternatively, SMETANA (https://smetana.readthedocs.io/en/latest/) can be used to predict interactions between models and provide metrics that can be used in direct comparisons. You can get these by running in the bash terminal:

# Load SMETANA 
ml SMETANA

# For the global (simplified) mode
smetana ./hands_on_files/gem_files/*.xml -o ./hands_on_files/community/community -g

# For the detailed mode
smetana ./hands_on_files/gem_files/*.xml -o ./hands_on_files/community/community -d

The produced files can be inspected with pandas.

4.1 Exploration of global dependencies¶

In [ ]:
import pandas as pd
df_global = pd.read_csv('./hands_on_files/community/community_global.tsv', sep='\t')
df_global
Out[ ]:
community medium size mip mro
0 all complete 3 2 0.619048

From previous lecture:

  • MRO (metabolic resource overlap): calculates how much the species compete for the same metabolites.
  • MIP (metabolic interaction potential): calculates how many metabolites the species can share to decrease their dependency on external resources.

Scenarios in which this mode can be useful include:

  • Testing different species in and out of the community to reduce the MRO (less competition for the same metabolites).
  • Testing different species in and out of the community to increase the MIP (more metabolites are being shared) to simplify media compositions.
  • Explore how dependency change in different media compositions

Here is another example (predicted in a community of 8 members):

In [ ]:
df_global = pd.read_csv('./hands_on_files/community_examples/example_global.tsv', sep='\t')
df_global
Out[ ]:
community medium size mip mro
0 all complete 8 8 0.600907

Exercise : what community compete less for the resources? what community is less dependent on media composition?

4.2 Studying interaction in SMETANA detailed mode¶

In detailed mode the following metrics are reported:

  • SCS (species coupling score): measures the dependency of one species in the presence of the others to survive
  • MUS (metabolite uptake score): measures how frequently a species needs to uptake a metabolite to survive
  • MPS (metabolite production score): measures the ability of a species to produce a metabolite
  • SMETANA: the individual smetana score is a combination of the 3 scores above, it gives a measure of certainty on a cross-feeding interaction (species A receives metabolite X from species B).
In [ ]:
df_detail = pd.read_csv('./hands_on_files/community/community_detailed.tsv', sep='\t')
df_detail
Out[ ]:
community medium receiver donor compound scs mus mps smetana
0 all minimal arect.fa ecoli.fa M_ala__L_e 1.0 0.01 1 0.01
1 all minimal arect.fa ecoli.fa M_arg__L_e 1.0 0.03 1 0.03
2 all minimal arect.fa ecoli.fa M_asp__L_e 1.0 0.36 1 0.36
3 all minimal arect.fa ecoli.fa M_cgly_e 1.0 0.01 1 0.01
4 all minimal arect.fa ecoli.fa M_cys__L_e 1.0 0.09 1 0.09
5 all minimal arect.fa ecoli.fa M_fe3_e 1.0 0.90 1 0.90
6 all minimal arect.fa ecoli.fa M_fe3pyovd_kt_e 1.0 0.10 1 0.10
7 all minimal arect.fa ecoli.fa M_g3pg_e 1.0 0.65 1 0.65
8 all minimal arect.fa ecoli.fa M_glc__D_e 1.0 0.99 1 0.99
9 all minimal arect.fa ecoli.fa M_glu__L_e 1.0 0.02 1 0.02
10 all minimal arect.fa ecoli.fa M_glyc3p_e 1.0 0.30 1 0.30
11 all minimal arect.fa ecoli.fa M_gua_e 1.0 0.02 1 0.02
12 all minimal arect.fa ecoli.fa M_h2s_e 1.0 0.02 1 0.02
13 all minimal arect.fa ecoli.fa M_h_e 1.0 0.03 1 0.03
14 all minimal arect.fa ecoli.fa M_ile__L_e 1.0 0.01 1 0.01
15 all minimal arect.fa ecoli.fa M_lys__L_e 1.0 0.04 1 0.04
16 all minimal arect.fa ecoli.fa M_nh4_e 1.0 0.06 1 0.06
17 all minimal arect.fa ecoli.fa M_orn_e 1.0 0.02 1 0.02
18 all minimal arect.fa ecoli.fa M_pi_e 1.0 0.05 1 0.05
19 all minimal arect.fa ecoli.fa M_thr__L_e 1.0 0.07 1 0.07
20 all minimal arect.fa ecoli.fa M_tyr__L_e 1.0 0.02 1 0.02
21 all minimal arect.fa ecoli.fa M_ura_e 1.0 0.02 1 0.02
22 all minimal btheta.fa ecoli.fa M_ala__D_e 1.0 0.01 1 0.01
23 all minimal btheta.fa ecoli.fa M_ala__L_e 1.0 0.01 1 0.01
24 all minimal btheta.fa ecoli.fa M_arg__L_e 1.0 0.02 1 0.02
25 all minimal btheta.fa ecoli.fa M_asp__L_e 1.0 0.01 1 0.01
26 all minimal btheta.fa ecoli.fa M_cgly_e 1.0 0.19 1 0.19
27 all minimal btheta.fa ecoli.fa M_cys__L_e 1.0 0.03 1 0.03
28 all minimal btheta.fa ecoli.fa M_fe3_e 1.0 0.60 1 0.60
29 all minimal btheta.fa ecoli.fa M_fe3pyovd_kt_e 1.0 0.40 1 0.40
30 all minimal btheta.fa ecoli.fa M_fum_e 1.0 0.03 1 0.03
31 all minimal btheta.fa ecoli.fa M_glc__D_e 1.0 0.03 1 0.03
32 all minimal btheta.fa ecoli.fa M_glu__L_e 1.0 0.02 1 0.02
33 all minimal btheta.fa ecoli.fa M_gly_e 1.0 0.02 1 0.02
34 all minimal btheta.fa ecoli.fa M_glyc3p_e 1.0 0.99 1 0.99
35 all minimal btheta.fa ecoli.fa M_gsn_e 1.0 0.02 1 0.02
36 all minimal btheta.fa ecoli.fa M_h_e 1.0 0.01 1 0.01
37 all minimal btheta.fa ecoli.fa M_ile__L_e 1.0 0.01 1 0.01
38 all minimal btheta.fa ecoli.fa M_leu__L_e 1.0 0.02 1 0.02
39 all minimal btheta.fa ecoli.fa M_mal__L_e 1.0 0.01 1 0.01
40 all minimal btheta.fa ecoli.fa M_met__L_e 1.0 0.01 1 0.01
41 all minimal btheta.fa ecoli.fa M_nh4_e 1.0 0.02 1 0.02
42 all minimal btheta.fa ecoli.fa M_pi_e 1.0 0.01 1 0.01
43 all minimal btheta.fa ecoli.fa M_pyovd_kt_e 1.0 0.01 1 0.01
44 all minimal btheta.fa ecoli.fa M_rib__D_e 1.0 0.02 1 0.02
45 all minimal btheta.fa ecoli.fa M_ser__L_e 1.0 0.01 1 0.01
46 all minimal btheta.fa ecoli.fa M_tyr__L_e 1.0 1.00 1 1.00
47 all minimal btheta.fa ecoli.fa M_val__L_e 1.0 0.01 1 0.01

Exercise : based on the 4 metrics reported, which conclusions can we extract from the community?

In [ ]:
 

4.3 Visualizing interactions¶

4.3.1 Networks¶

One common way to visualize meaningful information from the model is using a network. networkx is a powerful library to define custom network visualizations from a community model:

In [ ]:
import networkx as nx
import matplotlib.pyplot as plt

# Load the model 
df = pd.read_csv('./hands_on_files/community/community_detailed.tsv', sep='\t')

# Create an empty graph
G = nx.Graph()

# Add nodes and edges to the graph
for _, row in df.iterrows():
    receiver = row[2]
    donor = row[3]
    compound = row[4]
    smetana = row[8]
    G.add_edge(receiver, donor, compound=compound, smetana=smetana)

# Generate a layout for the graph
pos = nx.spring_layout(G)

# Create a colormap based on the receiver column
receivers = set(df['receiver']).union(set(df['donor']))
colors = plt.cm.rainbow(range(len(receivers)+1))
receiver_color_map = dict(zip(list(receivers)+['compound'], colors))

# Draw nodes and edges with labels and sizes
for (u, v, d) in G.edges(data=True):
    compound = d["compound"]
    smetana = d["smetana"]
    edge_size = smetana * 100  # Scale edge size

    nx.draw_networkx_edges(
        G,
        pos,
        edgelist=[(u, v)],
        width=edge_size,
        alpha=0.5,
        edge_color="gray",
    )

nx.draw_networkx_nodes(
    G,
    pos,
    node_color=[receiver_color_map[node] for node in G.nodes()],
    node_size=300,
    alpha=0.8,
)

nx.draw_networkx_labels(G, pos, font_size=8, font_color="black")

# Show the plot
plt.axis("off")
plt.show()

Exercise : what are your conclusions based on this network?

In [ ]:
 
In [ ]:
# Create an empty graph
G = nx.Graph()

# Add edges between receiver and donor nodes
for index, row in df.iterrows():
    receiver = row['receiver']
    donor = row['donor']
    smetana = row['smetana']*10
    compound = row['compound']
    
    G.add_node(compound)
    G.add_node(receiver)
    G.add_node(donor)

    G.add_edge(receiver, compound, weight=smetana)
    G.add_edge(compound, donor, weight=smetana)


# Create a layout for the graph
pos = nx.spring_layout(G)

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_color='lightgray', node_size=20)

# Draw edges
edge_sizes = [data['weight'] for u, v, data in G.edges(data=True)]
nx.draw_networkx_edges(G, pos, width=edge_sizes, edge_color='gray')

# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')

# Show the plot
plt.axis('off')
plt.show()

Exercise : go through the previous code and explain what are we plotting in the plot above

In [ ]:
import networkx as nx
import matplotlib.pyplot as plt

# Create an empty graph
G = nx.Graph()

# Add edges between receiver and donor nodes
for index, row in df.iterrows():
    receiver = row['receiver']
    donor = row['donor']
    smetana = row['smetana']*10
    compound = row['compound']
    
    if donor=='ecoli.fa':
        G.add_node(compound)
        G.add_node(donor)
        G.add_edge(compound, donor, weight=smetana)


# Draw nodes
pos = nx.random_layout(G)

nx.draw_networkx_nodes(G,  pos, node_color='lightgray', node_size=20)

# Draw edges
edge_sizes = [data['weight'] for u, v, data in G.edges(data=True)]
nx.draw_networkx_edges(G, pos, width=edge_sizes, edge_color='gray')

# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')

# Show the plot
plt.axis('off')
plt.show()

Exercise : go through the previous code and explain what are we plotting in the plot above

Exercise : can you edit the following code to show the compounds received by btheta?

In [ ]:
import networkx as nx
import matplotlib.pyplot as plt

# Create an empty graph
G = nx.Graph()

# Add edges between receiver and donor nodes
for index, row in df.iterrows():
    receiver = row['receiver']
    donor = row['donor']
    smetana = row['smetana']*10
    compound = row['compound']
    
    if donor=='ecoli.fa':
        G.add_node(compound)
        G.add_node(donor)
        G.add_edge(compound, donor, weight=smetana)


# Draw nodes
pos = nx.random_layout(G)

nx.draw_networkx_nodes(G,  pos, node_color='lightgray', node_size=20)

# Draw edges
edge_sizes = [data['weight'] for u, v, data in G.edges(data=True)]
nx.draw_networkx_edges(G, pos, width=edge_sizes, edge_color='gray')

# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')

# Show the plot
plt.axis('off')
plt.show()

4.3.2 Heatmaps¶

Heatmaps are also useful data representations, we can use seaborn to quickly plot the number of compounds shared between individuals in the community:

In [ ]:
import seaborn as sns
# Pivot the DataFrame to count the number of compounds shared between donors and receivers
heatmap_data = df.pivot_table(index='donor', columns='receiver', values='compound', aggfunc='count', fill_value=0)

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt='d')
plt.title('Compound Sharing Heatmap')
plt.xlabel('Receiver')
plt.ylabel('Donor')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()

Exercise : What are the conclusion from the previous graph?

In [ ]:
# Pivot the DataFrame to count the number of compounds shared between donors and receivers
heatmap_data = df[df['donor']=='ecoli.fa'].pivot_table(index='compound', columns='receiver', values='smetana', aggfunc='count', fill_value=0)

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt='d')
plt.title('Compound sharing between E. coli and the two receivers')
plt.xlabel('Receiver')
plt.ylabel('Compound')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()

Exercise :

  • Are there compounds given by E. coli taken by the other two members?
  • Are there any compound only taken by another member?

Exercise : Can you edit the previous code to represent btheta as receiver showing in the columns who is the donor?

In [ ]:
# Pivot the DataFrame to count the number of compounds shared between donors and receivers
heatmap_data = df[df['donor']=='ecoli.fa'].pivot_table(index='compound', columns='receiver', values='smetana', aggfunc='count', fill_value=0)

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt='d')
plt.title('Compound sharing between E. coli and the two receivers')
plt.xlabel('Receiver')
plt.ylabel('Compound')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()

That was not very meaningful, right? Can you try loading the example in hands_on_files/community_examples/example_detailed.tsv and repeating the example?

In [ ]: