Network bioinformatics II#
General information#
Main objective#
The goal of this exercise is to gain hands-on experience with a network-based integration approach. We picked one approach that allows to experiment and also understand some of the common limitations and issues. Please go through the document in a sequential way. The text guides you through the exercise providing additional context information.
Actions are highlighted in :hlblue:`BLUE`, and imply that you have to do something and generate an output.
Questions are highlighted in :hlred:`RED`, and imply that you have to think and formulate an answer (ideally in small groups). Exemplary answers and additional information are provided at the end of the document. Before referring to the answers, it is crucial that you make an active effort to come up with own answers.
Learning objectives#
Gain hands-on experience with a versatile tool for network-embedded analysis of omics data
Understand the relevance of the network as a tool
Understand the implications of input parameters and data filters
Prerequisites#
For this exercise, you need a browser connected to the web. A spreadsheet software (e.g. Excel or Tables) is helpful to navigate through results and reproduce some of the data. Download the accompanying data from Moodle here
Task#
The task today is to analyze the metabolic differences between two breast cancer cell lines: SK-BR-3 and Hs 578T. They differ in physiology and in the set of receptor they express on the surface: SK-BR3 overexpresses HER2 that can be treated with e.g. Herceptin, HS578T is a so-called triple negative line that lacks peculiar overexpression of hormone receptors (no HER2, no PR, no ER) and, thus, is more difficult to treat. We measured the expression of > 17’000 genes by RNA sequencing and the levels of ca. 700 metabolites by non-targeted metabolomics. We are interested in finding differences between the two cell lines.
Available data#
The full data is available in the DE_genes.xlsx and DE_metabolites.xlsx files. The chosen network integration tool reads only tab separated text files, i.e. .tsv files. Therefore, the worksheet has to be saved accordingly or copy-pasted in a text editor. To reproduce the majority of experiments described in the following, we prepared numerous tsv files which all derived from the full data.
Genes are identified by their symbol. Metabolites by the common HMDB1 identifier. The data is provided in the form of a differential analysis: for each measured feature (gene or metabolite), we calculated the fold-change (reported as log2) and the p-value from a two-tailed student t-test. Additional columns for Benjamini-Hochberg adjusted p-values, q-values, average absolute expression level, etc. are provided for completeness.
Canonical differential analysis#
Before we move on to the network analysis, it is interesting to have a look at the univariate analysis. By plotting p-values and fold-changes, we note that thousands of genes and hundreds of metabolites pass common significance criteria (e.g. adjusted p-value < 0.01, red line). This is an overwhelming amount of hits. With a network-centric approach, we hope to obtain an integrated view of the underlying metabolic differences.
GAM#
The network-analysis approach chosen is GAM. GAM stands for “Genes And Metabolites”, and is a method that was first presented by Jha et al (Immunity 2015). The method was developed by the group of Maxim Artyomov at the Washington University School of Medicine in St. Louis. GAM can be accessed using a browser at the address https://artyomovlab.wustl.edu/shiny/gam/. Computations are run “on the fly” and results are returned typically within a minute. They can be visualized in the browser or exported to the local drive in form of tables or pictures. GAM is a conceptually simple technique to analyze in parallel omics data for metabolites and enzymatic reactions (associated to genes). The core task is to identify the portions of the metabolic network that are mostly affected when comparing two states, e.g. a treated and an untreated set of samples. Hence,GAM always requires differential data for metabolites (eg from metabolomics) and for reactions (e.g.from transcriptomics, proteomics, or even allele information from genomics). Upon projecting the differential data to the nodes and edges of the network, GAM seeks for subgraphs enriched in changes.
Part 1 – Execute GAM analysis with default setting#
To demonstrate how the workflow works and that it is simple to be used, the first task is to run the analysis with default settings
:hlblue:`1.3 Hit “Run step 1, autogenerate FDRs and run step 2”`
1.4 After a few seconds, you should see the result in the lower panel. A network similar to the one shown below. If yes, congratulations! If not, you likely did something wrong in step 1.2.
Understanding the output#
The output of the GAM analysis is the “most regulated module” (in the lower part of the page). By default, edges stand for reactions (genes) and nodes their corresponding metabolites. You can navigate with the cursor on the network to see more information. Clicking on an object opens a new page with additional details on the information parsed. You can download the network as PDF. For the example above, you should obtain a file named DE_genes.re.mf=-1.4.rf=-6.6.ams=-13.4.pdf. Note that the name includes the TSV used for genes and the key parameters used for module extraction (more on this topic later). You can also download the network structure as XGMML, and open it in Cytoscape.
The so-called “module” is an extract of the whole network. In the example above, the whole network has 1929 nodes and 2004 edges. If you are interested in the whole network, it can be downloaded as XGMML and visualized in e.g. Cytoscape. If you want to overlay colors, you’ll need the Cytoscape GAM_VizMap.xml file that defines the style.
Compared to the thousands of significantly different genes and metabolites found by simple differential analysis, the “most regulated module” obtained by GAM is concise and visual. This is possible because all genes and metabolites are put in their molecular context (their enzymatic reactions), which allows (i) triangulating network regions that are enriched for changes and (ii) filter spurious and isolated changes which are unlikely to be of biological relevance.
Part 2 – The effect of the network size#
The metabolic network is prior information and used as a scaffold to project measurement data. In principle, networks can be generated from scratch at will. For simplicity, the GAM web service comes with four prepared networks (i.e. human, yeast, Arabidopsis, mouse). The networks were taken from KEGG database. This is a long-standing bioinformatics resource that includes genomic information for ~ 6’000 species (organisms). In parallel, KEGG offers catalogues of biochemical reactions, humanassembled pathways, compounds of biomedical relevance, disease-relevant mutations, etc. Metabolic networks in KEGG are inferred from the genome sequence, i.e. by providing putative annotation of genes based on sequence similarity with orthologue genes that encode for a protein with known function. In practice, it is possible to download a list of biochemical reactions and metabolites for any of the organisms listed in KEGG. A reaction in the model might be associated with 0, 1, or multiple genes.
Full KEGG models typically include thousands of reactions and metabolites, and tens of thousands of genes. To simplify the task and speed up computation and visualization, GAM recommends reducing the number of genes/reactions from the species-generic model and keep only data-relevant genes/reactions. How to define relevant genes? There are some obvious options:
Including genes that are expressed. Non-expressed genes are prone to noisy results and confound the analysis. As suggested in the paper (page W196 Adjusting the global network), only 10’000-15’000 genes are typically well-expressed in a cell. If we plot the average expression levels on the two cell lines in a scatter plot (e.g. column G and H in DE_genes.xlsx), we obtain two main clouds. The cloud closer to the origin of the axes contains low or nonexpressed genes. The cloud in the upper right corner includes the genes with large count numbers. Based on these plots or histogram plots, we set a cutoff of 5 on the expression level. In the column I of the DE_genes.xlsx, we indicate what genes pass the cutoff of 5 in at least one of the cell lines. Out of 18752, we found 13068 expressed genes, i.e. in line with the expectations.
Including genes that are significantly affected. These can be filtered on p-values or similar measures, or on the magnitude of the change (log2FC). In the data, we found 12007 genes with p-value < 0.05. 6912 genes have p-value < 0.05 and |log2FC| > 1 (i.e. two-fold).
Combining both filters: this leads to a list of 5694 genes.
What is the effect of including these filters on the size of the network and on the “most regulated module”?
:hlblue:`2.1 Hit “Reset all”, load file “DE_genes.tsv”, load file “DE_metabolites.tsv”`
:hlblue:`2.2 Verify that “Interpret reactions as” is set to “edges” and that “Use RPAIRS” is checked.` These options are chosen to get comparable and interpretable results. More details can be found in the paper
:hlblue:`2.3 Hit “Run step 1: autogenerate FDR ans run step 2”, Note the number of nodes and edges in the network summary. Note the FDRs for genes and metabolites. Save the network as a PDF.` The results for the first row are reported in the table below. You should obtain similar results.
:hlblue:`2.4 Repeat the steps 2.1-2.3 for the 2nd – 4th row with different transcriptomics data.` Note the results
As it can be observed from the results, filtering for significance has the largest strongest effect. The majority of significantly expressed genes is also expressed. This is expected given the large amount of differential genes found in the univariate differential analysis.
:hlred:`Question 2: Why is the FDR for genes changing, and then FDR for metabolites constant?`
Now focus on the modules, have a look at the extracted modules (PDFs).
Part 3 - The role of false discovery rates#
Next to the network, GAM uses additional parameters to govern extraction of the most regulated modules: the false discovery rate (FDR) for genes and metabolites. We want to test their relevance. (There is a third parameter named “Score for absent metabolites”, but in this exercise we will neglect itand keep the default value of -13.4 throughout. The same goes for removing the option ‘adding transedges’.
Note that to extract the most regulated modules, raw p-values for genes and metabolites are first transformed into scores and ranked (page W197 Scoring). FDRs are estimated by fitting p-value to a predefined distribution. Rising the log10(FDR) values relaxes the filter, leading to larger modules. A value of 0 virtually abolishes the filter.
Test by yourself the effect of FDR thresholding on the module…
:hlblue:`3.1 Hit “Reset all”, load file “DE_genes.tsv”, load file “DE_metabolites.tsv”`
:hlblue:`3.3 Hit “Run step 1: autogenerate FDR and run step 2”, Note the number of nodes and edges from the module summary.` The results for the first row are reported in the table below. You should obtain similar results.
s you can observe, it is simple to reduce or increase the size of the “most regulated” network by modifying the three parameters. GAM was tuned such that the autogenerated FDRs tend to isolate about 100-150 genes and metabolites (page W197 Selecting FDR). This is an interesting size in practical terms: it is large enough to include 4-5 different pathways for hypothesis generation but - at the same time - it remains a manageable size for visualization or for rational interpretation.
An independent example is provided by the first study in which GAM was used, i.e. the paper by Jha et al. The GAM approach (under the name CoMBI-T) was applied as above and lead to a “most regulated” module that entails about 100 genes and metabolites. Without going into the details of the study, it is quite obvious that this size allows to organize nodes and highlight a handful of “hotspots” to be investigated in follow-ups.
Apart of practical aspects, it is very important to realize that the autogenerated FDR or the size of ~100 edges/nodes are fully arbitrary: they are cherry-picked to highlight a manageable number of changes, but do not say anything about the magnitude or significance of such changes. This has several implications: comparing two cell lines that are indeed very similar or two that differ a lot (as SK-BR-3 and Hs 578T) will lead to similarly sized “most regulated” modules. Hence, module size is di per se not a metric for difference.
This limitation of the approach raises a key question:
:hlred:`Question 4: How to determine cutoffs (and thus modules) that are not arbitrarily set to achieve a predefined size, but build on absolute criteria and can be used to chart all changes of relevance?` This is crucial question with multiple answers. Any ideas? Take at least 5 minutes to think about options before moving on…
Hints are provided below in the question. Full answers and discussions can be found at the end. Similar principles can be applied to different types of omic analysis, so it is worth to spend time to understand the principles.
:hlred:`Question 7:` The estimation of FDR is entirely based on p-values and statistical assumptions (e.g. beta distribution). :hlred:`How could one use data randomization to infer the robustness of edges/nodes instead of using thresholds on scores?`
Part 4 - Interpretation of the results#
From the previous parts, we conclude that the module below best recapitulates the metabolic changes in gene expression between the two cell lines. (images and network are available in the folder Part4)
:hlred:`Question 8: What do you conclude from this “most regulated module”?`
:hlred:`Question 8:` The interpretation of the module is complicated by an erratic mix-up of red and green edges/nodes. Biologically, we would rather expect to so coherent changes that point to increase or decrease. :hlred:`How could GAM be run to retrieve coherent patterns?`
References#
Jha AK, Huang SC, Sergushichev A, Lampropoulou V, Ivanova Y, Loginicheva E, Chmielewski K, Stewart KM, Ashall J, Everts B, Pearce EJ, Driggers EM, Artyomov MN. Network integration of parallel metabolic and transcriptional data reveals metabolic modules that regulate macrophage polarization. Immunity. 2015 Mar 17;42(3):419-30. doi: 10.1016/j.immuni.2015.02.005. PMID: 25786174.
Sergushichev AA, Loboda AA, Jha AK, Vincent EE, Driggers EM, Jones RG, Pearce EJ, Artyomov MN. GAM: a web-service for integrated transcriptional and metabolic network analysis. Nucleic Acids Res. 2016 Jul 8;44(W1):W194-200. doi: 10.1093/nar/gkw266. PMID: 27098040
GAM Server: https://artyomovlab.wustl.edu/shiny/gam/
Solutions#
Question 1#
A reaction can also remain unlinked if the gene encoding for the enzyme is still unknown. This is oftentimes the case for newly or poorly characterized organisms. Even without knowing what the enzyme is, it is possible to directly prove that a reaction exists by testing the enzymatic activity in crude cellular extracts.
A reaction can be linked to multiple genes in the case that multiple copies of gene with similar enzymatic activity exist in the genome. Enzymes with identical catalytic activity are called “isoenzymes”.
Question 2#
:hlred:`Why is the FDR for genes changing, and then FDR for metabolites constant?`
FDRs are estimated from the entire distribution of p-values. Hence, they change as soon as the list of input genes with p-value change (e.g. by using different DE_genes_*.tsv files), even if the additional or removed p-values are non-significant. Whenever FDR are estimated, it is important to NOT provide only highly significant changes because the resulting FDR will misrepresent the data. Instead, may supposedly non-significant features should be included to provide a representative estimate of the full p-value distribution.
The FDR value for metabolites are constant because the same list (and p-values) where used throughout the tests.
Question 3#
Apart of some differences in the layout (e.g. orientation), all results are largely similar. The same genes and metabolites can be found across all modules. There are minor differences on how genes are connected, but >75% is in fact conserved. This is illustrated below for the two most extreme cases of longest (left) and smallest (right) list of genes. Note that the module extracted with the smallest list has a few more edges, indicating that there is no trivial correlation between the size of input data and size of output network.
This apparent “robustness” of results is due primarily to the use of autogenerated FDR. They are tuned based on the data to ensure that a network of similar size is produced (more below)
Should one better used the full dataset or smaller subsets? Given that the results are similar, it could be argued that smaller is fine. In general, however, it is better to always use the full dataset and avoid filtering data based on arbitrary decisions (such as cutoffs for significance or expression levels). As soon as filters are introduced, there is an inherent risk of biasing the solution. Any type of filter or cutoff should be justified and verified (more below). In the following, we will use the full list of genes DE_genes.tsv.
Question 4#
Multiple answers are provided below (Question 5 to 7). Question 5 ^^^^^^^^^^^^
A simple approach is to generate a network and module using solely edges and nodes that change substantially. As a reminder, the fold-change provided with the data is neglected by GAM, which uses solely the p-value. Fold-changes are overlaid on the resulting networks and modules in form of colors, but do not have any role in the generation of the modules. Hence, the strategy is to include only the genes and metabolites that pass a fold-change cutoff. It will results in a smaller network (after step 1) and, thus, a smaller module in step 2.
We can illustrate the effect of FC-thresholding with a gene list that only includes genes with |log2FC| > 3 (i.e. at least 8-fold, list available as DE_genes_very_large.tsv). This is an extreme filter, but suited for an example. In our data, only ~2600 genes out of 18000 pass the cut. The modules obtained from step 2 had only 30 edge/nodes without FDR filters (20 edge/nodes with autogenerated FDRs). This is much less than obtained with the full gene list without FC filtering (figures below).
Fold-change filtering has a similar effect than FDR filtering: it reduces the module size. The advantage of fold-change filtering over the autogenerated FDR filtering is that it is biologically more meaningful and, thus, intuitive. For metabolites, it is quite common to consider fold-changes that are as low as 30%. For gene expression, changes than don’t exceed two-fold are often disregarded because of little physiological significance. Importantly, fold-change threshold can be determined experimentally in every systems by multiple measurements of control samples.From a biological/biochemical perspective, replacing fold-change filtering can replace FDR based filtering allows to build network analysis on absolute criteria.
We illustrate the effect of replacing FDR-filters with fold-change filters with our test case. We used a mild filtering criteria on genes only, including only the 6912 genes with p-value < 0.05 and |log2FC| > 1 (two-fold, DE_genes_significant_large.tsv file). Without any additional filtering, GAM isolated a network with 129 nodes and 166 edges. To compare, only 42 nodes and 48 edges were isolated above with autogenerated FDR cutoffs and without fold-change filter.
The resulting fold-change filtered module generated by GAM contains hundreds of edges and nodes. Therefore, it is quite crowded and packed. Nevertheless, it summarizes better (then FDR-filtered networks) ALL significant network changes. Considering that the two cell lines are quite different, it likely reflects better the underlying biological differences.
To further explore the data, the module can be exported as XGMML network and imported in Cytoscape. A style file is also provided (Cytoscape VizMap).
Question 6#
Yes, prior knowledge can be used to optimize any threshold such that known changes or known nonchanges (in the sense that it is known that two reactions aren’t different across the tested groups) are recovered.
The principle is simple: perform the analysis with threshold values (e.g. the FDR or fold-change cutoffs) spanning the whole range of values. For each analysis, one has to quantify sensitivity-specificity by comparing whether known changes (true positives) and known non-changes (true negatives) are identified or not. Once the analyses have been performed for all thresholds, a so-called ROC or (better) precision-recall curve can be built to guide the selection of the threshold that offers the best performance. . For more information, read https://www.nature.com/articles/nmeth.3945.
The approach, however, hides recurring problems. First, truth must be known for a substantial number of reactions/metabolites. Hence, substantial prior information is necessary. Second, prior information is rarely comprehensive (otherwise we wouldn’t even use something like GAM). Therefore, it is risky to assume that changes that have not been described in the past are false results. Third, imbalanced class sizes (false >> true) are frequent but problematic.
Question 7#
Yes, permutation or data is often used to determine empirical p-values for almost any observation, i.e. the likelihood to observe the same with random data.
The principle is to randomize data by shuffling the names of metabolites, of genes, or both. A GAM analysis is performed with randomized data. The resulting module is stored. Randomization and GAM analysis is repeated thousands of times.
For any of the reactions or metabolites that are included in the “most regulated module” obtained with the correct (non-randomized) data and without FDR filters, one can count the frequency in the GAM results with randomized data (counting how many times the same reaction was found). The frequency corresponds to the p-value of the permutation test and can be used to filter only significant reactions/metabolites. An example are highly connected metabolites. Because of their numerous link they tend to show up in network-based analyses that don’t correct for connectivity Question 8 ^^^^^^^^^^
:hlred:`What do you conclude from this “most regulated module”?`
The interpretation of the module is still driven by human observation. Based on our judgement and knowledge of genes/metabolites, we highlight some metabolic hotspots of interest.
First, there is a coherent cluster of lipids changes. Some key cholines and ethanolamines are much higher in SK-BR-3 (red), whereas sphingosines and ceramides are higher in Hs578T (green). The differences in metabolites are associated to a general but opposed change in enzyme levels (PLD1, PLA2G10, CYP4B1, ALOX15, GALC). Note of caution: these are multifunctional enzymes (i.e. they catalyze multiple reactions) and, thus, their appearance in the module might be caused by their overrepresentation
A second hotspot is about sugars: the hexose-phosphates in upper glycolysis are altered (PFKFB2, HK1). More importantly, there is a change in enzymes of fructose metabolism (AKR1B1, SORD, and KHK). This is a very interesting pathway that in cancer and pathophysiological states (e.g. heart hypertrophy) allows to bypass glycolysis and canonical regulation of sugar metabolism. Perhaps, the two cell lines differ in the preference for glucose vs fructose as a carbon source.
Third, there is a slight by generic difference in serine metabolism, both on metabolite and in the expression of key enzymes (PSAT1, PSP). This is a very important pathway that fuels the entire cell with C1-groups for biosynthesis. It has been associated oftentimes to cancer growth.
Fourth, many changes are concentrated around the TCA cycle. Most metabolites (succinate, fumarate, malate, oxaloacetate, citrate, aconitate, …) are altered. In addition, several of the involved enzymes are also affected. Collectively, these results suggest that the metabolic activity of the TCA cycle or mitochondria is different. . Unfortunately, it remains almost prohibitive to speculate how the two cell lines differ because the direction of the changes are not coherent (red and green edges/nodes are mixed). Follow ups are needed to clarify the nature of the differences.
Question 9#
:hlred:`How could GAM be run to retrieve coherent patterns?`
A simple approach is to run the analysis with the subset of genes (metabolites) that are either higher in SK-BR-3 (positive sign) or lower. We generate two specific lists for genes and obtain the following results. As it can be appreciated, the result now focus on modules that are coherent in sign.