Introduction to Python Programming, Data Analysis and Data Visualization¶
The goal of this tutorial is to provide a brief introduction to Python programming and basic data exploratory analyses that can be performed with diverse packages such as pandas, matplotlib, seaborn or plotly.
The index of these notebook includes:
Notes on notebooks usage
You are working in a jupyter notebook
. These documents allow to both write executable code, for example in python
, and document it using markdown
in different cells
.
Some useful shortcuts include:
- Shift+Enter : execute cell
- Esc, b : add a cell below
- Esc, a : add a cell above
- Esc, m : transform cell to
markdown
- Esc, d : delete cell
Copying this tutorial:¶
Before starting, run this command from your home to copy the material:
cp -r /nfs/nas22/fs2202/biol_micro_teaching/551-1119-00L-2024/s0506_python .
1. Python Introduction¶
Python is a programming language that has been under development for over 25 years.
This Chapter will not cover everything in Python. If you would like, please consider the following resources:
Getting Started with Python:
- https://www.codecademy.com/learn/python
- http://docs.python-guide.org/en/latest/intro/learning/
- https://learnpythonthehardway.org/book/
- https://www.codementor.io/learn-python-online
- https://websitesetup.org/python-cheat-sheet/
Learning Python in Notebooks:
This is handy to always have available for reference:
Python Reference:
1.1 Statements¶
Python is an imperative language based on statements. That is, programs in Python consists of lines composed of statements. A statement can be:
- a single expression
- an assignment
- a function call
- a function definition
- a statement; statement
1.1.1 Expressions¶
- Numbers
- integers
- floating-point
- complex numbers
- strings
- boolean values
- lists and dicts
1.1.1.1 Numbers¶
1 # This a comment: 1 is a number
2
-3
1
2
3.14
print(type(1), type(3.14)) # type() tells you the class of the object
1.1.1.2 Strings¶
'ap"ple'
"apple"
1.1.1.3 Boolean Values¶
True
False
1.1.1.4 Iterables¶
Python has very useful iterable data structures built into the language:
- lists: [] --> mutable list of items
- tuples: (item, ...) --> read-only data structure (immutable)
- sets: set([item, ...]) --> like a tuple but keeping only unique items
- dictionaries (hash tables): {} --> Dictionaries can relate a set of unique keys with a value (this can be a list, a number, etc.)
a = [1, 2, 3] # Equal is used to define a variable 'a'
a
a[::2] # Lists can be accessed using square brackets and the index (it is 0-based!)
a.append(10) # append adds a new element to a list
a
a[0] = 10 # We can edit values in the list
a # No need to use print when working in a notebook
b = (1, 2, 3, 3, 3) # This is a tuple
b
c = set([1,2,3,3,3])
c
Exercise : What happens when you try to change an element in a tuple or a set?
c[0] = 1
1.1.4.1 Dictionaires¶
A dictionary is a container that holds pairs of objects - keys and values.
capitals = {'France': 'Paris', 'England': 'London', 'Canada': 'Toronto'}
capitals['Canada']
capitals['Canada'] = 'Ottawa' # What happens here?
complement = {'A': 'T',
'C': 'G',
'T': 'A',
'G':'C'}
complement['A']
seq = 'AATTGCAT'
seq[0:3]
len(seq)
reverse_seq = seq[::-1]
reverse_seq
for letter in reverse_seq:
print(f'{letter} -> {complement[letter]}')
print(1)
print(2)
print(3)
1
2
1.1.2.2 Operators¶
There are two ways to call functions in Python:
- by pre-defined infix operator name
- by function name, followed by parentheses
Infix operator name:
1 + 2
Exercise : What are these operators doing?
# Other operators:
print(1 - 2)
print(1 * 2)
print(1 / 2)
print(4 % 2)
print(4 ** 2)
# Variable can be operated on the fly
a = 1
a += 10
print(a)
b = 10
b *= 10
print(b)
Exercise : What are these functions doing?
# Other operations
a = [1,3,-3]
abs(a[2]), sum(a), len(a)
1.1.3 Special Values¶
None
1.1.4 Defining Functions¶
def plus(a_l, b):
a = a_l[2]
return a + b
plus([3, 4, 5], 4)
Exercise : What happens if you do not use return?
def plus(a, b):
a + b
a = 1
plus(a, 4)
a
What happened? All functions return something, even if you don't specify it. If you don't specify a return value, then it will default to returning None
.
"a" + 1
How to Read Python Error Messages
TypeError: Can't convert 'int' object to str implicitly
Above the error message is the "traceback" also called the "call stack". This is a representation of the sequence of procedure calls that lead to the error. If the procedure call originated from code from a file, the filename would be listed after the word "File" on each line. If the procedure call originated from a notebook cell, then the word "ipython-input-#-HEX".
1.2 Equality¶
1.2.1 ==¶
1 != 0
1.2.2 is¶
a = 1
[] is []
list() is list()
tuple() is tuple()
57663463467 == 57663463469
57663463467 != 57663463469
1.3 Advanced Topics¶
The Zen of Python:
import this
import operator # import can be used to use installed packages or your own functions
operator.add(1, 2)
import operator as op # libraries can be imported to abbreviations
op.add(1,2)
from operator import add # specific libraries can be imported
add(1,10)
2.1 Scope of variables¶
Is not always clear:
y = 0
for x in range(10):
y = x
y
{x:x+10 for x in range(10, 20)}
x
2.2 Scope¶
Python follows the LEGB Rule (after https://www.amazon.com/dp/0596513984/):
- L, Local: Names assigned in any way within a function (def or lambda)), and not declared global in that function.
- E, Enclosing function locals: Name in the local scope of any and all enclosing functions (def or lambda), from inner to outer.
- G, Global (module): Names assigned at the top-level of a module file, or declared global in a def within the file.
- B, Built-in (Python): Names preassigned in the built-in names module : open, range, SyntaxError,...
x = 3
def foo():
x=4
def bar():
print(x) # Accesses x from foo's scope
bar() # Prints 4
x=5
bar() # Prints 5
a = 0
def test(a):
a +=1
return a
test(a)
2.3 Generators¶
def function():
for i in range(10):
yield i
[i for i in function()]
for y in function():
print(f'current_number {y}')
2.4 Default arguments¶
def do_something(a, b, c=3):
return (a, b, c)
do_something(1, 2, 6)
def do_something_else(a=1, b=2, c=3):
return (a, b, c)
do_something_else()
def some_function(start=[]):
start.append(1)
return start
result = some_function()
result
result.append(2)
other_result = some_function()
other_result
2.5 List comprehension¶
"List comprehension" is the idea of writing some code inside of a list that will generate a list. Using for
we can iterate on a list, string, etc.
Consider the following:
[x ** 2 for x in range(10)]
temp_list = []
for x in range(10):
temp_list.append(x ** 2)
temp_list
But list comprehension is much more concise.
for i in ['hello']:
print(i)
2.6 Conditions¶
a = 3
b = 4
if a>b:
print(f'{a} is greater than {b}')
elif a==b:
print(f'{a} == {b}')
else:
print(f'{b} is greater than {a}')
Exercise : write a function to calculate the GC content of the sequence
seq = 'ACGTATTGCTTAGGCTGAGGCTAGGAGAGGGGACCCCCTAGCTAGGATCGT'
Hint: there are two different ways to do this:
- iterating and checking the character
- using the method
.count(X)
in a string
def gc1(seq):
c = 0
for i in seq:
if i in ['C', 'G']:
c += 1
return 100*c/len(seq)
def gc2(seq):
c = 0
for i in range(len(seq)):
if seq[i] == 'C':
c += 1
elif seq[i] == 'G':
c += 1
else:
pass
return 100*c/len(seq)
def gc3(seq):
"""
Calculates the GC content of <seq>
"""
return 100*(seq.count('C') + seq.count('G'))/len(seq)
seq = 'ACGTATTGCTTAGGCTGAGGCTAGGAGAGGGGACCCCCTAGCTAGGATCGT'
print(gc1(seq), gc2(seq), gc3(seq))
2. Introduction to data analysis with pandas¶
Thanks Anna Sintsova for the example analysis used as reference in this section!
Libraries¶
Most of the power of a programming language is in its libraries.¶
- A library is a collection of files (called modules) that contains functions for use by other programs.
- May also contain data values (e.g., numerical constants) and other things.
- Library’s contents are supposed to be related, but there’s no way to enforce that.
- The Python standard library is an extensive suite of modules that comes with Python itself.
- Many additional libraries are available from PyPI (the Python Package Index).
A program must import a library module before using it.
- Use
import
to load a library module into a program’s memory. - Then refer to things from the module as
module_name.thing_name
. - Python uses
.
to mean “part of”. - Using
numpy
, one of the most important libraries for scientific computing - To use a function or a constant from a module, you have to always refer to that modules name
import numpy
# Display value of pi
numpy.pi
# Calculate log2 of a number
# log2 is a function from numpy library. It takes one argument.
numpy.log2(8)
# Learn more about the function and it's arguments
?numpy.log2
# argument doesn't have to be a number, can be a list
numpy.log2([8, 16, 32])
Import specific itmes from a library¶
- Use
from ... import ...
to load only specific functions/itmes from a library - Then you can refer to them directly in your code
from numpy import floor
# floor function returns floor of a number (largest integer i, such that i <= x)
floor([4.5, 3.2, 1.9])
Create an alias for a library¶
- Use
import ... as ...
to give library an alias, then refer to items using that shortened name - Some popular libraries have well established aliases that are in common use
import numpy as np
# ceiling function returns ceiling of a number (smallest integer i, such that i >= x)
np.ceil([4.5, 3.2, 1.9])
Reading tabular data into DataFrames¶
- Data for this workshop comes from Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma
Abstract
The composition of the gut microbiome has been associated with clinical responses to immune checkpoint inhibitor (ICI) treatment, but there is limited consensus on the specific microbiome characteristics linked to the clinical benefits of ICIs. We performed shotgun metagenomic sequencing of stool samples collected before ICI initiation from five observational cohorts recruiting ICI-naive patients with advanced cutaneous melanoma (n = 165). Integrating the dataset with 147 metagenomic samples from previously published studies, we found that the gut microbiome has a relevant, but cohort-dependent, association with the response to ICIs. A machine learning analysis confirmed the link between the microbiome and overall response rates (ORRs) and progression-free survival (PFS) with ICIs but also revealed limited reproducibility of microbiome-based signatures across cohorts. Accordingly, a panel of species, including Bifidobacterium pseudocatenulatum, Roseburia spp. and Akkermansia muciniphila, associated with responders was identified, but no single species could be regarded as a fully consistent biomarker across studies. Overall, the role of the human gut microbiome in ICI response appears more complex than previously thought, extending beyond differing microbial species simply present or absent in responders and nonresponders. Future studies should adopt larger sample sizes and take into account the complex interplay of clinical factors with the gut microbiome over the treatment course.
# import and give an alias
import pandas as pd
%ls data
# Read in a csv file
data = pd.read_csv('data/2022-04-13.LeeKA_2022.colData.csv', index_col=0)
# Look at the top 5 rows of the data
# The columns in a dataframe are the observed variables, and the rows are the observations.
data.head()
# View the last five rows
# Check number of rows and columns
data.shape
# Find out mroe about a dataframe
data.info()
# Are all the indices (i.e. row names) unique
data.index
data.index.unique()
len(data.index.unique()) == len(data.index)
data.index.nunique() == len(data.index)
# Use DataFrame.columns to access or change column names
data.columns
# Get summary statistics
# Only for numerical columns
data.describe()
Indexing, slicing, and subsetting¶
- We use square brackets [] to select a subset of a Python object. For example, we can select all data from a column named
country
from the surveys_df DataFrame by name. There are two ways to do this:
data['gender']
data.gender
data['gender'].unique()
data['gender'].value_counts()
data['gender'].value_counts(normalize=True)
# Calculate what % of subjects belonged to each of the subcohorts (Hint, there is a column 'subcohort')
# You can select more than one column
columns_to_show = ['subject_id', 'country', 'gender', 'age_category', 'BMI', 'PFS12']
data = data[columns_to_show]
data.head()
- Slicing using the [ ] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax:
data[start:stop]
. When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this:
data[0:3]
data[-1:]
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.
loc
is primarily label based indexing. Integers may be used but they are interpreted as a label.iloc
is primarily integer based indexingTo select a subset of rows and columns from our DataFrame, we can use the iloc method.
For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:
data.iloc[0:2, 0:2]
data.loc[['BCN-01_1', 'BCN-02_1'], ['subject_id', 'country']]
Subsetting Data using Criteria¶
We can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the “surveys” data.
Equals: ==
Not equals: !=
Greater than, less than: >
or <
Greater than or equal to >=
Less than or equal to <=
# Look at all the entries from Spain
data[data.country == 'ESP']
# Look at all the entries not from Spain
data[data.country != 'ESP']
# Look at all the entries not from Spain
data[(data.country == 'ESP') & (data.gender == 'female')]
data[(data.BMI < 26) & (data.BMI > 25)]
data[(data.BMI.between(25, 26))]
data[data.subject_id.str.contains('BCN')]
data[data['country'].isin(['GBR', 'ESP'])]
Calculating Statistics From Data In A Pandas¶
# You can calculate any summary statistics on the numberic columns with min(), max(), mean(), std(), count(), etc.
data['BMI'].min()
data['BMI'].median()
- But if we want to summarize by one or more variables, for example sex, we can use Pandas’
.groupby
method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.
data.groupby('gender').median(numeric_only=True)
# Calculate BMI for different genders for each country
data.groupby(['country', 'gender']).BMI.mean()
# Calculate how gender distribution for each of the countries
data.groupby(['country', 'gender']).subject_id.count()
# For each country and gender, calculate % with PFS12
data.groupby(['country', 'gender']).PFS12.value_counts(normalize=True)
Missing values¶
data[data.PFS12.isnull()]
# Drop all observations with missing values
data.dropna()
Joining dataframes¶
rel_abundance = pd.read_csv("data/2022-04-13.LeeKA_2022.relative_abundance.SAMPLE.csv", index_col=0)
rel_abundance
df = data.merge(rel_abundance, left_index=True, right_index=True)
df.head()
df.groupby('gender').mean(numeric_only=True).T
df.to_csv("data/ploting_ws.csv")
3. Data visualization in Python¶
3.1 Basic graphs with matplotlib¶
import matplotlib.pyplot as plt
%matplotlib inline
To create a simple line plot, just give a list of y-values to the function plt.plot().
plt.plot([5, 8, 2, 6, 1, 8, 2, 3, 4, 5, 6])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('A plot')
plt.boxplot([[1,2,3], [3,4,5]])
# plt.show() # To show the plot, not required in notebooks
plt.savefig('./data/boxplot.svg') # You can save in png, svg, pdf, etc
subdf = df.groupby('gender').mean(numeric_only=True).T.copy()
subdf
plt.boxplot([subdf['female'], subdf['male']], labels=['female', 'male'])
plt.show()
Read the documentation on matplotlib:
3.2 Interactive graphs with plotly¶
import plotly.express as px
df = pd.read_csv("data/ploting_ws.csv", index_col=0)
df.head()
Plotly express¶
Histogram¶
px.histogram(df, x='BMI', color='gender', nbins=100)
Scatter plot¶
fig = px.scatter(df, x="BMI", y="Dorea_longicatena")
fig.show()
fig = px.scatter(df, x="BMI", y="Dorea_longicatena", color='gender',
hover_data=['country', 'subject_id', 'BMI', 'gender', 'age_category'])
fig.show()
Box plot¶
# Is there difference in PFS12 across countries/genders/age categories
fig = px.box(df, x='country', y='BMI')
fig
# Is there difference in PFS12 across countries/gender/age categories
fig = px.box(df, x='country', y='BMI',
facet_col='gender',
facet_row='age_category',
color='PFS12',
width=800, height=600,
template='simple_white')
fig
Bar plots¶
# Pivoting dataframe for plotting
bar_df = df.melt(id_vars = ['subject_id', 'country', 'gender', 'age_category', 'BMI', 'PFS12'], value_name = 'RelAb',
var_name = 'species')
bar_df.head()
new_df = bar_df.groupby(['PFS12', 'species']).RelAb.mean().reset_index()
px.bar(new_df, x='PFS12', y= 'RelAb', color='PFS12', facet_col='species', facet_col_wrap=4, height=1000)
fig = px.bar(bar_df[bar_df.subject_id.isin(df.subject_id.unique()[0:5])], x="subject_id", y="RelAb", color="species",
template='plotly_white',
hover_data =['gender', 'age_category'], width=800, height=600,
)
fig.show()
Strip plot¶
px.strip(bar_df[bar_df.species == 'Akkermansia_muciniphila'], x='country', y='RelAb', color='PFS12',
facet_col='gender', log_y=True, template = 'plotly_white')
import pandas as pd
df = pd.read_csv('./data/genome_summary.csv.gz', compression='gzip', index_col=0)
md = pd.read_csv('./data/metadata.tsv', sep='\t', index_col=0)
df.columns
md.columns
metadata goes by sample, let's integrate that information into the genome table, the column biosample relates them:
# Assuming md and df are your dataframes and both contain a 'biosample' column
# We select only the necessary columns from md: 'biosample', 'A', 'B', 'C'
md_subset = md[['biosample', 'lat', 'lon', 'date', 'temp', 'depth_str', 'depth', 'log_depth', 'depth_layer', 'size_fraction']]
# Merge md_subset into df based on the 'biosample' column
df = df.merge(md_subset, on='biosample', how='left')
df
Basic inspects:
df['genus']
df.iloc[7]
df[df['genome']=='ZORZ22-1_SAMN30647033_MAG_00000149']
df[(df['completeness']>=90) &
(df['contamination']<5) &
(df['genome'].str.contains('TARA'))]
df.sort_values('completeness', ascending=False).head(100)
Exercise : How many MAGs have completeness >90 and and contamination < 5 and gc_content > 40 ?
Exercise : How many MAGs from the previous subset are in each environment?
from collections import Counter
Counter(df['environment'])
3.3.2. Plotting directly from dataframes¶
Finally, seaborn
is a library that makes easy to interact with dataframes and plotting functions between pandas
and matplotlib
:
import seaborn as sns
sns.scatterplot(x='genome_size', y='completeness', data=df, alpha=0.2)
sns.jointplot(x='genome_size', y='completeness', data=df)
Exercise : How do you interprete the previous plot?
3.3.3. Basic analysis example¶
Is there a relation between the depth at which an organism is found and its GC content?
sns.boxplot(x='depth_layer', y='gc_content', data=df)
sns.violinplot(x='depth_layer', y='gc_content', data=df)
sns.stripplot(x='depth_layer', y='gc_content', data=df, alpha=0.1)
Exercise: given the three previous plots? which one is more informative?
You can also give a 'hue' to separate on a categorical variable:
sns.boxplot(x='depth_layer', y='gc_content', hue='domain', data=df.head(10000), palette='viridis')
What can you interpret from the previous plot?
3.3.4. Other visualizations¶
heatmap_data = df[df['environment']=='coral metagenome'][['completeness', 'contamination', 'genome_size']] # What is happening here?
Heatmaps are also useful data representations, we can use seaborn to quickly plot the number of compounds shared between individuals in the community:
heatmap_data
import seaborn as sns
normalized_heatmap = (heatmap_data - heatmap_data.min()) / (heatmap_data.max() - heatmap_data.min()) # What is this line doing?
plt.figure(figsize=(10, 8))
sns.heatmap(normalized_heatmap, cmap='YlGnBu')
plt.show()
Exercise : What are the conclusion from the previous graph?
plt.figure(figsize=(10, 8))
sns.clustermap(normalized_heatmap, cmap='YlGnBu')
plt.show()
Exercise : What are the conclusions you can extract from the previous graph?
Exercise : Explore by yourself a question from the data given and produce the code to visualize one figure to explain the most out of your question
Finally, given our context, another useful representation is using maps:
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
# Direct URL to the shapefile
url = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
def plot_lat_lon_on_world_map(df):
# Convert latitude and longitude in DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['lon'], df['lat']))
# Load world shapefile directly from URL
world = gpd.read_file(url)
# Plotting
world.plot(figsize=(10, 10), edgecolor=None, color='lightgrey')
gdf.plot(marker='o', color='#4884af', markersize=5, ax=plt.gca())
# Labeling the axes and title
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Latitude and Longitude Points on World Map')
plt.show()
# Call the function with your DataFrame, 'md'
plot_lat_lon_on_world_map(md)
# Or modify it to plot more data as color, shape...
import matplotlib.colors as mcolors
def plot_lat_lon_with_temp(df):
# Create a GeoDataFrame from the input DataFrame
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['lon'], df['lat']))
# Load the world map shapefile
world = gpd.read_file(url)
# Plot the world map
fig, ax = plt.subplots(figsize=(10, 10))
world.plot(ax=ax, edgecolor=None, color='lightgrey')
# Plot points with color mapped to the temp column
norm = mcolors.Normalize(vmin=df['temp'].min(), vmax=df['temp'].max())
gdf.plot(
ax=ax,
marker='o',
column='temp', # Use the 'temp' column for color
cmap='coolwarm', # You can change the colormap if desired
markersize=5,
norm=norm,
legend=True
)
# Set plot labels and title
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Latitude and Longitude Points on World Map (Colored by Temp)')
plt.show()
# Example usage:
plot_lat_lon_with_temp(md)
You can explore other representations in the seaborn page: