Using Seq and SeqIO objects#

Loading the package#

First of all, you have to load the packages. You load the required packages with standard python syntax:

from Bio.Seq import Seq
from Bio import SeqIO

There are two main components you are likely going to use, the Seq object from the Seq module (confusing, yes) and the SeqIO object.

Seq objects#

The Seq object combines a Python string with biological methods relevant to sequence-specific information. Thus by saving a string of letters as a Seq object (e.g. a DNA sequence), you can retrieve the complementary strand, the transcribed RNA sequence or the translated protein sequence. When using these methods, keep in mind that Biopython does not know whether a sequence object “AGCTAGGT” you created is a DNA sequence or a protein sequence rich in alanines, cysteines, guanines and threonines.

To declare a new Seq object is straightforward:

my_seq = Seq("AGCTTTTCATTCTGACTG")

In many ways, Seq objects behave like strings, with find and count methods:

# Find the first position of a particular subsequence
my_seq.find("ACT")
my_seq.find("AAA") # returns -1 if not found

# Count the number of occurrences of a particular subsequence
my_seq.count("A")
my_seq.count("TT") # only non-overlapping sequences are counted

They also have useful, sequence-specific methods:

# Complement
my_seq.complement() # complementary sequence

# Reverse complement
my_seq.reverse_complement() # complementary sequence with reversed order of bases

# Transcription and reverse transcription
my_rna = my_seq.transcribe() # transcription from DNA to RNA
my_dna = my_rna.back_transcribe() # reverse transcription from RNA to DNA

# Translation works on both DNA and RNA
my_rna.translate() # translation from RNA to protein
my_dna.translate() # like transcription from DNA to RNA and translation from RNA to protein

Sequences can also be concatenated (merged) like strings, thus multiple sequences can be added after each other to create one large sequence. Furthermore, sequences can be sliced (cut into pieces) like strings. By choosing an index range of bases you can extract shorter subsequences from one large sequence. For slicing sequences, remember that Python uses 0-based indexing, so whenever you want to generate a subsequence, your index count will start from 0 thus the first letter in a sequence will be my_seq[0] (not my_seq[1]).

# Add some made up sequence
my_newseq = Seq("ATG") + my_seq

# Get the first 10bp
my_subseq = my_seq[0:10]

# Get the last 10bp
my_subseq = my_seq[-10:]

More information about Seq objects can be found here.

Reading files with SeqIO#

SeqIO provides a simple interface to read and output sequence file formats (also multi-fasta files) in a uniform manner. SeqIO provides a function parse() that allows you to read a multi-fasta file by specifying the file name and file format. Thus parse() allows you to break down sequence files into their components by creating an iterator of SeqRecord objects (which you learn more about below).

Instead of using the file name, you can use a handle to specify which file you read. A handle is typically a file opened for reading, but it can also be an output of a previous command or data from the internet. The advantage of using a handle is guaranteeing that the file is closed correctly after reading.

More information can be found here at parse(), iterator and handle.

# As an iterator
records = SeqIO.parse("myfile.fasta", "fasta")

# Using a handle
with open("myfile.fasta") as handle:
    for record in SeqIO.parse(handle, "fasta")
        <do things>

Records read in by SeqIO are SeqRecord objects, which contain a seq variable that is a Seq object and additional information such as the record ID and description. Many of the methods for Seq objects work identically for SeqRecords. More information about SeqRecords can be found here.

Sometimes you don’t want to work through the records in file order, in which case you can use list() to convert the iterator to a python list, but be careful with very large files as this will put every record into memory at the same time. You can also convert the iterator to a dictionary with record IDs as keys using a provided function.

# As a list object
records = list(SeqIO.parse("myfile.fasta", "fasta"))

# As a dictionary
records = SeqIO.to_dict(SeqIO.parse("myfile.fasta", "fasta"))

Note that the SeqIO.parse examples above specify the file format as “fasta”. Many other formats are supported, but the correct format must be explicitly given as an argument, for instance fastq is “fastq” and GenBank is “genbank” or “gb”. Sadly, GFF format is not yet supported and requires an additional package or parsing it yourself. The full list of formats is available here.

Accessing feature information#

If you import a GenBank file with SeqIO, the Seq object will also contain information about the record’s features, stored as SeqFeature objects.

# Import a genbank file and inspect its features
records = list(SeqIO.parse("myfile.gbk", "gb"))
record = records[0]

# List of features
record.features

# Inspect a feature
print(record.features[1])
record.features[1].location
record.features[1].qualifiers

# Extract the sequence for the feature from the parent record
feature_seq = record.features[1].extract(record)

As features are a list, you can of course sort them using list comprehension by type, position, or similar. Note that when you slice a sequence to create a subsequence, only features that are contained completely within the subsequence are kept by it.

Writing files with SeqIO#

SeqIO can also be used to output records to file, in the supported format of your choice. Obviously if you convert file format you might lose information, for instance fastq to fasta, or genbank to fasta. Again, the file can be written using a handle if desired.

# Write to fasta
SeqIO.write(records, "myrecords.fasta", "fasta")

# Write to fasta with a handle
with open("myrecords.fasta", w) as handle:
    SeqIO.write(records, handle, "fasta")

Converting file formats#

If you use SeqIO to read in a file in one format, you can convert it by writing to another format. There are some things to note when doing this however:

  • If you output to a format that does not support features, such as fasta, then you lose that information

  • If you extract a feature sequence or slice a sequence, the new SeqRecord inherits the additional properties such as ID and description of the parent sequence

  • If you translate a SeqRecord from nucleotide to amino acid sequence, the additional record information such as ID and description are lost and replaced with awkward ‘<unknown x>’ strings

Exercise 3.4#

Exercise 3.4

  • Using SeqIO, read in the GenBank file located at

/nfs/teaching/551-0132-00L/1_Unix/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gbff
# Read in the file
from Bio import SeqIO

records = list(SeqIO.parse("/nfs/teaching/551-0132-00L/1_Unix/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gbff", 'gb'))
record = records[0]
  • What is the GC content (the percentage of bases that are G or C) of the genome?

# Calculate GC content
from Bio.Seq import Seq

gc = (record.seq.count('G') + record.seq.count('C')) / len(record)
gc # GC content is 50.8%
  • How many genes are there in the genome?

# Count genes
genes = [feature for feature in record.features if feature.type=='gene']
len(genes) # 4609 genes
  • Pick any gene and write the sequence out to a new fasta file

# Output a gene
my_gene = genes[0]
my_gene_seqrec = my_gene.extract(record)                                       # Note the difference between a feature and a sequence record containing features
my_gene_seqrec.id = my_gene.qualifiers['gene'][0]                              # Change the ID of the new sequence record
my_gene_seqrec.description = 'extracted from ' + my_gene_seqrec.description    # Change the description of the new record
SeqIO.write(my_gene_seqrec, 'my_gene.fna', 'fasta')
  • For the same gene, write the translated amino acid sequence out to another fasta file

# Output a translation
my_gene_trans = my_gene_seqrec.translate()
my_gene_trans                                               # See that the metadata is messed up
my_gene_trans.id = my_gene_seqrec.id                        # Copy the metadata from the original sequence record
my_gene_trans.description = my_gene_seqrec.description      # Ditto
SeqIO.write(my_gene_trans, 'my_gene.faa', 'fasta')

Note: More thorough explanations and additional material about Biopython can be found `here <http://biopython.org/DIST/docs/tutorial/Tutorial.html>`__. (it is worth a look)