Sequence Data

There are three predominant forms of sequencing available to researchers today. The oldest is Sanger sequencing, which is highly accurate and covers approximately 800bp, typically used for checks on constructs made by molecular biology techniques. The second oldest is next generation sequencing or NGS, dominated by Illumina, which is fairly accurate and provides millions of short reads from 50bp up to 300bp paired end. It’s used for everything high-throughput: genome sequencing, metagenomics, RNASeq and other more specific techniques. Finally the most recent development is long read sequencing, mainly from Pacific Biosciences and Oxford Nanopore, which provides millions of long reads - less accurate that short reads but typically 5-10kbp and in extreme cases over 1Mbp. Still in development, long reads are mostly used for genome sequencing, but can also be used in all the applications of short reads and in the long run, potentially more.

Data formats

Raw

Fortunately, sequence data formats are relatively standardised, and you should familiarise yourself with them.

The raw data from Sanger sequencing is usually a .ab1 file, which can be viewed as a chromatogram, telling you the likelihood of each base sequenced being an A, C, G or T. Similar raw formats exist for Illumina data and the two long read technologies, but you will probably never see them. This is because raw data goes through a process called basecalling which converts the chromatic or electrical signals from the sequencing devices into bases.

image

FASTQ

Thus the most likely format in which to receive sequencing data is fastq, in which a read is defined by four lines:

  1. The header line begins with @ and is the read identifier, usually generated by the device

  2. The sequence line

  3. +

  4. The quality line, corresponding one-to-one with the sequence line

The quality scores are encoded such that the ASCII value of the character minus 33 provides the phred quality score of the base. For instance, F has an ASCII value of 70 and therefore corresponds to a quality score of 47. The quality score is determined by the probability of error as follows:

\[Q = -10 \ \log_{10} P\]

An example read is shown below, and files containing fastq data will be .fastq or .fq, possibly with .gz on the end to indicate they have been data-compressed.

1@A00460:311:HJMCYDRXX:1:1101:20130:1000 1:N:0:ATTGGCTTCT+TGACAATGTC
2CNCACCAGTCTGGCGCATGCTGCAAAATATCTTCGAGAGCCTCTTTTGATATGACAAAAACCGGAATATCCAGACCAAACTGTTCTTTTATCATCGTCTCA
3+
4F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFF:FFFFF,FFFFFFFFFFFFF,FFF,FFFFFFFFFFFFFFFFF

FASTA

The standard format for final, processed sequences, nucleotide or protein, is fasta, in which a sequence is defined by two lines:

  1. The header line begins with > and is the sequence identifier, which could include all sorts of information

  2. The sequence line or lines - originally the format had a fixed character width and you will still frequently see sequences split across multiple lines

There is no information about the quality of the sequence in the fasta format, however it does allow for ambiguous bases according to the IUPAC system. An example read file is below, and files containing fasta data will be .fasta or .fa, whilst .fna should be nucleotide sequence and .faa should be amino acid sequence. Originally, fasta was designed to hold only one sequence per file, so you will sometimes see files containing many sequences referred to as multifasta.

1>B21_02754
2TTCCACTTAGATATTGTGCCTATGTGGCTTCCCGTGTCGTCATTCACCGGCTGCATGGATGAAGGCAATGCGCTCTGGTATAACTTAGCGCAACCGCCGTCAGTTGGCCTGGCGGCTCCCGTGGAGCGTTTGTTACAGCAGTTACGCACTGGCGCGCCGGTTTAG

GENBANK

NCBI Genbank format is a comprehensive format for both sequence data and accompanying annotations, typically in either .gbk or .gbff files. In brief, it allows you to define features of a sequence and properties of those features, for instance a feature could be a gene with properties such as its name and function. Below are the first few lines of an example.

LOCUS       CP021288               26585 bp    DNA     linear   BCT 05-SEP-2017
DEFINITION  Escherichia coli strain PA45B chromosome, complete genome.
ACCESSION   CP021288 REGION: 3324728..3351312
VERSION     CP021288.1  GI:1239663444
DBLINK      BioProject: PRJNA385892
            BioSample: SAMN06920354
KEYWORDS    .
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
            Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 26585)
  AUTHORS   Goh,K.G.K., Phan,M.-D., Forde,B.M., Ulett,G.C., Sweet,M.J.,
            Beatson,S.A. and Schembri,M.A.
  TITLE     Novel genes associated with capsule production in uropathogenic
            Escherichia coli
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 26585)
  AUTHORS   Goh,K.G.K., Phan,M.-D., Forde,B.M., Ulett,G.C., Sweet,M.J.,
            Beatson,S.A. and Schembri,M.A.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-MAY-2017) School of Chemistry and Molecular
            Biosciences, University of Queensland, Building 76 Cooper road,
            Brisbane, Queensland 4072, Australia
COMMENT     Bacteria and source DNA available from Mark Schembri, The
            University of Queensland, Australia.

            ##Genome-Assembly-Data-START##
            Assembly Method        :: HGAP v. 2.0
            Expected Final Version :: yes
            Genome Coverage        :: 120.0x
            Sequencing Technology  :: PacBio
            ##Genome-Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..26585
                     /organism="Escherichia coli"
                     /mol_type="genomic DNA"
                     /strain="PA45B"
                     /serotype="O2:K1:H7"
                     /host="Homo sapiens"
                     /db_xref="taxon:562"
                     /country="Australia: Brisbane"
                     /collection_date="2010"
     gene            <1..478
                     /gene="yeeW_3"
                     /locus_tag="PA45B_3249"
     CDS             <1..478
                     /gene="yeeW_3"
                     /locus_tag="PA45B_3249"
                     /note="CP4-44 prophage"
                     /codon_start=3
                     /transl_table=11
                     /product="yeeW_3"
                     /protein_id="ASW61198.1"
                     /db_xref="GI:1239666534"
                     /translation="MKLALTLEADSVNVQALNMGRIVVDVDGVNLSELINKVSENGYL
                     LRVVDKSDQHATSTPPPLTTLTCIRCSTAHITETDNAWLYSLSHQTNDDGESEWIHFT
                     GSGYLLRTDAWSYPVLRLKRLGLSKTFRCLVVTLTRRYGVSLIHLDASAECLPGLPTF
                     NW"

GFF

Another format that contains annotations, but without sequence, is the general feature format or gff, which consists of a tab-separated table with 9 columns:

  1. Sequence name

  2. Source of annotation

  3. Feature name

  4. Feature start position (1-based)

  5. Feature end position (1-based)

  6. Feature score (if available)

  7. Feature strand (+ or -)

  8. Phase of the coding sequence (0, 1 or 2)

  9. Attributes (which can contain almost anything and varies between different similar formats)

An example is shown below.

##sequence-region HE654724.1 1 93842
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=216597
HE654724.1      EMBL    region  1       93842   .       +       .       ID=HE654724.1:1..93842;Dbxref=taxon:216597;Is_circular=true;Name=pSLT_SL1344;gbkey=Src;genome=plasmid;mol_type=genomic DNA;plasmid-name
HE654724.1      EMBL    gene    21      584     .       -       .       ID=gene-SL1344_P1_0001;Name=finO;gbkey=Gene;gene=finO;gene_biotype=protein_coding;locus_tag=SL1344_P1_0001
HE654724.1      EMBL    CDS     21      584     .       -       0       ID=cds-CCF76709.1;Parent=gene-SL1344_P1_0001;Dbxref=EnsemblGenomes-Gn:SL1344_P1_0001,EnsemblGenomes-Tr:CCF76709,GOA:H8WUJ4,InterPro:IPR

Sequence Databases

You can find files in these formats everywhere that provides sequence data, but if you are looking for reference data or the sequencing results from an experiment, there are perhaps two major resource databases to consider: the NCBI (US) and the ENA (Europe).

The NCBI contains a number of different databases, including:

  • Assembly: for genome assemblies

  • GenBank: a collection of all publically available DNA sequences

  • Genome: includes complete and partial genome sequences

  • RefSeq: a curated collection of genome, transcript and protein sequences for selected organisms

  • SRA: raw sequencing data storage

  • Taxonomy: the names and phylogenetic relationships of organisms

The ENA is more aimed at storing raw sequencing data and assemblies, with a specific focus on the associated metadata that describes the experimental workflow.

There are of course many more selective databases aimed at specific applications or organisms.

  • Uniprot: for protein sequences, structures and function

  • Pfam: for protein families and sequence alignments

  • Function and pathway databases such as KEGG, Biocyc

  • And many more..

Exercises

  • Take a look at the files in /science/ecoli/ and you should be able to see the various formats discussed above (except for raw data)

  • Using the NCBI website, https://www.ncbi.nlm.nih.gov/ , find the two reference genomes available for Salmonella enterica

  • Download the associated files to Morgan, decompress them if required and take a look