BLAST on the command line#

The various BLAST programs are also available for download to run yourself. The BLAST+ suite is installed on the server Cousteau and Euler. There are several reasons you might want to run BLAST offline:

  • You may have a large number of searches you want to perform, which is clumsy via the web interface

  • You have sensitive data you want to search for or with that you shouldn’t submit online

  • You want to perform BLAST searches as part of a larger program or software pipeline

Making a BLAST database#

In order to run a search offline, you need a database to search. You can download one of the NCBI databases if you have the storage space, but hopefully your friendly local bioinformatician has already done that for you - and indeed we have made the RefSeq Prokaryotic Reference Genome database available here:

/nfs/teaching/databases/ncbi/ref_prok_rep_genomes

It’s also likely that you want to search a much smaller and specific set of sequences that you have already prepared in FASTA format. For this, there is the command makeblastdb (which is typically used as soon as you have more than 2 sequences to align):

# Make a nucleotide sequence database
makeblastdb -in dna_sequences_to_search.fasta -dbtype nucl

The -in argument should point to a fasta file that you want to create the database from, and the -dbtype must be either ‘nucl’ for nucleotide sequence or ‘prot’ for protein sequence. Other options allow you to use different types of input file, to mask sequences or parts of sequences, to index taxonomic information for the database and more. However you run it, makeblastdb produces some additional files if you check the directory containing the FASTA file that index the sequences ready for searching.

Running BLAST#

The four BLAST algorithms highlighted on the front page of the NCBI BLAST homepage have identically-named command line equivalents. The software suite is available via our module system, called BLAST+. We will demonstrate here with blastn, and some of the arguments vary slightly for the other algorithms, but the most important thing is to choose the correct one based on the nature of your query and database (as described above).

Firstly, you can perform pairwise alignment just as with the online interface by specifying a query (sequence of interest) and subject (for comparison) sequence, as in this minimal example:

# Load the software module
ml BLAST+

# Pairwise alignment with blastn
blastn -query /nfs/teaching/551-0132-00L/4_Alignment/pairwise1.fasta -subject /nfs/teaching/551-0132-00L/4_Alignment/pairwise2.fasta

Secondly, you can of course search by alignment by providing a database to look through, as in this minimal example:

# Run blastn
blastn -db /path/to/db -query sequence_to_look_for.fasta

You can recreate any of the options available via the online interface with the right set of command line options. A full listing will be shown by running blastn -h or blastn -help.

Without any additional options, the two examples above output directly to the command line. You can direct the output to a file with the -out filename option. The output is also quite extensive, and although it is human-readable, it isn’t easy to process for a computer. There are ways to modify the output with the option -outfmt n where n is a number, and perhaps the most useful is -outfmt 6, which produces a tab-separated table summarising the hits found.

# Run blastn for nice output
blastn -db /path/to/db -query sequence_to_look_for.fasta -out blastn_results.txt -outfmt 6

The output columns, in order, are:

1.   qseqid      query or source (e.g., gene) sequence id
2.   sseqid      subject  or target (e.g., reference genome) sequence id
3.   pident      percentage of identical matches
4.   length      alignment length (sequence overlap)
5.   mismatch    number of mismatches
6.   gapopen     number of gap openings
7.   qstart      start of alignment in query
8.   qend        end of alignment in query
9.   sstart      start of alignment in subject
10.  send        end of alignment in subject
11.  evalue      expect value
12.  bitscore    bit score

For more information on this output format, look here

If you are searching a very large database, blastn can take a very long time to run. There are a couple of options that you can use to improve speed:

# Use a faster but less sensitive algorithm
-task megablast

# Use more compute threads if available
-num_threads 32

Exercise 4.4#

Exercise 4.4

  • Make sure you are in your homefolder before executing the commands

# Go to your homefolder
cd
  • Run BLAST on the command line to determine what organism the sequence in the file /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_02.fasta might originate from - be careful to choose the correct algorithm

# It is always good to have a quick look at the sequence you want to find out
less /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_02.fasta

# First we have to load the software. If it is not already loaded
ml BLAST+

# Since we want to analyse a protein sequence with a nucleotide database, we have to use tblastn
tblastn -db /nfs/teaching/databases/ncbi/some_prok_rep_genomes -query /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_02.fasta -out tblastn_results.txt -outfmt 6 -num_threads 8

# The sequence analysis reveals that the sequence is from the Thermus thermophilus bacterium
  • Run BLAST on the command line to determine the similarity between mystery_sequence_02.fasta and mystery_file_03.faa

# It is always good to have a quick look at the sequences you want to investigate
less /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_02.fasta
less /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_03.faa

# Since we want to align two protein sequences, we have to use blastp
blastp -query /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_02.fasta -subject /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_03.faa -out blastp_results.txt -outfmt 6

# The sequence analysis reveals that the sequences are 50% identical, thus there is some similarity but there are also a lot of differences in encoded amino acids
  • How could you have answered the previous question more computationally efficient? Hint: Remember that initially you only had the files mystery_sequence_01.fasta and mystery_sequence_02.fasta

# Since we want to compare a protein sequence (mystery_sequence_02.fasta) with a nucleotide sequence (mystery_sequence_01.fasta), we have to use tblastn
tblastn -query /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_02.fasta -subject /nfs/teaching/551-0132-00L/4_Alignment/mystery_sequence_01.fasta -out tblastn_results_2.txt -outfmt 6

# The result is the same as for the previous question