Searching the NCBI#

The NCBI’s primary text search and retrieval system, Entrez, comprises 39 molecular and literature databases and is usually accessed via the search bar (Figure 1 red frame, nearly all search boxes on NCBI access the Entrez system).

Since Entrez searches in a vast amount of databases and the search input can be almost anything (single words, short phrases, sentences, database identifiers, gene symbols, names, etc.) even simple searches can lead to an overwhelming amount of results. Therefore it is useful to know some tricks which make searching more efficient.

Boolean Operators: You should be familiar with Boolean Operators from Statistics. They can be used in Entrez to make your search more specific:
- AND: Finds documents that contain terms on both sides of the operator, the intersection of both searches.
- OR: Finds documents that contain either term, the union of both searches.
- NOT: Finds documents that contain the term on the left but not the term on the right of the operator, the subtraction of the right hand search from the one on the left.
Please note that these Boolean Operators have to be written in uppercase to work and are processed from left to right
Phrases: Individual search terms separated by a space are joined as if an AND was put between them, unless the words match a phrase indexed by the database, in which case the phrase is searched for as written. If you want to force a search for a phrase, put the words in quotation marks “like this”. Furthermore, you can use * as a wildcard to represent any character.
Indexed Fields: Each database has various indices to improve and speed up searching - the metadata for each entry. A field can be searched specifically by putting its name in square brackets immediately after a search term. For instance, entries in Nucleotide are associated with an Organism and a Publication Date (amongst many other fields) that you can search for like so:

“Escherichia coli”[Organism] AND 2020/1/1[Publication Date]

If you want to know more about Entrez click here.

Accessing sequencing data with the package Bio#

For this part you have to switch back once again to the terminal (Unix) as described here.

We have installed a useful package for the terminal called bio that makes the process of getting hold of sequence data much easier. You can load it as follows:

ml Bio
bio

Exercise 3.3#

Exercise 3.3

Using NCBI search tools, find the genome record for Escherichia coli K12 MG1655.

# Genome record E.coli K12
        # We start at the NCBI homepage (https://www.ncbi.nlm.nih.gov)
        # Change the database to Genome (https://www.ncbi.nlm.nih.gov/genome/)
        # Search for Escherichia coli K12 MG1655 (Escherichia coli K12 works too)
        # An overview about Escherichia coli appears
        # Scroll down to representatives
        # Click on the identifier (ASM...) of the first reference genome entry
        # Scroll down and click on the number under RefSeq
        # The genome record appears. The K12 genome has the accession number NC_000913.3

Using NCBI’s genome database (https://www.ncbi.nlm.nih.gov/genome/), find the RefSeq reference prokaryotic genomes that are considered to have ‘Complete’ assembled genomes (there should be 15)

# Complete prokaryotic genomes
        # We start at the NCBI genome page (https://www.ncbi.nlm.nih.gov/genome/)
        # Select Browse by Organism
        # Select prokaryotes
        # Use the Filter and select under the RefSeq category reference. The 15 genomes should be selected
        # The 15 genomes are:

        Acinetobacter pittii PHEA-2                                         GCA_000191145.1
        Bacillus subtilis subsp. subtilis str. 168                          GCA_000009045.1
        Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819         GCA_000009085.1
        Caulobacter vibrioides NA1000                                       GCA_000022005.1
        Chlamydia trachomatis D/UW-3/CX                                     GCA_000008725.1
        Coxiella burnetii RSA 493                                           GCA_000007765.2
        Escherichia coli O157:H7 str. Sakai                                 GCA_000008865.2
        Escherichia coli str. K-12 substr. MG1655                           GCA_000005845.2
        Klebsiella pneumoniae subsp. pneumoniae HS11286                     GCA_000240185.2
        Listeria monocytogenes EGD-e                                        GCA_000196035.1
        Mycobacterium tuberculosis H37Rv                                    GCA_000195955.2
        Pseudomonas aeruginosa PAO1                                         GCA_000006765.1
        Salmonella enterica subsp. enterica serovar Typhimurium str. LT2    GCA_000006945.2
        Shigella flexneri 2a str. 301                                       GCA_000006925.2
        Staphylococcus aureus subsp. aureus NCTC 8325                       GCA_000013425.1

Using NCBI’s taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy/), explore the record with the taxonomy 28901. At which phylogenetic level is this taxonomy given? Are there taxonomies at different phylogenetic levels?

# Explore taxonomy id 28901
        # We start at the NCBI taxonomy page (https://www.ncbi.nlm.nih.gov/taxonomy/)
        # Search for the taxonomy 28901
        # A species-level record of Salmonella enterica appears
        # If you click on this entry, a very long list of subspecies appears, thus this taxonomy is given at species level
        # Click on an entry within the subspecies list (e.g. the first Salmonella enterica subsp.arizonae serovar)
        # Check the taxonomy of this entry, it is 2577202, so different than the species taxonomy
        # Thus taxonomy is not specific to one genome, but is available at different phylogenetic levels

If you are in turn interested in one specific genome for your work, which type of information (instead of the taxonomy or an organism name) can you use as unique identifier?

# Find unique identifier of genome
        # Continue from the subspecies entry you opened for answering the last question
        # In the Entrez records at the top right corner you see how many feature entries there are
        # Click on 4 next to Assembly to look at individual genomes of that subspecies
        # In the results, you see 4 items (assemblies) and you see each assembly contains a Genbank accession number
        # For a description of the accession number, go back to the section Genbank flat file format
        # This Genbank accession number is unique for each genome and can be used to trace back specific genomes
        # You can also use this Genbank accession number (e.g. GCA_011634505.1 from first assembly) on the NCBI homepage to search directly for a specific genome

# Note: There are a lot of different ways to find the solution. These are just examples.

Find and read the available help information for bio.

# Explore the package
bio # typing bio without any arguments gives you an overview of the commands in this package
# Note that bio --help will give an error because bio is not a command but a package containing many commands

# Explore a function within the package
bio search --help # typing a command (bio search) with --help leads you to its help page (including possible arguments for the command)

# Other options to find documentation and help resources
# Check the official documentation page of the package or command
# Ask Google (type bio package documentation or any problem-related question)
# Ask Stackoverflow (https://stackoverflow.com/)
# Ask ChatGPT or use Github Copilot - but be warned that they can make easy mistakes

Choose one of the 15 genomes found in the exercise above and download the fasta and genbank files using bio to your homework folder.

# For example
bio fetch GCA_000191145.1 > GCA_000191145.1.gbk
bio fasta GCA_000191145.1.gbk > GCA_000191145.1.fasta

Searching the NCBI

Contents

Searching the NCBI#

Accessing sequencing data with the package Bio#

Exercise 3.3#