Searching#

Searching for a file#

When you are trying to find a file in your system, the command find offers a number of options to help you. The first argument is where to start looking (it looks recursively inside all directories from there), and then an option must be given to specify the search criteria.

# Finding files ("." stands for the current directory you are in)
find . -name "*.txt" -type f  # searches for files ending in .txt. The type option defines the type of the file.
find . -mtime -2              # searches for files modified in the last two days
find . -mtime +365            # searches for files modified at least one year ago
find . -size +1G              # searches for files at least 1GB or larger
find . -maxdepth 1            # searches only on one level, thus only here, i.e.: doesn't look inside directories

Exercise 0.6

  • Use cp to copy all files from the directory /nfs/teaching/551-0132-00L/1_Unix/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2 into a new directory in your home directory

# Make a directory for the new files
cd ~
mkdir ecoli

# Copy all the files
cp /nfs/teaching/551-0132-00L/1_Unix/genomes/bacteria/escherichia/GCF_000005845.2_ASM584v2/* ~/ecoli/
  • Navigate to the /nfs/teaching/551-0132-00L/1_Unix/genomes directory

# Navigation
cd /nfs/teaching/551-0132-00L/1_Unix/genomes
  • Use man to read about the find function

#Looking at find
man find
  • Use find to get a list from everything stored in the /nfs/teaching/551-0132-00L/1_Unix/genomes directory

# Getting a list with find
find /nfs/teaching/551-0132-00L/1_Unix/genomes/
  • Use find to look for all .faa files there

# Looking fore .faa files
 find . -name "*.faa"
  • Use find to look for all files larger than 5MB

# Looking fore files lager than 5MB
find . -size +5M
  • Now combine these criteria to find all .faa files larger than 5MB

# Looking fore .faa files larger than 5MB
find . -name "*.faa" -size +5M

Searching in less#

When you open a file to look at it using less, it is also possible to search within that file by pressing / (search forwards) or ? (search backwards) followed by a pattern.

# Finding strings
/AAAA  # finds the next instance of "AAAA"
?TTTT  # finds the previous instance of "TTTT"

These same commands will also work with man, helping you to find a particular argument more easily.

But what happens when you search for “.”? The entire document will be highlighted! Why is this?

Regular Expressions#

The reason this happens is that in the context of these search functions, “.” represents any character. It is acting as a wildcard, from a different set of wildcards to those discussed in Unix1.

This set of wildcards is part of a system of defining a search pattern called regular expression or regex. Such a pattern can consist of wildcards, groups and quantifiers, and may involve some complex logic which we will not cover here. Further, the exact set of wildcards available depends on the programming language being used.

# Wildcards and quantifiers
.   any character
\d  any digit
\w  any letter or digit
\s  any whitespace

^   the start of the string
$   the end of the string

*   pattern is seen 0 or more times
+   pattern is seen 1 or more times
?   pattern is seen 0 or 1 times

These are just a few of the possibilities available. An example regular expression that would search for email addresses, for instance, would be:

# name@domain.net can be matched as: \w+@\w+\.\w+

Let’s break this down:

  • The first part \w+ looks for any letter or digits one or more times, i.e.: the name part of the email address. Note that w does not match punctuation like “.” but does match underscores “_”.

  • Then we ask for an at symbol @.

  • The second part \w+ again matches any alphanumeric string, i.e.: the domain part of the email address.

  • Then we ask for an explicit full stop \. which has to be delimited because a normal “.” matches any character.

  • The third and final part is the same as the first and second and should match the net part of the email address.

So this is not a perfect regex for all email addresses because they can contain full stops and have more complex domain addresses.

Instead of searching for a regular expression describing a class such as w standing for any letter or digit, you could search for a specific expression such as the sequence ACGT. Enclosing this expression in brackets () turns it into a group. You can then also search for multiple occurences of this group by using brackets ()+ or you can simultaneously search for multiple patterns using the pipe character |. The pipe character acts as a logical OR operation here and divides the individual expressions within the group into alternates.

# Multiple occurences
ACGT        would match ACGT
(ACGT)+     would match ACGT, ACGTACGT, ACGTACGTACGT etc.
(AC|CG|GT)  would match AC, CG, GT

Grep#

The command grep allows you to search within files without opening them first with another program. It also uses regular expressions to allow for powerful searches, and has a number of useful options to help give you the right output.

# A simple **grep**
grep "AAAAAAAAA" E.coli.fna        # shows all lines containing "AAAAAAAAA" highlighted

# Using grep with a regex
grep -E "(ACGT)(ACGT)+" E.coli.fna # shows all lines containing "ACGTACGT.." highlighted

# Useful options
grep -o  # show only the matches
grep -c  # show only a count of the matches

Exercise 0.7

  • Navigate to the directory you copied the E. coli files to earlier.

# Navigation
cd ~/ecoli
  • Use less to look at the GCF_000005845.2_ASM584v2_cds_from_genomic.fna file, containing nucleotide gene sequences.

# Look at the file
less GCF_000005845.2_ASM584v2_cds_from_genomic.fna
  • Search within less to find the sequence for dnaA.

# Type this within less:
/dnaA
# Type 'n' or 'N' after to see if there are more search hits
# Press q to quit
  • Use man to look at the grep command

#Looking at grep
man grep
  • Use grep to find the same entry in the file.

#Using grep to search for dnaA
grep 'dnaA' GCF_000005845.2_ASM584v2_cds_from_genomic.fna
  • Use grep to count how many fasta entries the file has. As a reminder, a FASTA header always starts with a ‘>’.

# Use grep to count
grep -c '>' GCF_000005845.2_ASM584v2_cds_from_genomic.fna
  • Find the locus tag for the gene dnaA?

# Which entry?
grep '>.*dnaA.*' GCF_000005845.2_ASM584v2_cds_from_genomic.fna
  • If you are interested in learning regular expressions, try the exercises here