Searching for and within Files#

Wildcards#

When providing a file path as an argument to a command, it is often possible to provide multiple file paths using wildcards. These are special characters or strings that can be substituted for a matching pattern.

  • ? matches any single character

  • * matches any number of any characters

  • […] matches any character within the brackets

  • {word1,word2,…} matches any string inside the brackets

For instance:

# Pattern matching
ls /science/teaching/ecoli/*      # lists all files in the ecoli directory
ls /science/teaching/ecoli/*.fna  # lists all nucleotide fasta files there
ls /science/teaching/ecoli/*.f?a  # lists all nucleotide and protein fasta files there

Searching for a file#

When you are trying to find a file in your system, the command find offers a number of options to help you. The first argument is where to start looking (it looks recursively inside all directories from there), and then an option must be given to specify the search criteria.

# Finding files
find . -name "*.txt"  # searches for files ending in .txt
find . -mtime -2      # searches for files modified in the last two days
find . -mtime +365    # searches for files modified at least one year ago
find . -size +1G      # searches for files at least 1GB or larger
find . -maxdepth 1    # searches only here, i.e.: doesn't look inside directories

Exercises#

  • Use ls to get a list of all files in the /science/teaching/ecoli directory

  • Use cp to copy all files from the ecoli directory into a directory in your home directory

  • Navigate to the /science/teaching directory

  • Use find to look for all .txt files there

  • Use find to look for all files larger than 1MB

  • Now combine these criteria to find all .txt files larger than 1MB

Searching in less#

When you open a file to look at it using less it is also possible to search within that file by pressing / (search forwards) or ? (search backwards) followed by a pattern.

# Finding strings
/AAAAAAAAA  # finds the next instance of "AAAA"
?TTTTTTTTT  # finds the previous instance of "TTTT"

These same commands will also work with man, helping you to find a particular argument more easily.

But what happens when you search for “.”? The entire document will be highlighted! Why is this?

Regular Expressions#

The reason this happens is that in the context of these search functions, “.” represents any character. It is acting as a wildcard, from a different set of wildcards to those discussed above.

This set of wildcards is part of a system of defining search patterns called regular expressions or regex. Such a pattern can consist of wildcards, groups and quantifiers, and involve some complex logic which we will not cover here. Further, the exact set of wildcards available depends on the language being used.

# Wildcards and quantifiers
.   any character
\d  any digit
\w  any letter or digit
\s  any whitespace

^   the start of the string
$   the end of the string

*   pattern is seen 0 or more times
+   pattern is seen 1 or more times
?   pattern is seen 0 or 1 times

These are just a few of the possibilities available. An example regular expression that would search for email addresses, for instance, would be:

# name@domain.net
\w+@\w+\.\w+

Grep#

Grep is a program that allows you to search within files without opening them first with another program. It also uses regular expressions to allow for powerful searches, and has a number of useful options to help give you the right output.

# A simple grep
grep "AAAAAAAAA" E.coli.fna        # shows all lines containing "AAAAAAAAA" highlighted

# Using a regex
grep -E "(ACGT)(ACGT)+" E.coli.fna # shows all lines containing "ACGTACGT.." highlighted

# Useful options
grep -o  # show only the matches
grep -c  # show only a count of the matches

Exercises#

  • Navigate to the directory you copied the E. coli files to earlier

  • Use less to look at the GCF_000482265.1_EC_K12_MG1655_Broad_SNP_cds_from_genomic.fna file, containing nucleotide gene sequences

  • Search within less to find the sequence for dnaA

  • Use grep to find the same entry in the file

  • Use grep to count how many fasta entries the file has - as a reminder, a faster header always starts with a ‘>’

  • If you are interested in learning regular expressions, try the exercises here