Data wrangling#
A lot of time and effort in bioinformatics is spent arranging data in the correct way or correct format (aka “data wrangling”). Consequently, it is very useful to know how to filter and rearrange data files. In these exercises, we will learn some of the commands we use to do this.
The command sort will sort each line of a file, alphabetically by default, but other options are available.
# Sort some example files
cat /nfs/teaching/551-0132-00L/2_Good_practices/sort_words.txt
sort /nfs/teaching/551-0132-00L/2_Good_practices/sort_words.txt
#Sorting nummerically with the -n option
cat /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
sort -n /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
The command cut allows you to extract a single column of data from a file, for instance a .csv or .tsv file. The parameter -f describes which column will be extracted and can also be used to extract multiple columns.
# Look at some experimental metadata and extract the column we are interested in
less /nfs/teaching/551-0132-00L/2_Good_practices/metadata.tsv
# Extract the 4th column from left to right
cut -f 4 /nfs/teaching/551-0132-00L/2_Good_practices/metadata.tsv
# Extract multiple columns
cut -f 4,5 /nfs/teaching/551-0132-00L/2_Good_practices/metadata.tsv
The command paste allows you to put data from different files into columns of the same file.
# Put together two files into one
paste /nfs/teaching/551-0132-00L/2_Good_practices/sort_words.txt /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
The command tr will replace a given character set with another character set, but to use it properly you need to know how to combine commands (below).
# For instance, this command requires you to type the input in
tr 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz'
# Then try typing AN UPPER CASE SENTENCE
# Remember to exit a program that is running use ctrl + c
# It can also be used to delete characters
tr -d 'a'
# Then try typing a sentence with the letter 'a' in it.
# Remember to exit a program that is running use ctrl + c
The command uniq compresses adjacent repeated lines into one line, and is best used with sort when combining commands (see below).
# Look at a file and remove adjacent repeated lines
less /nfs/teaching/551-0132-00L/2_Good_practices/uniq_nums.txt
uniq /nfs/teaching/551-0132-00L/2_Good_practices/uniq_nums.txt
# Count how many times each value is repeated
uniq -c /nfs/teaching/551-0132-00L/2_Good_practices/uniq_nums.txt
Exercise 0.7
Use the sort examples above and see what happens when you try to sort the sort_nums.txt file without the -n flag.
# Sort sort_nums.text without -n
sort /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
# The file will be sorted alphabetically
Look at the file
/nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
.
# Look at sort_tab.txt
less /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
Extract the second column of this file using cut.
# Extract the second column
cut -f 2 /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
Looking at the manual for sort, can you figure out how to sort sort_tab.txt according to the second column, or ‘key’?
# Looking at he manuel
man sort
# Sort the table by second column
sort -n -k 2 /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
# Note that if you forget the -n then the numbers are sorted alphabetically, not numerically
Use paste to combine the two files sort_words.txt and sort_nums.txt (in the directory
/nfs/teaching/551-0132-00L/2_Good_practices/
) into a single two-column output.
# Use paste to combine files
paste /nfs/teaching/551-0132-00L/2_Good_practices/sort_words.txt /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
Use tr so that when you enter the word banana it comes out as rococo.
# Use tr to convert one word into another
tr 'ban' 'roc'
# Then input banana and back comes rococo!
# Use ctr + c to kill the command
Use the uniq examples above, then check with uniq -c that each line in sort_tab.txt is unique.
# Check file with uniq
uniq -c /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
# Each value in the first column is 1 - no repeats!