Data wrangling

Data wrangling#

A lot of time and effort in bioinformatics is spent arranging data in the correct way or correct format (aka “data wrangling”). Consequently, it is very useful to know how to filter and rearrange data files. In these exercises, we will learn some of the commands we use to do this.

The command sort will sort each line of a file, alphabetically by default, but other options are available.

# Sort some example files
cat  /nfs/teaching/551-0132-00L/2_Good_practices/sort_words.txt
sort /nfs/teaching/551-0132-00L/2_Good_practices/sort_words.txt

#Sorting nummerically with the -n option
cat /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
sort -n /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt

The command cut allows you to extract a single column of data from a file, for instance a .csv or .tsv file. The parameter -f describes which column will be extracted and can also be used to extract multiple columns.

# Look at some experimental metadata and extract the column we are interested in
less /nfs/teaching/551-0132-00L/2_Good_practices/metadata.tsv
# Extract the 4th column from left to right
cut -f 4 /nfs/teaching/551-0132-00L/2_Good_practices/metadata.tsv
# Extract multiple columns
cut -f 4,5 /nfs/teaching/551-0132-00L/2_Good_practices/metadata.tsv

The command uniq compresses adjacent repeated lines into one line, and is best used with sort when combining commands (see below).

# Look at a file and remove adjacent repeated lines
less /nfs/teaching/551-0132-00L/2_Good_practices/uniq_nums.txt
uniq /nfs/teaching/551-0132-00L/2_Good_practices/uniq_nums.txt

# Count how many times each value is repeated
uniq -c /nfs/teaching/551-0132-00L/2_Good_practices/uniq_nums.txt

Exercise 1.8#

Exercise 1.8

  • Use the sort examples above and see what happens when you try to sort the sort_nums.txt file without the -n flag.

# Sort sort_nums.text without -n
sort /nfs/teaching/551-0132-00L/2_Good_practices/sort_nums.txt
# The file will be sorted alphabetically
  • Look at the file /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt.

# Look at sort_tab.txt
less /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
  • Extract the second column of this file using cut.

# Extract the second column
cut -f 2 /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
  • Use the uniq examples above, then check with uniq -c that each line in sort_tab.txt is unique.

# Check file with uniq
uniq -c /nfs/teaching/551-0132-00L/2_Good_practices/sort_tab.txt
# Each value in the first column is 1 - no repeats!