Linux & scripting for data wrangling

Jimmy Christensen

October 31, 2022

Overview

Slides available at https://dusted.dk/pages/scripting/slides.html

Linux ?

Linux is an operating system, like Windows or OSX, it allows you to run programs.

Bash ?

Popular “command line” for Linux & OSX

Python ?

Popular scripting language

All of them ?

Bash, grep, awk, sed, tr, cut:

Bash scripts using above programs to automate extraction.

Python scripts to remodel and consolidate extracted data, calculate and prepare results.

Overview: Linux

Editing with nano

Nano is a (bit too) simple text editor

Files / Directories

/home/chrisji/theWork/important.txt

Peek at unknown file: hd strangeFile | head

Check size: ls -lh strangeFile

Only beginning: head strangeFile

Page through it: more strangeFile

Pipes & redirects

Pipes & redirects

Composition, also useful in bash scripting

# A pipe directs data from stdout of one program to stdin of the next
ls | wc -l 

# A redirect directs data from stdout of a program to a file.
ls > myFiles.txt

# An appending redirect will not truncate the file first.
echo "Line 1" > file.txt
echo "Line 2" >> file.txt

Script vs program

Same difference, almost

Overview: Bash

Two quick notes

# This line is a comment, because it starts with the hash mark
# comments are to help the humans (us)

both bash and python # ignores a hash mark and any following text
but only for rest of that line

Making & running a bash script

Creating the script with the nano editor

nano shee.sh

Making & running a bash script

# Look at our new script file, it is not marked as being "executable"
ls -l shee.sh

# We make it executable!
chmod +x shee.sh

# Check that it has the x flag (and pretty green color)
ls -l shee.sh

# Now we can run it
./shee.sh
Hello World!

Bash: Variables

Holds something, has a name, like box with a label on it

#!/bin/bash

placeToGreet="World"

echo "Hello $placeToGreet!"
Hello World!

Bash: Variables

They can be re-assigned!

#!/bin/bash

placeToGreet="World"

placeToGreet="Earth"

echo "Hello $placeToGreet!"

What’s the result ? (remember, line-by-line)

Bash: Variables

A bash variable can store about 2 megabytes of data, python about 64 gigabytes

#!/bin/bash

# Create some variables and assign them values
timesToCall=3
animal="dog"
name="viggo"

# Output to the screen
echo "Calling for $name the $animal $timesToCall times!"
Calling for viggo the dog 3 times!

Bash: Parameters

Variables assigned from command-line

# The shee.sh script is executed with four parameters:
# First parameter: 3
# Second parameter: dog
# Third parameter: viggo
# Fourth parameter: heere doggie doggie! 
./shee.sh 3 dog viggo "heere doggie doggie!"
Calling for viggo the dog 3 times!

Bash: Parameters

#!/bin/bash

# $1 = first parameter
# $2 = second parameter

timesToCall=$1
animal="$2"
name="$3"

echo "Calling for $name the $animal $timesToCall times!"

Variables: There’s more

Bash has strong string-manipulation capabilities, quickly degenerates into alphabet soup

Don’t go there, use python

For the curious: https://tldp.org/LDP/abs/html/string-manipulation.html

It also has arrays and sets, bash syntax makes it painful

Conditionals

Conditionals let’s us decide if we want to do something

if CONDITION
then
    # do
    # some
    # things
fi

Conditionals

The “else” word lets us do something if the condition is not met

if CONDITION
then
    # do
    # some
    # things
else
    # do some
    # other
    # things
fi

Conditionals

The elif is optional, allows chains

if CONDITION_A
then
    # something to do if A..
elif CONDITION_B
then
    # This must not happen if A, but if B.
elif CONDITION_C
then
    # If neither A nor B, but C, then do this
else
    # If not A, B nor C, then do this..
fi

else is also optional

if CONDITION_A
then
    # Do this if A
elif CONDITION_B
then
    # Do this if B
fi

Conditionals

In bash, a condition is met if the [ ] builtin or a program exit with the “success” status code.

true # A program that returns success
false # A program that returns.. not success
grep # Will return success if it finds at least one match
[ "$name" == "viggo" ] # Will success if the variable contains the text viggo
[ $a -gt $b ] # success if a is Greather Than b
[ $a -lt $b ] # success if a is Less Then b
[ $a -eq $b ] # success if a is EQual to b
[ -z ${a+x} ] # success if a is empty (weird syntax, I know)
[ "$name" == "viggo" ] && [ "$animal" == "dog" ] # success only when name is viggo AND animal is dog (both must be true)
[ "$name" == "viggo" ] || [ "$animal" == "dog" ] # success if name is viggo or animal is dog (one or both must be true)

Conditionals

if echo "secret string" | grep "secret" &> /dev/null
then
    echo "The secret string was found!"
fi

Conditionals

if [ "$name" == "viggo" ] && [ "$animal" == "dog" ]
then
    echo "Viggo is a very special dog!"
elif [ "$animal" == "dog" ]
then
    echo "Dogs are just great, even if they're $name instead of viggo!"
else
    echo "I'm sure $name is a great $animal!"
fi

Loops

Loop: for

Repeats for each item in sequence

for VARIABLE in SEQUENCE
do
    # things to do
done

Loop: for

Example

for number in 1 2 3 4
do
    echo "Number is $number"
done

Loop: for

animals="dog fish cat bird"

for animal in $animals
do
    echo "$animal is an animal"
done
dog is an animal
fish is an animal
cat is an animal
bird is an animal

Loop: for

animals="dog fish cat bird"

for animal in "$animals"
do
    echo "$animal is an animal"
done
dog fish cat bird is an animal

Loop: for

Use expansion for files, this produces a sequence of .txt files in current directory

for file in *.txt
do
    echo "Text file: $file"
done

Loop: while

Repeats as long as CONDITION is met (potentially forever)

while CONDITION
do
    # thing to do
done

Interjection

This is a good time to remind you that pressing ctrl+c (eventually) terminates a running script.

Loop: while

When you need to repeat something until “things are done”

while CONDITION
do
    # Thing to do until CONDITION is no longer met
done

Loop: while

while read line
do
    echo "This just in: $line"
done

Reading data

Reading data: The read builtin

read VARIABLE # Creates a variable and reads data from standard input into it

Read data from human

Useful for interactive scripts

echo "Welcome to super cool script v 24!"

# The -p parameter is a prompt to show to the human
read -p "File to corrupt: " fileName

echo "Don't worry, I was just kidding, nothing happened to $fileName"
Welcome to super cool script v 24!
File to corrupt: homework.md
Don't worry, I was just kidding, nothing happened to homework.md

Read data from human (in a loop)


answer=""
while [ "$answer" != "y" ] && [ "$answer" != "n" ]
do
    read -p "Continue? [y/n] ?" answer
done

if [ "$answer" == "n" ]
then
    echo "Stopping here."
    exit
fi

echo "Continuing"
Continue? [y/n] ?yes
Continue? [y/n] ?...
Continue? [y/n] ?yeees
Continue? [y/n] ?noooo ?
Continue? [y/n] ?n
Stopping here.

Read data from file

For small files, we can read the entire file into a variable by running the cat program and capturing its output in a variable

someFileContents=$(cat secrets.txt)

echo "File content: $someFileContents"

Read data from program

For small outputs, we can capture the entire standard (non-error) output into a variable

gloriousResult=$(grep 'someRegex' secrets.txt)

Read data from stream

Read exits with success unless standard input is closed

grep 'someRegex' secrets.txt | while read line
do
    echo "A line of the result: $line"
done

Read data from stream

We can compose complex pipelines now

grep 'someRegex' secrets.txt | while read line
do
    # Extract some field from the line
    importantThing=$(echo $line | cut -f 2 -d ',' | awk magic things)
    # Look for that field in some other file, put the result in a variable
    if resultFromOtherFile=$(grep $importantThing otherSecrets.txt)
    then
        # Eureka! some correlation is interesting when the field from subset of secrets.txt is in otherSecrets.txt!
        # Maybe combine on one line and print it out?
        echo "$line,$resultFromOtherFile"
    fi
done > results.txt

Read data from stream

We can also write a script that takes a stream from a pipe, for use with other scripts

ourCoolScript.sh

#!/bin/bash

while read ourLine
do
    # Do intersting things with ourLine
done

We’d use it like this

grep 'someRegex' someFile.txt | ./ourCoolScript.sh > results.txt

Write data

We just use the > and >> after any command, or after the script itself.

> # Will create an empty file (or truncate an existing one)
>> # Will create an empty file (or append to an existing one)

Functions

Functions

function removeRepeatSpaces {
    tr -s " "
}

echo "Badly formatted    data right  here." | removeRepeatSpaces
function largestOne {
    if [ $1 -gt $2 ]
    then
        echo $1
    elif $1 -lt $2 ]
    then
        echo $2
    fi
}

largest=$(largestOne $A $B)

Debugging

See EVERYTHING that happens during script execution by adding set -x

#!/bin/bash
set -x
echo "My script"
read -p "What is your name ?" name

if [ "$name" == "viggo" ]
then
    echo "Good dog!"
fi

Tips

Tips

#!/bin/bash

# This script needs 3 paramters, name, animal and number of legs

name=$1
animal=$2
nlegs=$3

if [ -z ${nlegs+x} ]
then
    echo "Usage $0 NAME ANIMAL NUM_LEGS"
    echo "    NAME - The name of the pet (example: viggo)"
    echo "    ANIMAL - What kind of animal is it? (example: dog)"
    echo "    NUM_LEGS - How many legs does it have? (example: 4)"
    exit 1
fi

Even if you’re the only user, you might forgot how to use it later

$0 is special, it is always the name of the script in question.

Tips

Suggestion for treating multiple files while keeping track of the source of each result

Let there be binary files data1.bin data2.bin … dataN.bin that can be translated by a “translatebinarytotext” program and filtered by some regular expression to find relevant lines:

#!/bin/bash

for fileName in data*.bin
do
    translatebinarytotext "$fileName" \
        | grep 'someRegex' \
        | ./ourAwesomeScript.sh "$fileName" \
        >> result.txt
done
# Alternative if the files are unorderly
for fileName in data1.bin data_2.bin thirdPartOfData.bin

The polished version of above alternative could be used like a real program.

./processThePeskyFiles.sh data1.bin data_2.bin thirdPartOfData.bin > result.txt

Then the script would look like:

#!/bin/bash

for fileName in $@
do
    translatebinarytotext "$fileName" \
        | grep 'someRegex' \
        | ./ourAwesomeScript.sh "$fileName"
done

Tips

It can be useful to write out such “debug” information to the standard error channel, this way it goes “around” pipes and are shown on screen.


#!/bin/bash

# This function writes a message to stderr, so our script output can be piped while we still see information/errors.
function msg {
    echo $@ >&2
}

msg "Script for doing cool stuff running in $(pwd)"
echo "Important data output"
bash coolStuffDoer.sh > test.txt
Script for doing cool stuff running in /home/chrisji/theWork/
cat test.txt 
Important data output

Tips

Script and data does not need to be the same place, you can write the script as if it is next to the files.

.
├── data
│   └── seta
│       └── wool.txt
└── scripts
    └── sheep.sh

scripts/sheep.sh:

#!/bin/bash

sheep=$(grep sheep wool.txt | wc -l)
echo "There are $sheep sheep."
cd data/seta
../../scripts/sheep.sh wool.txt

Tips

Alternative, write script that uses parameters:

.
├── data
│   └── seta
│       └── wool.txt
└── scripts
    └── sheep.sh

scripts/sheep.sh:

#!/bin/bash

file="$1"
sheep=$(grep sheep "$file" | wc -l)
echo "There are $sheep sheep in $file"
./scripts/sheep.sh data/seta/wool.txt

Tips

Alternative, write script that uses pipes:

.
├── data
│   └── seta
│       └── wool.txt
└── scripts
    └── sheep.sh

scripts/sheep.sh:

#!/bin/bash

sheep=$(grep sheep | wc -l)
echo "There are $sheep sheep."
cat data/seta/wool.txt | ./scripts/sheep.sh

Tips

A script that reads which files to work on from a file

filesToWorkOn.txt

file1.txt
file2.txt
file3.txt

process.sh

#!/bin/bash

listFileName="$1"

filesToProcess=$(cat "$listFileName")

for fileToProcess in $filesToProcess
do
    echo "Processing file: $fileToProcess"
done
./process.sh filesToWorkOn.txt 
Processing file: file1.txt
Processing file: file2.txt
Processing file: file3.txt

Overview: Python

Why?!

Make a script & run it

nano hello.py
#!/usr/bin/python3
print("Hello World!")
./hello.py
Hello World!

Variables

#!/usr/bin/python3

placeToGreet = "Earth"

# Apologies, the f is not a typo, it tells python we want variables substituted in the text string
print(f"Hello World {placeToGreet}!")

Variable

Types & objects

Variables

Variables

# integer
length=23

# floating point
pi=3.1415
tau=pi*2

# set
instances={ pi, tau, tau, 'dog', 'viggo' }
# How many tau is in the set? 1. It's a set.
# What order are the items in? None, sets are unordered.

# Convenient for acumulating unique occurences of stuff
instances.add( 42 )

if tau in instances:
    print("Tau is in the set!")

# list
aList = [ 5,3,12,5 ]
# Lists are ordered, duplicates are allowed. (5 occurs twice, at the start and end)
# In python and most other languages, the first element of a list is 0
print(f"The second element in the list is {aList[1]})

Variables

Dictionaries makes it easy to structure data

dogs = dict(
    viggo = dict(
        age = 4,
        legs= 4,
        tail=true
    ),
    henry = dict(
        age = 9,
        legs= 3.9,
        tail=true
    )
)

dogs.update(
    bob = dict(
        age = 2,
        legs= 4,
        tail=false # poor guy!
    )
)

print( f"Viggo is {dogs['viggo']['age']} years old." )

Variables


# Sets can accumulate unique things, dicts can be count them
numAnimals = dict()

# -- snip --

if 'tau' not in numAnimals:
    numAnimals['tau'] = 0

numAnimals['tau'] += 1

Variables

More than everything you need to know about Python variables at

https://docs.python.org/3/library/stdtypes.html

Conditionals

Conditionals easier to read in python, generally we don’t run external programs, so the boolean result of an expression is all we worry about.

A condition is met when the result is True or any non-zero value.

A condition is not met when the result is False or 0, or empty (such as the empty string "" or list [])

Conditionals

a > b # Is True if a is greater than b
a < b # Is True if a is less than b
a == b # Is True if a is equal to b
a != b # Is true if a is not euqal t b
a < b and b > c # IS true if a is less than b AND b is greater than c
a > b or b < c # Is true if a is greater than b or if b is less than c
a in c # Is true if a is an element in c
a not in c # Is true if a is not an element in c

Conditionals

if CONDITION:
    # Do something that should be done
    # if CONDITION is met
if CONDITION
    # Do something
else
    # Do something else

Attention!

Python is indentation sensitive, that is, the number of “tabs” indicate to which code-block a line belong.


if False:
    print("This is never shown")
    print("And this is also never shown")
print("This is shown")

Conditionals

name = 'viggo'

if name == 'viggo':
    print(f"The dogs name is Viggo!")

Conditionals

name = 'Viggo'

# 'Viggo' is not the same as 'viggo', but python strings provide a method for lowercasing for comparisons
if name.casefold() == 'viggo':
    print(f"The dogs name is Viggo!")

Loops

# For is great for iteration over "iterables"
for VARIABLE in ITERABLE:
    # do thing for each item
for number in range( 10, 20 ):
    print(f"Number: {number}")
Number: 10
Number: 11
Number: 12
Number: 13
Number: 14
Number: 15
Number: 16
Number: 17
Number: 18
Number: 19

More about the CSV module at https://docs.python.org/3/library/csv.html

Loops

#!/usr/bin/python3

import sys
import csv

# This csv file used ; instead of ,
reader = csv.reader(sys.stdin, delimiter=';')

rowNumber=0

for row in reader:
    name=row[0]
    age=int(row[1]) # Convert from text to number 
    print(f"Row {rowNumber}, name: {name} age: {age}")
    if age > 5:
        print(f"{name} is an old doggie!")
    rowNumber += 1
cat csvfi | ./hello.py 
Row 1, name: viggo age: 6
viggo is an old doggie!
Row 2, name: henry age: 4

Loops

Much more about loops here: https://wiki.python.org/moin/ForLoop

Reading from a file

With an open file, readline will read a single line from the file, useful for large files

#!/usr/bin/python3

fileName="poe.txt"

with open(fileName) as textFile:
    while True:
        line = textFile.readline()
        if not line: # If the line was empty, file end reached
            break;   # so break the lop
        line = line.strip();
        print(f"{line}")
Once upon a midnight dreary,
while I pondered, weak and weary,
Over many a quaint and curious
volume of forgotten lore—

Reading from a file

With an open file, readlines will read from the file into a list

#!/usr/bin/python3

lineNumber=0
fileName="poe.txt"

with open(fileName) as textFile:
        lines = textFile.readlines()
        for line in lines:
            lineNumber += 1
            line = line.strip(); # remove newline from the line
            print(f"{fileName} {lineNumber}: {line}")
poe.txt 1: Once upon a midnight dreary,
poe.txt 2: while I pondered, weak and weary,
poe.txt 3: Over many a quaint and curious
poe.txt 4: volume of forgotten lore—

Writing files

reverser.py

#!/usr/bin/python3

inFileName="poe.txt"
outFileName="eop.txt"

with open(inFileName) as inFile:
    with open(outFileName, 'w') as outFile:
        while True:
            line = inFile.readline()
            if not line: # If the line was empty, file end reached
                break;   # so break the lop
            line = line.strip();
            line = line[::-1] # Reverse the string
            line = line + "\n"
            outFile.write(line)
./reverser.py
cat eop.txt 
,yraerd thgindim a nopu ecnO
,yraew dna kaew ,derednop I elihw
suoiruc dna tniauq a ynam revO

Writing CSV files

#!/usr/bin/python3
import csv

ourResults = []                                                                 
ourResults.append( ['Johnny', 'Cat', 4 ] )                                      
ourResults.append( ['Viggo', 'Dog', 5 ] )                                       
ourResults.append( ['Mat', 'Dog', 7 ] )                                         
ourResults.append( ['Markus', 'Cat', 7 ] )                                      

with open("out.csv", "w") as csvOutFile:
    writer = csv.writer(csvOutFile, dialect='excel')
    writer.writerow(['Name', 'Species', 'Age'])
    for row in ourResults:
        writer.writerow(row)
cat out.csv 
Name,Species,Age
Johnny,Cat,4
Viggo,Dog,5
Mat,Dog,7
Markus,Cat,7

Writing CSV files

#!/usr/bin/python3
import csv

csvHeader = [ 'Name', 'Species', 'Age' ]

ourResults = []
ourResults.append( { 'Name': 'Johnny', 'Species': 'Cat', 'Age': 4 } )
ourResults.append( { 'Name': 'Viggo',  'Species': 'Dog', 'Age': 5 } )
ourResults.append( { 'Name': 'Mat',    'Species': 'Dog', 'Age': 7 } )
ourResults.append( { 'Name': 'Markus', 'Species': 'Cat', 'Age': 7 } )


with open("out.csv", "w") as csvOutFile:
    writer = csv.DictWriter(csvOutFile, fieldnames=csvHeader, dialect='excel')
    writer.writeheader()
    for row in ourResults:
        writer.writerow(row)

String slicing

string[ BEGIN : STEP : END ]
# Slicing (extract substring)

myString = "abcdefg"

subString = myString[0:3:1] # First 3 characters (character 0,1,2)
subString = myString[3:6:1] # From third character to sixth (def)
substring = myString[-2::]  # Last 2 characters
subString = myString[::-1]  # Step through string in reverse   

String splitting

string.split( DELIMITER )
string.split( DELIMITER, NUMBER_OF_SPLITS)

# rsplit is the same, but goes right to left
myString = "a,b,c,d,e,f,g"

subString = myString.split(',') # ['a', 'b', 'c', 'd', 'e', 'f', 'g']
subString = myString.split(',', 1) # ['a', 'b,c,d,e,f,g']
subString = myString.rsplit(',', 1) # ['a,b,c,d,e,f', 'g']

Regular expressions

import re # regex module

text = "He was carefully disguised but captured quickly by police."

subString = re.split(r"but", text) # ['He was carefully disguised ', ' captured quickly by police.']

subString = re.findall(r"\w+ly\b", text) # ['carefully', 'quickly']

More about regex at https://docs.python.org/3/library/re.html

Searching in list

#!/usr/bin/python3

listOfAnimals = [ 
        { "Name": "Viggo", "Species": "Dog", "Age": 5 },
        { "Name": "Mat", "Species": "Dog", "Age": 6 },
        { "Name": "Oliver", "Species": "Cat", "Age": 5 },
        { "Name": "Luggie", "Species": "Dog", "Age": 5 }
]

fiveYearOldDogs = []

for animal in listOfAnimals:
    if animal["Species"] == "Dog" and animal["Age"] == 5:
        fiveYearOldDogs.append(animal["Name"])

print(fiveYearOldDogs)
['Viggo', 'Luggie']

Functions

def FUN_NAME( ARGUMENTS ):
    BODY

Functions

#!/usr/bin/python3

# repeater will return a textToRepeat repeated timesToRepeat times.
# If timesToRepeat is not provided, a default value of 5 is used
def repeater( textToRepeat, timesToRepeat=5 ):
    return textToRepeat * timesToRepeat

repeatedString = repeater("test", 3)

print(repeatedString)
testtesttest

Functions

#!/usr/bin/python3

listOfAnimals = [ 
        { "Name": "Viggo", "Species": "Dog", "Age": 5 },
        { "Name": "Mat", "Species": "Dog", "Age": 6 },
        { "Name": "Oliver", "Species": "Cat", "Age": 5 },
        { "Name": "Luggie", "Species": "Dog", "Age": 5 }
]

def isLuggie( animal ):
    return animal["Species"] == "Dog" and animal["Age"] == 5 and animal["Name"] == "Luggie"

num = 0

for animal in listOfAnimals:
    num += 1
    if isLuggie(animal):
        print(f"Found {animal['Name']} he is number {num} in the list")
Found Luggie he is number 4 in the list

Sorting lists

list.sort()
list.sort( reverse=True )
list.sort( key=FUNCTION )
#!/usr/bin/python3

myList = [ 2, 6, 1 ,4 ]
myList.sort() # Sort will sort the list "in place"
print(myList)

myList = [ 
    { 'Age': 5, 'Name': 'Viggo' },
    { 'Age' : 2, 'Name': 'Mat' }
]

def getAge(animal):
    return animal['Age']

myList.sort( key=getAge)

print(myList)
[1, 2, 4, 6]
[{'Age': 2, 'Name': 'Mat'}, {'Age': 5, 'Name': 'Viggo'}]

The end