Awk: `Begin { ` Part 1

0
23

The other day, I was watching Bryan Cantrill’s 2018 talk, Rust, and Other Interesting Things, and he made an offhanded comment while discussing values of different programming languages and communities. He said, “If you get the awk programming language manual…you’ll read it in about two hours and then you’re done. That’s it. You know all of awk.”

Only two hours to learn an entire language?! …. Challenge accepted!

I had previously used snippets of awk here and there. Most of them were given to me by Stack Overflow answers when googling for niche data file manipulations. But, I did not know enough to successfully write an awk program from scratch. I definitely did not have a real grasp on the language nor its power. And, a couple of hours sounded like a relatively small time investment to learn what Bryan Cantrill said was a language he used three times a day.

It turns out it takes more than two hours to learn awk, and I am by no means an expert… yet (growth mindset!). But, I now know enough to write a little about the essentials. Here goes!

What is awk useful for?

Awk is useful for data file manipulation. Already, having used it for a few days only, I wish I had invested time in learning it earlier. My usual workflow when encountering a data file is to import it into Google Sheets and use their builtin functions. If those weren’t enough, I would write little code snippets to somewhat awk..wardly get the information I want. Awk is way more powerful than what I was doing before. Let’s take a look:

Running awk programs

If we’re going to learn awk, we need to know how to run an awk program. The syntax to run an awk program in a shell is:

$ awk 'awk_program_contents' data-file-1 data-file-2

We can also write a longer awk program to run instead of writing the awk code inline. We could write a file with awk codeand then pass inline to awk with -f

$ awk -f awk-program.awk data-file-1 data-file-2

awk program contents

Well, what is an awk program? We know it is best used for simple data reformatting or manipulation. The way it does this is by performing different actions on different patterns within a data file. The basic syntax of an awk program depends on these pattern and actions.

pattern { action }
pattern { action }
...

We can give as many pattern { action } pairs as we want. Each pair will be executed independently of the others. This means if a line matches more than one pattern, it will have more than one corresponding action. In the example above we use newlines to separate distinct pairs. Similar to bash, we can also use ; to separate commands and put everything on one line: pattern { action }; pattern { action }

awk with data files

But, it turns out awk is much more useful (and fun!) with a data file. The UN has a few publicly available datasets. I picked this one on education at the primary, secondary and tertiary levels to delve into first.

Let’s start by using awk to get a sense of what the data looks like. NR is a predefined variable which records the number of rows read in a file so far. We can use it to look at the first few lines of a program. In this case, our pattern will be NR <= 5, and by not including an action, the implied action will be print:

$ awk 'NR <= 5' education.csv
T07,"Enrolment in primary, secondary and tertiary education levels",,,,,
Region/Country/Area,,Year,Series,Value,Footnotes,Source
1,"Total, all countries or areas",2005,Students enrolled in primary education (thousands),"678,991.6070",,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
1,"Total, all countries or areas",2005,Gross enrollement ratio - Primary (male),104.9360,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (female),99.9214,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."

Okay, so looks like this is giving us a bit of information about our file. Notably:

  1. There are two header rows: a title row, and a row telling us what the fields are
  2. The file is comma separated
  3. … except there are sometimes commas within double quoted strings: "Total, all countries or areas"

Let’s address these one by one!

We can ignore the first two header rows by using our nifty NR moving forward. We can pattern match that NR > 2. Note: awk is 1-indexed.

$ awk 'NR > 2' education.csv

Field Separators

awk’s default field separator is a space. We can actually see this by printing the first field. To access the value of a field, we use $. So $1 is the first field, $2 the second, and so on. $0 refers to the entire row.

If we try this:

$ awk 'NR <= 5 { print $1 }' education.csv
T07,"Enrolment
Region/Country/Area,,Year,Series,Value,Footnotes,Source
1,"Total,
1,"Total,
1,"Total,

we can confirm that we’re splitting on spaces. awk has the option to specify a different field separator with the -F 'separator' flag:

$ awk -F ',' 'NR <= 5 { print $1 }' education.csv
T07
Region/Country/Area
1
1
1

Great! But…. we had commas embedded within strings with double quotes. Sure enough, if we print the second field ($2), we see:

$ awk -F ',' 'NR <= 5 { print $2 }' education.csv
"Enrolment in primary

"Total
"Total
"Total

Hmmm. What we want here is to split fields by content. Which awk does not have, but gawk (GNU awk) does: FPAT! From the gawk manual: “All properly written awk programs should work with gawk. So most of the time, we don’t distinguish between gawk and other awk implementations.”

Sounds like we can use gawk here instead then. Let’s try pattern matching. I’m not going to go into regex here, but the pattern we want, defined by "[^,]*|"[^"]+"" is anything that either starts with a non-comma character, or starts with a double quote, contains only non-quote characters, and ends with a double quote:

$ gawk 'BEGIN { FPAT = "[^,]*|"[^"]+"" } NR <= 5 { print $2 }' education.csv
"Enrolment in primary, secondary and tertiary education levels"

"Total, all countries or areas"
"Total, all countries or areas"
"Total, all countries or areas"

I snuck a BEGIN in there without explaining it. Let’s go on a brief tangent…

BEGIN { tangent }

Beyond the pattern and actions, awk also has a concept of BEGIN and END blocks. The BEGIN is executed before any of the data is processed. It can be useful for declaring variables or printing text to appear at the beginning. Analogously, the END is executed after the data is processed. It can be useful for performing manipulations on aggregates of the data, like averaging a sum.

This means if we wanted to write a little “Hello, awk!” program, we could do it without even needing a data file.

$ awk 'BEGIN { print "Hello, awk!" }'
Hello, awk!

END { tangent }

…back to our example. In our case, we used a BEGIN block to declare the FPAT before reading our data file.

But, we’ve only looked at the first 5 lines. For all we know, the rest of the file could look completely different. Let’s use NR again to see some more of the file. First, let’s figure out how long the file is. We can use the END block here. After we’ve parsed the whole file, we can see what the value of NR is, and that’ll tell us how many lines it is:

$ awk 'END { print NR }' education.csv
8630

Okay, so maybe if we print every 500 lines, we’ll get a sense of what data we’re looking at. We can set our pattern to be only if NR is a multiple of 500:

$ awk 'NR % 500 == 0' education.csv

… and I’m going to leave this blog post on a real cliff hanger. Mostly because it already feels too long! There’s a second post about awk actually looking at the data and manipulating it to figure out which countries have stark differences in number of males and females that they educate.

TL;DR or TL;Skimmed far enough to get here, please give me the shorter version:

To rehash what we’ve learned about awk:

  • awk is run using awk 'awk_program' data-file
  • awk programs have the form pattern { action }; pattern { action };
  • BEGIN blocks are executed before reading data files
  • END blocks are executed after reading data files
  • NR is a variable that tells us the number of rows read
  • -F '' is how we can define a field separator for a file
  • Space is the default field separator
  • FPAT="..." is a way to use regex to define a pattern for each field
  • FPAT is only defined in gawk

LEAVE A REPLY

Please enter your comment!
Please enter your name here