A Crash Course In AWK

Bill Phillips's Headshot
Bill Phillips

A while back, MarkD wrote a great series of posts on DTrace. I'd never been exposed to DTrace—I assumed it was similar to strace. It's a whole other animal, though—an event-based engine suitable for everything from debugging to systems scripting.

The coolest thing was that the most powerful aspects of DTrace come from its wholesale copying of the programming model of one of my favorite UNIX tools: AWK. If I'm hacking together something on the command line, chances are good that I'm using AWK for some part of it. AWK is much simpler than DTrace—it's a general-purpose tool built around one big idea.

There are plenty of great resources on how to use AWK. Rather than write another one, this short post will show you the basics of what AWK is, and what it's good for. You'll need to know some command line basics, as well as what a regular expression is. All the examples in this post were written assuming an OS X environment.

Not An Operation—A Programming Model

They say that the UNIX way is to compose together small tools that do one thing well. AWK definitely does that, but not in the same way as head or tail do.

Let me show you what I mean. Let's say that I have a little text file that is an inventory of all my worldly possessions:

bash-3.2$ cat inventory beans and celery beans and oatmeal beans and beans quinoa

Even if you've never seen the head command before, the following example will probably make sense:

bash-3.2$ cat inventory | head -1 beans and celery

AWK is different. If you saw this next example in a shell script, you'd have a hard time knowing what it meant without reading up on AWK:

bash-3.2$ cat inventory | awk '/oatmeal/ { print $1 ": featuring " $3 }' beans: featuring oatmeal

That's because AWK's job isn't to do one small thing. It's to allow you to use one small idea: event-based programming.

Event-based Programming

In a normal procedural shell script or command line session, you're telling the computer to do a sequence of things in a specific order. That's not how AWK works. In AWK, you tell the computer how to look for events, and then tell it what to do when it finds an event you're interested in.

Let's take another look at that AWK program. This time, I'll format it a bit more nicely:

/oatmeal/ {
    print $1 ": featuring " $3;
}

The first part of this program — /oatmeal/ — is the event that you're looking for. Events can be specified in a few different ways: you can use a C-style conditional expression, or a special event like BEGIN that is triggered before the first line is read. However, the most common kind of event to see is a regular expression event, which is what /oatmeal/ is. If "oatmeal" appears in a line of text, then our event will be triggered.

The action is the second part of this program, the part between the braces. This part is a procedural set of instructions to perform when your event occurs. Here, you have a small C-like programming language at your disposal, with for/while loops, if statements, and global variables at your disposal.

When AWK runs your program, it will read each line of input in, one after the other. Each time it reads in a line, it will see if your event occurred. If it has, then it performs your event's action. You can define as many events as you like. If more than one event occurs, each event's action is performed in the order they appear in your program.

Here's a slightly more complicated example: an implementation of FizzBuzz on the command line, using seq and awk: (updated: now correct! I should know what FizzBuzz is before I write it. -Bill)

bash-3.2$ seq 1 100 | awk '
> ($1 % 3 == 0) {
>     printf("Fizz");
> }
> ($1 % 5 == 0) {
>     printf("Buzz");
> }<br></br>> ($1 % 3 != 0 && $1 % 5 != 0) { <br></br>>     printf($1); <br></br>> }
> { <br></br>>     printf("\n") <br></br>> }'<br></br><br></br><span style="color: #6e501e;font-family: Arial, sans-serif;font-size: 24px;font-weight: bold;line-height: 28px">Simple String Processing</span>

Our first AWK program didn't use any loops or conditionals, but it did use a couple of other features specific to AWK. Here's our first action again:

print $1 ": featuring " $3;

Since AWK is mainly used for wrangling text, it automatically does a bit of that work for you. It splits each line of text up into whitespace-separated words and stashes them in variables named $N, where N is the index of the word starting from 1 ($0 gives you the entire line of text).

AWK also makes it easy to paste two strings together. All you have to is put them next to one another. So the line of code above pastes together $1 ("beans"), ": featuring ", and $3 ("oatmeal").

Getting Fancy: Multiple Events And Variables

Lots of AWK scripts do little more than look for a particular line in a file and print out a specific field, but you can use it to do simple parsing of structured text, too. For example, as an Android developer, I'm often working with XML layout files that look like this:

<FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"
  android:layout_width="match_parent"
  android:layout_height="match_parent"
  >

  <android.support.v4.view.ViewPager
    android:id="@+id/fragment_pager_viewPager"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:padding="24dp" />

</FrameLayout>

In my Java code, nine times out of ten I'm going to want to pull out a reference to the ViewPager I defined above by writing a line of code like this:

final ViewPager viewPager = (ViewPager)v.findViewById(R.id.fragment_pager_viewPager);

Now, if I were writing production code that translated that XML into that line of Java code, I'd want to use a real programming language with a real XML parser to avoid any parsing pitfalls. If I'm writing a tool for myself, though, that doesn't sound like a lot of fun. It'd also be nice to be able parse sloppy input, like a small fragment of the XML file containing just a few views. That won't work with a beefier XML parser, which will yell at me if it doesn't receive perfect input.

In its own slapdash way, AWK is handy with this kind of thing. I've got an AWK script I use for just this task. It uses the gensub function, which is specific to gawk. (You can install gawk with homebrew or macports if you're on a Mac.) Here's the script:

#!/usr/bin/env gawk -f

BEGIN {
    # appropriate for an onCreateView
    spacing = "        ";
}

/<[a-zA-Z.]*/ {
    tagName = gensub(/^.*<([a-zA-Z.]*\.)?([a-zA-Z]*).*/, "\\2", $0);
}

/android:id=\"@\+id\// {
    rawId = gensub(/^.*:id=\"@\+id\/([a-zA-Z0-9_]*)\".*/, "\\1", "", $0);
    fieldName = gensub(/^.*_/, "", "", rawId);
    if (tagName == "include") {
        tagName = "View";
    }
    print spacing "final " tagName " " fieldName " = (" tagName ")v.findViewById(R.id." rawId ");"
}

This script has three events. The first one, BEGIN, happens before processing any text. It defines the amount of leading whitespace, which we'll need later on.

The second event looks for opening XML tags. Whenever it finds one, it uses the gensub function to pull out the last part of the class name with regex matching. It then stashes that classname in the fieldName variable. So tagName will always store the last class name we read in.

The last event looks for the android:id attribute we're interested in. When this happens, we should spit out a line of Java code. We can do that by using gensub again, first to pull out the id, then to strip out the underscored portion to get our variable name.

This script isn't perfect—it's easy to create an XML file that will break it. As long as the XML looks like the kind of XML my team writes, though, it's great.

I'm Sold, Bill. Where Can I Buy An Awk?

There's a little bit more to AWK than I've covered here, but those are the basics. If you're interested in more, check out Bruce Barnett's tutorial and short reference here.

Recent Comments

comments powered by Disqus