Sun, Nov 19, 2023 , last updated on Mon, Nov 20, 2023

Three patterns for speedy data processing with AWK

Many a time have I been bitten by data that is too large to fit into memory. Whether it’s filtering one dataset against another, or tabulating counts of things, reading 40 GB of data into 16 GB RAM is no bueno.

So, without further ado, let’s dive into three patterns that I’ve figured out during my recent foray into learning AWK.

Filtering

I really like dplyr’s filtering joins, specifically, the anti_join. Whereas joining usually involves gluing together two separate frames on a common field, the anti-join removes records from one frame that occur in another:

df <- readr::read_csv("HUGE.csv")
exclude <- readr::read_csv("exclude.csv")

df |>
    dplyr::anti_join(exclude, by = dplyr::join_by(id))

The result is all rows from df that are not in exclude. Now, if HUGE.csv is a misnomer and only a couple of Gigabytes, this may work well. But if it’s larger than the memory available to you, you guessed it: no bueno.

Instead, let’s look at how we can construct such a filter in AWK:

#!/bin/env -S awk -f
BEGIN {
  FS = ",";
  exclude[""] = 0;
}

NR == FNR {
    exclude[$1] = 0;
    next;
}
!($1 in exclude) { print $0; }
# the default action in AWK is to print, so the previous line can be shortened
# to simply:
# !($1 in exclude)

We initialize an exclude-array that will hold, as keys, the IDs we want to drop. The values are simply set to 0, we won’t need them. NR == FNR is a nifty AWK-trick to process two files at once. NR tracks the total number of records, whereas FNR tracks the number of records for the current file, and is reset for every new file. So this clause will be true for the first file, but not any subsequent ones. We populate exclude using the first column of the first file.

Then, once we get to the second file, we print the current line only if the first column does not appear in the exclude-array.

Save this as filter.awk, make it executable with chmod +x filter.awk, then run it on the files with ./filter.awk exclude.csv HUGE.csv, potentially redirecting output via > result.csv.

As a bonus, if you want to achieve the inverse, i.e., only keep the records that appear in another file, all you have to do is change the last line to $1 in exclude.

Bueno :)

Fan-out

Imagine you have a large file, and you want to count how often a certain record appears; in SQL, this is COUNT(*) with GROUP BY. For illustration, let’s say each line in your file is a user clicking on an item, and you want to count how many distinct items each user clicks on.

In Python, you’d typically keep track of users and items in a dict, with user-IDs as the keys pointing to a set of item-IDs:

from collections import defaultdict

counter = defaultdict(set)
with open(file, 'r') as f:
    for line in f:
        parts = line.split(',')
        user = parts[0]
        item = parts[1]
        counter[user].add(item)

Looks good! However, with a “large” amount of users, this exhausts the available memory. No bueno.

Instead, what if we wrote some code that would process the data sequentially, and immediately wrote the results out to another file? That way, we wouldn’t have to allocate memory to track anything. Then, in a second step, we could process all the small files in parallel and accumulate.

An AWK-script I recently came up with does just this: the first column is the user-ID, the fourth the item-ID, and the count is set to a placeholder value:

BEGIN {
    FS = ",";
    OFS = ",";
}
{
    outfile = "user-" $1 ".csv";
    print $1, $4, "1" >> outfile;
}

Now you may be wondering why we do file redirection from within AWK - couldn’t we just do shell-redirection? Not really! We want one file per user, and which user we’re processing only becomes evident by looking at the first column. I don’t know how this could be done with a redirection.

Fan in and filter could easily be combined into a single AWK-script. I believe this could also be parallelized with e.g. GNU parallel. I’m thinking along the lines of cat HUE.csv | parallel --pipe filter.awk. Maybe I’ll try that out, if I do, I’ll report back!

We now have several files, each containing the items a user interacts with. What we’re still missing is the tabulation. Almost bueno.

Fan-in

The last pattern is really the second half of the previous pattern. I haven’t put much thought into it, but I get the feeling fan-out isn’t worth much without a matching fan-in.

Now this is where GNU parallel does come in handy. We have a handful of files, each independent from the other - why not handle several of them at the same time?

What we can parallelize over are the individual files, so let’s look at what we need to do for each file:

read a line
tally the item-ID
when done, write the tally to a central sink-file

Easy peasy lemon squeezy:

BEGIN {
    FS = ",";
    OFS = ",";
}
{
    items[$2] = 0;
}
END {
    print $1, length(items) >> "sink.csv";
}

Note two things: for one, we don’t initialize the array items with an empty string, as in Filtering, as we won’t be accessing the array by key (and the empty key would result in an additional item reported for length); for another, we can only write our results once we’ve processed the entire file, which is why this has to be done in END: only then do we know how many items the user interacted with.

Now the cherry on top: with the above script in process.awk, all you need to do in order to max out your cores is pass all the files to parallel:

find . \
      -type f \
      -name "user-*.csv" \  # <- created in "fan-out"
      -print0 \
    | parallel \
      -q0 \
      --eta \
      ./process.awk

This will locate all your user-item-interaction files, hand them over to GNU parallel, which will distribute them fairly to as many process.awk-instances as your CPU can handle.

Bueno :)

Conclusion

Three simple AWK patterns that require a manageable amount of memory, and are mostly trivial to parallelize - what’s not to like?

To be fair, I had read about replacing fancy data processing pipelines with simple Unix utilities some time ago, but forgot about it because I didn’t understand what AWK was capable of. However, once I’d exhausted my knowledge of R, Python, and the multicore-capabilities of each, I had to take a step back and figure out another approach. So I returned to the post above, and primed myself with the bare minimum of AWK-syntax using this wonderful compendium. Mirco suggested this resource, which I’ve only skimmed, but it looks like a treasure trove of snippets!

My two take-aways from all this are these:

there is beauty (and speed!) in stupdly simple approaches composed of primitive building blocks, and
know your Unix utilities.

Go to top