Run regular expressions over a large text dataset through Python scripts
This is the second article in a series that walks through how to use Meadowrun to quickly run regular expressions over a large text dataset, using English-language Wikipedia as our example data set. If you want to follow along with this second article, you’ll need the simplified article extracts that we produce in the first article. Alternatively, it should be pretty simple to translate the code in this post to work with any data set in any data format.
Background and motivation
This series is inspired by past projects at hedge funds for parsing credit card transaction data. To oversimplify quite a bit, we would look at the description fields of each transaction to see if it matched a tradeable public company, add up the amounts for all the transactions for each company, and use that to try to predict revenue for each company.
There were a lot of challenges to making these revenue forecasts accurate. The problem we’ll focus on this article is that for some reason (which patio11 could probably explain in depth), the description field for credit card transactions would come to us completely garbled. For a company like McDonald’s, we would see variations like
mcd, and even misspellings and typos like
mcdnalds. Our solution was to create regular expressions that covered all of the common variations of all of the brands of each company we were interested in.
This dataset was about a terabyte uncompressed, and we had hundreds of regular expressions, which meant that we needed two main pieces of infrastructure: a really fast regular expression library and a distributed computation engine. For regular expressions we used Hyperscan which we’ll introduce here. Our distributed computation engine isn’t publicly available, but we’ll introduce Meadowrun which works in a similar way.
The credit card dataset we used obviously isn’t available publicly as well, so we’ll use English-language Wikipedia as a stand-in (~67 GB uncompressed), as the goal of this article is to walk through the engineering aspects of this analysis.
Getting up to speed
If you didn’t follow along with the first article in this series, you should be able to follow this article with your own dataset as long as you install smart_open and Meadowrun. smart_open is an amazing library that lets you open objects in S3 (and other cloud object stores) as if they’re files on your filesystem, and Meadowrun makes it easy to run your Python code on the cloud.
poetry add smart_open[s3]poetry add meadowrun
poetry run meadowrun-manage-ec2 install
We’ll assume you’re using Poetry here, but feel free to use any environment manager. We’ll also assume you’ve configured your AWS CLI. (See the docs for more context, as well as for using Meadowrun with Azure, pip, and/or conda.)
Some realistic data
We’ll use simplified extracts of English-language Wikipedia as generated by the code in the first article in this series. As a quick overview, that produces 222 files like s3://wikipedia-meadowrun-demo/extracted-200000.tar.gz, which is a tar.gz file containing the 200,000th through 299,999th Wikipedia article. The filenames are the titles of the articles and the contents of each file are the text of the corresponding article. We’ll need a little function that can read one of these tar files, iterate_extract.
For our regular expressions, we’ll do a simplified version of what I describe above and just take the names of the companies in the S&P500, which we get, of course, from Wikipedia. (If you’re curious, this is the 1379418th article in the 2022–06–20 index, so you can also find it in s3://wikipedia-meadowrun-demo/extracted-1300000.tar.gz.) I used a little bit of elbow grease to get this into a file called companies.txt, and the first few lines look like:
A. O. Smith
We’ll upload this to S3 with
aws s3 cp companies.txt s3://wikipedia-meadowrun-demo/companies.txt, and use this bit of code to turn this into a simple regex:
This will give us a regex like
3M|A.O. Smith|Abbott|... which lets us look for any occurrence of these company names.
Regex engines: re vs re2 vs Hyperscan
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. (Jamie Zawinski)
If you haven’t spent too much time in the world of regular expressions, you’ll probably start with the re standard library — Python comes with batteries included, after all. Let’s see how this does. We’ll use Meadowrun’s
run_function to run some exploratory code directly on EC2:
scan_re will just extract the first 100 articles from the specified .tar.gz file and run our company name regex on those articles, and tell us how many bytes per second it’s able to process.
scan_re_ec2 uses Meadowrun to run
scan_re on an EC2 instance — here we’re asking it to start an EC2 instance that has at least 1 CPU and 2 GB of memory, and that we’re okay with spot instances up to an 80% chance of interruption. We could just run
scan_re locally, but this will actually be faster overall because downloading data from S3 is significantly faster from EC2 than over the internet. In other words, we’re sending our code to our data rather than the other way around.
Running this gives us:
Scanned 1,781,196 bytes in 10.38 seconds (171,554 B/s)
So for this regular expression we’re getting about 170 KB/s. We have about 67 GB, so distributing this over many EC2 instances is still going to be slow and expensive.
Let’s try re2, which is a regular expression engine built by Google primarily with the goal of taking linear time to search a string for any regular expression. For context, python’s built-in re library uses a backtracking approach, which can take exponential time to search a string. re2 uses a Thompson NFA approach, which can guarantee the linear time search, but offers fewer features.
The official python package on PyPI is google-re2, but pyre2 has nicely pre-compiled wheels for Windows and Mac as well (in addition to Linux):
poetry add pyre2
pyre2 is designed as a drop-in replacement, so we can just replace
import re with
import re2 as re in our last snippet and rerun to see what re2’s performance is like. Note that Meadowrun is automatically syncing the changes to our code and environment on the remote machine, and we don’t have to worry about manually rebuilding a virtual environment or a container image with our new re2 library in it.
Rerunning with re2 and 10,000 articles, we get:
Scanned 190,766,485 bytes in 6.94 seconds (27,479,020 B/s)
A massive speedup to 27 MB/s!
Our last stop is Hyperscan, which is a regular expression engine originally built with an eye towards deep packet inspection by a startup called Sensory Networks which was acquired by Intel in 2013. Hyperscan has a ton of really cool parts to it — there’s a good overview by a maintainer Geoff Langdale, and the paper goes into more depth. I’ll just highlight one of my favorites, which is its extensive use of SIMD instructions for searching strings.
There’s a compiled version of Hyperscan with python bindings in pypi thanks to python-hyperscan. python-hyperscan only has pre-built wheels for Linux, but that’s fine, as the EC2 instances Meadowrun creates for us are running Linux. We can even tell Poetry to just install Hyperscan if we’re on Linux, as installing it on Windows or Mac will probably fail:
poetry add hyperscan --platform linux
(Pip requirements.txt files also support “environment markers” which allow you to accomplish the same thing.)
Hyperscan’s API isn’t a drop-in replacement for re2, so we’ll need to adjust our code:
scan_hyperscan shows a really basic usage of the python-hyperscan API, and
scan_hyperscan_ec2 is doing the same thing of using Meadowrun to run this on EC2. Running this gives us:
Scanned 190,766,485 bytes in 2.74 seconds (69,679,969 B/s)
Which is another solid improvement on top of re2.
Putting it all together
Now we can use Meadowrun’s
run_map to run this over all of Wikipedia:
run_map has similar semantics to python’s built-in map function. If you haven’t seen it before,
map(f, xs) is roughly equivalent to
[f(x) for x in xs].
run_map is doing the same thing as
map but in parallel on the cloud. So we’re requesting the same CPU/memory per task as before, and we’re asking Meadowrun to start enough EC2 instances that we can run 64 tasks in parallel. Each task will run
scan_hyperscan on a different extract file, although we’re actually going over the dataset twice to synthetically make the dataset a bit larger.
The exact instance type that Meadowrun selects will vary based on spot instance availability and real-time pricing, but in an example run Meadowrun prints out:
Launched 1 new instance(s) (total $0.6221/hr) for the remaining 64 workers:
ec2-18-117-89-205.us-east-2.compute.amazonaws.com: c5a.16xlarge (64 CPU/128.0 GB), spot ($0.6221/hr, 2.5% chance of interruption), will run 64 job/worker
The doubled Wikipedia data set is about 135GB and searching for any occurrence of a company name in the S&P 500 takes about 5 minutes and only costs us about 5 cents!
- Hyperscan can be a bit annoying to use, as it has a different API from re and there aren’t any precompiled versions for Windows or Mac (it is possible to compile for these platforms, it just requires a bit of work). In these very unscientific benchmarks it’s “only” 2.5x faster than re2, but in my experience it’s worth it because as your regular expressions get larger and more complicated, the gap in performance compared to re2 gets larger. And it’s really quite a marvel of both computer science theory and engineering.
- Meadowrun makes it easy to use really powerful machines in EC2 (or Azure) to process your data. Obviously, for this exact workload it would be faster to use Google or Wikipedia’s own search functionality, but the approach we’re showing here can be used on any large text dataset with arbitrarily complicated regular expressions. For anything that isn’t already indexed by Google or another tech giant, I don’t think there’s a better combination of tools.
- The complete code for this series is here in case you want to use it as a template.