I’ve been working on some code that will use some supplied regular expressions to search through log files (I know, regex isn’t that efficient, yadda, yadda, yadda, but these were the requirements). The issue I was running into was that there was a lot of data. For example, I had 10 regexes that would search 36 gzipped files averaging 1.2 million lines each. The real issue was that these logs came in hourly, so if it couldn’t finish searching them all within an hour it was going to get backed up.
Being a good Pythonista, I followed the cardinal rules of:
Get it right.
Test it’s right.
Profile if slow.
Repeat from 2.
The problem was, after a while I sort of hit a wall. Nothing I did could make this code appreciably faster (of course this was with my limited knowlegde. I’m sure that more experienced Python programmers could optimize this code a lot better then I can, rewrite the regex bottleneck in C, etc) but I was at the end of my rope.
On thing led to another and I remembered reading about Pypy. Pypy is implementation of Python using a JIT (Just In Time Compiler) and other things that have lost there meaning for me since I did systems programming in college. “What the heck”, I thought, “I’ll give it a try”. Pypy is supposed to be highly compatible with CPython (the regular python implementation) and my code didn’t use any exotic libraries.
So I dumped the tarball on my linux machine, unzipped and ran my unmodified code against it, and DAMN was it fast.
sean@linux1:~/code/python/hourly_alerts$ python alerter.py Loaded regexes Processing known_bad.log.gz Searched 817051 lines in 119.398156166 seconds using filter
sean@linux1:~/code/python/hourly_alerts$ /home/sean/bin/pypy-1.7/bin/pypy alerter.py Loaded regexes Processing known_bad.log.gz Searched 817051 lines in 51.1275110245 seconds using filter
More then twice as fast! Now I know that it was a totally unscientific test and all, but its great to see such an improvement right away.
I my also try Cython, but that looks like it doesn’t have quite the drop-in functionality of Pypy.
Here are the link to Pypy : http://pypy.org/