At one of my client we had a Bash script that grepped a huge log file 20 times in order to generate a report. It created a lot of load on the server as grep was reading the entire file 20 times.
As we were converting our Shell scripts to Python anyway I thought I could rewrite it in Python and go over the file once instead of 20 times and use the Regex engine of Python to extract the same information.
The Python version should be faster as we all know file I/O is way more expensive than in-memory operations.
After starting conversion it turned out to be incorrect. Our code became way slower. Let's see a simulation of it.
Generate the big log file
In order to make it easy to reproduce the case I created a script that could create a big text file.
examples/python/create-big-file.py
import sys
import random
if len(sys.argv) != 4:
exit(f"{sys.argv[0]} FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS")
_, filename, rows, length = sys.argv
line = "x" * int(length) + "\n"
match = random.randint(0, int(length))
with open(filename, 'w') as fh:
for i in range(int(rows)):
if i == match:
fh.write("x" * (int(length)-2) + "yx\n")
else:
fh.write(line)
We can run it like this, indicating the name of the file we would like to create, the number of rows and the length of rows.
python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS
It will create a file full of the character "x", with a single "y" somewhere.
I think this is going to be good enough for our simple example.
Using grep
In the original shell script we had some 20 different calls to grep, but to make it simpler I made this shell script with that runs the same regex multiple times.
filename=$1
limit=$2
for ((i=1;i<=$limit;i++));
do
grep y $filename
done
You can pass the name of the data file and the number of time you'd like to run grep.
Grep with Python regexes
I have an implementation in Python as well.
import sys
import re
if len(sys.argv) != 3:
exit(f"{sys.argv[0]} FILENAME LIMIT")
_, filename, limit = sys.argv
with open(filename) as fh:
for line in fh:
for _ in range(int(limit)):
if re.search(r'y', line):
print(line)
I know in the simple case of finding a single "y" character I could use the index method or the find method and thous would be probably faster, but in our cases we really had more complex regexes.
Comparing the speed
python create-big-file.py a.txt 100000 50
Verify the file:
$ wc a.txt
1000000 1000000 51000000 a.txt
# grep y a.txt
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx
$ time bash examples/grep_speed.sh a.txt 20
real 0m0.227s
user 0m0.055s
sys 0m0.172s
$ time python examples/grep_speed.py a.txt 20
real 0m9.509s
user 0m9.477s
sys 0m0.032s
grep is about 50 times faster than Python even though grep had to read the file 20 time while Python only read it once.
More complex grep
In the previous case we used a very simple regex, now let's change it to use a slightly more complex expression in which we are not only looking for a single character, but we also want to make sure it is between two identical characters.
filename=$1
limit=$2
for ((i=1;i<=$limit;i++));
do
grep '\(.\)y\1' $filename
done
More complex python
import sys
import re
if len(sys.argv) != 3:
exit(f"{sys.argv[0]} FILENAME LIMIT")
_, filename, limit = sys.argv
with open(filename) as fh:
for line in fh:
for _ in range(int(limit)):
if re.search(r'(.)y\1', line):
print(line)
You can try it yourself:
grep '\(.\)y\1' a.txt
Comparing the speed of the more complex examples
$ time bash examples/grep_speed_oxo.sh a.txt 20
real 0m0.196s
user 0m0.035s
sys 0m0.161s
$ time python examples/grep_speed_oxo.py a.txt 20
real 0m25.067s
user 0m24.972s
sys 0m0.016s
The speed of grep did not change, but Python became even slower. This time grep is more than a 100 times faster than Python.
Version information
$ python -V
Python 3.8.2
$ grep -V
grep (GNU grep) 3.4
Other cases
The results are consistent with what I saw during my work, but I wonder what would be the results if the file was larger than the available memory in my computer.
Conclusion
grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.
Or I made a mistake somewhere that impacts the results.
Oh and one more thing, I also create a Perl version of the code and Perl is much faster than Python even though it is also slower than the grep code.