Grepping a huge file (80GB) any way to speed it up?
BashGrepBash Problem Overview
grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU
Bash Solutions
Solution 1 - Bash
Here are a few options:
-
Prefix your grep command with
LC_ALL=C
to use the C locale instead of UTF-8. -
Use
fgrep
because you're searching for a fixed string, not a regular expression. -
Remove the
-i
option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
Solution 2 - Bash
If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep
include:
- Dropping the
-i
flag. - Using the
-F
flag for a fixed string - Disabling NLS with
LANG=C
- Setting a max number of matches with the
-m
flag.
Solution 3 - Bash
Some trivial improvement:
-
Remove the -i option, if you can, case insensitive is quite slow.
-
Replace the
.
by\.
A single point is the regex symbol to match any character, which is also slow
Solution 4 - Bash
Two lines of attack:
- are you sure, you need the
-i
, or do you habe a possibility to get rid of it? - Do you have more cores to play with?
grep
is single-threaded, so you might want to start more of them at different offsets.
Solution 5 - Bash
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.
Solution 6 - Bash
Try ripgrep
It provides much better results compared to grep.