Grepping a huge file (80GB) any way to speed it up?

BashGrep

Bash Problem Overview


 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

the grep is grabbing up to 93% CPU

Bash Solutions


Solution 1 - Bash

Here are a few options:

  1. Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

  2. Use fgrep because you're searching for a fixed string, not a regular expression.

  3. Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

Solution 2 - Bash

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

  • Dropping the -i flag.
  • Using the -F flag for a fixed string
  • Disabling NLS with LANG=C
  • Setting a max number of matches with the -m flag.

Solution 3 - Bash

Some trivial improvement:

  • Remove the -i option, if you can, case insensitive is quite slow.

  • Replace the . by \.

    A single point is the regex symbol to match any character, which is also slow

Solution 4 - Bash

Two lines of attack:

  • are you sure, you need the -i, or do you habe a possibility to get rid of it?
  • Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

Solution 5 - Bash

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'  

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

Solution 6 - Bash

Try ripgrep

It provides much better results compared to grep.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionzzapperView Question on Stackoverflow
Solution 1 - BashdogbaneView Answer on Stackoverflow
Solution 2 - BashSteveView Answer on Stackoverflow
Solution 3 - BashBeniBelaView Answer on Stackoverflow
Solution 4 - BashEugen RieckView Answer on Stackoverflow
Solution 5 - Bashuser584583View Answer on Stackoverflow
Solution 6 - BashShaileshView Answer on Stackoverflow