Fastest possible grep

BashUnixGrep

Bash Problem Overview


I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.

Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.

Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?

Bash Solutions


Solution 1 - Bash

Try with GNU parallel, which includes an example of how to use it with grep:

> grep -r greps recursively through directories. On multicore CPUs GNU > parallel can often speed this up. > > find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {} > > This will run 1.5 job per core, and give 1000 arguments to grep.

For big files, it can split it the input in several chunks with the --pipe and --block arguments:

 parallel --pipe --block 2M grep foo < bigfile

You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile

Solution 2 - Bash

If you're searching very large files, then setting your locale can really help.

GNU grep goes a lot faster in the C locale than with UTF-8.

export LC_ALL=C

Solution 3 - Bash

Ripgrep claims to now be the fastest.

https://github.com/BurntSushi/ripgrep

Also includes parallelism by default

 -j, --threads ARG
              The number of threads to use.  Defaults to the number of logical CPUs (capped at 6).  [default: 0]

From the README

> It is built on top of Rust's regex engine. Rust's regex engine uses > finite automata, SIMD and aggressive literal optimizations to make > searching very fast.

Solution 4 - Bash

Apparently using --mmap can help on some systems:

http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Solution 5 - Bash

Not strictly a code improvement but something I found helpful after running grep on 2+ million files.

I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.

Solution 6 - Bash

If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.

  1. If you've one very large file:

    parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>

  2. Many small compressed files (sorted by inode)

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>

I usually compress my files with lz4 for maximum throughput.

  1. If you want just the filename with the match:

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}

Solution 7 - Bash

Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.

So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).

Solution 8 - Bash

I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.

https://github.com/ggreer/the_silver_searcher

Update: I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.

Solution 9 - Bash

One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:

Having a file of ids you want to search for in a big file called my_ids.txt Name of bigfile bigfile.txt

Use split to split the file into parts:

# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames 
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]

# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files

In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.

Solution 10 - Bash

cgrep, if it's available, can be orders of magnitude faster than grep.

Solution 11 - Bash

MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.

https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep

https://metacpan.org/release/MCE

One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.

Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.

Another alternate is the MCE::Grep module included with MCE.

Solution 12 - Bash

A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:

Once you compile it (the golang package is needed), you can index a folder with:

# index current folder
cindex .

The index will be created under ~/.csearchindex

Now you can search:

# search folders previously indexed with cindex
csearch eggs

I'm still piping the results through grep to get colorized matches.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpistacchioView Question on Stackoverflow
Solution 1 - BashChewieView Answer on Stackoverflow
Solution 2 - BashdavebView Answer on Stackoverflow
Solution 3 - BashradoView Answer on Stackoverflow
Solution 4 - BashSandro PasqualiView Answer on Stackoverflow
Solution 5 - Bashthe wandererView Answer on Stackoverflow
Solution 6 - BashAlex VView Answer on Stackoverflow
Solution 7 - BashChrisView Answer on Stackoverflow
Solution 8 - BashJinxmcgView Answer on Stackoverflow
Solution 9 - Bashuser6504312View Answer on Stackoverflow
Solution 10 - BashxhtmlView Answer on Stackoverflow
Solution 11 - BashMario RoyView Answer on Stackoverflow
Solution 12 - BashccpizzaView Answer on Stackoverflow