How does grep run so fast?

UnixGrep

Unix Problem Overview


I am really amazed by the functionality of GREP in shell, earlier I used to use substring method in java but now I use GREP for it and it executes in a matter of seconds, it is blazingly faster than java code that I used to write.(according to my experience I might be wrong though)

That being said I have not been able to figure out how it is happening? there is also not much available on the web.

Can anyone help me with this?

Unix Solutions


Solution 1 - Unix

Assuming your question regards GNU grep specifically. Here's a note from the author, Mike Haertel:

> GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE. > > GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH > BYTE that it > does look at. > > GNU grep uses the well-known Boyer-Moore algorithm, which looks first > for the final letter of the target string, and uses a lookup table to > tell it how far ahead it can skip in the input whenever it finds a > non-matching character. > > GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the > Boyer-Moore delta table entries in such a way that it doesn't need to > do the loop exit test at every unrolled step. The result of this is > that, in the limit, GNU grep averages fewer than 3 x86 instructions > executed for each input byte it actually looks at (and it skips many > bytes entirely). > > GNU grep uses raw Unix input system calls and avoids copying data > after reading it. Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO > LINES. Looking for newlines would slow grep down by a factor of > several times, because to find the newlines it would have to look at > every byte! > > So instead of using line-oriented input, GNU grep reads raw data into > a large buffer, searches the buffer using Boyer-Moore, and only when > it finds a match does it go and look for the bounding newlines > (Certain command line options like > -n disable this optimization.)

This answer is a subset of the information taken from here.

Solution 2 - Unix

To add to Steve's excellent answer.

It may not be widely known but grep is almost always faster when grepping for a longer pattern-string than a short one, because in a longer pattern, Boyer-Moore can skip forward in longer strides to achieve even better sublinear speeds:

Example:

# after running these twice to ensure apples-to-apples comparison
# (everything is in the buffer cache) 

$ time grep -c 'tg=f_c' 20140910.log
28
0.168u 0.068s 0:00.26

$ time grep -c ' /cc/merchant.json tg=f_c' 20140910.log
28
0.100u 0.056s 0:00.17

The longer form is 35% faster!

How come? Boyer-Moore consructs a skip-forward table from the pattern-string, and whenever there's a mismatch, it picks the longest skip possible (from last char to first) before comparing a single char in the input to the char in the skip table.

Here's a video explaining Boyer Moore (Credit to kommradHomer)

Another common misconception (for GNU grep) is that fgrep is faster than grep. f in fgrep doesn't stand for 'fast', it stands for 'fixed' (see the man page), and since both are the same program, and both use Boyer-Moore, there's no difference in speed between them when searching for fixed-strings without regexp special chars. The only reason I use fgrep is when there's a regexp special char (like ., [], or *) I don't want it to be interpreted as such. And even then the more portable/standard form of grep -F is preferred over fgrep.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRohitView Question on Stackoverflow
Solution 1 - UnixSteveView Answer on Stackoverflow
Solution 2 - UnixarielfView Answer on Stackoverflow