How to grep a text file which contains some binary data?

Shell

Shell Problem Overview


grep returns

Binary file test.log matches

For example

echo    "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log  # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log  # in bash
grep re test.log

I wish the result will show line1 and line3 (total two lines).

Is it possible to use tr convert the unprintable data into readable data, to let grep work again?

Shell Solutions


Solution 1 - Shell

grep -a

It can't get simpler than that.

Solution 2 - Shell

One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).

Alternatively, you can send your file through tr with the following command:

tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever

This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.


If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:

#include<stdio.h>
int main (void) {
    int ch;
    while ((ch = getchar()) != EOF) {
        if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
            putchar (ch);
        } else {
            printf ("{{%02x}}", ch);
        }
    }
    return 0;
}

This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.

You can see that program in action here, where it:

pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob

Solution 3 - Shell

You could run the data file through cat -v, e.g

$ cat -v tmp/test.log | grep re
line1 re ^@^M
line3 re^M

which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.

-v simply tells cat to display non-printing characters.

Solution 4 - Shell

You can use "strings" to extract strings from a binary file, for example

strings binary.file | grep foo

Solution 5 - Shell

You can force grep to look at binary files with:

grep --binary-files=text

You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.

Solution 6 - Shell

Starting with Grep 2.21, binary files are treated differently:

> When searching binary data, grep now may treat non-text bytes as line > terminators. This can boost performance significantly.

So what happens now is that with binary data, all non-text bytes (including newlines) are treated as line terminators. If you want to change this behavior, you can:

  • use --text. This will ensure that only newlines are line terminators

  • use --null-data. This will ensure that only null bytes are line terminators

Solution 7 - Shell

grep -a will force grep to search and output from a file that grep thinks is binary. grep -a re test.log

Solution 8 - Shell

As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text. See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

try

cat test.log | grep -a somestring

Solution 9 - Shell

you can do

strings test.log | grep -i

this will convert give output as a readable string to grep.

Solution 10 - Shell

Here's what I used in a system that didn't have "strings" command installed

cat yourfilename | tr -cd "[:print:]"

This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.

Solution 11 - Shell

You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDaniel YC LinView Question on Stackoverflow
Solution 1 - ShellJames SelvakumarView Answer on Stackoverflow
Solution 2 - ShellpaxdiabloView Answer on Stackoverflow
Solution 3 - ShellvielmettiView Answer on Stackoverflow
Solution 4 - ShellmoodywoodyView Answer on Stackoverflow
Solution 5 - ShellA BView Answer on Stackoverflow
Solution 6 - ShellZomboView Answer on Stackoverflow
Solution 7 - ShellKevin BuchsView Answer on Stackoverflow
Solution 8 - ShellDerKnorrView Answer on Stackoverflow
Solution 9 - ShellMridView Answer on Stackoverflow
Solution 10 - ShellMuurderView Answer on Stackoverflow
Solution 11 - ShellMattCollWView Answer on Stackoverflow