grepping binary files and UTF16

UnicodeGrepUtf 16

Unicode Problem Overview


Standard grep/pcregrep etc. can conveniently be used with binary files for ASCII or UTF8 data - is there a simple way to make them try UTF16 too (preferably simultaneously, but instead will do)?

Data I'm trying to get is all ASCII anyway (references in libraries etc.), it just doesn't get found as sometimes there's 00 between any two characters, and sometimes there isn't.

I don't see any way to get it done semantically, but these 00s should do the trick, except I cannot easily use them on command line.

Unicode Solutions


Solution 1 - Unicode

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

Solution 2 - Unicode

I found the below solution worked best for me, from https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/

Grep does not play well with Unicode, but it can be worked around. For example, to find,

Some Search Term

in a UTF-16 file, use a regular expression to ignore the first byte in each character,

S.o.m.e. .S.e.a.r.c.h. .T.e.r.m 

Also, tell grep to treat the file as text, using '-a', the final command looks like this,

grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt

Solution 3 - Unicode

You can explicitly include the nulls (00s) in the search string, though you will get results with nulls, so you may want to redirect the output to a file so you can look at it with a reasonable editor, or pipe it through sed to replace the nulls. To search for "bar" in *.utf16.txt:

grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g'

The "-P" tells grep to accept Perl regexp syntax, which allows \x00 to expand to null, and the -a tells it to ignore the fact that Unicode looks like binary to it.

Solution 4 - Unicode

I use this one all the time after dumping the Windows registry as its output is unicode. This is running under Cygwin.

$ regedit /e registry.data.out
$ file registry.data.out
registry.data.out: Little-endian **UTF-16 Unicode text**, with CRLF line terminators

$ sed 's/\x00//g' registry.data.out | egrep "192\.168"
"Port"="192.168.1.5"
"IPSubnetAddress"="192.168.189.0"
"IPSubnetAddress"="192.168.102.0"
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
"MRU0"="192.168.16.93"
[HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\192.168.16.93]
"A"="192.168.1.23"
"B"="192.168.1.28"
"C"="192.168.1.200:5800"
"192.168.254.190::5901/extra"=hex:02,00
"00"="192.168.254.190:5901"
"ImagePrinterPort"="192.168.1.5"

Solution 5 - Unicode

ripgrep

Use ripgrep utility to grep UTF-16 files.

> ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

Example syntax:

rg sometext file

To dump all lines, run: rg -N . file.

Solution 6 - Unicode

I needed to do this recursively, and here's what I came up with:

find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done

This is absolutely horrible and very slow; I'm certain there's a better way and I hope someone can improve on it -- but I was in a hurry :P

What the pieces do:

find -type f

gives a recursive list of filenames with paths relative to current

while read l; do ... done

Bash loop; for each line of the list of file paths, put the path into $l and do the thing in the loop. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.)

iconv -s -f utf-16le -t utf-8 "$l"

Convert the file named in $l: assume the input file is utf-16 little-endian and convert it to utf-8. The -s makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). The output from this conversion goes to stdout.

nl -s "$l: " | cut -c7-

This is a hack: nl inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. Then I use cut to strip off the line number, leaving just the filename prefix. (Why I didn't use sed: escaping is much easier this way. If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of. nl is much dumber than sed, and will just take the parameter -s entirely literally, and the shell handles the escaping for me.)

So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. If there are matches, I can tell which file they're in from the prefix.

Caveats

  • This is much, much slower than grep -R, because I'm spawning a new copy of iconv, nl, cut, and grep for every single file. It's horrible.
  • Everything that isn't utf-16le input will come out as complete garbage, so if there's a normal ASCII file that contains 'somestring', this command won't report it -- you need to do a normal grep -R as well as this command (and if you have multiple unicode encoding types, like some big-endian and some little-endian files, you need to adjust this command and run it again for each different encoding).
  • Files whose name happens to contain 'somestring' will show up in the output, even if their contents have no matches.

Solution 7 - Unicode

ugrep (Universal grep) fully supports Unicode, UTF-8/16/32 input files, detects invalid Unicode to ensure proper results, displays text and binary files, and is fast and free:

> ugrep searches UTF-8/16/32 input and other formats. Option --encoding permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KOI8.

See ugrep on GitHub for details.

Solution 8 - Unicode

The sed statement is more than I can wrap my head around. I have a simplistic, far-from-perfect TCL script that I think does an OK job with my test point of one:

#!/usr/bin/tclsh

set insearch [lindex $argv 0]

set search ""

for {set i 0} {$i<[string length $insearch]-1} {incr i} {
    set search "${search}[string range $insearch $i $i]."
}
set search "${search}[string range $insearch $i $i]"

for {set i 1} {$i<$argc} {incr i} {
    set file [lindex $argv $i]
    set status 0
    if {! [catch {exec grep -a $search $file} results options]} {
        puts "$file: $results"
    }
}

Solution 9 - Unicode

I added this as a comment to the accepted answer above but to make it easier to read. This allow you to search for text in a bunch of files while also displaying the filenames that it is finding the text. All of these files have a .reg extension since I'm searching through exported Windows Registry files. Just replace .reg with any file extension.

// Define grepreg in bash by pasting at bash command prompt
grepreg ()
{
    find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg"
}

// Sample usage
grepreg SampleTextToSearch

Solution 10 - Unicode

You can use the following Ruby's one-liner:

ruby -e "puts File.open('file.txt', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new 'PATTERN'.encode(Encoding::UTF_16LE))"

For simplicity, this can be defined as the shell function like:

grep-utf16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; }

Then it be used in similar way like grep:

grep-utf16 PATTERN file.txt

Source: <https://stackoverflow.com/q/54729247/55075>

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiontawView Question on Stackoverflow
Solution 1 - UnicodeNiki YoshiuchiView Answer on Stackoverflow
Solution 2 - UnicodenirmalView Answer on Stackoverflow
Solution 3 - UnicodeEthan BradfordView Answer on Stackoverflow
Solution 4 - UnicodeMike CushView Answer on Stackoverflow
Solution 5 - UnicodekenorbView Answer on Stackoverflow
Solution 6 - UnicodeFelixView Answer on Stackoverflow
Solution 7 - UnicodeDr. Alex REView Answer on Stackoverflow
Solution 8 - Unicodeuser1117791View Answer on Stackoverflow
Solution 9 - UnicodeAndrew SternView Answer on Stackoverflow
Solution 10 - UnicodekenorbView Answer on Stackoverflow