How to check if the file is a binary file and read all the files which are not?

ShellUnixBinaryfiles

Shell Problem Overview


How can I know if a file is a binary file?

For example, compiled c file.

I want to read all files from some directory, but I want ignore binary files.

Shell Solutions


Solution 1 - Shell

Use utility file, sample usage:

 $ file /bin/bash
 /bin/bash: Mach-O universal binary with 2 architectures
 /bin/bash (for architecture x86_64):	Mach-O 64-bit executable x86_64
 /bin/bash (for architecture i386):	Mach-O executable i386

 $ file /etc/passwd
 /etc/passwd: ASCII English text

 $ file code.c
 code.c: ASCII c program text

file manual page

Solution 2 - Shell

Adapted from excluding binary file

find . -exec file {} \; | grep text | cut -d: -f1

Solution 3 - Shell

I use

! grep -qI . $path

Only drawback I can see is that it will consider an empty file binary but then again, who decides if that is wrong?

Solution 4 - Shell

BSD grep

Here is a simple solution to check for a single file using BSD grep (on macOS/Unix):

grep -q "\x00" file && echo Binary || echo Text

which basically checks if file consist NUL character.

Using this method, to read all non-binary files recursively using find utility you can do:

find . -type f -exec sh -c 'grep -q "\x00" {} || cat {}' ";"

Or even simpler using just grep:

grep -rv "\x00" .

For just current folder, use:

grep -v "\x00" *

Unfortunately the above examples won't work for GNU grep, however there is a workaround.

GNU grep

Since GNU grep is ignoring NULL characters, it's possible to check for other non-ASCII characters like:

$ grep -P "[^\x00-\x7F]" file && echo Binary || echo Text

Note: It won't work for files containing only NULL characters.

Solution 5 - Shell

perl -E 'exit((-B $ARGV[0])?0:1);' file-to-test

Could be used to check whenever "file-to-test" is binary. The above command will exit wit code 0 on binary files, otherwise the exit code would be 1.

The reverse check for text file can look like the following command:

perl -E 'exit((-T $ARGV[0])?0:1);' file-to-test

Likewise the above command will exit with status 0 if the "file-to-test" is text (not binary).

Read more about the -B and -T checks using command perldoc -f -X.

Solution 6 - Shell

Use Perl’s built-in -T file test operator, preferably after ascertaining that it is a plain file using the -f file test operator:

$ perl -le 'for (@ARGV) { print if -f && -T }' \
    getwinsz.c a.out /etc/termcap /bin /bin/cat \
    /dev/tty /usr/share/zoneinfo/UTC /etc/motd
getwinsz.c
/etc/termcap
/etc/motd

Here’s the complement of that set:

$ perl -le 'for (@ARGV) { print unless -f && -T }' \
    getwinsz.c a.out /etc/termcap /bin /bin/cat \
    /dev/tty /usr/share/zoneinfo/UTC /etc/motd
a.out
/bin
/bin/cat
/dev/tty
/usr/share/zoneinfo/UTC

Solution 7 - Shell

cat+grep

Assuming binary means the file containing NULL characters, this shell command can help:

(cat -v file.bin | grep -q "\^@") && echo Binary || echo Text

or:

grep -q "\^@" <(cat -v file.bin) && echo Binary

This is the workaround for grep -q "\x00", which works for BSD grep, but not for GNU version.

Basically -v for cat converts all non-printing characters so they are visible in form of control characters, for example:

$ printf "\x00\x00" | hexdump -C
00000000  00 00                                             |..|
$ printf "\x00\x00" | cat -v
^@^@
$ printf "\x00\x00" | cat -v | hexdump -C
00000000  5e 40 5e 40                                       |^@^@|

where ^@ characters represent NULL character. So once these control characters are found, we assume the file is binary.


The disadvantage of above method is that it could generate false positives when characters are not representing control characters. For example:

$ printf "\x00\x00^@^@" | cat -v | hexdump -C
00000000  5e 40 5e 40 5e 40 5e 40                           |^@^@^@^@|

See also: How do I grep for all non-ASCII characters.

Solution 8 - Shell

Going off Bach's suggestion, I think --mime-encoding is the best flag to get something reliable from file.

file --mime-encoding [FILES ...] | grep -v '\bbinary$'

will print the files that file believes have a non-binary encoding. You can pipe this output through cut -d: -f1 to trim the : encoding if you just want the filenames.


Caveat: as @yugr reports below .doc files report an encoding of application/mswordbinary. This looks to me like a bug - the mime type is erroneously being concatenated with the encoding.

$ for flag in --mime --mime-type --mime-encoding; do
    echo "$flag"
    file "$flag" /tmp/example.{doc{,x},png,txt}
  done
--mime
/tmp/example.doc:  application/msword; charset=binary
/tmp/example.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
/tmp/example.png:  image/png; charset=binary
/tmp/example.txt:  text/plain; charset=us-ascii
--mime-type
/tmp/example.doc:  application/msword
/tmp/example.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document
/tmp/example.png:  image/png
/tmp/example.txt:  text/plain
--mime-encoding
/tmp/example.doc:  application/mswordbinary
/tmp/example.docx: binary
/tmp/example.png:  binary
/tmp/example.txt:  us-ascii

Solution 9 - Shell

Try the following command-line:

file "$FILE" | grep -vq 'ASCII' && echo "$FILE is binary"

Solution 10 - Shell

It's kind of brute force to exclude binary files with tr -d "[[:print:]\n\t]" < file | wc -c, but it is no heuristic guesswork either.

find . -type f -maxdepth 1 -exec /bin/sh -c '
   for file in "$@"; do
      if [ $(LC_ALL=C LANG=C tr -d "[[:print:]\n\t]" < "$file" | wc -c) -gt 0 ]; then
         echo "${file} is no ASCII text file (UNIX)"
      else
         echo "${file} is ASCII text file (UNIX)"
      fi
   done
' _ '{}' +

The following brute-force approach using grep -a -m 1 $'[^[:print:]\t]' file seems quite a bit faster, though.

find . -type f -maxdepth 1 -exec /bin/sh -c '
   tab="$(printf "\t")"
   for file in "$@"; do
      if LC_ALL=C LANG=C grep -a -m 1 "[^[:print:]${tab}]" "$file" 1>/dev/null 2>&1; then
         echo "${file} is no ASCII text file (UNIX)"
      else
         echo "${file} is ASCII text file (UNIX)"
      fi
   done
' _ '{}' + 

Solution 11 - Shell

Solution 12 - Shell

grep

Assuming binary means file containing non-printable characters (excluding blank characters such as spaces, tabs or new line characters), this may work (both BSD and GNU):

$ grep '[^[:print:][:blank:]]' file && echo Binary || echo Text

Note: GNU grep will report file containing only NULL characters as text, but it would work correctly on BSD version.

For more examples, see: How do I grep for all non-ASCII characters.

Solution 13 - Shell

Perhaps this would suffice ..

if ! file /path/to/file | grep -iq ASCII ; then
    echo "Binary"
fi

if file /path/to/file | grep -iq ASCII ; then
    echo "Text file"
fi

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRefaelView Question on Stackoverflow
Solution 1 - ShellAdam SiemionView Answer on Stackoverflow
Solution 2 - ShellgongzhitaaoView Answer on Stackoverflow
Solution 3 - ShellAlois MahdalView Answer on Stackoverflow
Solution 4 - ShellkenorbView Answer on Stackoverflow
Solution 5 - ShellOnlyjobView Answer on Stackoverflow
Solution 6 - ShelltchristView Answer on Stackoverflow
Solution 7 - ShellkenorbView Answer on Stackoverflow
Solution 8 - Shelldimo414View Answer on Stackoverflow
Solution 9 - Shelluser1985553View Answer on Stackoverflow
Solution 10 - ShellvronView Answer on Stackoverflow
Solution 11 - ShelltonixView Answer on Stackoverflow
Solution 12 - ShellkenorbView Answer on Stackoverflow
Solution 13 - ShellMike QView Answer on Stackoverflow