How to get only the first ten bytes of a binary file

BashBinary

Bash Problem Overview


I am writing a bash script that needs to get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes. These are binary files and will likely have \0's and \n's throughout the first 10 bytes. It seems like most utilities work with ASCII files. What is a good way to achieve this task?

Bash Solutions


Solution 1 - Bash

To get the first 10 bytes, as noted already:

head -c 10

To get all but the first 10 bytes (at least with GNU tail):

tail -c+11

Solution 2 - Bash

head -c 10 does the right thing here.

Solution 3 - Bash

You can use the http://www.gnu.org/software/coreutils/manual/html_node/dd-invocation.html">`dd`</a> command to copy an arbitrary number of bytes from a binary file.

dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1

Solution 4 - Bash

How to split a stream (or a file) under [tag:bash]

Reading SO request:

> get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.

I understand:

> How to split a file at specific point

As all answers here does access same file two time, instead of splitting them like a stream, here is my two cents:

The interesting thing using Un*x is considering the whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat, grep, awk, sed, python, perl ...) work as filters.

Using head but in a single pass

{ head -c 10 >head_part; cat >tail_part;} <file

This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part and the rest goes to tail_part.

Note: second redirection >tail_part could be place outside of whole list ({ ...;}) as well...

You could do same, using dd:
{ dd count=1 bs=10 of=head_part; cat;} <file >tail_part

This stay more efficient than running two process of dd to open same file two times.

...And still use standard block size for the rest of file:

Another sample based on read by line:

Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r):

nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
    { sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw

or, to drop empty last head line:

nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
    { sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw

This will produce two files:

ls -l so_*.raw
-rw-r--r-- 1 root    root           307 Apr 25 11:40  so_head.raw
-rw-r--r-- 1 root    root           219 Apr 25 11:40  so_body.raw

grep www so_*.raw
so_body.raw:<A HREF="http://www.google.com/">here</A>.
so_head.raw:Location: http://www.google.com/

Pure bash way:

If the goal is to obtain values of first 10 bytes in a usable [tag:bash] variable, here is a nice and efficient way:

Because ten byte are few, fork to head could be avoided. from Read a file by bytes in BASH:

read8() {
    local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
    read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
    printf -v $_r8_var %02X "'"$_r8_car
}
{ 
    first10=()
    for i in {0..9};do
        read8 first10[i] || break
    done
    cat
 } < "$infile" >"$outfile"

This will create an array ${first10[@]} containing hexadecimal values of first ten bytes of $infile and store rest of data into $outfile.

declare -p first10

declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")

This was a PDF (%PDF -> 25 50 44 46)... Here's another sample:

{
    first10=()
    for i in {0..9};do
        read8 first10[i] || break
    done
    cat
} <<<"Hello world!"
d!

As I didn't redirect output, string d! will be output on terminal.

echo ${first10[@]}
48 65 6C 6C 6F 20 77 6F 72 6C

printf '%b%b%b%b%b%b%b%b%b%b\n' ${first10[@]/#/\\x}
Hello worl
About binary

You said:

> These are binary files and will likely have \0's and \n's throughout the first 10 bytes.

{
    first10=()
    for i in {0..9};do
        read8 first10[i] || break
    done
    cat
} < <(gzip <<<"Hello world!") >/dev/null 

echo ${first10[@]}
1F 8B 08 00 00 00 00 00 00 03

( Sample with a \n at bottom of this ;)

As a function
read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
    read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
    printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
    local -n result=${1:-first10}     # 1st arg is array name
    local -i _i
    result=()
    for ((_i=0;_i<${2:-10};_i++));do  # 2nd arg is number of bytes
        read8 result[_i] || { unset result[_i] ; return 1 ;}
    done
    cat
}

Then (here, I use the special character for: there was no newline. ).

get10 pdf 4 <$infile >$outfile
printf %b ${pdf[@]/#/\\x}
%PDF⛶

echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4

get10 test 8 <<<'Hello world'
rld!

printf %b ${test[@]/#/\\x}
Hello Wo⛶

get10 test 24 <<<'Hello World!'
printf %b ${test[@]/#/\\x}
Hello World!

( And the last character printed is a \n! ;)

Final binary demo:
get10 test 256 < <(gzip <<<'Hello world!')

printf '%b' ${test[@]/#/\\x} | gunzip 
Hello world!

printf "  %s %s %s %s  %s %s %s %s    %s %s %s %s  %s %s %s %s\n" ${test[@]}
  1F 8B 08 00  00 00 00 00    00 03 F3 48  CD C9 C9 57
  28 CF 2F CA  49 51 E4 02    00 41 E4 A9  B2 0D 00 00
  00                    

Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split, head, tail and/or dd.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionUser1View Question on Stackoverflow
Solution 1 - BashpsmearsView Answer on Stackoverflow
Solution 2 - BashmoonshadowView Answer on Stackoverflow
Solution 3 - BashMark RansomView Answer on Stackoverflow
Solution 4 - BashF. HauriView Answer on Stackoverflow