How to get only the first ten bytes of a binary file
BashBinaryBash Problem Overview
I am writing a bash script that needs to get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes. These are binary files and will likely have \0
's and \n
's throughout the first 10 bytes. It seems like most utilities work with ASCII files. What is a good way to achieve this task?
Bash Solutions
Solution 1 - Bash
To get the first 10 bytes, as noted already:
head -c 10
To get all but the first 10 bytes (at least with GNU tail
):
tail -c+11
Solution 2 - Bash
head -c 10
does the right thing here.
Solution 3 - Bash
You can use the http://www.gnu.org/software/coreutils/manual/html_node/dd-invocation.html">`dd`</a> command to copy an arbitrary number of bytes from a binary file.
dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1
Solution 4 - Bash
How to split a stream (or a file) under [tag:bash]
Reading SO request:
> get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.
I understand:
> How to split a file at specific point
As all answers here does access same file two time, instead of splitting them like a stream, here is my two cents:
The interesting thing using Un*x is considering the whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat
, grep
, awk
, sed
, python
, perl
...) work as filters.
head
but in a single pass
Using { head -c 10 >head_part; cat >tail_part;} <file
This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part
and the rest goes to tail_part
.
Note: second redirection >tail_part
could be place outside of whole list ({ ...;}
) as well...
dd
:
You could do same, using { dd count=1 bs=10 of=head_part; cat;} <file >tail_part
This stay more efficient than running two process of dd
to open same file two times.
...And still use standard block size for the rest of file:
Another sample based on read by line:
Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r
):
nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
or, to drop empty last head line:
nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
This will produce two files:
ls -l so_*.raw
-rw-r--r-- 1 root root 307 Apr 25 11:40 so_head.raw
-rw-r--r-- 1 root root 219 Apr 25 11:40 so_body.raw
grep www so_*.raw
so_body.raw:<A HREF="http://www.google.com/">here</A>.
so_head.raw:Location: http://www.google.com/
Pure bash way:
If the goal is to obtain values of first 10 bytes in a usable [tag:bash] variable, here is a nice and efficient way:
Because ten byte are few, fork to head
could be avoided. from Read a file by bytes in BASH:
read8() {
local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car
}
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < "$infile" >"$outfile"
This will create an array ${first10[@]}
containing hexadecimal values of first ten bytes of $infile
and store rest of data into $outfile
.
declare -p first10
declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")
This was a PDF (%PDF
-> 25 50 44 46
)... Here's another sample:
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} <<<"Hello world!"
d!
As I didn't redirect output, string d!
will be output on terminal.
echo ${first10[@]}
48 65 6C 6C 6F 20 77 6F 72 6C
printf '%b%b%b%b%b%b%b%b%b%b\n' ${first10[@]/#/\\x}
Hello worl
About binary
You said:
> These are binary files and will likely have \0
's and \n
's throughout the first 10 bytes.
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < <(gzip <<<"Hello world!") >/dev/null
echo ${first10[@]}
1F 8B 08 00 00 00 00 00 00 03
( Sample with a \n
at bottom of this ;)
As a function
read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
local -n result=${1:-first10} # 1st arg is array name
local -i _i
result=()
for ((_i=0;_i<${2:-10};_i++));do # 2nd arg is number of bytes
read8 result[_i] || { unset result[_i] ; return 1 ;}
done
cat
}
Then (here, I use the special character ⛶
for: there was no newline. ).
get10 pdf 4 <$infile >$outfile
printf %b ${pdf[@]/#/\\x}
%PDF⛶
echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4
get10 test 8 <<<'Hello world'
rld!
printf %b ${test[@]/#/\\x}
Hello Wo⛶
get10 test 24 <<<'Hello World!'
printf %b ${test[@]/#/\\x}
Hello World!
( And the last character printed is a \n
! ;)
Final binary demo:
get10 test 256 < <(gzip <<<'Hello world!')
printf '%b' ${test[@]/#/\\x} | gunzip
Hello world!
printf " %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s\n" ${test[@]}
1F 8B 08 00 00 00 00 00 00 03 F3 48 CD C9 C9 57
28 CF 2F CA 49 51 E4 02 00 41 E4 A9 B2 0D 00 00
00
Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split
, head
, tail
and/or dd
.