Identifying and removing null characters in UNIX
UnixShellNullSpecial CharactersUnix Problem Overview
I have a text file containing unwanted null characters (ASCII NUL, \0
). When I try to view it in vi
I see ^@
symbols, interleaved in normal text. How can I:
-
Identify which lines in the file contain null characters? I have tried grepping for
\0
and\x0
, but this did not work. -
Remove the null characters? Running
strings
on the file cleaned it up, but I'm just wondering if this is the best way?
Unix Solutions
Solution 1 - Unix
I’d use tr
:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<
, >
, …) anywhere in the command line, actually.
Solution 2 - Unix
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
Solution 3 - Unix
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv
to convert it to UTF-8.
Solution 4 - Unix
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
Solution 5 - Unix
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Solution 6 - Unix
Here is example how to remove NULL characters using ex
(in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt
(if it is supported by your shell).
Useful for scripting since sed
and its -i
parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
Solution 7 - Unix
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
Solution 8 - Unix
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')