How to remove non UTF-8 characters from text file
LinuxBashTextUtf 8Character EncodingLinux Problem Overview
I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
Linux Solutions
Solution 1 - Linux
This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence
Solution 2 - Linux
iconv
can do it
iconv -f cp1252 foo.txt
Solution 3 - Linux
Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.
Solution 4 - Linux
None of the methods here or on any other similar questions worked for me. In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.
May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.
Solution 5 - Linux
cat foo.txt | strings -n 8 > bar.txt
will do the job.