How to remove non UTF-8 characters from text file

LinuxBashTextUtf 8Character Encoding

Linux Problem Overview


I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

Is there anyway to do it?

Linux Solutions


Solution 1 - Linux

This command:

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence

Solution 2 - Linux

iconv can do it

iconv -f cp1252 foo.txt

Solution 3 - Linux

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

Solution 4 - Linux

None of the methods here or on any other similar questions worked for me. In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.

May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

Solution 5 - Linux

cat foo.txt | strings -n 8 > bar.txt

will do the job.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionHakimView Question on Stackoverflow
Solution 1 - LinuxPalantirView Answer on Stackoverflow
Solution 2 - LinuxZomboView Answer on Stackoverflow
Solution 3 - LinuxCharles KnellView Answer on Stackoverflow
Solution 4 - LinuxMythosView Answer on Stackoverflow
Solution 5 - Linuxatul jhaView Answer on Stackoverflow