Best way to convert text files between character sets?

TextUnicodeUtf 8Character Set

Text Problem Overview


What is the fastest, easiest tool or method to convert text files between character sets?

Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.

Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.

Best solutions so far:

On Linux/UNIX/OS X/cygwin:

  • Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:

      $ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
    

    As pointed out by Ben, there is an online converter using iconv.

  • recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:

      $ recode UTF8..ISO-8859-15 in.txt
    

    This one uses shorter aliases:

      $ recode utf8..l9 in.txt
    

    Recode also supports surfaces which can be used to convert between different line ending types and encodings:

    Convert newlines from LF (Unix) to CR-LF (DOS):

      $ recode ../CR-LF in.txt
    

    Base64 encode file:

      $ recode ../Base64 in.txt
    

    You can also combine them.

    Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:

      $ recode utf8/Base64..l1/CR-LF/Base64 file.txt
    

On Windows with Powershell (Jay Bazuzi):

  • PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt

(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

Edit

Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa

gc -en string in.txt | Out-File -en utf8 out.txt

Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".

Text Solutions


Solution 1 - Text

Stand-alone utility approach

iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt

-f ENCODING  the encoding of the input
-t ENCODING  the encoding of the output

You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

Solution 2 - Text

Try VIM

If you have vim you can use this:

Not tested for every encoding.

The cool part about this is that you don't have to know the source encoding

vim +"set nobomb | set fenc=utf8 | x" filename.txt

Be aware that this command modify directly the file


Explanation part!
  1. + : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
  2. | : Separator of multiple commands (like ; in bash)
  3. set nobomb : no utf-8 BOM
  4. set fenc=utf8 : Set new encoding to utf-8 doc link
  5. x : Save and close file
  6. filename.txt : path to the file
  7. " : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

Solution 3 - Text

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

Solution 4 - Text

iconv(1)

iconv -f FROM-ENCODING -t TO-ENCODING file.txt

Also there are iconv-based tools in many languages.

Solution 5 - Text

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT

The shortest version, if you can assume that the input BOM is correct:

gc FILE.TXT | Out-File -en utf7 file-utf7.txt

Solution 6 - Text

Try iconv Bash function

I've put this into .bashrc:

utf8()
{
    iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
    rm $1
    mv $1.tmp $1
}

..to be able to convert files like so:

utf8 MyClass.java

Solution 7 - Text

Try Notepad++

On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".

Solution 8 - Text

Oneliner using find, with automatic character set detection

The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:

$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;

To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.

Whereby file -bi means:

  • -b, --brief Do not prepend filenames to output lines (brief mode).

  • -i, --mime Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii rather than ASCII text. The sed command cuts this to only us-ascii as is required by iconv.

The find command is very useful for such file management automation. Click here for more find galore.

Solution 9 - Text

DOS/Windows: use Code page

chcp 65001>NUL
type ascii.txt > unicode.txt

Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.

Solution 10 - Text

PHP iconv()

iconv("UTF-8", "ISO-8859-15", $input);

Solution 11 - Text

Assuming, you don't know the input encoding and still wish to automate most of the conversion, I concluded this one liner from summing up previous answers.

iconv -f $(chardetect input.text | awk '{print $2}') -t utf-8 -o output.text

Solution 12 - Text

to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):

$ native2ascii filename.properties

For example:

$ cat test.properties 
first=Execução número um
second=Execução número dois

$ native2ascii test.properties 
first=Execu\u00e7\u00e3o n\u00famero um
second=Execu\u00e7\u00e3o n\u00famero dois

PS: I writed Execution number one/two in portugues to force special characters.

In my case, in first execution I received this message:

$ native2ascii teste.txt 
The program 'native2ascii' can be found in the following packages:
 * gcj-5-jdk
 * openjdk-8-jdk-headless
 * gcj-4.8-jdk
 * gcj-4.9-jdk
Try: sudo apt install <selected package>

When I installed the first option (gcj-5-jdk) the problem was finished.

I hope this help someone.

Solution 13 - Text

Simply change encoding of loaded file in IntelliJ IDEA IDE, on the right of status bar (bottom), where current charset is indicated. It prompts to Reload or Convert, use Convert. Make sure you backed up original file in advance.

Solution 14 - Text

Try EncodingChecker

EncodingChecker on github

File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.

File Encoding Checker requires .NET 4 or above to run.

For encoding detection, File Encoding Checker uses the UtfUnknown Charset Detector library. UTF-16 text files without byte-order-mark (BOM) can be detected by heuristics.

enter image description here

Solution 15 - Text

In powershell:

function Recode($InCharset, $InFile, $OutCharset, $OutFile)  {
	# Read input file in the source encoding
    $Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
    $Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
    
    # Write output file in the destination encoding
    $Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)    
    [System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
}

Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt" 

For a list of supported encoding names:

https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding

Solution 16 - Text

There is also a web tool to convert file encoding: https://webtool.cloud/change-file-encoding

It supports wide range of encodings, including some rare ones, like IBM code page 37.

Solution 17 - Text

With ruby:

ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"

Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences

Solution 18 - Text

Use this Python script: https://github.com/goerz/convert_encoding.py Works on any platform. Requires Python 2.7.

Solution 19 - Text

My favorite tool for this is Jedit (a java based text editor) which has two very convenient features :

  • One which enables the user to reload a text with a different encoding (and, as such, to control visually the result)
  • Another one which enables the user to explicitly choose the encoding (and end of line char) before saving

Solution 20 - Text

If macOS GUI applications are your bread and butter, SubEthaEdit is the text editor I usually go to for encoding-wrangling — its "conversion preview" allows you to see all invalid characters in the output encoding, and fix/remove them.

And it's open-source now, so yay for them .

Solution 21 - Text

Visual Studio Code

  1. Open your file in Visual Studio Code
  2. Reopen with Encoding: In the bottom status bar, to the right, you should see your current file encoding (eg "UTF-8"). Click this and select "Reopen with Encoding".
  3. Select the correct encoding of the file (eg: ISO 8859-2).
  4. Confirm that your content is displaying as expected.
  5. Save with Encoding: The bottom status bar should now display your new encoding format (eg: ISO 8859-2). Click this and choose "Save with Encoding" and select UTF-8 (or whatever new encoding you want).

NOTE: THIS WILL OVERWRITE YOUR ORGINIAL FILE. MAKE A BACKUP FIRST.

Solution 22 - Text

As described on https://stackoverflow.com/questions/132318/how-do-i-correct-the-character-encoding-of-a-file Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.

Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAntti KissaniemiView Question on Stackoverflow
Solution 1 - TextTroels ArvinView Answer on Stackoverflow
Solution 2 - TextBoopView Answer on Stackoverflow
Solution 3 - TextCheekysoftView Answer on Stackoverflow
Solution 4 - TextDaniel PapasianView Answer on Stackoverflow
Solution 5 - TextJay BazuziView Answer on Stackoverflow
Solution 6 - TextArne EvertssonView Answer on Stackoverflow
Solution 7 - TextJeremy GloverView Answer on Stackoverflow
Solution 8 - TextSerge StroobandtView Answer on Stackoverflow
Solution 9 - TextlalthomasView Answer on Stackoverflow
Solution 10 - Textuser15096View Answer on Stackoverflow
Solution 11 - TextMarcelo Teixeira RuggeriView Answer on Stackoverflow
Solution 12 - TextMaciel Escudero BombonatoView Answer on Stackoverflow
Solution 13 - TextNikolai VarankineView Answer on Stackoverflow
Solution 14 - TextAmr AliView Answer on Stackoverflow
Solution 15 - TextAmr AliView Answer on Stackoverflow
Solution 16 - TextPavel MorshenyukView Answer on Stackoverflow
Solution 17 - TextDorianView Answer on Stackoverflow
Solution 18 - TextkinORnirvanaView Answer on Stackoverflow
Solution 19 - TextyotaView Answer on Stackoverflow
Solution 20 - TexttiennouView Answer on Stackoverflow
Solution 21 - TextAlex CzartoView Answer on Stackoverflow
Solution 22 - Textpi3View Answer on Stackoverflow