How to determine encoding table of a text file
TextUnicodeEncodingCharacter EncodingText Problem Overview
I have .txt
and .java
files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?
Text Solutions
Solution 1 - Text
If you're on Linux, try file -i filename.txt
.
$ file -i vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
For reference, here is my environment:
$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic
Some file
versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:
$ file -I vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
Also, have a look here.
Solution 2 - Text
Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.
Solution 3 - Text
You can't reliably detect the encoding from a textfile - what you can do is make an educated guess by searching for a non-ascii char and trying to determine if it is a unicode combination that makes sens in the languages you are parsing.
Solution 4 - Text
See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.
Solution 5 - Text
In a text file there is no header that saves the encoding or so. You can try the linux/unix command find
which tries to guess the encoding:
file -i unreadablefile.txt
or on some systems
file -I unreadablefile.txt
But that often gives you text/plain; charset=iso-8859-1
although the file is unreadable (cryptic glyphs).
This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv
. First I tried all encodings, displaying (grep
) a line that contained the word www. (a website address):
for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less
This last commandline shows the the tested file encoding and then the translated/transcoded line.
There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:
ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt
In my case it was a chinese windows encoding, which is now readable (if you know chinese).
Solution 6 - Text
>Does there exist any program to determine the file encoding or to see the encoding?
This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I
command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:
% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary
somefile.asc
was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit
- a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?