Java FileReader encoding issue

JavaFileUnicodeEncoding

Java Problem Overview


I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.

Here's my environment:

  • Windows 2003, OS encoding: CP1252

  • Java 5.0

My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.

I use the following code to do my work:

   private static String readFileAsString(String filePath)
    throws java.io.IOException{
        StringBuffer fileData = new StringBuffer(1000);
        FileReader reader = new FileReader(filePath);
        //System.out.println(reader.getEncoding());
        BufferedReader reader = new BufferedReader(reader);
        char[] buf = new char[1024];
        int numRead=0;
        while((numRead=reader.read(buf)) != -1){
            String readData = String.valueOf(buf, 0, numRead);
            fileData.append(readData);
            buf = new char[1024];
        }
        reader.close();
        return fileData.toString();
    }

The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:

> The constructors of this class assume > that the default character encoding > and the default byte-buffer size are > appropriate.

Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

Java Solutions


Solution 1 - Java

Yes, you need to specify the encoding of the file you want to read.

Yes, this means that you have to know the encoding of the file you want to read.

No, there is no general way to guess the encoding of any given "plain text" file.

The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.

Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).

In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

Solution 2 - Java

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.

If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

Solution 3 - Java

For Java 7+ doc you can use this:

BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);

Here are all Charsets doc

For example if your file is in CP1252, use this method

Charset.forName("windows-1252");

Here is other canonical names for Java encodings both for IO and NIO doc

If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Solution 4 - Java

Since Java 11 you may use that:

public FileReader(String fileName, Charset charset) throws IOException;

Solution 5 - Java

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.

Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.

List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
    String fileName = "College_Grade4.txt";
    String charset = "UTF-8";
    BufferedReader reader = new BufferedReader(
        new InputStreamReader(
            new FileInputStream(fileName), charset)); 
   
    String line; 
    while ((line = reader.readLine()) != null) { 
        line = line.trim();
        if( line.length() == 0 ) continue;
        int idx = line.indexOf("\t");
        words.add( line.substring(0, idx ));
        meanings.add( line.substring(idx+1));
    } 
    reader.close();
}

Solution 6 - Java

For another as Latin languages for example Cyrillic you can use something like this:

FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);

and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionnybonView Question on Stackoverflow
Solution 1 - JavaJoachim SauerView Answer on Stackoverflow
Solution 2 - JavaMichael BorgwardtView Answer on Stackoverflow
Solution 3 - JavaAndreas GeleverView Answer on Stackoverflow
Solution 4 - JavaRadoslav IvanovView Answer on Stackoverflow
Solution 5 - JavaGuangtong ShenView Answer on Stackoverflow
Solution 6 - JavaIefimenko IevgenView Answer on Stackoverflow