All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

JavaCharacter Encoding

Java Problem Overview


I'm creating a simple wordcount program in Java that reads through a directory's text-based files.

However, I keep on getting the error:

java.nio.charset.MalformedInputException: Input length = 1

from this line of code:

BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));

I know I probably get this because I used a Charset that didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.

I later learned at the JavaDocs that the Charset is optional and only used for a more efficient reading of the files, so I changed the code to:

BufferedReader reader = Files.newBufferedReader(file);

But some files still throw the MalformedInputException. I don't know why.

I was wondering if there is an all-inclusive Charset that will allow me to read text files with many different types of characters?

Thanks.

Java Solutions


Solution 1 - Java

You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException, try the next encoding.

Solution 2 - Java

Creating BufferedReader from Files.newBufferedReader

Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);

when running the application it may throw the following exception:

java.nio.charset.MalformedInputException: Input length = 1

But

new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));

works well.

The different is that, the former uses CharsetDecoder default action.

> The default action for malformed-input and unmappable-character errors is to report them.

while the latter uses the REPLACE action.

cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)

Solution 3 - Java

ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-

req.setCharacterEncoding("ISO-8859-1");

I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.

Solution 4 - Java

I also encountered this exception with error message,

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)

and found that some strange bug occurs when trying to use

BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));

to write a String "orazg 54" cast from a generic type in a class.

//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");

This String is of length 9 containing chars with the following code points:

111 114 97 122 103 9 53 52 10

However, if the BufferedWriter in the class is replaced with:

FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));

it can successfully write this String without exceptions. In addition, if I write the same String create from the characters it still works OK.

String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();

Previously I have never encountered any Exception when using the first BufferedWriter to write any Strings. It's a strange bug that occurs to BufferedWriter created from java.nio.file.Files.newBufferedWriter(path, options)

Solution 5 - Java

try this.. i had the same issue, below implementation worked for me

Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);

then use Reader where ever you want.

foreg:

CsvToBean<anyPojo> csvToBean = null;
    try {
        Reader reader = Files.newBufferedReader(Paths.get(csvFilePath), 
                        StandardCharsets.ISO_8859_1);
        csvToBean = new CsvToBeanBuilder(reader)
                .withType(anyPojo.class)
                .withIgnoreLeadingWhiteSpace(true)
                .withSkipLines(1)
                .build();

    } catch (IOException e) {
        e.printStackTrace();
    }

Solution 6 - Java

ISO_8859_1 Worked for me! I was reading text file with comma separated values

Solution 7 - Java

I wrote the following to print a list of results to standard out based on available charsets. Note that it also tells you what line fails from a 0 based line number in case you are troubleshooting what character is causing issues.

public static void testCharset(String fileName) {
	SortedMap<String, Charset> charsets = Charset.availableCharsets();
	for (String k : charsets.keySet()) {
		int line = 0;
		boolean success = true;
		try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
			while (b.ready()) {
				b.readLine();
				line++;
			}
		} catch (IOException e) {
			success = false;
			System.out.println(k+" failed on line "+line);
		}
		if (success) 
            System.out.println("*************************  Successs "+k);
	}
}

Solution 8 - Java

Well, the problem is that Files.newBufferedReader(Path path) is implemented like this :

public static BufferedReader newBufferedReader(Path path) throws IOException {
    return newBufferedReader(path, StandardCharsets.UTF_8);
}

so basically there is no point in specifying UTF-8 unless you want to be descriptive in your code. If you want to try a "broader" charset you could try with StandardCharsets.UTF_16, but you can't be 100% sure to get every possible character anyway.

Solution 9 - Java

UTF-8 works for me with Polish characters

Solution 10 - Java

you can try something like this, or just copy and past below piece.

boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.		
int index = 0;
	
while(exception) {
	try {
		lines = Files.readAllLines(f.toPath(),charset);
          for (String line: lines) {
        	  line= line.trim();
        	  if(line.contains(keyword))
        		  values.add(line);
        	  }	          
        //No exception, just returns
		exception = false; 
	} catch (IOException e) {
		exception = true;
		//Try the next charset
		if(index<Charset.availableCharsets().values().size())
			charset = (Charset) Charset.availableCharsets().values().toArray()[index];
		index ++;
	}
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJonathan LamView Question on Stackoverflow
Solution 1 - JavaDawood ibn KareemView Answer on Stackoverflow
Solution 2 - JavaXin WangView Answer on Stackoverflow
Solution 3 - JavaTim CooperView Answer on Stackoverflow
Solution 4 - JavaTomView Answer on Stackoverflow
Solution 5 - JavaVinView Answer on Stackoverflow
Solution 6 - JavaShahid Hussain AbbasiView Answer on Stackoverflow
Solution 7 - JavaEngineerWithJava54321View Answer on Stackoverflow
Solution 8 - Javafrancesco forestiView Answer on Stackoverflow
Solution 9 - JavaAdrianoView Answer on Stackoverflow
Solution 10 - JavaPengxiangView Answer on Stackoverflow