Reading UTF-8 - BOM marker

JavaFileEncoding

Java Problem Overview


I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?

fr = new FileReader(file);
br = new BufferedReader(fr);
	String tmp = null;
	while ((tmp = br.readLine()) != null) {
	String text;	
    text = new String(tmp.getBytes(), "UTF-8");
	content += text + System.getProperty("line.separator");
}

output after first line

?<style>

Java Solutions


Solution 1 - Java

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

Solution 2 - Java

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.

tmp = tmp.replace("\uFEFF", "");

Also see this Guava bug report

Solution 3 - Java

Use the Apache Commons library.

Class: org.apache.commons.io.input.BOMInputStream

Example usage:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
	BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
	ByteOrderMark bom = bOMInputStream.getBOM();
	String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
	InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
	//use reader
} finally {
	inputStream.close();
}

Solution 4 - Java

Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

Solution 5 - Java

Consider UnicodeReader from Google which does all this work for you.

Charset utf8 = Charset.forName("UTF-8"); // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8)) {
    ....
}

Maven Dependency:

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

Solution 6 - Java

Use Apache Commons IO.

For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.

Solution 7 - Java

If somebody wants to do it with the standard, this would be a way:

public static String cutBOM(String value) {
	// UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
	String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
	if (bom.equals("efbbbf"))
		// UTF-8
		return value.substring(3, value.length());
	else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
		// UTF-16BE or UTF16-LE
		return value.substring(2, value.length());
	else
		return value;
}

Solution 8 - Java

It's mentioned here that this is usually a problem with files on Windows.

One possible solution would be running the file through a tool like dos2unix first.

Solution 9 - Java

The easiest way I found to bypass BOM

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
					//case of, remove the BOM of UTF-8 BOM
					currentLine = currentLine.replace("","");

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiononigunnView Question on Stackoverflow
Solution 1 - JavaRealHowToView Answer on Stackoverflow
Solution 2 - JavafinnwView Answer on Stackoverflow
Solution 3 - JavapeenutView Answer on Stackoverflow
Solution 4 - JavasnakedoctorView Answer on Stackoverflow
Solution 5 - JavaAdrian SmithView Answer on Stackoverflow
Solution 6 - JavapawmanView Answer on Stackoverflow
Solution 7 - JavaMarkusView Answer on Stackoverflow
Solution 8 - JavaDrake SobaniaView Answer on Stackoverflow
Solution 9 - JavaDavidView Answer on Stackoverflow