Reading UTF-8 - BOM marker
JavaFileEncodingJava Problem Overview
I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?
fr = new FileReader(file);
br = new BufferedReader(fr);
String tmp = null;
while ((tmp = br.readLine()) != null) {
String text;
text = new String(tmp.getBytes(), "UTF-8");
content += text + System.getProperty("line.separator");
}
output after first line
?<style>
Java Solutions
Solution 1 - Java
In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream
to handle this situation.
Take a look at this solution: Handle UTF8 file with BOM
Solution 2 - Java
The easiest fix is probably just to remove the resulting \uFEFF
from the string, since it is extremely unlikely to appear for any other reason.
tmp = tmp.replace("\uFEFF", "");
Also see this Guava bug report
Solution 3 - Java
Use the Apache Commons library.
Class: org.apache.commons.io.input.BOMInputStream
Example usage:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}
Solution 4 - Java
Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):
try( BufferedReader br = new BufferedReader(
new InputStreamReader( new BOMInputStream( new FileInputStream(
file), false, ByteOrderMark.UTF_8,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
// use br here
} catch( Exception e)
}
Solution 5 - Java
Consider UnicodeReader from Google which does all this work for you.
Charset utf8 = Charset.forName("UTF-8"); // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8)) {
....
}
Maven Dependency:
<dependency>
<groupId>com.google.gdata</groupId>
<artifactId>core</artifactId>
<version>1.47.1</version>
</dependency>
Solution 6 - Java
Use Apache Commons IO.
For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:
String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));
BOMInputStream bomInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
data = reader.read();
ari.add(Character.toString(theChar));
}
reader.close();
As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.
Solution 7 - Java
If somebody wants to do it with the standard, this would be a way:
public static String cutBOM(String value) {
// UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
if (bom.equals("efbbbf"))
// UTF-8
return value.substring(3, value.length());
else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
// UTF-16BE or UTF16-LE
return value.substring(2, value.length());
else
return value;
}
Solution 8 - Java
It's mentioned here that this is usually a problem with files on Windows.
One possible solution would be running the file through a tool like dos2unix first.
Solution 9 - Java
The easiest way I found to bypass BOM
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
while ((currentLine = br.readLine()) != null) {
//case of, remove the BOM of UTF-8 BOM
currentLine = currentLine.replace("","");