UTF-8 byte[] to String
JavaUtf 8Java Problem Overview
Let's suppose I have just used a BufferedInputStream
to read the bytes of a UTF-8 encoded text file into a byte array. I know that I can use the following routine to convert the bytes to a string, but is there a more efficient/smarter way of doing this than just iterating through the bytes and converting each one?
public String openFileToString(byte[] _bytes)
{
String file_string = "";
for(int i = 0; i < _bytes.length; i++)
{
file_string += (char)_bytes[i];
}
return file_string;
}
Java Solutions
Solution 1 - Java
Look at the constructor for String
String str = new String(bytes, StandardCharsets.UTF_8);
And if you're feeling lazy, you can use the Apache Commons IO library to convert the InputStream to a String directly:
String str = IOUtils.toString(inputStream, StandardCharsets.UTF_8);
Solution 2 - Java
Java String class has a built-in-constructor for converting byte array to string.
byte[] byteArray = new byte[] {87, 79, 87, 46, 46, 46};
String value = new String(byteArray, "UTF-8");
Solution 3 - Java
To convert utf-8 data, you can't assume a 1-1 correspondence between bytes and characters. Try this:
String file_string = new String(bytes, "UTF-8");
(Bah. I see I'm way to slow in hitting the Post Your Answer button.)
To read an entire file as a String, do something like this:
public String openFileToString(String fileName) throws IOException
{
InputStream is = new BufferedInputStream(new FileInputStream(fileName));
try {
InputStreamReader rdr = new InputStreamReader(is, "UTF-8");
StringBuilder contents = new StringBuilder();
char[] buff = new char[4096];
int len = rdr.read(buff);
while (len >= 0) {
contents.append(buff, 0, len);
}
return buff.toString();
} finally {
try {
is.close();
} catch (Exception e) {
// log error in closing the file
}
}
}
Solution 4 - Java
You can use the String(byte[] bytes)
constructor for that. See this link for details.
EDIT You also have to consider your plateform's default charset as per the java doc:
> Constructs a new String by decoding the specified array of bytes using
> the platform's default charset. The length of the new String is a
> function of the charset, and hence may not be equal to the length of
> the byte array. The behavior of this constructor when the given bytes
> are not valid in the default charset is unspecified. The
> CharsetDecoder class should be used when more control over the
> decoding process is required.
Solution 5 - Java
Knowing that you are dealing with a UTF-8 byte array, you'll definitely want to use the String constructor that accepts a charset name. Otherwise you may leave yourself open to some charset encoding based security vulnerabilities. Note that it throws UnsupportedEncodingException
which you'll have to handle. Something like this:
public String openFileToString(String fileName) {
String file_string;
try {
file_string = new String(_bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// this should never happen because "UTF-8" is hard-coded.
throw new IllegalStateException(e);
}
return file_string;
}
Solution 6 - Java
You could use the methods described in this question (especially since you start off with an InputStream): https://stackoverflow.com/q/309424/372643
In particular, if you don't want to rely on external libraries, you can try this answer, which reads the InputStream
via an InputStreamReader
into a char[]
buffer and appends it into a StringBuilder
.
Solution 7 - Java
Here's a simplified function that will read in bytes and create a string. It assumes you probably already know what encoding the file is in (and otherwise defaults).
static final int BUFF_SIZE = 2048;
static final String DEFAULT_ENCODING = "utf-8";
public static String readFileToString(String filePath, String encoding) throws IOException {
if (encoding == null || encoding.length() == 0)
encoding = DEFAULT_ENCODING;
StringBuffer content = new StringBuffer();
FileInputStream fis = new FileInputStream(new File(filePath));
byte[] buffer = new byte[BUFF_SIZE];
int bytesRead = 0;
while ((bytesRead = fis.read(buffer)) != -1)
content.append(new String(buffer, 0, bytesRead, encoding));
fis.close();
return content.toString();
}
Solution 8 - Java
String has a constructor that takes byte[] and charsetname as parameters :)
Solution 9 - Java
This also involves iterating, but this is much better than concatenating strings as they are very very costly.
public String openFileToString(String fileName)
{
StringBuilder s = new StringBuilder(_bytes.length);
for(int i = 0; i < _bytes.length; i++)
{
s.append((char)_bytes[i]);
}
return s.toString();
}
Solution 10 - Java
Why not get what you are looking for from the get go and read a string from the file instead of an array of bytes? Something like:
BufferedReader in = new BufferedReader(new InputStreamReader( new FileInputStream( "foo.txt"), Charset.forName( "UTF-8"));
then readLine from in until it's done.
Solution 11 - Java
I use this way
String strIn = new String(_bytes, 0, numBytes);