How to check whether a file is valid UTF-8?

ValidationUtf 8Internationalization

Validation Problem Overview


I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.

There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.

I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.

Validation Solutions


Solution 1 - Validation

You can use GNU iconv:

$ iconv -f UTF-8 your_file -o /dev/null; echo $?

Or with older versions of iconv, such as on macOS:

$ iconv -f UTF-8 your_file > /dev/null; echo $?

The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.

Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

Solution 2 - Validation

Use python and str.encode|decode functions.

>>> a="γεια"
>>> a
'\xce\xb3\xce\xb5\xce\xb9\xce\xb1'
>>> b='\xce\xb3\xce\xb5\xce\xb9\xff\xb1' # note second-to-last char changed
>>> print b.decode("utf_8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 6: unexpected code byte

The exception thrown has the info requested in its .args property.

>>> try: print b.decode("utf_8")
... except UnicodeDecodeError, exc: pass
...
>>> exc
UnicodeDecodeError('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte')
>>> exc.args
('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte')

Solution 3 - Validation

You can use isutf8 from the moreutils collection.

$ apt-get install moreutils
$ isutf8 your_file

In a shell script, use the --quiet switch and check the exit status, which is zero for files that are valid utf-8.

Solution 4 - Validation

How about the gnu iconv library? Using the iconv() function: "An invalid multibyte sequence is encountered in the input. In this case it sets errno to EILSEQ and returns (size_t)(-1). *inbuf is left pointing to the beginning of the invalid multibyte sequence."

EDIT: oh - i missed the part where you want a scripting language. But for command line work, the iconv utility should validate for you too.

Solution 5 - Validation

You can also use recode, which will exit with an error if it tries to decode UTF-8 and encounters invalid characters.

if recode utf8/..UCS < "$FILE" >/dev/null 2>&1; then
    echo "Valid utf8 : $FILE"
else
    echo "NOT valid utf8: $FILE"
fi

This tries to recode to the Universal Character Set (UCS) which is always possible from valid UTF-8.

Solution 6 - Validation

Here is the bash script to check whether a file is valid UTF-8 or not:

#!/bin/bash

inputFile="./testFile.txt"

iconv -f UTF-8 "$inputFile" -o /dev/null

if [[ $? -eq 0 ]]
then
	echo "Valid UTF-8 file.";
else
	echo "Invalid UTF-8 file!";
fi

Description:

  • --from-code, -f encoding (Convert characters from encoding)
  • --to-code, -t encoding (Convert characters to encoding, it doesn't have to be specified, it will be assumed to be UTF-8.)
  • --output, -o file (Specify output file 'instead of stdout')

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIan DickinsonView Question on Stackoverflow
Solution 1 - ValidationTorsten MarekView Answer on Stackoverflow
Solution 2 - ValidationtzotView Answer on Stackoverflow
Solution 3 - ValidationRoger DahlView Answer on Stackoverflow
Solution 4 - ValidationAShellyView Answer on Stackoverflow
Solution 5 - ValidationmivkView Answer on Stackoverflow
Solution 6 - ValidationSherzadView Answer on Stackoverflow