Can I make git recognize a UTF-16 file as text?

GitUnicodeCharacter EncodingDiffUtf 16

Git Problem Overview


I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.

Can git be taught to recognize that this file is text and handle it appropriately?

I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.

Git Solutions


Solution 1 - Git

I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool takes the same arguments as git diff would, but runs a diff program of your choice instead of the built-in GNU diff. So pick a multibyte-aware diff (in my case, vim in diff mode) and just use git difftool instead of git diff.

Find "difftool" too long to type? No problem:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git rocks.

Solution 2 - Git

There is a very simple solution that works out of the box on Unices.

For example, with Apple's .strings files just:

  1. Create a .gitattributes file in the root of your repository with:

     *.strings diff=localizablestrings
    
  2. Add the following to your ~/.gitconfig file:

     [diff "localizablestrings"]
     textconv = "iconv -f utf-16 -t utf-8"
    

Source: Diff .strings files in Git (and older post from 2010).

Solution 3 - Git

Have you tried setting your .gitattributes to treat it as a text file?

e.g.:

*.vmc diff

More details at http://www.git-scm.com/docs/gitattributes.html.

Solution 4 - Git

By default, it looks like git won't work well with UTF-16; for such a file you have to make sure that no CRLF processing is done on it, but you want diff and merge to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).

But looking at the .gitattributes manpage, here is the custom attribute that is binary:

[attr]binary -diff -crlf

So it seems to me that you could define a custom attribute in your top level .gitattributes for utf16 (note that I add merge here to be sure it is treated as text):

[attr]utf16 diff merge -crlf

From there you would be able to specify in any .gitattributes file something like:

*.vmc utf16

Also note that you should still be able to diff a file, even if git thinks it's binary with:

git diff --text

Edit

This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git use a different tool to see differences (via --ext-diff), that answer suggests Guiffy.

But what you likely need is just to diff a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff and the following shell script:

#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")

Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.

As for the output to the terminal when looking at a diff of a UTF-16 file:

> Trying to diff like that results in > binary garbage spewed to the screen. > If git is using GNU diff, it would > seem that GNU diff is not > unicode-aware.

GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).

Solution 5 - Git

git recently has begun to understand encodings such as utf16. See gitattributes docs, search for working-tree-encoding

[Make sure your man page matches since this is quite new!]

If (say) the file is UTF-16 without BOM on Windows machine then add to your .gitattributes file

*.vmc text working-tree-encoding=UTF-16LE eol=CRLF

If UTF-16 (with bom) on *nix make it:

*.vmc text working-tree-encoding=UTF-16-BOM eol=LF

(Replace *.vmc with *.whatever for whatever type files you need to handle)

See: Support working-tree-encoding "UTF-16LE-BOM".


Added later

Following @Hackslash, one may find that this is insufficient

 *.vmc text working-tree... 

To get nice text-diffs you need

 *.vmc diff working-tree...

Putting both works as well

 *.vmc text diff working-tree... 

But it's arguably

  • Redundant — eol=... implies text
  • Verbose — a large project could easily have dozens of different text file types

The Problem

Git has a macro-attribute binary which means -text -diff. The opposite +text +diff is not available built-in but git gives the tools (I think!) for synthesizing it

The solution

Git allows one to define new macro attributes.

I'd propose that top of the .gitattributes file you have

 [attr]textfile text diff

Then for all paths that need to be text and diff do

 path textfile working-tree-encoding= eol=...

Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.

Most lines should look like

*.c textfile
*.py textfile
Etc
Why not just use diff?

Practical: In most cases we want native eol. Which means no eol=... . So text won't get implied and needs to be put explicitly.

Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.

Disclaimer

Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.

Solution 6 - Git

Solution is to filter through cmd.exe /c "type %1". cmd's type builtin will do the conversion, and so you can use that with the textconv ability of git diff to enable text diffing of UTF-16 files (should work with UTF-8 as well, although untested).

Quoting from gitattributes man page:


Performing text diffs of binary files

Sometimes it is desirable to see the diff of a text-converted version of some binary files. For example, a word processor document can be converted to an ASCII text representation, and the diff of the text shown. Even though this conversion loses some information, the resulting diff is useful for human viewing (but cannot be applied directly).

The textconv config option is used to define a program for performing such a conversion. The program should take a single argument, the name of a file to convert, and produce the resulting text on stdout.

For example, to show the diff of the exif information of a file instead of the binary information (assuming you have the exif tool installed), add the following section to your $GIT_DIR/config file (or $HOME/.gitconfig file):

[diff "jpg"]
        textconv = exif

A solution for mingw32, cygwin fans may have to alter the approach. The issue is with passing the filename to convert to cmd.exe - it will be using forward slashes, and cmd assumes backslash directory separators.

Step 1:

Create the single argument script that will do the conversion to stdout. c:\path\to\some\script.sh:

#!/bin/bash
SED='s/\//\\\\\\\\/g'
FILE=\`echo $1 | sed -e "$SED"\`
cmd.exe /c "type $FILE"

Step 2:

Set up git to be able to use the script file. Inside your git config (~/.gitconfig or .git/config or see man git-config), put this:

[diff "cmdtype"]
textconv = c:/path/to/some/script.sh

Step 3:

Point out files to apply this workarond to by utilizing .gitattributes files (see man gitattributes(5)):

*vmc diff=cmdtype

then use git diff on your files.

Solution 7 - Git

I have written a small git-diff driver, to-utf8, which should make it easy to diff any non-ASCII/UTF-8 encoded files. You can install it using the instructions here: https://github.com/chaitanyagupta/gitutils#to-utf8 (the to-utf8 script is available in the same repo).

Note that this script requires both file and iconv commands to be available on the system.

Solution 8 - Git

Had this problem on Windows recently, and the dos2unixand unix2dos bins that ship with git for windows did the trick. By default they're located in C:\Program Files\Git\usr\bin\. Observe this will only work if your file doesn't need to be UTF-16. For example, someone accidently encoded a python file as UTF-16 when it didn't need to be (in my case).

PS C:\Users\xxx> dos2unix my_file.py
dos2unix: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 Unix format...

and

PS C:\Users\xxx> unix2dos my_file.py
unix2dos: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 DOS format...

Solution 9 - Git

As described in other answers git diff doesn't handle UTF-16 files as text and this makes them unviewable in Atlassian SourceTree for example. If the file name/or suffix is known the fix below will make those files viewable and comparable normally under SourceTree.

If the file suffix of the UTF-16 files is known (*.uni for example) then all files with that suffix can be associated with UTF-16 to UTF-8 converter with the following two changes:

  1. Create or modify the .gitattributes file in the root directory of the repository with the following line:

     *.uni diff=utf16
    
  2. Then modify the .gitconfig file in the users home directory (C:\Users\yourusername\.gitconfig) with the following section:

    [diff=utf16]
        textconv = "iconv -f utf-16 -t utf-8"
    

These two changes should take effect immediately without reloading the repository into SourceTree. It applies the text conversion to all *.uni files which makes them viewable and comparable like other text files. If other files need this conversion you can add additional lines to the .gitattributes file. (If the designated file(s) are NOT UTF-16 you will get unreadable results for that file.)

Note that this answer is a simplified rewrite of Tony Kuneck's answer.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionskiphoppyView Question on Stackoverflow
Solution 1 - GitSam StokesView Answer on Stackoverflow
Solution 2 - GitIlDanView Answer on Stackoverflow
Solution 3 - GitChealionView Answer on Stackoverflow
Solution 4 - GitJared OberhausView Answer on Stackoverflow
Solution 5 - GitRusiView Answer on Stackoverflow
Solution 6 - GitTony KuneckView Answer on Stackoverflow
Solution 7 - GitChaitanya GuptaView Answer on Stackoverflow
Solution 8 - GitMatt MessersmithView Answer on Stackoverflow
Solution 9 - GitRod DewellView Answer on Stackoverflow