Can I make git recognize a UTF-16 file as text?
GitUnicodeCharacter EncodingDiffUtf 16Git Problem Overview
I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.
Can git be taught to recognize that this file is text and handle it appropriately?
I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.
Git Solutions
Solution 1 - Git
I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:
$ git config --global diff.tool vimdiff # or merge.tool to get merging too!
$ git difftool commit1 commit2
git difftool
takes the same arguments as git diff
would, but runs a diff program of your choice instead of the built-in GNU diff
. So pick a multibyte-aware diff (in my case, vim
in diff mode) and just use git difftool
instead of git diff
.
Find "difftool" too long to type? No problem:
$ git config --global alias.dt difftool
$ git dt commit1 commit2
Git rocks.
Solution 2 - Git
There is a very simple solution that works out of the box on Unices.
For example, with Apple's .strings
files just:
-
Create a
.gitattributes
file in the root of your repository with:*.strings diff=localizablestrings
-
Add the following to your
~/.gitconfig
file:[diff "localizablestrings"] textconv = "iconv -f utf-16 -t utf-8"
Source: Diff .strings files in Git (and older post from 2010).
Solution 3 - Git
Have you tried setting your .gitattributes
to treat it as a text file?
e.g.:
*.vmc diff
More details at http://www.git-scm.com/docs/gitattributes.html.
Solution 4 - Git
By default, it looks like git
won't work well with UTF-16; for such a file you have to make sure that no CRLF
processing is done on it, but you want diff
and merge
to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).
But looking at the .gitattributes
manpage, here is the custom attribute that is binary
:
[attr]binary -diff -crlf
So it seems to me that you could define a custom attribute in your top level .gitattributes
for utf16
(note that I add merge here to be sure it is treated as text):
[attr]utf16 diff merge -crlf
From there you would be able to specify in any .gitattributes
file something like:
*.vmc utf16
Also note that you should still be able to diff
a file, even if git
thinks it's binary with:
git diff --text
Edit
This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git
use a different tool to see differences (via --ext-diff
), that answer suggests Guiffy.
But what you likely need is just to diff
a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff
and the following shell script:
#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")
Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.
As for the output to the terminal when looking at a diff of a UTF-16 file:
> Trying to diff like that results in > binary garbage spewed to the screen. > If git is using GNU diff, it would > seem that GNU diff is not > unicode-aware.
GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).
Solution 5 - Git
git recently has begun to understand encodings such as utf16.
See gitattributes docs, search for working-tree-encoding
[Make sure your man page matches since this is quite new!]
If (say) the file is UTF-16 without BOM on Windows machine then add to your .gitattributes
file
*.vmc text working-tree-encoding=UTF-16LE eol=CRLF
If UTF-16 (with bom) on *nix make it:
*.vmc text working-tree-encoding=UTF-16-BOM eol=LF
(Replace *.vmc
with *.whatever
for whatever
type files you need to handle)
See: Support working-tree-encoding "UTF-16LE-BOM".
Added later
Following @Hackslash, one may find that this is insufficient
*.vmc text working-tree...
To get nice text-diffs you need
*.vmc diff working-tree...
Putting both works as well
*.vmc text diff working-tree...
But it's arguably
- Redundant —
eol=...
impliestext
- Verbose — a large project could easily have dozens of different text file types
The Problem
Git has a macro-attribute binary
which means -text -diff
. The opposite +text +diff
is not available built-in but git gives the tools (I think!) for synthesizing it
The solution
Git allows one to define new macro attributes.
I'd propose that top of the .gitattributes
file you have
[attr]textfile text diff
Then for all paths that need to be text and diff do
path textfile working-tree-encoding= eol=...
Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.
Most lines should look like
*.c textfile
*.py textfile
Etc
Why not just use diff?
Practical: In most cases we want native eol. Which means no eol=...
. So text
won't get implied and needs to be put explicitly.
Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.
Disclaimer
Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.
Solution 6 - Git
Solution is to filter through cmd.exe /c "type %1"
. cmd's type
builtin will do the conversion, and so you can use that with the textconv ability of git diff to enable text diffing of UTF-16 files (should work with UTF-8 as well, although untested).
Quoting from gitattributes man page:
Sometimes it is desirable to see the diff of a text-converted version of some binary files. For example, a word processor document can be converted to an ASCII text representation, and the diff of the text shown. Even though this conversion loses some information, the resulting diff is useful for human viewing (but cannot be applied directly). Performing text diffs of binary files
The textconv config option is used to define a program for performing such a conversion. The program should take a single argument, the name of a file to convert, and produce the resulting text on stdout.
For example, to show the diff of the exif information of a file instead of the binary information (assuming you have the exif tool installed), add the following section to your $GIT_DIR/config
file (or $HOME/.gitconfig
file):
[diff "jpg"]
textconv = exif
A solution for mingw32, cygwin fans may have to alter the approach. The issue is with passing the filename to convert to cmd.exe - it will be using forward slashes, and cmd assumes backslash directory separators.
Create the single argument script that will do the conversion to stdout. c:\path\to\some\script.sh: Step 1:
#!/bin/bash
SED='s/\//\\\\\\\\/g'
FILE=\`echo $1 | sed -e "$SED"\`
cmd.exe /c "type $FILE"
Set up git to be able to use the script file. Inside your git config ( Step 2:
~/.gitconfig
or .git/config
or see man git-config
), put this:
[diff "cmdtype"]
textconv = c:/path/to/some/script.sh
Point out files to apply this workarond to by utilizing .gitattributes files (see man gitattributes(5)): Step 3:
*vmc diff=cmdtype
then use git diff
on your files.
Solution 7 - Git
I have written a small git-diff driver, to-utf8
, which should make it easy to diff any non-ASCII/UTF-8 encoded files. You can install it using the instructions here: https://github.com/chaitanyagupta/gitutils#to-utf8 (the to-utf8
script is available in the same repo).
Note that this script requires both file
and iconv
commands to be available on the system.
Solution 8 - Git
Had this problem on Windows recently, and the dos2unix
and unix2dos
bins that ship with git for windows did the trick. By default they're located in C:\Program Files\Git\usr\bin\
. Observe this will only work if your file doesn't need to be UTF-16. For example, someone accidently encoded a python file as UTF-16 when it didn't need to be (in my case).
PS C:\Users\xxx> dos2unix my_file.py
dos2unix: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 Unix format...
and
PS C:\Users\xxx> unix2dos my_file.py
unix2dos: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 DOS format...
Solution 9 - Git
As described in other answers git diff doesn't handle UTF-16 files as text and this makes them unviewable in Atlassian SourceTree for example. If the file name/or suffix is known the fix below will make those files viewable and comparable normally under SourceTree.
If the file suffix of the UTF-16 files is known (*.uni for example) then all files with that suffix can be associated with UTF-16 to UTF-8 converter with the following two changes:
-
Create or modify the .gitattributes file in the root directory of the repository with the following line:
*.uni diff=utf16
-
Then modify the .gitconfig file in the users home directory (C:\Users\yourusername\.gitconfig) with the following section:
[diff=utf16] textconv = "iconv -f utf-16 -t utf-8"
These two changes should take effect immediately without reloading the repository into SourceTree. It applies the text conversion to all *.uni files which makes them viewable and comparable like other text files. If other files need this conversion you can add additional lines to the .gitattributes file. (If the designated file(s) are NOT UTF-16 you will get unreadable results for that file.)
Note that this answer is a simplified rewrite of Tony Kuneck's answer.