git is very very slow when tracking large binary files

Git

Git Problem Overview


My project is six months old and git is very very slow. We track around 30 files which are of size 5 MB to 50 MB. Those are binary files and we keep them in git. I believe those files are making git slow.

Is there a way to kill all files of size > 5MB from the repository. I know I would lose all of these files and that is okay with me.

Ideally I would like a command that would list all the big files ( > 5MB) . I can see the list and then I say okay go ahead and delete those files and make git faster.

I should mention that git is slow not only on my machine but deploying the app on staging environment is now taking around 3 hours.

So the fix should be something that will affect the server and not only the users of repository.

Git Solutions


Solution 1 - Git

Do you garbage collect?

git gc

This makes a significant difference in speed, even for small repos.

Solution 2 - Git

Explanation

Git is really good at huge histories of small text files because it can store them and their changes efficiently. At the same time, git is very bad at binary files, and will naïvely store separate copies of the file (by default, at least). The repository gets huge, and then it gets slow, as you've observed.

This is a common problem among DVCS's, exacerbated by the fact that you download every version of every file ("the whole repository") every time you clone. The guys at Kiln are working on a plugin to treat these large files more like Subversion, which only downloads historical versions on-demand.

Solution

This command will list all files under the current directory of size >= 5MB.

find . -size +5000000c 2>/dev/null -exec ls -l {} \;

If you want to remove the files from the entire history of the repository, you can use this idea with git filter-branch to walk the history and get rid of all traces of large files. After doing this, all new clones of the repository will be leaner. If you want to lean-up a repository without cloning, you'll find directions on the man page (see "Checklist for Shrinking a Repository").

git filter-branch --index-filter \
    'find . -size +5000000c 2>/dev/null -exec git rm --cached --ignore-unmatch {} \;'

A word of warning: this will make your repository incompatible with other clones, because the trees and indices have different files checked in; you won't be able to push or pull from them anymore.

Solution 3 - Git

Here is a censored revision intended to be less negative and inflammatory:

Git has a well-known weakness when it comes to files that are not line-by-line text files. There is currently no solution, and no plans announced by the core git team to address this. There are workarounds if your project is small, say, 100 MB or so. There exist branches of the git project to address this scalability issue, but these branches are not mature at this time. Some other revision control systems do not have this specific issue. You should consider this issue as just one of many factors when deciding whether to select git as your revision control system.

Solution 4 - Git

There is nothing specific about binary files and the way git is handling them. When you add a file to a git repository, a header is added and the file is compressed with zlib and renamed after the SHA1 hash. This is exactly the same regardless of file type. There is nothing in zlib compression that makes it problematic for binary files.

But at some points (pushing, gc) Git start to look at the possibility to delta compress content. If git find files that are similar (filename etc) it is putting them in RAM and starting to compress them together. If you have 100 files and each of them arr say 50Mb it will try to put 5GB in the memory at the same time. To this you have to add some more to make things work. You computer may not have this amount of RAM and it starts to swap. The process takes time.

You can limit the depth of the delta compression so that the process doesn't use that much memory but the result is less efficient compression. (core.bigFileThreshold, delta attribute, pack.window, pack.depth, pack.windowMemory etc)

So there are lots of thinks you can do to make git work very well with large files.

Solution 5 - Git

One way of speeding things up is to use the --depth 1 flag. See the man page for details. I am not a great git guru but I believe this says do the equivalent of a p4 get or an svn get, that is it give you only the latest files only instead of "give me all of the revisions of all the files through out all time" which is what git clone does.

Solution 6 - Git

have you told git those files are binary?

e.g. added *.ext binary to your repository's .gitattributes

Solution 7 - Git

You can also consider BFG Repo Cleaner as a faster easier way to clean up large files.

https://rtyley.github.io/bfg-repo-cleaner/

Solution 8 - Git

I have been running Git since 2008 both on windows and GNU/linux and I most of the files I track are binary files. Some of my repos are several GB and contains Jpeg and other media. I have many computers both at home and at work running Git.

I have never had the symptoms that are described by the original post. But just a couple of weeks ago I installed MsysGit on an old Win-XP Laptop and almost whatever I did, it brought git to a halt. Even test with just two or three small text files was ridiculously slow. We are talking about 10 minutes to add a file less that 1k... it seems like the git processes stayed alive forever. Everything else worked as expected on this computer.
I downgraded from the latest version to 1.6 something and the problems were gone...
I have other Laptops of the same brand, also with Win-XP installed by the same IT department form the same image, where Git works fine regardless of version... So there must be something odd with that particular computer.

I have also done some tests with binary files and compression. If you have a BMP picture and you make small changes to it and commit them, git gc will compress very well. So my conclusion is that the compression is not depending on if the files are binary or not.

Solution 9 - Git

Just set the files up to be ignored. See the link below:

http://help.github.com/git-ignore/

Solution 10 - Git

That's because git isn't scalable.

This is a serious limitation in git that is drowned out by git advocacy. Search the git mailing lists and you'll find hundreds of users wondering why just a meager 100 MB of images (say, for a web site or application) brings git to its knees. The problem appears to be that nearly all of git relies on an optimization they refer to as "packing". Unfortunately, packing is inefficient for all but the smallest text files (i.e., source code). Worse, it grows less and less efficient as the history increases.

It's really an embarrassing flaw in git, which is touted as "fast" (despite lack of evidence), and the git developers are well aware of it. Why haven't they fixed it? You'll find responses on the git mailing list from git developers who won't recognize the problem because they Photoshop documents (*.psd) are proprietary format. Yes, it's really that bad.

Here's the upshot:

Use git for tiny, source-code only projects for which you don't feel like setting up a separate repo. Or for small source-code only projects where you want to take advantage of git's copy-the-entire-repo model of decentralized development. Or when you simply want to learn a new tool. All of these are good reasons to use git, and it's always fun to learn new tools.

Don't use git if you have a large code base, binaries, huge history, etc. Just one of our repos is a TB. Git can't handle it. VSS, CVS, and SVN handle it just fine. (SVN bloats up, though.)

Also, give git time to mature. It's still immature, yet it has a lot of momentum. In time, I think the practical nature of Linus will overcome the OSS purists, and git will eventually be usable in the larger field.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionNick VanderbiltView Question on Stackoverflow
Solution 1 - GitkubiView Answer on Stackoverflow
Solution 2 - GitAndres Jaan TackView Answer on Stackoverflow
Solution 3 - GitJohnView Answer on Stackoverflow
Solution 4 - GitmartinView Answer on Stackoverflow
Solution 5 - GitDavidView Answer on Stackoverflow
Solution 6 - GitsmlView Answer on Stackoverflow
Solution 7 - GitDavid I.View Answer on Stackoverflow
Solution 8 - GitmartinView Answer on Stackoverflow
Solution 9 - GitjoshlrogersView Answer on Stackoverflow
Solution 10 - GitJohnView Answer on Stackoverflow