Git with large files

GitLarge FilesGitlab

Git Problem Overview


Situation

I have two servers, Production and Development. On Production server, there are two applications and multiple (6) databases (MySQL) which I need to distribute to developers for testing. All source codes are stored in GitLab on Development server and developers are working only with this server and don't have access to production server. When we release an application, master logs into production and pulls new version from Git. The databases are large (over 500M each and counting) and I need to distribute them as easy as possible to developers for testing.

Possible solutions

  • After a backup script which dumps databases, each to a single file, execute a script which pushes each database to its own branch. A developer pulls one of these branches if he wants to update his local copy.

    This one was found non working.

  • Cron on production server saves binary logs every day and pushes them into the branch of that database. So, in the branch, there are files with daily changes and developer pulls the files he doesn't have. The current SQL dump will be sent to the developer another way. And when the size of the repository becomes too large, we will send full dump to the developers and flush all data in the repository and start from the beginning.

Questions

  • Is the solution possible?
  • If git is pushing/pulling to/from repository, does it upload/download whole files, or just changes in them (i.e. adds new lines or edits the current ones)?
  • Can Git manage so large files? No.
  • How to set how many revisions are preserved in a repository? Doesn't matter with the new solution.
  • Is there any better solution? I don't want to force the developers to download such large files over FTP or anything similar.

Git Solutions


Solution 1 - Git

Update 2017:

Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle "the largest repo on the planet"
(ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds)

> GVFS virtualizes the file system beneath your git repo so that git and all tools see what appears to be a normal repo, but GVFS only downloads objects as they are needed.

Some parts of GVFS might be contributed upstream (to Git itself).
But in the meantime, all new Windows development is now (August 2017) on Git.


Update April 2015: GitHub proposes: Announcing Git Large File Storage (LFS)

Using git-lfs (see git-lfs.github.com) and a server supporting it: lfs-test-server, you can store metadata only in the git repo, and the large file elsewhere. Maximum of 2 Gb per commit.

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

See git-lfs/wiki/Tutorial:

git lfs track '*.bin'
git add .gitattributes "*.bin"
git commit -m "Track .bin files"

Original answer:

Regarding what the git limitations with large files are, you can consider bup (presented in details in GitMinutes #24)

The design of bup highlights the three issues that limits a git repo:

  • huge files (the xdelta for packfile is in memory only, which isn't good with large files)
  • huge number of file, which means, one file per blob, and slow git gc to generate one packfile at a time.
  • huge packfiles, with a packfile index inefficient to retrieve data from the (huge) packfile.

Handling huge files and xdelta

> The primary reason git can't handle huge files is that it runs them through xdelta, which generally means it tries to load the entire contents of a file into memory at once.
If it didn't do this, it would have to store the entire contents of every single revision of every single file, even if you only changed a few bytes of that file.
That would be a terribly inefficient use of disk space
, and git is well known for its amazingly efficient repository format.

> Unfortunately, xdelta works great for small files and gets amazingly slow and memory-hungry for large files.
For git's main purpose, ie. managing your source code, this isn't a problem.

> What bup does instead of xdelta is what we call "hashsplitting."
We wanted a general-purpose way to efficiently back up any large file that might change in small ways, without storing the entire file every time. We read through the file one byte at a time, calculating a rolling checksum of the last 128 bytes.

> rollsum seems to do pretty well at its job. You can find it in bupsplit.c.
Basically, it converts the last 128 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk.
This happens on average once every 2^13 = 8192 bytes, so the average chunk size is 8192 bytes.
We're dividing up those files into chunks based on the rolling checksum.
Then we store each chunk separately (indexed by its sha1sum) as a git blob.

> With hashsplitting, no matter how much data you add, modify, or remove in the middle of the file, all the chunks before and after the affected chunk are absolutely the same.
All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences.
Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously.

> The next problem is less obvious: after you store your series of chunks as git blobs, how do you store their sequence? Each blob has a 20-byte sha1 identifier, which means the simple list of blobs is going to be 20/8192 = 0.25% of the file length.
For a 200GB file, that's 488 megs of just sequence data.

> We extend the hashsplit algorithm a little further using what we call "fanout." Instead of checking just the last 13 bits of the checksum, we use additional checksum bits to produce additional splits.
What you end up with is an actual tree of blobs - which git 'tree' objects are ideal to represent.

Handling huge numbers of files and git gc

> git is designed for handling reasonably-sized repositories that change relatively infrequently. You might think you change your source code "frequently" and that git handles much more frequent changes than, say, svn can handle.
But that's not the same kind of "frequently" we're talking about.

> The #1 killer is the way it adds new objects to the repository: it creates one file per blob. Then you later run 'git gc' and combine those files into a single file (using highly efficient xdelta compression, and ignoring any files that are no longer relevant).

> 'git gc' is slow, but for source code repositories, the resulting super-efficient storage (and associated really fast access to the stored files) is worth it.

> bup doesn't do that. It just writes packfiles directly.
Luckily, these packfiles are still git-formatted, so git can happily access them once they're written.

Handling huge repository (meaning huge numbers of huge packfiles)

> Git isn't actually designed to handle super-huge repositories.
Most git repositories are small enough that it's reasonable to merge them all into a single packfile, which 'git gc' usually does eventually.

> The problematic part of large packfiles isn't the packfiles themselves - git is designed to expect the total size of all packs to be larger than available memory, and once it can handle that, it can handle virtually any amount of data about equally efficiently.
The problem is the packfile indexes (.idx) files.

> each packfile (*.pack) in git has an associated idx (*.idx) that's a sorted list of git object hashes and file offsets.
If you're looking for a particular object based on its sha1, you open the idx, binary search it to find the right hash, then take the associated file offset, seek to that offset in the packfile, and read the object contents.

> The performance of the binary search is about O(log n) with the number of hashes in the pack, with an optimized first step (you can read about it elsewhere) that somewhat improves it to O(log(n)-7).
Unfortunately, this breaks down a bit when you have lots of packs.

> To improve performance of this sort of operation, bup introduces midx (pronounced "midix" and short for "multi-idx") files.
As the name implies, they index multiple packs at a time.

Solution 2 - Git

You really, really, really do not want large binary files checked into your Git repository.

Each update you add will cumulatively add to the overall size of your repository, meaning that down the road your Git repo will take longer and longer to clone and use up more and more disk space, because Git stores the entire history of the branch locally, which means when someone checks out the branch, they don't just have to download the latest version of the database; they'll also have to download every previous version.

If you need to provide large binary files, upload them to some server separately, and then check in a text file with a URL where the developer can download the large binary file. FTP is actually one of the better options, since it's specifically designed for transferring binary files, though HTTP is probably even more straightforward.

Solution 3 - Git

rsync could be a good option for efficiently updating the developers copies of the databases.

It uses a delta algorithm to incrementally update the files. That way it only transfers the blocks of the file that have changed or that are new. They will of course still need to download the full file first but later updates would be quicker.

Essentially you get a similar incremental update as a git fetch without the ever expanding initial copy that the git clone would give. The loss is not having the history but is sounds like you don't need that.

rsync is a standard part of most linux distributions if you need it on windows there is a packaged port available: http://itefix.no/cwrsync/

To push the databases to a developer you could use a command similar to:

rsync -avz path/to/database(s) HOST:/folder

Or the developers could pull the database(s) they need with:

rsync -avz DATABASE_HOST:/path/to/database(s) path/where/developer/wants/it

Solution 4 - Git

You can look at solution like git-annex, which is about managing (big) files with git, without checking the file contents into git(!)
(Feb 2015: a service hosting like GitLab integrates it natively:
See "Does GitLab support large files via git-annex or otherwise?")

git doesn't manage big files, as explained by Amber in her answer.

That doesn't mean git won't be able to do better one day though.
From GitMinutes episode 9 (May 2013, see also below), From Peff (Jeff King), at 36'10'':

(transcript)

> There is a all other realm of large repositories where people are interested in storing, you know, 20 or 30 or 40 GB, sometime even TB-sized repositories, and yeah it comes from having a lot of files, but a lot of it comes from having really big files and really big binaries files that don't deal so well with each others.

> That's sort of an open problem. There are a couple solutions: git-annex is probably the most mature of those, where they basically don't put the asset into git, they put the large asset on an asset server, and put a pointer into git.

> I'd like to do something like that, where the asset is conceptually in git, that is the SHA1 of that object is part of the SHA1 that goes into the tree, that goes into the commit ID and all those things.
So from git perspective, it is part of the repository, but at a level below, at the object storage level, at a level below the conceptual history graph, where we already have multiple way of storing an object: we have loose objects, we have packed objects, I'd like to have maybe a new way of storing an object which is to say "we don't have it here, but it is available by an asset server", or something like that.

> (Thomas Ferris Nicolaisen) Oh cool...

> The problem with things like git-annex is: once you use them, you're... locked-in to the decisions you made at that time forever. You know, that if you decide oh 200 MB is big, and we are gonna store on an asset server, and then, later you decide, aah it should have been 300 MB, well tough luck: that's encoded in your history forever.
And so by saying conceptually, at the git level, this object is in the git repository, not some pointer to it, not some pointer to an asset server, the actual object is there, and then taking care of those details at a low-level, at the storage level, then that frees you up to make a lot of different decisions, and even change your decision later about how you actually want to store the stuff on disk.

Not an high-priority project for now...


3 years later, in April 2016, Git Minutes 40 includes an interview of Michael Haggerty from GitHub around 31' (Thank you Christian Couder for the interview).

He is specialized in reference back-end for quite a while.
He is citing David Turner's work on back-end as the most interesting at the moment. (See David's current "pluggable-backends" branch of his git/git fork)

(transcript)

> Christian Couder (CD): The goal is to have git refs stored in a database, for example? Michael Haggerty (MH): Yeah, I see it as two interesting aspects: The first is simply having the ability to plug in different source entry references. Entry references are stored in the filesystem, as a combination of loose references and packed references.
Loose reference is one file per reference, and packed reference is one big file containing a list of many many references.

> So that's a good system, especially for a local usage; as it doesn't have any real performance problem for normal people, but it does have some problem, like you can't store references reflogs after the references have been deleted, because there can be conflicts with newer references which have been created with similar names. There is also a problem where reference names are stored on filesystem so you can have references which are named similar but with different capitalization.
So those are things which could be fixed by having different reference back-end system in general.
And the other aspect of David Turner's patch series is a change to store references in a database called lmdb, this is a really fast memory-based database that has some performance advantages over the file back-end.

[follows other considerations around having faster packing, and reference patch advertisement]

Solution 5 - Git

Having a auxiliary storage of files referenced from your git-stashed code is where most people go. git-annex does look pretty comprehensive, but many shops just use an FTP or HTTP (or S3) repository for the large files, like SQL dumps. My suggestion would be to tie the code in the git repo to the names of the files in the auxiliary storage by stuffing some of the metadata - specifically a checksum (probably SHA) - in to the hash, as well as a date.

  • So each aux file gets a basename, date, and SHA(for some version n) sum.
  • If you have wild file turnover, using only a SHA poses a tiny but real threat of hash collision, hence the inclusion of a date (epoch time or ISO date).
  • Put the resulting filename into the code, so that the aux chunk is included, very specifically, by reference.
  • Structure the names in such a way that a little script can be written easily to git grep all the aux file names, so that the list for any commit is trivial to obtain. This also allows the old ones to be retired at some point, and can be integrated with the deployment system to pull the new aux files out to production without clobbering the old ones (yet), prior to activating the code from the git repo.

Cramming huge files into git (or most repos) has a nasty impact on git's performance after a while - a git clone really shouldn't take twenty minutes, for example. Whereas using the files by reference means that some developers will never need to download the large chunks at all (a sharp contrast to the git clone), since the odds are that most are only relevant to the deployed code in production. Your mileage may vary, of course.

Solution 6 - Git

Large files uploading sometime create issues and errors. This happens usually. Mainly git supports less than 50MB file to upload. For uploading more than 50MB files in git repository user should need to install another assistant that cooperates to upload big file(.mp4,.mp3,.psd) etc.

there are some basic git commands you know before uploading big file in git. this is the configuration for uploading at github. it needs to install gitlfs.exe

intall it from lfsinstall.exe



then you should use basic commands of git along with some different

git lfs install
git init
git lfs track ".mp4"
git lfs track ".mp3"
git lfs track ".psd"
git add .
git add .gitattributes
git config lfs.https://github.com/something/repo.git/info/lfs.locksverify false 
git commit -m "Add design file"
git push origin master` ones

you may find you find it lfs.https://github.com/something/repo.git/info/lfs.locksverify false like instructions during push command if push without using it

Solution 7 - Git

As stated in many other answers, storing big files in git is highly unrecommended. I will not reiterate any more on this.

Your questions seems more like a question on database persistency rather than git. If the database info is not that much, then

  1. For Java, you can use flywaydb(java) to store the diff of database between each release.
  2. For Django, it can store db info into json dump (python manage.py dumpdata your_app > datadump.json) and reload it somewhere else (python manage.py loaddata datadump.json)

However, since your DB is large, then you should consider popular binary stores such as nexus or artifactory which can store binary files or being used as the store for gitlfs. Then to alleviate the burden of dev because you don't want them to explicitly download the file, you need to construct your own CI/CD pipeline which enable devs to publish it in a click.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJakub RiedlView Question on Stackoverflow
Solution 1 - GitVonCView Answer on Stackoverflow
Solution 2 - GitAmberView Answer on Stackoverflow
Solution 3 - GitPeterSWView Answer on Stackoverflow
Solution 4 - GitVonCView Answer on Stackoverflow
Solution 5 - GitAlex North-KeysView Answer on Stackoverflow
Solution 6 - GitAriful IslamView Answer on Stackoverflow
Solution 7 - GitR. LiuView Answer on Stackoverflow