Find files in git repo over x megabytes, that don't exist in HEAD
GitGit Problem Overview
I have a Git repository I store random things in. Mostly random scripts, text files, websites I've designed and so on.
There are some large binary files I have deleted over time (generally 1-5MB), which are sitting around increasing the size of the repository, which I don't need in the revision history.
Basically I want to be able to do..
me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old
..then be able to go though each result, checking if it's no longer required then removing it (probably using filter-branch
)
Git Solutions
Solution 1 - Git
This is an adaptation of the git-find-blob
script I posted previously:
#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;
sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }
@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();
my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp;
sub walk_tree {
my ( $tree, @path ) = @_;
my @subtree;
my @r;
{
open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
while ( <$ls_tree> ) {
my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
if ( $type eq 'tree' ) {
push @subtree, [ $sha1, $name ];
}
elsif ( $type eq 'blob' and $size >= $cutoff ) {
push @r, [ $size, @path, $name ];
}
}
}
push @r, walk_tree( $_->[0], @path, $_->[1] )
for @subtree;
return @r;
}
memoize 'walk_tree';
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
or die "Couldn't open pipe to git-log: $!\n";
my %seen;
while ( <$log> ) {
chomp;
my ( $tree, $commit, $age ) = split " ", $_, 3;
my $is_header_printed;
for ( walk_tree( $tree ) ) {
my ( $size, @path ) = @$_;
my $path = join '/', @path;
next if $seen{ $path }++;
print "$commit $age\n" if not $is_header_printed++;
print "\t$size\t$path\n";
}
}
Solution 2 - Git
More compact ruby script:
#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte
big_files = {}
IO.popen("git rev-list #{head}", 'r') do |rev_list|
rev_list.each_line do |commit|
commit.chomp!
for object in `git ls-tree -zrl #{commit}`.split("\0")
bits, type, sha, size, path = object.split(/\s+/, 5)
size = size.to_i
big_files[sha] = [path, size, commit] if size >= treshold
end
end
end
big_files.each do |sha, (path, size, commit)|
where = `git show -s #{commit} --format='%h: %cr'`.chomp
puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end
Usage:
ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M example/blah.psd (aad2981: 4 months ago)
1.1M another/big.file (6e73ca2: 2 weeks ago)
Solution 3 - Git
Python script to do the same thing (based on this post):
#!/usr/bin/env python
import os, sys
def getOutput(cmd):
return os.popen(cmd).read()
if (len(sys.argv) <> 2):
print "usage: %s size_in_bytes" % sys.argv[0]
else:
maxSize = int(sys.argv[1])
revisions = getOutput("git rev-list HEAD").split()
bigfiles = set()
for revision in revisions:
files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
for file in files:
if file == "":
continue
splitdata = file.split()
commit = splitdata[2]
if splitdata[3] == "-":
continue
size = int(splitdata[3])
path = splitdata[4]
if (size > maxSize):
bigfiles.add("%10d %s %s" % (size, commit, path))
bigfiles = sorted(bigfiles, reverse=True)
for f in bigfiles:
print f
Solution 4 - Git
Ouch... that first script (by Aristotle), is pretty slow. On the git.git repo, looking for files > 100k, it chews up the CPU for about 6 minutes.
It also appears to have several wrong SHAs printed -- often a SHA will be printed that has nothing to do with the filename mentioned in the next line.
Here's a faster version. The output format is different, but it is very fast, and it is also -- as far as I can tell -- correct.
The program is a bit longer but a lot of it is verbiage.
#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );
my $min = shift;
$min =~ /^\d+$/ or die "need a number";
# ----------------------------------------------------------------------
my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;
# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";
my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
next unless / ./; # no commits or top level trees
( $blob, $name ) = split;
$name{$blob} = $name;
say $blobfile $blob;
}
close($blobfile);
# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";
my ( $dummy, $size );
while (<$sizes>) {
( $blob, $dummy, $size ) = split;
next if $size < $min;
$size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}
my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;
say "
The size shown is the largest that file has ever attained. But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";
# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
say "$size{$name}\t$name";
for my $r (@refs) {
system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
}
print "\n";
}
print "\n";
Solution 5 - Git
You want to use the BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branch
specifically designed for removing large files from Git repos.
Download the BFG jar (requires Java 6 or above) and run this command:
$ java -jar bfg.jar --strip-blobs-bigger-than 1M my-repo.git
Any files over 1M in size (that aren't in your latest commit) will be removed from your Git repository's history. You can then use git gc
to clean away the dead data:
$ git gc --prune=now --aggressive
The BFG is typically 10-50x faster than running git-filter-branch
and the options are tailored around these two common use-cases:
- Removing Crazy Big Files
- Removing Passwords, Credentials & other Private data
Full disclosure: I'm the author of the BFG Repo-Cleaner.
Solution 6 - Git
Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.
By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:
$ git reflog expire --expire=1.minute refs/heads/master
# all deletions up to 1 minute ago available to be garbage-collected
$ git fsck --unreachable
# lists all the blobs(file contents) that will be garbage-collected
$ git prune
$ git gc
A side comment: While I am big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.
Of course this use of Git is a perfectly good way to get familiar with how it works.
Solution 7 - Git
#!/bin/bash
if [ "$#" != 1 ]
then
echo 'git large.sh [size]'
exit
fi
declare -A big_files
big_files=()
echo printing results
while read commit
do
while read bits type sha size path
do
if [ "$size" -gt "$1" ]
then
big_files[$sha]="$sha $size $path"
fi
done < <(git ls-tree --abbrev -rl $commit)
done < <(git rev-list HEAD)
for file in "${big_files[@]}"
do
read sha size path <<< "$file"
if git ls-tree -r HEAD | grep -q $sha
then
echo $file
fi
done
Solution 8 - Git
This bash "one-liner" displays all blob objects in the repository that are larger than 10 MiB and are not present in HEAD
sorted from smallest to largest.
It's very fast, easy to copy & paste and only requires standard GNU utilities.
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk -v min_mb=10 '/^blob/ && $3 >= min_mb*2^20 {print substr($0,6)}' \
| grep -vFf <(git ls-tree -r HEAD | awk '{print $3}') \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
This will generate output like this:
2ba44098e28f 12MiB path/to/hires-image.png
bd1741ddce0d 63MiB path/to/some-video-1080p.mp4
For more information, including an output format more suitable for further script processing, see my original answer on a similar question.
macOS users: Since numfmt
is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils
.
Solution 9 - Git
My python simplification of https://stackoverflow.com/a/10099633/131881
#!/usr/bin/env python
import os, sys
bigfiles = []
for revision in os.popen('git rev-list HEAD'):
for f in os.popen('git ls-tree -zrl %s' % revision).read().split('\0'):
if f:
mode, type, commit, size, path = f.split(None, 4)
if int(size) > int(sys.argv[1]):
bigfiles.append((int(size), commit, path))
for f in sorted(set(bigfiles)):
print f
Solution 10 - Git
A little late to the party, but git-fat has this functionality built in.
Just install it with pip and run git fat -a find 100000
where the number at the end is in Bytes.