Find files in git repo over x megabytes, that don't exist in HEAD

Git Problem Overview

I have a Git repository I store random things in. Mostly random scripts, text files, websites I've designed and so on.

There are some large binary files I have deleted over time (generally 1-5MB), which are sitting around increasing the size of the repository, which I don't need in the revision history.

Basically I want to be able to do..

me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old

..then be able to go though each result, checking if it's no longer required then removing it (probably using filter-branch)

Git Solutions

Solution 1 - Git

This is an adaptation of the git-find-blob script I posted previously:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}

Solution 2 - Git

More compact ruby script:

#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte

big_files = {}

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold
    end
  end
end

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end

Usage:

ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M  example/blah.psd  (aad2981: 4 months ago)
1.1M  another/big.file  (6e73ca2: 2 weeks ago)

Solution 3 - Git

Python script to do the same thing (based on this post):

#!/usr/bin/env python

import os, sys

def getOutput(cmd):
    return os.popen(cmd).read()

if (len(sys.argv) <> 2):
    print "usage: %s size_in_bytes" % sys.argv[0]
else:
    maxSize = int(sys.argv[1])

    revisions = getOutput("git rev-list HEAD").split()

    bigfiles = set()
    for revision in revisions:
        files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
        for file in files:
            if file == "":
                continue
            splitdata = file.split()
            commit = splitdata[2]
            if splitdata[3] == "-":
                continue
            size = int(splitdata[3])
            path = splitdata[4]
            if (size > maxSize):
                bigfiles.add("%10d %s %s" % (size, commit, path))
    
    bigfiles = sorted(bigfiles, reverse=True)
    
    for f in bigfiles:
        print f

Solution 4 - Git

Ouch... that first script (by Aristotle), is pretty slow. On the git.git repo, looking for files > 100k, it chews up the CPU for about 6 minutes.

It also appears to have several wrong SHAs printed -- often a SHA will be printed that has nothing to do with the filename mentioned in the next line.

Here's a faster version. The output format is different, but it is very fast, and it is also -- as far as I can tell -- correct.

The program is a bit longer but a lot of it is verbiage.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );

my $min = shift;
$min =~ /^\d+$/ or die "need a number";

# ----------------------------------------------------------------------

my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;

# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";

my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
    next unless / ./;    # no commits or top level trees
    ( $blob, $name ) = split;
    $name{$blob} = $name;
    say $blobfile $blob;
}
close($blobfile);

# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";

my ( $dummy, $size );
while (<$sizes>) {
    ( $blob, $dummy, $size ) = split;
    next if $size < $min;
    $size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}

my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;

say "
The size shown is the largest that file has ever attained.  But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";

# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
    say "$size{$name}\t$name";

    for my $r (@refs) {
        system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
    }
    print "\n";
}
print "\n";

Solution 5 - Git

You want to use the BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branch specifically designed for removing large files from Git repos.

Download the BFG jar (requires Java 6 or above) and run this command:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1M  my-repo.git

Any files over 1M in size (that aren't in your latest commit) will be removed from your Git repository's history. You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is typically 10-50x faster than running git-filter-branch and the options are tailored around these two common use-cases:

Removing Crazy Big Files
Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Solution 6 - Git

Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.

By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:

$ git reflog expire --expire=1.minute refs/heads/master
     # all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable 
     # lists all the blobs(file contents) that will be garbage-collected 
$ git prune 
$ git gc

A side comment: While I am big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.

Of course this use of Git is a perfectly good way to get familiar with how it works.

Solution 7 - Git

#!/bin/bash
if [ "$#" != 1 ]
then
  echo 'git large.sh [size]'
  exit
fi

declare -A big_files
big_files=()
echo printing results

while read commit
do
  while read bits type sha size path
  do
    if [ "$size" -gt "$1" ]
    then
      big_files[$sha]="$sha $size $path"
    fi
  done < <(git ls-tree --abbrev -rl $commit)
done < <(git rev-list HEAD)

for file in "${big_files[@]}"
do
  read sha size path <<< "$file"
  if git ls-tree -r HEAD | grep -q $sha
  then
    echo $file
  fi
done

Source

Solution 8 - Git

This bash "one-liner" displays all blob objects in the repository that are larger than 10 MiB and are not present in HEAD sorted from smallest to largest.

It's very fast, easy to copy & paste and only requires standard GNU utilities.

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk -v min_mb=10 '/^blob/ && $3 >= min_mb*2^20 {print substr($0,6)}' \
| grep -vFf <(git ls-tree -r HEAD | awk '{print $3}') \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

This will generate output like this:

2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

For more information, including an output format more suitable for further script processing, see my original answer on a similar question.

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

Solution 9 - Git

My python simplification of https://stackoverflow.com/a/10099633/131881

#!/usr/bin/env python
import os, sys

bigfiles = []
for revision in os.popen('git rev-list HEAD'):
    for f in os.popen('git ls-tree -zrl %s' % revision).read().split('\0'):
        if f:
            mode, type, commit, size, path = f.split(None, 4)
            if int(size) > int(sys.argv[1]):
                bigfiles.append((int(size), commit, path))

for f in sorted(set(bigfiles)):
    print f

Solution 10 - Git

A little late to the party, but git-fat has this functionality built in.

Just install it with pip and run git fat -a find 100000 where the number at the end is in Bytes.

Content Type	Original Author	Original Content on Stackoverflow
Question	dbr	View Question on Stackoverflow
Solution 1 - Git	Aristotle Pagaltzis	View Answer on Stackoverflow
Solution 2 - Git	mislav	View Answer on Stackoverflow
Solution 3 - Git	SigTerm	View Answer on Stackoverflow
Solution 4 - Git	Sitaram Chamarty	View Answer on Stackoverflow
Solution 5 - Git	Roberto Tyley	View Answer on Stackoverflow
Solution 6 - Git	Paul	View Answer on Stackoverflow
Solution 7 - Git	Zombo	View Answer on Stackoverflow
Solution 8 - Git	raphinesse	View Answer on Stackoverflow
Solution 9 - Git	Collin Anderson	View Answer on Stackoverflow
Solution 10 - Git	Caustic	View Answer on Stackoverflow