How to find the N largest files in a git repository?

Git

Git Problem Overview


I wanted to find the 10 largest files in my repository. The script I came up with is as follows:

REP_HOME_DIR=<top level git directory>
max_huge_files=10

cd ${REP_HOME_DIR}
git verify-pack -v ${REP_HOME_DIR}/.git/objects/pack/pack-*.idx | \
  grep blob | \
  sort -r -k 3 -n | \
  head -${max_huge_files} | \
  awk '{ system("printf \"%-80s \" `git rev-list --objects --all | grep " $1 " | cut -d\" \" -f2`"); printf "Size:%5d MB Size in pack file:%5d MB\n", $3/1048576,  $4/1048576; }'
cd -

Is there a better/more elegant way to do the same?

By "files" I mean the files that have been checked into the repository.

Git Solutions


Solution 1 - Git

I found another way to do it:

> git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10

Quoted from: SO: git find fat commit

Solution 2 - Git

This bash "one-liner" displays the 10 largest blobs in the repository, sorted from smallest to largest. In contrast to the other answers, this includes all files tracked by the repository, even those not present in any branch tip.

It's very fast, easy to copy & paste and only requires standard GNU utilities.

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 10 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The first four lines implement the core functionality, the fifth limits the number of results, while the last two lines provide the nice human-readable output that looks like this:

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

For more information, including further filtering use cases and an output format more suitable for script processing, see my original answer to a similar question.

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

Solution 3 - Git

How about

git ls-files | xargs ls -l | sort -nrk5 | head -n 10
  • git ls-files: List all the files in the repo
  • xargs ls -l: perform ls -l on all the files returned in git ls-files
  • sort -nrk5: Numerically reverse sort the lines based on 5th column
  • head -n 10: Print the top 10 lines

Solution 4 - Git

Cannot comment. ypid's answer modified for powershell

git ls-tree -r -l --abbrev --full-name HEAD | Sort-Object {[int]($_ -split "\s+")[3]} | Select-Object -last 10

Edit raphinesse's solution(ish)

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | Where-Object {$_ -like "blob*"} | Sort-Object {[int]($_ -split "\s+")[2]} | Select-Object -last 10

Solution 5 - Git

An improvement to raphinesse's answer, sort by size with largest first:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr($0,6)}' \
| sort --numeric-sort --key=2 --reverse \
| head \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Solution 6 - Git

On Windows, I started with @pix64's answer (thanks!) and modified it to handle files with spaces in the path, and also to output objects instead of strings:

git rev-list --objects --all |
 git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
 Where-Object {$_ -like "blob*"} |
 % { $tokens = $_ -split "\|"; [pscustomobject]@{ Hash = $tokens[1]; Size = [int]($tokens[2]); Name = $tokens[3] } } |
 Sort-Object -Property Size -Descending |
 Select-Object -First 50

Even better, if you want to output the file sizes with nice file size units, you can add the DisplayInBytes function from here to your environment, and then pipe the above to:

Format-Table Hash, Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

This gives you output like:

Hash                                     Name                                        Size
----                                     ----                                        ----
f51371aa843279a1efe45ff14f3dc3ec5f6b2322 types/react-native-snackbar-component/react 95.8 MB
84f3d727f6b8f99ab4698da51f9e507ae4cd8879 .ntvs_analysis.dat                          94.5 MB
17d734397dcd35fdbd715d29ef35860ecade88cd fhir/fhir-tests.ts                          11.5 KB
4c6a027cdbce093fd6ae15e65576cc8d81cec46c fhir/fhir-tests.ts                          11.4 KB

Lastly, if you'd like to get all the largest file types, you can do so with:

git rev-list --objects --all |
 git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
 Where-Object {$_ -like "blob*"} |
 % { $tokens = $_ -split "\|"; [pscustomobject]@{ Size = [int]($tokens[2]); Extension = [System.IO.Path]::GetExtension($tokens[3]) } } |
 Group-Object -Property Extension |
 % { [pscustomobject]@{ Name = $_.Name; Size = ($_.Group | Measure-Object Size -Sum).Sum } } |
 Sort-Object -Property Size -Descending |
 select -First 20 -Property Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

Solution 7 - Git

For completion, here's the method I found:

ls -lSh `git ls-files` | head

The optional -h prints the size in human-readable format.

Solution 8 - Git

You can also use du - Example: du -ah objects | sort -n -r | head -n 10 . du to get the size of the objects, sort them and then picking the top 10 using head.

Solution 9 - Git

Adding my 5 cents on how to do this for the whole repo history (useful before BFGing out the large blobs commited by accident):

git rev-list --all | while read rev ; do git ls-tree -rl --full-name $rev ; done | sort -k4 -nr | uniq

Example output (from dte repo from github) reveals that there's one screenshot in history that may probably be removed to keep the whole repo a bit smaller:

100644 blob 3147cb8d0780442f70765a005f1a114442f24e9b   67942	Documentation/screenshot.png
100644 blob 36ea7701a6d58185800e22c39cac78d979f4375a   62575	Documentation/screenshot.png
100644 blob c0cd355f06a093cd762339b76f0e726edf22fca1   49046	src/command.c
100644 blob 76d20c2e4a80cd3f417d15c130ee6968e99d6d7f   48601	src/command.c
100644 blob c476fbf2fda71ebd4b337e62fb76922d18aeb1f3   48588	src/command.c
100644 blob 24465d1fab54e48817780338f8206baf47e98091   48451	src/command.c
100644 blob 74494b6020b2eff223dfaeed39bbfca414f2b359   48429	src/command.c
100644 blob fb8f13abe39ca8ff0e98aa65f95c336c9253b487   47838	src/command.c
100644 blob c2ce190eb428c3aeb12d40cf902af2a433324dee   47835	src/command.c
...

...but this precise repo is okay, no blobs of extreme size were found.

EDIT: How to find the commits that work with the objects (adding for my own reference, haha):

git log --all --find-object=3147cb8d07

Solution 10 - Git

You can use find to find files larger than a given threshold, then pass them to git ls-files to exclude untracked files (e.g. build output):

find * -type f -size +100M -print0 | xargs -0 git ls-files

Adjust 100M (100 megabytes) as needed until you get results.

Minor caveat: this won't search top-level "hidden" files and folders (i.e. those whose names start with .). This is because I used find * instead of just find to avoid searching the .git database.

I was having trouble getting the sort -n solutions to work (on Windows under Git Bash). I'm guessing it's due to indentation differences when xargs batches arguments, which xargs -0 seems to do automatically to work around Windows' command-line length limit of 32767.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSumitView Question on Stackoverflow
Solution 1 - GitypidView Answer on Stackoverflow
Solution 2 - GitraphinesseView Answer on Stackoverflow
Solution 3 - GitpranithkView Answer on Stackoverflow
Solution 4 - Gitpix64View Answer on Stackoverflow
Solution 5 - GitstudogView Answer on Stackoverflow
Solution 6 - GitUnionPView Answer on Stackoverflow
Solution 7 - GittsvikasView Answer on Stackoverflow
Solution 8 - GitFirst ZeroView Answer on Stackoverflow
Solution 9 - GitexaView Answer on Stackoverflow
Solution 10 - GitJoey AdamsView Answer on Stackoverflow