Remove folder and its contents from git/GitHub's history
GitGithubRebaseGit RebaseGit Problem Overview
I was working on a repository on my GitHub account and this is a problem I stumbled upon.
- Node.js project with a folder with a few npm packages installed
- The packages were in
node_modules
folder - Added that folder to git repository and pushed the code to github (wasn't thinking about the npm part at that time)
- Realized that you don't really need that folder to be a part of the code
- Deleted that folder, pushed it
At that instance, the size of the total git repo was around 6MB where the actual code (all except that folder) was only around 300 KB.
Now what I am looking for in the end is a way to get rid of details of that package folder from git's history so if someone clones it, they don't have to download 6mb worth of history where the only actual files they will be getting as of the last commit would be 300KB.
I looked up possible solutions for this and tried these 2 methods
- https://stackoverflow.com/questions/2164581/remove-file-from-git-repository-history
- http://help.github.com/remove-sensitive-data/
- https://gist.github.com/1588371
The Gist seemed like it worked where after running the script, it showed that it got rid of that folder and after that it showed that 50 different commits were modified. But it didn't let me push that code. When I tried to push it, it said Branch up to date
but showed 50 commits were modified upon a git status
. The other 2 methods didn't help either.
Now even though it showed that it got rid of that folder's history, when I checked the size of that repo on my localhost, it was still around 6MB. (I also deleted the refs/original
folder but didn't see the change in the size of the repo).
What I am looking to clarify is, if there's a way to get rid of not only the commit history (which is the only thing I think happened) but also those files git is keeping assuming one wants to rollback.
Lets say a solution is presented for this and is applied on my localhost but cant be reproduced to that GitHub repo, is it possible to clone that repo, rollback to the first commit perform the trick and push it (or does that mean that git will still have a history of all those commits? - aka. 6MB).
My end goal here is to basically find the best way to get rid of the folder contents from git so that a user doesn't have to download 6MB worth of stuff and still possibly have the other commits that never touched the modules folder (that's pretty much all of them) in git's history.
How can I do this?
Git Solutions
Solution 1 - Git
WARNING: git filter-branch is no longer officially recommended
If you are here to copy-paste code:
This is an example which removes node_modules
from history
git filter-branch --tree-filter "rm -rf node_modules" --prune-empty HEAD
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
echo node_modules/ >> .gitignore
git add .gitignore
git commit -m 'Removing node_modules from git history'
git gc
git push origin master --force
What git actually does:
The first line iterates through all references on the same tree (--tree-filter
) as HEAD (your current branch), running the command rm -rf node_modules
. This command deletes the node_modules folder (-r
, without -r
, rm
won't delete folders), with no prompt given to the user (-f
). The added --prune-empty
deletes useless (not changing anything) commits recursively.
The second line deletes the reference to that old branch.
The rest of the commands are relatively straightforward.
Solution 2 - Git
I find that the --tree-filter
option used in other answers can be very slow, especially on larger repositories with lots of commits.
Here is the method I use to completely remove a directory from the git history using the --index-filter
option, which runs much quicker:
# Make a fresh clone of YOUR_REPO
git clone YOUR_REPO
cd YOUR_REPO
# Create tracking branches of all branches
for remote in `git branch -r | grep -v /HEAD`; do git checkout --track $remote ; done
# Remove DIRECTORY_NAME from all commits, then remove the refs to the old commits
# (repeat these two commands for as many directories that you want to remove)
git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch DIRECTORY_NAME/' --prune-empty --tag-name-filter cat -- --all
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
# Ensure all old refs are fully removed
rm -Rf .git/logs .git/refs/original
# Perform a garbage collection to remove commits with no refs
git gc --prune=all --aggressive
# Force push all branches to overwrite their history
# (use with caution!)
git push origin --all --force
git push origin --tags --force
You can check the size of the repository before and after the gc
with:
git count-objects -vH
Solution 3 - Git
It appears that the up-to-date answer to this is to not use filter-branch
directly (at least git itself does not recommend it anymore), and defer that work to an external tool. In particular, git-filter-repo is currently recommended. The author of that tool provides arguments on why using filter-branch
directly can lead to issues.
Most of the multi-line scripts above to remove dir
from the history could be re-written as:
git filter-repo --path dir --invert-paths
The tool is more powerful than just that, apparently. You can apply filters by author, email, refname and more (full manpage here). Furthermore, it is fast. Installation is easy - it is distributed in a variety of formats.
Solution 4 - Git
In addition to the popular answer above I would like to add a few notes for Windows-systems. The command
git filter-branch --tree-filter 'rm -rf node_modules' --prune-empty HEAD
-
works perfectly without any modification! Therefore, you must not use
Remove-Item
,del
or anything else instead ofrm -rf
. -
If you need to specify a path to a file or directory use slashes like
./path/to/node_modules
Solution 5 - Git
The best and most accurate method I found was to download the bfg.jar file: https://rtyley.github.io/bfg-repo-cleaner/
Then run the commands:
git clone --bare https://project/repository project-repository
cd project-repository
java -jar bfg.jar --delete-folders DIRECTORY_NAME
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --mirror https://project/new-repository
If you want to delete files then use the delete-files option instead:
java -jar bfg.jar --delete-files *.pyc
Solution 6 - Git
Complete copy&paste recipe, just adding the commands in the comments (for the copy-paste solution), after testing them:
git filter-branch --tree-filter 'rm -rf node_modules' --prune-empty HEAD
echo node_modules/ >> .gitignore
git add .gitignore
git commit -m 'Removing node_modules from git history'
git gc
git push origin master --force
After this, you can remove the line "node_modules/" from .gitignore
Solution 7 - Git
For Windows user, please note to use "
instead of '
Also added -f
to force the command if another backup is already there.
git filter-branch -f --tree-filter "rm -rf FOLDERNAME" --prune-empty HEAD
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
echo FOLDERNAME/ >> .gitignore
git add .gitignore
git commit -m "Removing FOLDERNAME from git history"
git gc
git push origin master --force
Solution 8 - Git
I removed the bin and obj folders from old C# projects using git on windows. Be careful with
git filter-branch --tree-filter "rm -rf bin" --prune-empty HEAD
It destroys the integrity of the git installation by deleting the usr/bin folder in the git install folder.
Solution 9 - Git
For copypasters (from here):
git filter-repo --invert-paths --path PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA
echo "YOUR-FILE-WITH-SENSITIVE-DATA" >> .gitignore
git add .gitignore
git commit -m "Add YOUR-FILE-WITH-SENSITIVE-DATA to .gitignore"
git push origin --force --all