Why would I want stage before committing in Git?

Git

Git Problem Overview


I'm new to version control and I understand that "committing" is essentially creating a backup while updating the new 'current' version of what you're working on.

What I don't understand is what staging for is from a practical perspective. Is staging something that exists in name only or does it serve a purpose? When you commit, its going to commit everything anyway, right?

Edit: I think I may be confusing the terminology. Is a 'staged' file the same thing as a 'tracked' file?

Git Solutions


Solution 1 - Git

When you commit it's only going to commit the changes in the index (the "staged" files). There are many uses for this, but the most obvious is to break up your working changes into smaller, self-contained pieces. Perhaps you fixed a bug while you were implementing a feature. You can git add just that file (or git add -p to add just part of a file!) and then commit that bugfix before committing everything else. If you are using git commit -a then you are just forcing an add of everything right before the commit. Don't use -a if you want to take advantage of staging files.

You can also treat the staged files as an intermediate working copy with the --cached to many commands. For example, git diff --cached will show you how the stage differs from HEAD so you can see what you're about to commit without mixing in your other working changes.

Solution 2 - Git

  • Staging area gives the control to make commit smaller. Just make one logical change in the code, add the changed files to the staging area and finally if the changes are bad then checkout to the previous commit or otherwise commit the changes.It gives the flexibility to split the task into smaller tasks and commit smaller changes. With staging area it is easier to focus in small tasks.
  • It also gives you the offer to take break and forgetting about how much work you have done before taking break. Suppose you need to change three files to make one logical change and you have changed the first file and need a long break until you start making the other changes. At this moment you cannot commit and you want to track which files you are done with so that after coming back you do not need to try to remember how much work have been done. So add the file to the staging area and it will save your work. When you come back just do git diff --staged and check which files you changed and where and start making other changes.

Solution 3 - Git

One practical purpose of staging is logical separation of file commits.

As staging allows you to continue making edits to the files/working directory, and make commits in parts when you think things are ready, you can use separate stages for logically unrelated edits.

Suppose you have 4 files fileA.html, fileB.html, fileC.html and fileD.html. You make changes to all 4 files and are ready to commit but changes in fileA.html and fileB.html are logically related (for example, same new feature implementation in both files) while changes in fileC.html and fileD.html are separate and logically unrelated to previous to files. You can first stage files fileA.html and fileB.html and commit those.

git add fileA.html
git add fileB.html
git commit -m "Implemented new feature XYZ"

Then in next step you stage and commit changes to remaining two files.

git add fileC.html
git add fileD.html
git commit -m "Implemented another feature EFG"

Solution 4 - Git

To expand on Ben Jackson's answer, which is fine, let's look at the original question closely. (See his answer for why bother type questions; this is more about what is going on.)

> I'm new to version control and I understand that "committing" is essentially creating a backup while updating the new 'current' version of what you're working on.

This isn't quite right. Backups and and version control are certainly related—exactly how strongly depends on some things that are to some extent matters of opinion—but there are certainly some differences, if only in intent: Backups are typically designed for disaster recovery (machine fails, fire destroys entire building including all storage media, etc.). Version control is typically designed for finer-grained interactions and offers features that backups don't. Backups are typically stored for some time, then jettisoned as "too old": a fresher backup is all that matters. Version control normally saves every committed version forever.

> What I don't understand is what staging for is from a practical perspective. Is staging something that exists in name only or does it serve a purpose? When you commit, its going to commit everything anyway, right?

Yes and no. Git's design here is somewhat peculiar. There exist version control systems that don't require a separate staging step. For instance, Mercurial, which is otherwise a lot like Git in terms of usage, doesn't require a separate hg add step, beyond the very first one that introduces an all-new file. With Mercurial, you use the hg command that selects some commit, then you do your work, then you run hg commit, and you're done. With Git, you use git checkout,1 then you do your work, then you run git add, and then git commit. Why the extra git add step?

The secret here is what Git calls, variously, the index, or the staging area, or sometimes—rarely these days—the cache. These are all names for the same thing.

> Edit: I think I may be confusing the terminology. Is a 'staged' file the same thing as a 'tracked' file?

No, but these are related. A tracked file is one that exists in Git's index. To properly understand the index, it's good to start with understanding commits.


1Since Git version 2.23, you can use git switch instead of git checkout. For this particular case, these two commands do exactly the same thing. The new command exists because git checkout got over-stuffed with too many things; they got split out into two separate commands, git switch and git restore, to make it easier and safer to use Git.


Commits

In Git, a commit saves a full snapshot of every file that Git knows about. (Which files does Git know about? We'll see that in the next section.) These snapshots are stored in a special, read-only, Git-only, compressed and de-duplicated form, that in general only Git itself can read. (There's more stuff in each commit than just this snapshot, but that's all we will cover here.)

The de-duplication helps with space: we normally only change a few files, then make a new commit. So most of the files in a commit are mostly the same as the files in the previous commit. By simply re-using those files directly, Git saves lots of space: if we only touched one file, the new commit only takes space for one new copy. Even then it's compressed—sometimes very compressed, though this actually happens later—so that a .git directory can actually be smaller than the files it contains, once they're expanded out to normal everyday files. The de-duplication is safe because the committed files are frozen for all time. Nobody can go change one, so it's safe for commits to depend on each others' copies.

Because the stored files are in this special, frozen-for-all-time, Git-only format, though, Git has to expand out each file into an ordinary everyday copy. This ordinary copy isn't Git's copy: it is your copy, to do with as you will. Git will just write to these when you tell it to do so, so that you have your copies to work with. These usable copies are in your working tree or work-tree.

What this means is that when you check out some particular commit, there are automatically two copies of each file:

  • Git has a frozen-for-all-time, Git-ified copy in the current commit. You can't change this copy (though you can of course select a different commit, or make a new commit).

  • You have, in your work-tree, a normal-format copy. You can do anything you want to this, using any of the commands on your computer.

Other version control systems (including Mercurial as mentioned above) stop here, with these two copies. You just modify your work-tree copy, then commit. Git ... doesn't.

The index

In between these two copies, Git stores a third copy2 of every file. This third copy is in the frozen format, but unlike the frozen copy in the commit, you can change it. To change it, you use git add.

The git add command means make the index copy of the file match the work-tree copy. That is, you are telling Git: Replace the frozen-format, de-duplicated copy that's in the index now, by compressing my updated work-tree copy, de-duplicating it, and getting it ready to be frozen into a new commit. If you don't use git add, the index still holds the frozen-format copy from the current commit.

When you run git commit, Git packages up whatever is in the index right then to use as the new snapshot. Since it's already in the frozen format, and pre-de-duplicated, Git does not have to do a lot of extra work.

This also explains what untracked files are all about. An untracked file is a file that is in your work-tree but isn't in Git's index right now. It doesn't matter how it the file wound up in this state. Maybe you copied it from some other place on your computer, into your work-tree. Maybe you created it fresh here. Maybe there was a copy in Git's index, but you removed that copy with git rm --cached. One way or another, there is a copy here in your work-tree, but there isn't a copy in Git's index. If you make a new commit now, that file won't be in the new commit.

Note that git checkout initially fills in Git's index from the commit you check out. So the index starts out matching the commit. Git also fills in your work-tree from this same source. So, initially, all three match. When you change files in your work-tree and git add them, well, now the index and your work-tree match. Then you run git commit and Git makes a new commit from the index, and now all three match again.

Because Git makes new commits from the index, we can put things this way: Git's index holds the next commit you plan to make. This ignores the expanded role that Git's index takes on during a conflicted merge, but we'd like to ignore that for now anyway. :-)

That's all there is to it—but it's still pretty complicated! It's particularly tricky because there's no easy way to see exactly what is in Git's index.3 But there is a Git command that tells you what's going on, in a way that's pretty useful, and that command is git status.


2Technically, this isn't actually a copy at all. Instead, it's a reference to the Git-ified file, pre-de-duplicated and everything. There's more stuff in here as well, such as the mode, file name, a staging number, and some cache data to make Git go fast. But unless you get into working with some of Git's low-level commands—git ls-files --stage and git update-index in particular—you can just think of it as a copy.

3The git ls-files --stage command will show you the names and staging numbers of every file in Git's index, but usually this isn't very useful anyway.


git status

The git status command actually works by running two separate git diff commands for you (and also doing some other useful stuff, such as telling you which branch you're on).

The first git diff compares the current commit—which, remember, is frozen for all time—to whatever is in Git's index. For files that are the same, Git will say nothing at all. For files that are different, Git will tell you that this file is staged for commit. This includes all-new files—if the commit doesn't have sub.py in it, but the index does have sub.py in it, then this file is added—and any removed files, that were (and are) in the commit but aren't in the index any more (git rm, perhaps).

The second git diff compares all the files in Git's index to the files in your work-tree. For files that are the same, Git says nothing at all. For files that are different, Git will tell you that this file is not staged for commit. Unlike the first diff, this particular list doesn't include files that are all-new: if the file untracked exists in your work-tree, but not in Git's index, Git just adds it to the list of untracked files.4

At the end, having accumulated these untracked files in a list, git status will announce those files' names too, but there's a special exception: if a file's name is listed in a .gitignore file, that suppresses this last listing. Note that listing a tracked file—one that's in Git's index—in a .gitignore has no effect here: the file is in the index, so it gets compared, and gets committed, even if it's listed in .gitignore. The ignore file only suppresses the "untracked file" complaints.5


4When using the short version of git statusgit status -s—the untracked files aren't as separated-out, but the principle is the same. Accumulating the files like this also lets git status summarize a bunch of untracked files' names by just printing a directory name, sometimes. To get the full list, use git status -uall or git status -u.

5Listing a file also makes en-masse add many file operations like git add . or git add * skip over the untracked file. This part gets a little more complicated, since you can use git add --force to add a file that would normally be skipped. There are some other normally-minor special cases, all of which add up to this: the file .gitignore might be more properly called .git-do-not-complain-about-these-untracked-files-and-do-not-auto-add-them or something equally unwieldy. But that's too ridiculous, so .gitignore it is.


git add -u, git commit -a, etc

There are several handy shortcuts to know about here:

  • git add . will add all updated files in the current directory and any sub-directory. This respects .gitignore, so if a file that is currently untracked is not complained-about by git status, it won't be auto-added.

  • git add -u will auto-add all updated files anywhere in your work-tree.6 This affects only tracked files. Note that if you've removed the work-tree copy, this will remove the index copy too (git add does this as part of its make the index match the work-tree thing).

  • git add -A is like running git add . from the top level of your work-tree (but see footnote 6).

Besides these, you can run git commit -a, which is roughly equivalent7 to running git add -u and then git commit. That is, this gets you the same behavior that is convenient in Mercurial.

I generally advise against the git commit -a pattern: I find that it's better to use git status often, look closely at the output, and if the status is not what you expected, figure out why that's the case. Using git commit -a, it's too easy to accidentally modify a file and commit a change you didn't intend to commit. But this is mostly a matter of taste / opinion.


6If your Git version predates Git 2.0, be careful here: git add -u only works on the current directory and sub-directories, so you must climb to the top level of your work-tree first. The git add -A option has a similar issue.

7I say roughly equivalent because git commit -a actually works by making an extra index, and using that other index to do the commit. If the commit works, you get the same effect as doing git add -u && git commit. If the commit doesn't work—if you make Git skip the commit in any of the many ways you can do that—then no files are git add-ed afterward, because Git throws out the temporary extra index and goes back to using the main index.

There are additional complications that come in if you use git commit --only here. In this case, Git creates a third index, and things get very tricky, especially if you use pre-commit hooks. This is another reason to use separate git add operations.

Solution 5 - Git

It is easier to understand the use of the git commands add and commit if you imagine a log file being maintained in your repository on Github. A typical project's log file for me may look like:

---------------- Day 1 --------------------
Message: Complete Task A
Index of files changed: File1, File2

Message: Complete Task B
Index of files changed: File2, File3
-------------------------------------------

---------------- Day 2 --------------------
Message: Correct typos
Index of files changed: File3, File1
-------------------------------------------
...
...
...and so on

I usually start my day with a git pull request and end it with a git push request. So everything inside a day's record corresponds to what occurs between them. During each day, there are one or more logical tasks that I complete which require changing a few files. The files edited during that task are listed in an index.

Each of these sub tasks(Task A and Task B here) are individual commits. The git add command adds files to the 'Index of Files Changed' list. This process is also called staging. The git commit command records/finalizes the changes and the corresponding index list along with a custom message.

Remember that you're still only changing the local copy of your repository and not the one on Github. After this, only when you do a 'git push' do all these recorded changes, along with your index files for each commit, get logged on the main repository(on Github).

As an example, to obtain the second entry in that imaginary log file, I would have done:

git pull
# Make changes to these files
git add File3 File4
# Verify changes, run tests etc..
git commit -m 'Correct typos'
git push

In a nutshell, git add and git commit lets you break down a change to the main repository into systematic logical sub-changes. As other answers and comments have pointed out, there are ofcourse many more uses to them. However, this is one of the most common usages and a driving principle behind Git being a multi-stage revision control system unlike other popular ones like Svn.

Solution 6 - Git

Staging area helps us craft the commits with greater flexibility. By crafting, I mean breaking up the commits into logical units. This is very crucial if you want a maintainable software. The most obvious way you can achieve this:

You can work on multiple features/bugs in a single working directory and still craft meaningful commits. Having a single working directory which contains all of our active work is also very convenient. (This can be done without a staging area, only as long as the changes don't ever overlap a file. And you also have the added responsibility of manually tracking whether they overlap)

You can find more examples here: Uses of Index

And the best part is, the advantages do not stop with this list of workflows. If a unique workflow does come up, you can be almost sure that staging area will help you out.

Solution 7 - Git

I see the point on using stage to make commits smaller as mentioned by @Ben Jackson and @Tapashee Tabassum Urmi and sometimes I use it for that purpose, but I mainly use it to make my commits larger! here is my point:

Say I want to add a small feature which require several smaller steps. I don't see any point in having a separate commit for smaller steps and flooding my timeline. However I want to save each step and go back if necessary,

I simply stage the smaller steps on top of each other and when I feel it is worthy of a commit, I commit. This way I remove the unnecessary commits from the timeline yet able to undo(checkout) the last step.

I see other ways for doing this (simplifying the git history) which you might use depending on your preference:

  1. git amend (which changes your last commit) which is not something you want for this specific purpose (I see it mostly as doing a bad commit and then fixing it)

  2. git rebase, which is an afterthought and can cause serious problems for you and others who use your repository.

  3. creating a temporary branch, merge and then delete it afterwards(which is also a good option, requires more steps but gives your more control)

Solution 8 - Git

It's like a checkbox that provides the ability to choose what files to commit.

for example, if I have edited fileA.txt and fileB.txt.But I wanna commit changes of fileA.txt only. because I am not finished yet with fileB.txt.

I can simply use git add fileA.txt and commit using git commit -m "changed fileA.txt" and continue working with fileB.txt and after finishing I can commit fileB.txt easily

Solution 9 - Git

Who wrote this terrible code and didn't maintain comments? Oh my, there's even a bug that wont let the code compile. Even if it does compile, it runs slow.

Done.

$ git diff

someTerribleCode

> - for(int i = 0; i < 20; i++) > > - this.assets[i] = clear(); > > > + foreach(var asset in assets) { > > + asset = clear() > > + }

undocumentedApi

> + //This Api has only one public method and only handles degrees > Celsius > > + //It is the onus of the user of the Api to conduct conversions

buggyCode

> + size_t NumberOfElements = sizeof(arr)/sizeof(arr[0]); > > + balance[NumberOfElements - 1];

slowCode

> + i = * ( long * ) &y; // evil floating point bit level hacking > > + i = 0x5f3759df - ( i >> 1 ); // what the ****?

I got carried away. I wanted to make a small change but ended up just doing everything in one go. Lemme commit.

$ git commit -m "fixed some terrible code and added api documentation and fixed some compile time errors. Also introduced fast inverse square root"

Right let me hit enter and commi... Arg wait. I actually worked on four completely different things.

  • Tracing the git log will be a nightmare. And what if I introduced some regression errors that are found only in the future? Finding the "exact commit" of the error introduction will be more difficult.

Well, that's easy. Let me commit them one by one. Let me start with:

$ git commit someTerribleCode.foo -m "cleaned up the iteration loop"
  • No wait. I need to sit back and see what I'm about to commit and need to know the difference between both someTerribleCode and undocumentedApi and HEAD.

  • Also, despite what Carmack and the Id guys claim, the confusion sown in the slowCode update outweighs performance gains. I shouldn't actually commit that.

If only that Finish/Swedish dude took this design criteria into account when giving me this tool. Ah, wait, he did.

$ git add undocumentedApi buggyCode

$ git diff --cached

undocumentedApi

> + //This Api has only one public method and only handles degrees > Celsius > > + //It is the onus of the user of the Api to conduct conversions

buggyCode

> + size_t NumberOfElements = sizeof(arr)/sizeof(arr[0]); > > + balance[NumberOfElements - 1];

Looks good. Ill commit those two in one go. Theyre not related but ones a set of comments so I wont stress.

$ git commit -m "fixed bug and added documentation"

Next one's separate commit:

$ git add someTerribleCode
$ git commit -m "refactored for loop"

Lemme check what's left in the staging area:

$ git status
Changes not staged for commit:
modified: slowCode

I can get rid of those changes.

$ git reset --hard

Git could have avoided the staging area entirely and rather designed "selective direct commits", but this would be less intuitive when interfacing with the command line and the mental model of understanding the various commands. When one builds a mental model of the staging area, the commands become more palatable and easier to grasp.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionCitizenView Question on Stackoverflow
Solution 1 - GitBen JacksonView Answer on Stackoverflow
Solution 2 - GitTapashee Tabassum UrmiView Answer on Stackoverflow
Solution 3 - GitDarthWaderView Answer on Stackoverflow
Solution 4 - GittorekView Answer on Stackoverflow
Solution 5 - GitCibin JosephView Answer on Stackoverflow
Solution 6 - GitAndrew NessinView Answer on Stackoverflow
Solution 7 - GitAli80View Answer on Stackoverflow
Solution 8 - GitRamounView Answer on Stackoverflow
Solution 9 - GitDean PView Answer on Stackoverflow