What's the best way to become familiar with a large codebase?

Legacy Problem Overview

Joining an existing team with a large codebase already in place can be daunting. What's the best approach;

Broad; try to get a general overview of how everything links together, from the code
Narrow; focus on small sections of code at a time, understanding how they work fully
Pick a feature to develop and learn as you go along
Try to gain insight from class diagrams and uml, if available (and up to date)
Something else entirely?

I'm working on what is currently an approx 20k line C++ app & library (Edit: small in the grand scheme of things!). In industry I imagine you'd get an introduction by an experienced programmer. However if this is not the case, what can you do to start adding value as quickly as possible?

--
Summary of answers:

Step through code in debug mode to see how it works
Pair up with someone more familiar with the code base than you, taking turns to be the person coding and the person watching/discussing. Rotate partners amongst team members so knowledge gets spread around.
Write unit tests. Start with an assertion of how you think code will work. If it turns out as you expected, you've probably understood the code. If not, you've got a puzzle to solve and or an enquiry to make. (Thanks Donal, this is a great answer)
Go through existing unit tests for functional code, in a similar fashion to above
Read UML, Doxygen generated class diagrams and other documentation to get a broad feel of the code.
Make small edits or bug fixes, then gradually build up
Keep notes, and don't jump in and start developing; it's more valuable to spend time understanding than to generate messy or inappropriate code.

this post is a partial duplicate of the-best-way-to-familiarize-yourself-with-an-inherited-codebase

Legacy Solutions

Solution 1 - Legacy

Start with some small task if possible, debug the code around your problem. Stepping through code in debug mode is the easiest way to learn how something works.

Solution 2 - Legacy

Another option is to write tests for the features you're interested in. Setting up the test harness is a good way of establishing what dependencies the system has and where its state resides. Each test starts with an assertion about the way you think the system should work. If it turns out to work that way, you've achieved something and you've got some working sample code to reproduce it. If it doesn't work that way, you've got a puzzle to solve and a line of enquiry to follow.

Solution 3 - Legacy

One thing that I usually suggest to people that has not yet been mentioned is that it is important to become a competent user of the existing code base before you can be a developer. When new developers come into our large software project, I suggest that they spend time becoming expert users before diving in trying to work on the code.

Maybe that's obvious, but I have seen a lot of people try to jump into the code too quickly because they are eager to start making progress.

Solution 4 - Legacy

This is quite dependent on what sort of learner and what sort of programmer you are, but:

Broad first - you need an idea of scope and size. This might include skimming docs/uml if they're good. If it's a long term project and you're going to need a full understanding of everything, I might actually read the docs properly. Again, if they're good.
Narrow - pick something manageable and try to understand it. Get a "taste" for the code.
Pick a feature - possibly a different one to the one you just looked at if you're feeling confident, and start making some small changes.
Iterate - assess how well things have gone and see if you could benefit from repeating an early step in more depth.

Solution 5 - Legacy

Pairing with strict rotation.

If possible, while going through the documentation/codebase, try to employ pairing with strict rotation. Meaning, two of you sit together for a fixed period of time (say, a 2 hour session), then you switch pairs, one person will continue working on that task while the other moves to another task with another partner.

In pairs you'll both pick up a piece of knowledge, which can then be fed to other members of the team when the rotation occurs. What's good about this also, is that when a new pair is brought together, the one who worked on the task (in this case, investigating the code) can then summarise and explain the concepts in a more easily understood way. As time progresses everyone should be at a similar level of understanding, and hopefully avoid the "Oh, only John knows that bit of the code" syndrome.

From what I can tell about your scenario, you have a good number for this (3 pairs), however, if you're distributed, or not working to the same timescale, it's unlikely to be possible.

Solution 6 - Legacy

I would suggest running Doxygen on it to get an up-to-date class diagram, then going broad-in for a while. This gives you a quickie big picture that you can use as you get up close and dirty with the code.

Solution 7 - Legacy

I agree that it depends entirely on what type of learner you are. Having said that, I've been at two companies which had very large code-bases to begin with. Typically, I work like this:

If possible, before looking at any of the functional code, I go through unit tests that are already written. These can generally help out quite a lot. If they aren't available, then I do the following.

First, I largely ignore implementation and look only at header files, or just the class interfaces. I try to get an idea of what the purpose of each class is. Second, I go one level deep into the implementation starting with what seems to be the area of most importance. This is hard to gauge, so occasionally I just start at the top and work my way down in the file list. I call this breadth-first learning. After this initial step, I generally go depth-wise through the rest of the code. The initial breadth-first look helps to solidify/fix any ideas I got from the interface level, and then the depth-wise look shows me the patterns that have been used to implement the system, as well as the different design ideas. By depth-first, I mean you basically step through the program using the debugger, stepping into each function to see how it works, and so on. This obviously isn't possible with really large systems, but 20k LOC is not that many. :)

Solution 8 - Legacy

Work with another programmer who is more familiar with the system to develop a new feature or to fix a bug. This is the method that I've seen work out the best.

Solution 9 - Legacy

I think you need to tie this to a particular task. When you have time on your hands, go for whichever approach you are in the mood for.

When you have something that needs to get done, give yourself a narrow focus and get it done.

Solution 10 - Legacy

First, if you have team members available who have experience with the code you should arrange for them to do an overview of the code with you. Each team member should provide you with information on their area of expertise. It is usually valuable to get multiple people explaining things, because some will be better at explaining than others and some will have a better understanding than others.

Then, you need to start reading the code for a while without any pressure (a couple of days or a week if your boss will provide that). It often helps to compile/build the project yourself and be able to run the project in debug mode so you can step through the code. Then, start getting your feet wet, fixing small bugs and making small enhancements. You will hopefully soon be ready for a medium-sized project, and later, a big project. Continue to lean on your team-mates as you go - often you can find one in particular who is willing to mentor you.

Don't be too hard on yourself if you struggle - that's normal. It can take a long time, maybe years, to understand a large code base. Actually, it's often the case that even after years there are still some parts of the code that are still a bit scary and opaque. When you get downtime between projects you can dig in to those areas and you'll often find that after a few tries you can figure even those parts out.

Good luck!

Solution 11 - Legacy

Get the team to put you on bug fixing for two weeks (if you have two weeks). They'll be happy to get someone to take responsibility for that, and by the end of the period you will have spent so much time problem-solving with the library that you'll probably know it pretty well.

Solution 12 - Legacy

If it has unit tests (I'm betting it doesn't). Start small and make sure the unit tests don't fail. If you stare at the entire codebase at once your eyes will glaze over and you will feel overwhelmed.

If there are no unit tests, you need to focus on the feature that you want. Run the app and look at the results of things that your feature should affect. Then start looking through the code trying to figure out how the app creates the things you want to change. Finally change it and check that the results come out the way you want.

You mentioned it is an app and a library. First change the app and stick to using the library as a user. Then after you learn the library it will be easier to change.

From a top down approach, the app probably has a main loop or a main gui that controls all the action. It is worth understanding the main control flow of the application. It is worth reading the code to give yourself a broad overview of the main flow of the app. If it is a GUI app, creating a paper that shows which screens there are and how to get from one screen to another. If it is a command line app, how the processing is done.

Even in companies it is not unusual to have this approach. Often no one fully understands how an application works. And people don't have time to show you around. They prefer specific questions about specific things so you have to dig in and experiment on your own. Then once you get your specific question you can try to isolate the source of knowledge for that piece of the application and ask it.

Solution 13 - Legacy

You may want to consider looking at source code reverse engineering tools. There are two tools that I know of:

SWAG Kit (Linux only) link
Bauhaus academic commercial

Both tools offer similar feature sets that include static analysis that produces graphs of the relations between modules in the software.

This mostly consists of call graphs and type/class decencies. Viewing this information should give you a good picture of how the parts of the code relate to one another. Using this information, you can dig into the actual source for the parts that you are most interested in and that you need to understand/modify first.

Solution 14 - Legacy

Start by understanding the 'problem domain' (is it a payroll system? inventory? real time control or whatever). If you don't understand the jargon the users use, you'll never understand the code.

Then look at the object model; there might already be a diagram or you might have to reverse engineer one (either manually or using a tool as suggested by Doug). At this stage you could also investigate the database (if any), if should follow the object model but it may not, and that's important to know.

Have a look at the change history or bug database, if there's an area that comes up a lot, look into that bit first. This doesn't mean that it's badly written, but that it's the bit everyone uses.

Lastly, keep some notes (I prefer a wiki).

The existing guys can use it to sanity check your assumptions and help you out.
You will need to refer back to it later.
The next new guy on the team will really thank you.

Solution 15 - Legacy

I had a similar situation. I'd say you go like this:

If its a database driven application, start from the database and try to make sense of each table, its fields and then its relation to the other tables.
Once fine with the underlying store, move up to the ORM layer. Those table must have some kind of representation in code.
Once done with that then move on to how and where from these objects are coming from. Interface? what interface? Any validations? What preprocessing takes place on them before they go to the datastore?

This would familiarize you better with the system. Remember that trying to write or understand unit tests is only possible when you know very well what is being tested and why it needs to be tested in only that way.

And in case of a large application that is not driven towards databases, I'd recommend an other approach:

What the main goal of the system?
What are the major components of the system then to solve this problem?
What interactions each of the component has among them? Make a graph that depicts component dependencies. Ask someone already working on it. These componentns must be exchanging something among each other so try to figure out those as well (like IO might be returning File object back to GUI and like)
Once comfortable to this, dive into component that is least dependent among others. Now study how that component is further divided into classes and how they interact wtih each other. This way you've got a hang of a single component in total
Move to the next least dependent component
To the very end, move to the core component that typically would have dependencies on many of the other components which you've already tackled
While looking at the core component, you might be referring back to the components you examined earlier, so dont worry keep working hard!

For the first strategy: Take the example of this stackoverflow site for instance. Examine the datastore, what is being stored, how being stored, what representations those items have in the code, how an where those are presented on the UI. Where from do they come and what processing takes place on them once they're going back to the datastore.

For the second one Take the example of a word processor for example. What components are there? IO, UI, Page and like. How these are interacting with each other? Move along as you learn further.

Be relaxed. Written code is someone's mindset, froze logic and thinking style and it would take time to read that mind.

Solution 16 - Legacy

I find that just jumping in to code can be a a bit overwhelming. Try to read as much documentation on the design as possible. This will hopefully explain the purpose and structure of each component. Its best if an existing developer can take you through it but that isn't always possible.

Once you are comfortable with the high level structure of the code, try to fix a bug or two. this will help you get to grips with the actual code.

Solution 17 - Legacy

I like all the answers that say you should use a tool like Doxygen to get a class diagram, and first try to understand the big picture. I totally agree with this.

That said, this largely depends on how well factored the code is to begin with. If its a gigantic mess, it's going to be hard to learn. If its clean, and organized properly, it shouldn't be that bad.

Solution 18 - Legacy

See https://stackoverflow.com/questions/3224752/how-to-navigate-around-and-get-familiar-with-large-c-or-c-code-base/3224863#3224863">this answer on how to use test coverage tools to locate the code for a feature of interest, without knowing anything about where that feature is, or how it is spread across many modules.

Solution 19 - Legacy

(shameless marketing ahead)

You should check out nWire. It is an Eclipse plugin for navigating and visualizing large codebases. Many of our customers use it to break-in new developers by printing out visualizations of the major flows.

Content Type	Original Author	Original Content on Stackoverflow
Question	RJFalconer	View Question on Stackoverflow
Solution 1 - Legacy	Brian R. Bondy	View Answer on Stackoverflow
Solution 2 - Legacy	user8599	View Answer on Stackoverflow
Solution 3 - Legacy	Russell Bryant	View Answer on Stackoverflow
Solution 4 - Legacy	Draemon	View Answer on Stackoverflow
Solution 5 - Legacy	Grundlefleck	View Answer on Stackoverflow
Solution 6 - Legacy	Paul Nathan	View Answer on Stackoverflow
Solution 7 - Legacy	user29227	View Answer on Stackoverflow
Solution 8 - Legacy	Avdi	View Answer on Stackoverflow
Solution 9 - Legacy	KevDog	View Answer on Stackoverflow
Solution 10 - Legacy	unintentionally left blank	View Answer on Stackoverflow
Solution 11 - Legacy	Leigh Caldwell	View Answer on Stackoverflow
Solution 12 - Legacy	Cervo	View Answer on Stackoverflow
Solution 13 - Legacy	Doug	View Answer on Stackoverflow
Solution 14 - Legacy	Robin Bennett	View Answer on Stackoverflow
Solution 15 - Legacy	Mohsin Shafique	View Answer on Stackoverflow
Solution 16 - Legacy	Dave Turvey	View Answer on Stackoverflow
Solution 17 - Legacy	dicroce	View Answer on Stackoverflow
Solution 18 - Legacy	Ira Baxter	View Answer on Stackoverflow
Solution 19 - Legacy	zvikico	View Answer on Stackoverflow