Hardest types of bugs to track?
DebuggingLanguage AgnosticTestingDebugging Problem Overview
What are some of the nastiest, most difficult bugs you have had to track and fix and why?
I am both genuinely curious and knee deep in the process as we speak. So as they say - misery likes company.
Debugging Solutions
Solution 1 - Debugging
[Heisenbugs](http://en.wikipedia.org/wiki/Unusual_software_bug#Heisenbug "Wikipedia"):
> A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.
Solution 2 - Debugging
Race conditions and deadlocks. I do a lot of multithreaded processes and that is the hardest thing to deal with.
Solution 3 - Debugging
Bugs that happen when compiled in release mode but not in debug mode.
Solution 4 - Debugging
Any bug based on timing conditions. These often come when working with inter-thread communication, an external system, reading from a network, reading from a file, or communicating with any external server or device.
Solution 5 - Debugging
Bugs that are not in your code per se, but rather in a vendor's module on which you depend. Particularly when the vendor is unresponsive and you are forced to hack a work-around. Very frustrating!
Solution 6 - Debugging
We were developing a database to hold words and definitions in another language. It turns out that this language had only recently been added to the Unicode standard and it didn't make it into SQL Server 2005 (though it was added around 2005). This had a very frustrating effect when it came to collation.
Words and definitions went in just fine, I could see everything in Management Studio. But whenever we tried to find the definition for a given word, our queries returned nothing. After a solid 8 hours of debugging, I was at the point of thinking I had lost the ability to write a simple SELECT query.
That is, until I noticed English letters matched other English letters with any amount of foreign letters thrown in. For example, EnglishWord would match E!n@gl##$ish$&Word. (With !@#$%^&* representing foreign letters).
When a collation doesn't know about a certain character, it can't sort them. If it can't sort them, it can't tell whether two string match or not (a surprise for me). So frustrating and a whole day down the drain for a stupid collation setting.
Solution 7 - Debugging
Threading bugs, especially race conditions. When you cannot stop the system (because the bug goes away), things quickly get tough.
Solution 8 - Debugging
The hardest ones I usually run into are ones that don't show up in any log trace. You should never silently eat an exception! The problem is that eating an exception often moves your code into an invalid state, where it fails later in another thread and in a completely unrelated manner.
That said, the hardest one I ever really ran into was a C program in a function call where the calling signature didn't exactly match the called signature (one was a long, the other an int). There were no errors at compile time or link time and most tests passed, but the stack was off by sizeof(int), so the variables after it on the stack would randomly have bad values, but most of the time it would work fine (the values following that bad parameter were generally being passed in as zero).
That was a BITCH to track.
Solution 9 - Debugging
Memory corruption under load due to bad hardware.
Solution 10 - Debugging
- Bugs that happen on one server and not another, and you don't have access to the offending server to debug it.
- Bugs that have to do with threading.
Solution 11 - Debugging
The most frustrating for me have been compiler bugs, where the code is correct but I've hit an undocumented corner case or something where the compiler's wrong. I start with the assumption that I've made a mistake, and then spend days trying to find it.
Edit: The other most frustrating was the time I got the test case set slightly wrong, so my code was correct but the test wasn't. That took days to find.
In general, I guess the worst bugs I've had have been the ones that aren't my fault.
Solution 12 - Debugging
The hardest bugs to track down and fix are those that combine all the difficult cases:
- reported by a third party but you can't reproduce it under your own testing conditions;
- bug occurs rarely and unpredictably (e.g. because it's caused by a race condition);
- bug is on an embedded system and you can't attach a debugger;
- when you try to get logging information out the bug goes away;
- bug is in third-party code such as a library ...
- ... to which you don't have the source code so you have to work with disassembly only;
- and the bug is at the interface between multiple hardware systems (e.g. networking protocol bugs or bus contention bugs).
I was working on a bug with all these features this week. It was necessary to reverse engineer the library to find out what it was up to; then generate hypotheses about which two devices were racing; then make specially-instrumented versions of the program designed to provoke the hypothesized race condition; then once one of the hypotheses was confirmed it was possible to synchronize the timing of events so that the library won the race 100% of the time.
Solution 13 - Debugging
There was a project building a chemical engineering simulator using a beowulf cluster. It so happened that the network cards would not transmit one particular sequence of bytes. If a packet contained that string, the packet would be lost. They solved the problem by replacing the hardware - finding it in the first place was much harder.
Solution 14 - Debugging
One of the hardest bugs I had to find was a memory corruption error that only occurred after the program had been running for hours. Because of the length of time it took to corrupt the data, we assumed hardware and tried two or three other computers first.
The bug would take hours to appear, and when it did appear it was usually only noticed quite a length of time after when the program got so messed up it started misbehaving. Narrowing down in the code base to where the bug was occurring was very difficult because the crashes due to corrupted memory never occurred in the function that corrupted the memory, and it took so damned long for the bug to manifest itself.
The bug turned out to be an off-by-one error in a rarely called piece of code to handle a data line that had something wrong with it (invalid character encoding from memory).
In the end the debugger proved next to useless because the crashes never occurred in the call tree for the offending function. A well sequenced stream of fprintf(stderr, ...) calls in the code and dumping the output to a file was what eventually allowed us to identify what the problem was.
Solution 15 - Debugging
Concurrency bugs are quite hard to track, because reproducing them can be very hard when you do not yet know what the bug is. That's why, every time you see an unexplained stack trace in the logs, you should search for the reason of that exception until you find it. Even if it happens only one time in a million, that does not make it unimportant.
Since you can not rely on the tests to reproduce the bug, you must use deductive reasoning to find out the bug. That in turn requires a deep understanding of how the system works (for example how Java's memory model works and what are possible sources of concurrency bugs).
Here is an example of a concurrency bug in Guice 1.0 which I located just some days ago. You can test your bug finding skills by trying to find out what is the bug causing that exception. The bug is not too hard to find - I found its cause in some 15-30 min (the answer is here).
java.lang.NullPointerException
at com.google.inject.InjectorImpl.injectMembers(InjectorImpl.java:673)
at com.google.inject.InjectorImpl$8.call(InjectorImpl.java:682)
at com.google.inject.InjectorImpl$8.call(InjectorImpl.java:681)
at com.google.inject.InjectorImpl.callInContext(InjectorImpl.java:747)
at com.google.inject.InjectorImpl.injectMembers(InjectorImpl.java:680)
at ...
P.S. Faulty hardware might cause even nastier bugs than concurrency, because it may take a long time before you can confidently conclude that there is no bug in the code. Luckily hardware bugs are rarer than software bugs.
Solution 16 - Debugging
A friend of mine had this bug. He accidentally put a function argument in a C program in square brackets instead of parenthesis like this: foo[5]
instead of foo(5)
. The compiler was perfectly happy, because the function name is a pointer, and there is nothing illegal about indexing off a pointer.
Solution 17 - Debugging
One of the most frustrating for me was when the algorithm was wrong in the software spec.
Solution 18 - Debugging
Probably not the hardest, but they are extremely common and not trivial:
- Bugs concerning mutable state. It is hard to maintain invariants in a data structure if it has many mutable fields. And you have operation order dependency - swap two lines and something bad occurs. One of my recent hard-to-find bugs was when I found that previous developer of the system I maintained used mutable data for hashtable keys - in some rare conditions it lead to infinite loops.
- Order of initialization bugs. Can be obvious when found, but not so when coding.
Solution 19 - Debugging
The hardest one ever was actually a bug I was helping a friend with. He was writing C in MS Visual Studio 2005, and forgot to include time.h. He further called time without the required argument, usually NULL. This implicitly declared time like: int time(); This corrupted the stack, and in a completely unpredictable way. It was a large amount of code, and we didn't think to look at the time() call for quite some time.
Solution 20 - Debugging
Buffer overflows ( in native code )
Solution 21 - Debugging
Last year I spent a couple of months tracking a problem that ended up being a bug in a downstream system. The team lead from the offending system kept claiming that it must be something funny in our processing even though we passed the data just like they requested it from us. If the lead would have been a little more cooperative we might have nailed the bug sooner.
Solution 22 - Debugging
Uninitialized variables. (Or have modern languages done away with this?)
Solution 23 - Debugging
For embedded systems:
Unusual behaviour reported by customers in the field, but which we're unable to reproduce.
After that, bugs which turn out to be due to a freak series or concurrence of events. These are at least reproducable, but obviously they can take a long time - and a lot of experimentation - to make happen.
Solution 24 - Debugging
Difficulty of tracking:
- off-by-one errors
- boundary condition errors
Solution 25 - Debugging
When objects are cached and their equals and hashcode implementations are implemented so poorly that the hash code value isn't unique and the equals returns true when it isn't equal.
Solution 26 - Debugging
Machine dependent problems.
I'm currently trying to debug why an application has an unhandled exception in a try{} catch{} block (yes, unhandled inside of a try / catch) that only manifests on certain OS / machine builds, and not on others.
Same version of software, same installation media, same source code, works on some - unhandled exception in what should be a very well handled part of code on others.
Gak.
Solution 27 - Debugging
Cosmetic web bugs involving styling in various browser O/S configurations, e.g. a page looks fine in Windows and Mac in Firefox and IE but on the Mac in Safari something gets messed up. These are annoying sometimes because they require so much attention to detail and making the change to fix Safari may break something in Firefox or IE so one has to tread carefully and realize that the styling may be a series of hacks to fix page after page. I'd say those are my nastiest ones that sometimes just don't get fixed as they aren't viewed as important.
Solution 28 - Debugging
WAY back in the days, memory leaks. Thankfully, there's a lot of tools to find them, these days.
Solution 29 - Debugging
Memory issues, particularly on older systems. We have some legacy 16-bit C software that must remain 16-bit for the time being. The 64K memory blocks are royal pain to work with, and we constantly add statics or code logic that pushes us past the 64K group limits.
To make matters worse, memory errors usually don't cause the program to crash, but cause certain features to sporadically break (and not always the same features). Debugging is a non-option - the debugger doesn't have the same memory constraints so the programs always run fine in debug mode ... plus, we can't add inline printf statements for testing since that bumps the memory usage even higher.
As a result, we can sometimes spend DAYS trying to find a single block of code to rewrite, or hours moving static chars to files. Luckily the system is slowly being moved offline.
Solution 30 - Debugging
Multithreading, memory leaks, anything requiring extensive mocks, interfacing with third-party software.
Solution 31 - Debugging
Without a doubt memory leaks. Especially when you do things like dynamically create controls and add handlers in ASP.NET. On Page Load.
Solution 32 - Debugging
thread leaks, you often forget to count the number of threads
Solution 33 - Debugging
This is purely fictional, but The Bug by Ellen Ullman is a great tale of a hard to find bug that had tragic consequences.
Solution 34 - Debugging
character conversion problems on edbdic systems where the system mechanism to handle this has been disabled - urgh
Solution 35 - Debugging
Handling scenarios where users open more than one tab when interfacing with a website. Very difficult to avoid data continuity problems. Depends on the app I guess but this is something we spend a lot of time unravelling!