On Thursday or Friday of last week, Steve had discovered a nasty Heisenbug. The release version of our code was sometimes giving different results, with no changes to input data, execution parameters, or executable. (The debug version was not giving any discrepancies.)
Any computer jock will understand that this gave us the heebie-jeebies. This sort of bug is the worst type of bug to track down and debug.
Steve, though, is a big QA stud. Over the weekend, he managed to trim down a massive input set to find a single document that would fairly regularly have variant output. We owe a lot to him; without his narrowing down the problem like that, I would probably have just thrown up my hands in defeat.
Steve had also found a discrepancy in the output from the tracer that corresponded to the discrepancy in the output. (Steve had gotten very familiar with this bug, so familiar that he had named the bug "Werner".)
That was where things stood when I came into work today. I looked at the discrepant tracer output. It was just an indicator of problems elsewhere, not a direct evidence of the problem. But it gave me a basis for putting trace statements elsewhere, and those trace statements led us nearer to the heart of the discrepancy.
With that new evidence, we were better able to understand the discrepancy. We saw that one of the results was clearly wrong--it was identifying an email address of the form 'firstname.lastname@example.org' as 'email@example.com'. (The other result was just somewhat wrong.) This twigged a memory--I had seen that misbehavior in the course of debugging. We fired up the debugger and confirmed that we were able to replicate that misbehavior.
We were still uncertain about why it would be nondeterministic, but we decided to pursue this bug. We asked Mike to look at it with us. This turned out to be just the right thing to do, because the bug turned out to be in a portion of the code within his domain of expertise. It turned out to be an off-by-one error in one of the most deeply hackish portions of the code. In fact, this turned out to explain a screwy bug we had observed in an entirely different product.
It also turned out that a workaround for another bug would also be a workaround for this bug, so we decided to take no action on this bug before the imminent release.
But we were still nervous about the nondeterminism. (I quipped, "As computer jocks, we are used to things being wrong, but we hate being surprised." Steve's girlfriend Raven said, "Thank you! That explains a lot about Steve.")
We devised some traces that demonstrated that the nondeterminism was within a single module. Steve decided to pursue that with a carpetbombing of trace statements.
Steve would narrow down the discrepancy a bit further, we would take a look, and we'd suggest further traces.
On the second time I went in to look at the narrowed discrepancy, I spotted the problem. The code was foolishly assuming null-terminated strings, when the actual strings were not null-terminated. And it so happened that different types of garbage after the last real letter were giving different results. Aha! This is just the sort of thing that causes nondeterministic behavior.
Steve worked on a quick hack that tested whether this was the problem, and I worked on rewriting the code to eliminate that problem. It even turned out that jpbl had written a unit test for that code that caught some of the errors in my rewrite. Jennifer rocks.
When I left, Steve was testing my rewritten code. All in all, it took us about a man-week of labor to identify these bugs and fix the nondeterminism. For a subtle heisenbug like this, I think that's excellent speed.