November 20th, 2001


6648 and all that

In response to a request from one of my fans (would y'all consider yourselves my fans?) I'm going to explain the bug I was tracking down.

I suppose this is a violation of Clairvoyance's IP, but I think the only person who could accomplish anything useful with this information is the original author of the code.

So: we have this disk-based data structure that stores sequences of elements. (In the case we're using it, the elements are each four bytes.) Small sequences (less than a pageful) are stored without overhead. Sequences larger than an 8K page are stored as a linked list of pages. The leftover bit that doesn't fill a page is stored in a 'segment' within another page. Segment sizes are rounded up to the nearest 2K, so four segments can fit in a page.

The code does a typical trick for storing the extra information for the linked-list maintenance: it stores its information in the low 8 bytes of a page or segment, and it returns (address + 8) as the beginning of the segment.

The error was that when it was using a segment, it was adding 8 to the address of the segment twice, instead of once.

Normally, this doesn't cause any problems. If it's the first, second, or third segment on a page, there's no problem, because all the segments are shifted by the same amount, so there's no overlap. If it's the fourth segment on a page, there's usually not a problem, either. There's only a problem if the sequence in the fourth segment has exactly (2K - 8) bytes worth of data. In this case, filling out that sequence eventually writes beyond the end of the page. In the debug case, this gets caught by an assert; in the optimized case, much more mysterious things happen.

This is why the structure of 4 terms appearing 6648 times is so significant. The number of elements on a page is (8192 - 8) / 4 = 2046; the number of elements in a segment is (2048 - 8) / 4 = 510. 6648 is 3 * 2046 + 510, so a sequence of 6648 elements will use three pages and one completely full segment.

Now that I've explained the mystery, I'll talk about how I debugged it.

As I explained, this bug only comes up in a fairly obscure case. But once it came up, it would come up every time. And it was coming up in one of our very large databases.

We ran it in the debugger, and after 30 minutes or an hour, it hit that assert.

Unfortunately, due to failings of MSVC, I didn't have complete debugging information available in that class. I could have gotten that information available, but it would have taken just as long to get back to that crash point. So I wanted to get as much debugging information as possible before restarting the program.

I ended up looking up all the data by tracing pointers through a hex dump of memory. It was fairly arduous, but eventually successful. In doing so, I discovered that it was writing just beyond the end of the page.

There were two possibilities: either it was misallocating sequences, or it was writing more entries into the sequence than it had allocated. With more memory-tracing, I discovered that it had specified a sequence length of 6648, and it was writing element 6647.

I vaguely remembered the layout of this structure from a talk back in 1998; I reconstructed more of the layout from reading the code. I tried a few times to come up with numbers that would force the introduction of new segments--and then I realized that I already knew that 6648 would result in the introduction of a new segment, and used that four times. Once I had the smaller test case, it was a relatively straightforward matter of tracking it down.

I feel pretty proud. I think I nailed this bug much more quickly than many people might have.