Monday afternoon, Mike joined me to tackle my current bug. The problem: On one database, the program was strangely crashing while trying to get text for an entity. With a bit debugging, we discovered that it didn't happen if we didn't ask for extraction annotations, and it didn't happen if we were able to use cached entities from the database instead of re-extracting.
After a lot of staring at the crash site and pondering, I added a flush to the output generation, so that we could look at the annotations from earlier documents. We discovered that every single entity annotation was wrong--sometimes they were just blank, sometimes they were showing nasty sequences of unprintable high-ASCII characters. This was a clue that darker things were afoot.
And sure enough, we found those darker things. The locations that were being used to extract the text were supposed to be relative to the source file. Actually, they were equal to 'offset in an internal buffer containing a distorted copy of the document's text' plus 'offset of that document in the source'. And then the locations were being used by another buffer that contained a copy of part of the text in the previously-mentioned buffer--but not the part of the text that this location came from.
It was at that point that I started imitating seppuku with the rubber fish.
Fixing this was complicated by the fact that there were three different classes that had this different information. But we finally managed to fix this bug. Part of this fix involved delicately munging the whole-document buffer to insert garbage corresponding to the document markup in the source, so that that buffer would have its text parts at the locations that corresponded to the source locations. Sigh... this is a house-of-cards solution at best.
One conclusion of software testing is that future bugs discovered in a piece of code tend to be proportional to the number of bugs found in that code in the past. So after the number of defects in a given piece of code becomes sufficiently high, it's time to rewrite that piece of code instead of continuing to patch it. We have definitely hit that defect level with locations and with the way the extraction subsystem deals with locations. Mike has fought with these location issues already before. But we probably won't have time to rewrite it any time soon.
As a postscript: on Tuesday, we pursued a similar bug. We discovered (with much less difficulty, thank Bog) that another code path was double-correcting for the document offset. So we had to restructure things yet again...