My Favorite Bug

Mark Dalrymple's Headshot
Mark Dalrymple

At the end of last year's storytelling session I mentioned that my favorite story is "Two programmers fixing a show-stopper bug over 24 hours, with the fix being a two-character change in the source code that resulted in a two-bit change in the compiled program." Here it is.

The Problem

It's a Wednesday afternoon in 2006. I was just about to call it day when I get a call from the mothership. It's the project manager of a product I was drafted to work on for a couple of months, mainly helping port some parts of it from Windows to the Mac. Even though I hadn't done any work for them in about six months, I knew calls from the PM means there's work involved. "Oh hi MarkD! We've got a problem I was hoping you could help us on." The coming monday was a big unveiling of the product at a conference, along with a simultaneous Mac and Linux release along with the long-established Windows version. The regular Mac people were tied up doing big fix-its, but there was an easy-to-repro (yet pretty much show-stopping) bug on the brand-new Intel Macs.

One of the features involves dragging an image file from the desktop into the main window. It worked fine on the PowerPC - jut drag in the image file and it gets added to the document. On the hot new Intel macs, it was a no-op. No error reported. Nothing logged. Just... nothing. So I got drafted, along with my cohort Dan who also did some work for that team a while back integrating an image processing library into the product.

Step 1 - Get the attitude

Even though I knew my next couple of days would be totally sidetracked working on this bug, I was pretty excited. I love work like this. "Hey, can you swoop in and reduce entropy for us?" Luckily at the company, working on "Code Red" situations like this is not a problem. It was ok for me to lose a couple of days on my regular work to pitch in.

Step 2 - Make a Damage Control Plan

Before embarking on a project like this, it's good to know what's expected of you, and what'll happen if you don't make it. "Hey PM, assuming we can't fix this by Monday, what happens?" It's entirely possible that we find the bug, but it's something that's not fixable in a tight time frame. Maybe it's a compiler bug or some deep-seated architectural problem that only manifests itself on the ICBMs (Intel Chip Based Macintoshes). For this case, if we didn't solve it by friday lunchtime they'd take one of the regular engineers off of their work and put them on the problem and work over the weekend. Obviously it'd be much better if they could continue with what they were working on.

Step 3 - Get an Accurate Symptom Description

Before I open a source file or set a break point, I want to get a reproducible test case. I might not always get that reproducible test case, but I want it. Luckily this test case was really easy, and always reproduced:

  1. Download the software from the continuous integration server.

  2. Run it on an ICBM and drag an image file to the document window. Nothing happens. Nothing happens with all standard image formats like gif, png, jpg, etc. Some of the fancier fringe-element formats work OK, though.

  3. On a PowerPC, do the same thing, but the image gets added to the document.

Step 4 - Reproduce the symptom

I downloaded a version from the continuous integration machine, tried it on my PowerPC PowerBook, and the image drag worked. Tried it on my Intel MacBook, and the image drag didn't work. Success! At least, the bug reproduced.

Step 5 - Do appropriate corrective maintenance

It had been several months since I was in that codebase. I checked it out, got it to build on both chip architectures and cranked up some warnings. We had different versions of the compiler (gcc 3.x on PowerPC, gcc 4.x on Intel) which means they're picky about different stuff. We made sure there wasn't anything obviously bad going on in the product in general, or the stuff that looked like graphic file processing. There were some warnings, but they weren't bad. I double-checked that the bug still happened with top-of-tree builds I just made.

Step 6 - Narrow it down to the root cause

Now it's time to get into the code. The application is a big C++ codebase that uses Qt as its foundation toolkit. As far as big C++ codebases go, this one is surprisingly good. The original architects picked a nice subset of C++ to center on so there wasn't too much language insanity going on.

It's architected as a main program that handles a lot of the common grunt work of the application with other features built as shared libraries that are loaded at run time. This helps reduce build and link times because you don't have to Link All The Things on every spin around the debug cycle. Just rebuild the shared library and then run the app. It also helps keep the source code compartmentalized. Each shared lib got its own folder for source code, graphics, resource files, and localizations. You could do a lot of work in there and not affect anyone else by stomping on common files.

The process of debugging went kind of like this :

Hypothesis - we might not be getting the drop events at all, due to bugs in the Intel Finder. Experiment - put some caveman debugging in the apple event handler to record information about file drops. Result - we get the drop events. The location to the files are good. So looks like the Finder is off the hook.

Hypothesis - the file's path isn't being passed into the image handling framework (which was in a shared library). Experiment - put some breakpoints in that code to see if it's getting called. Result - Yep getting called.

After a couple of hours of that kind of work we got fatigued (It was at the end of the work day when this landed on us) and scheduled an early time to get back to debugging.

The next day, after a couple more hours of caveman debugging and breakpointing, we found the root cause:

  1. The drop happened

  2. The code in the the shared library that handled the complicated image file format got first crack. If it didn't recognize the file type, it threw an GraphicsFormatException.

  3. An exception handler in the generic image handling shared library caught the exception:

    try { ... } catch (GraphicsFormatException e) { // See if it's a regular graphics file // If so, add it to the document. } catch (...) { // Give up. }

This is the crux of the bug. On PowerPC, the "See if it's a regular graphics file" branch got taken. On Intel, the "..." catch-all part got hit. Huh. Granted, throwing exceptions across shared library boundaries isn't necessarily the best thing to do, but so far it's worked great on Windows, PowerPC, and Linux. In this case, though, we put breakpoints in the catch-all portion, examined the exception, and it totally was a GraphicsFormatException.

Step 6.5 - Get smarter friends

This is where we admitted a bit of defeat and posted a question to our company's Mac IRC channel. Most everyone who had an interest in Macs kept it open, being our communal water cooler of frivolity, along with grousing about what other parts of the company were doing wrong. "Hey ya'll, we've got a PowerPC app that's happily catching exceptions, but on Intel it's going to the not-found case. Anyone have any ideas?"

That's when someone piped up "Oh, GCC changed the way it represents runtime metadata from version 3 to version 4." This includes things like RTTI and exception information. "It used to be strings, now it's a structure."

Veeery interesting.

Hypothesis: It's the compiler run-time metadata representation that's causing the exceptions to be handled differently. Experiment: compile the program with gcc4 on Power PC machines. Result: The PowerPC version now has the exact same bug. Conclusion: It's not a platform issue at all, but a compiler issue.

Step 7 - Repair or replace the defective component

Cool. It' a compiler metadata issue. In gcc 3, "GraphicsFormatException" (a string) was being thrown in one shared library while the other shared library was looking for "GraphicsFormatException" (a string). These two strings compare the same, so hey, someone threw a graphics format exception

In gcc 4, GraphicsFormatException (a structure, say at address 0x12345670) was being thrown. It was being caught, and compared to the other shared library's GraphicsFormatException (a structure, say at address 0x98765430). These two pointers are not equal, so it's not the same exception.

Why was the metadata pointer different? We were loading the shared libraries using dlopen(), kind of like:

void *token = dlopen (pathToSharedLibrary, RTLD_LAZY | RTLD_LOCAL);

That is, lazy-load the symbols from the shared library. Prefer to keep symbols from the library private, and segregate them from everyone else. This included the RTTI / Exception metadata, o each shared library got its own copy of things like object types and exception information. What we really need to do is have that stuff coalesced.

One way to do that is to move the exception declarations and whatnot into the main app and have the shared libraries link against the app. That turned out to be a fair amount of high-risk work (it was now Thursday afternoon), so we went for a temporary work around until after the grand unveiling at the conference: changing from local to global symbol processing:

void *token = dlopen (pathToSharedLibrary, RTLD_LAZY | RTLD_GLOBAL);

That means symbols exported from each of the shared libraries will be available to other shared libraries. And it so happens that the various duplicated exception types got coalesced, and now we could throw a GraphicsFormatException from one shared library and have it properly caught and handled in another.

Step 8 - Test

There was some risk with this change - we bounced it off of the regular team members and other smarter-than-I people. Luckily the engineers on this product have very good discipline so there weren't any opportunities for name or data aliasing that shouldn't have already happened. Luckily we had dedicated QA crews, both internal to the company and an outside vendor. There weren't any new problems introduced. Yay!

Step 9 - Take Pride in your Solution

We checked in our change on Thursday evening and got it reviewed and landed. So about two engineer-days tracking down and fixing the problem. Our fixed work. The program shipped on schedule, everyone was happy, and I got a thank-you note from the PM.

Here's those two lines of code again:

void *token = dlopen (pathToSharedLibrary, RTLD_LAZY | RTLD_LOCAL);
void *token = dlopen (pathToSharedLibrary, RTLD_LAZY | RTLD_GLOBAL); 

There's only two characters different:

void *token = dlopen (pathToSharedLibrary, RTLD_LAZY | RTLD_LO<strong>C</strong>AL);
void *token = dlopen (pathToSharedLibrary, RTLD_LAZY | RTLD_<strong>G</strong>LO<strong>B</strong>AL); 

Added a G, and changed a C to a B. In terms of values:

RTLD_LAZY | RTLD_LOCAL  -> 0x1 | 0x4 -> 0x5 -> <strong>01</strong>01
RTLD_LAZY | RTLD_GLOBAL -> 0x1 | 0x8 -> 0x9 -> <strong>10</strong>01

Two bits. One bit per engineer-day in terms of efficiency.

Step 10 - Prevent Future Occurrences of this problem

Once we made this fix, we searched the code for other uses of dlopen, to see if they were passing similar functions. Because this was a pretty well engineered product, this was the only point in the code that came close to this issue. We put comments before the dlopen call explaining why the particular parameters were chosen, in case someone in the future decides to change the flags.

So, if you ever ask me what my favorite bug is, I'll probably tell you this exact story.

Recent Comments

comments powered by Disqus