fbpx

Blogs from the Ranch

< Back to Our Blog

Thoughts on Debugging, Part 2

You can find Part 1 of this series here.

This has happened to everyone. A colleague walks into your space, erases your white board and starts scribbling on it. They say, “Hey, I’ve got this problem, and I wanted to know if you could help me with it,” followed by five minutes of boxes and circles and arrows.

And then comes, “Oh yeah, I see the problem! Thanks!” They walk out, and you haven’t even said a word. The same thing happens when you post a question to a mailing list or Stack Overflow, and then discover the solution five minutes later. This is where you get the standard response of, “You just organized your thoughts and discovered the solution,” followed by sage advice and recommendations to talk to teddy bears and rubber ducks. (I have a Bill the Cat that I talk to.)

Most of the time, though, a colleague walks in to my space, erases the white board, and starts scribbling on it. Five minutes of boxes and circles and arrows later, I get an expectant look that says, “So, did you solve it yet? I’m still totally stumped.” I usually blink back in confusion.

Information overload. I’ve received a barrage of data and I haven’t time to assimilate any of it. Too much detail. Some may be worthwhile. Some may be noise. I don’t know!

Define the Problem

I like Problem Statements, even though that term sounds very process-oriented: “Before signing up to address a software defect, one needs an applicable problem statement.” But problem statements are simple. It’s good to have one before walking into someone else’s space. Actually, it’s good to have a good problem statement even if you’re debugging on your own. If you’ve gotten to the point of debugging frustration, Everything You Know Is Wrong, including what you think is wrong.

A good problem statement (as it relates to debugging):

  • Phrases the bug as simply and concisely as possible.
  • Does not pre-form any conclusions.
  • Just enumerates what’s wrong, and what would make things right.

Good problem statements are actionable and include victory conditions so you know when to stop debugging.

Object Lessons

An extremely smart Rancher came to me with this problem one day:

“Hey MarkD, can you help be figure out some ARC memory management issues with some Sprite Kit Sprites?”

The question surprised me on a couple of levels. Recall my levels of probability for a bug (new code, old code, library code, Cocoa, the compiler and runtime). ARC comes from Apple’s compiler team, who are awesome. The odds of an actual bug there are pretty small. I’d definitely blame the prerelease Sprite Kit first with it falling into the “New Code” category.

“What makes you think you have memory management issues?” was my question. The response? “My Sprites are just white rectangles, so ARC must be releasing the images.”

A good problem statement here would be, “I have white triangles for sprites. I want the sprites to have proper textures.” This defines the problem. It also defines the victory conditions. Now we can start working.

Zero in on the main issue

One thing I’ve discovered when helping beginning programmers debug, whether in a classroom situation or 1:1 mentoring, is an obsession with data. The description of the problem comes in a torrent of information, usually explaining everything the programmer has already looked at.

When I come to a problem with a clean mind, I don’t want to know what all you’ve considered. The details will just bounce off my head. My mind is made of meat like yours, and I haven’t assimilated everything yet, so I’m subject to the magic number plus/minus 2 cognitive limit. Plus, in an instructional capacity, I think there’s pedagogical benefit for the student seeing a debugging process from the ground up.

This happened a couple of months ago:

“Hey MarkD, can you help be figure out this problem I’m having with layout? Let me describe the full algorithm.”

(… ten minutes bouncing around Xcode later, my eyes glazing over …)

“And oh, yeah, and it’s crashing on this line.”

To which I responded

“This crash—that’s the problem?”

“Yeah.”

Here, the best thing to do is to say, “My program is crashing on this line, and I don’t know why,” with a victory condition of “it doesn’t crash.” That way you avoid going over the river and through the woods.

During a debugging session, we’ll probably hit the rest of the layout algorithm, but it’ll be at a speed more amenable to mental assimilation. Then again, the bug might just be a null pointer dereference that takes 30 seconds to figure out.

Don’t Pre-Solve the problem

“Hey MarkD, when I’m on WiFi I can delete an object from the datastore. On 3G I can’t. We have a networking problem, but I can’t figure it out.”

I think you know how this conversation going to go. “What about networking do you think is broken?”

“I print the data we get back, and it’s the same in both cases, but the behavior is different between 3G and WiFi, ergo it’s a networking issue.”

The first part of this is a great problem description. “I can’t delete an object from the datastore on 3G. WiFi works great.” The second part pre-solves the problem. Remember, Everything You Know Is Wrong. If you first assume the networking stack is to blame, and it really isn’t, then you’ve wasted a lot of time looking in the wrong direction.

What’s the victory condition? “I can delete the object from the datastore.”

Science!

Why go through all the trouble to build a problem statement? Two reasons: the first is to prevent losing time by going down a rabbit hole. The second is that it’s easier to apply the good old scientific method that we learned in high school:

  1. Ask a question (based on the problem statement)
  2. Construct a hypothesis to explain the question
  3. Test the hypothesis with an experiment
  4. Analyze the data. (Do you get closer to the victory condition?)
  5. Repeat (until fixed)

I was a science nerd in high school, having great chemistry, biology and physics teachers, which led to the basic scientific method being what I use for debugging. Ask a question. Construct a hypothesis. Run a test or two or twelve. Keep a lab notebook log of the data for later analysis.

I’ve written about the Universal Troubleshooting Process before—it is a specialization of the general scientific method, giving a bit more structure to the approach.

Applications

How did my colleagues and I apply this process in tracking down some of these bugs?

For the blank sprites, the first question was, “Is ARC really implicated?” The hypothesis is, “ARC is not implicated, due to it being pretty low in the hierarchy of potential blame.” The experiment? We took a quick trip through the allocations Instrument and verified the memory management of the sprite images was sane. The next question was, “Are the images getting loaded from disk into UIImages correctly?” The experiment was a simple breakpoint at the load location and looking at the image data in memory. It seemed sane. These days, we can QuickLook image pointers in Xcode and see rasterized goodness. Eventually the problem was tracked down deep inside of Sprite Kit, radars filed, and things were fixed before the first public release.

For the networking/removing an object from the collection issue, an early question during debugging was, “Does this problem really only manifest itself on 3G and not WiFi?” Experiments were made on different networks and indeed only 3G showed the problem.

The next question was, “3G is implicated. Why is it failing on that network?” Hypothesis was, “Network responses are not coming back.” Simple caveman debugging showed that for all outgoing requests, we got a response. Next hypothesis was, “Network responses are coming back in different orders on WiFi and 3G”. The same caveman debugging showed that yes, indeed, the responses were being processed in a different order.

It turned out that the product had a caching layer that was losing coherency due to a race condition between the order of requests returning. On 3G, a slower response was confusing the cache causing the deleted object to immediately resurrect itself. It wasn’t a networking problem, just one exacerbated by the network.

Be single minded

The point of this whole process is to focus on one problem at a time. Don’t bounce around between problems. Don’t get distracted. Just investigate one thing at a time. If you come across some unrelated data that is interesting or is pertinent to another bug, put it in to your log.

I have the pleasure of having friends in higher education who are helping train the next generation of developers. I like to chat with them at conferences because they often have interesting insights into this whole teaching thing, and how it differs from the bootcamps we do at the Ranch. One thing that they’ve pointed out is that their novice debuggers flail around, chasing different bugs at the same time, wasting a lot of time and generally getting very frustrated. This is understandable because the folks in Programming 101 don’t yet have a grasp of how the different layers in system should work, much less figure out what is not working. Unfocused effort is Bad Mojo.

Good advice, should you find yourself in similar circumstances, is to Be Consistent. Be consistent with your test data. Don’t change configurations, databases or the documents you use to test the problem. Be consistent in your bug reproduction steps. Click the same buttons in the same order each time. You could be chasing after a number of different bugs that just happen to manifest themselves in the same way. (It happens. It’s happened to me.) By concentrating on one problem at a time, you have much less opportunity for distraction and burning up that most precious resource—your time.

Be Relentless

You can’t be nice to bugs. They’re tenacious. Keep after them. Tools like the Universal Troubleshooting Process are fundamentally binary searches through your program. Nothing can hide from a binary search for long.

Remember that it’s OK to Hack. Source code control has you covered. Maybe a hypothesis is, “Callers of this function are actually ignoring its return value”. In that case, the experiment is to cut out 200 lines of code and replace it with return 12; and see if the callers behave differently.

Be Weird

One last bit of babble on this topic. When I’m working with someone on a bug, I’ll sometimes have them do weird stuff. “It’s an experiment, we’ll get some data out of it.”

There have been times where I’ve set a breakpoint that never gets hit. But I know that code has to run for things to work. Remembering that everything I know is wrong, I’ll fall back to more basic principles. I’ll ask, “Is this code I’m looking at right now even getting included into the program?” To test this, I’ll do a cat-stomp on the keyboard:

- (IBAction)startStopListening:(NSButton *)sender {

sjdfnsjdnfjfnsldnfljsdnfljasdnf

    if (receivingSocket) {
        CFSocketInvalidate(receivingSocket);
        receivingSocket = NULL;
        [sender setState:1];
    }

And then do a build. If the program compiles and links OK, I know this code is not getting compiled at all. Maybe it wasn’t added to the target. Maybe there’s an #ifdef in the file, or an included header, that’s preventing compilation. But now I know I’m not insane that the breakpoint didn’t trigger.

Similarly, I was helping a friend track down an XPC/privileged helper tool problem. We were caveman debugging stuff and seeing the log text come through on the console, but it seemed like all of the code fixes we were doing were 100% ineffectual.

“Hey, can you add a NERDS RULE THE UNIVERSE NSLog right before the XPC connection is set up?”

“Uh, why?”

In my mind, I wasn’t convinced that our new code was actually getting run, just the same old code. So the question was, “Is the new code getting run?”, the hypothesis was, “It is not getting run. In fact, we keep reusing an old version that’s stuck in the matrix.” The experiment added that new log statement. If we saw NERDS RULE THE UNIVERSE in the console we would know that our new code was getting deployed properly, and we’re just terrible at fixing things. If we didn’t see it, we’d know our new code was never getting executed. Sure enough, it was a stale privileged helper tool that was not getting replaced by new code.

What’s the Takeaway?

When debugging, it’s easy to get mired down in details and to thrash around uncontrollably.

By knowing levels of pain (five-minute, one-day, omgwtf), you can spot when a bug has moved from one category to another, and you need to pull out more powerful tools.

By realizing that everything you know is wrong, you can help purge your mind of the preconceptions that are giving the bug a place to hide.

By having a hierarchy of blame (new code, old code, third-party library code, toolkit code, the compiler), you know to pay more attention to the things at the top of the list and trust the things at the bottom of the list (until real evidence indicates otherwise), freeing up mental resources.

By having a good problem statement, you can avoid going down rabbit holes and solving the wrong problem.

By using some flavor of the scientific method, you can acquire “laser”-like focus in finding the bug.

By concentrating on one problem, you don’t waste time chasing your tail.

By being consistent, you’re always attacking the same bug from run to run.

Don’t be afraid to hack code, or to do weird stuff to answer questions. If you can glean a new piece of data about the way your system operates, you’ll be ahead of the game.

And don’t forget. Debugging is tough. As you get more experience, the kinds of bugs you’ve already fixed just move into an easier bracket, while new and improved bugs move into omgwtf territory, waiting for you to find them.

Not Happy with Your Current App, or Digital Product?

Submit your event

Let's Discuss Your Project

Let's Discuss Your Project