Own Your Crash Logs with PLCrashReporter: Part 2
Creating Crash Logs
This is part two of our PLCrashReporter series. In this post, we will examine how crashes are created and learn more about specific crash types.
A crash handler has a three-phase life cycle:
- Preparing to handle crashes – There are a lot of secondary responsibilities to account for, but the essential task of this phase is to register a function (or functions) with the OS that will be executed if and when a crash occurs.
- Handling an actual crash – As a crash occurs, the code that was registered during the first phase is executed. During this phase, information about the nature of the crash (e.g., the call stack for the crashed thread) is captured. This information is written to disk for later use.
- Recovering a crash log – On iOS this phase does not occur until the user launches the app again. Without a subsequent launch, there is no opportunity for an in-process crash reporter to process the crash log. On macOS, it is possible for an app to run separate processes that detect and report crashes.
Next, let’s look closer at the first phase. How does one prepare to intercept crashes? To answer that, we need to look at how crashes are propagated. On Apple platforms, there are two pathways via which application crashes flow: POSIX signals and Mach exceptions.
First up are POSIX signals. When an illegal instruction or a request for termination occurs, the kernel sends a POSIX signal to the offending thread. These signals have a shortlist of usual suspects:
SIGSEGV– Memory errors
SIGILL– Illegal instructions
SIGABRT– Usually when the process itself calls
SIGKILL– For example when you issue the
killall -9command in a shell. That “9” is the value of SIGKILL.
Once that signal is delivered, the OS terminates the process. Between when the signal is sent and the process is terminated, any signal handlers that were registered for the process are given an opportunity to respond.
Apple’s operating systems run atop a Darwin kernel. Darwin is a descendant of the Mach kernel. Mach kernels, much like Darwin, use exception messages, rather than POSIX signals, to communicate about unexpected errors in the program flow. Mach exceptions are messages sent over IPC ports, which can be subscribed to by any interested observer with sufficient permissions for the process they’re interested in.
On Darwin, Mach exception messages are actually the underlying mechanism beneath the implementation of POSIX signals. Darwin registers a Mach exception handler that reflects Mach exceptions into POSIX signals. This is why, for example, when you look at a memory access error crash the description of the crash includes
EXC_BAD_ACCESS (SIGSEGV), the Mach exception and the POSIX signal respectively.
It is possible for your app to register its own Mach exception server, but a thorough exception server implementation requires use of undocumented or private API and is fraught with even more peril than writing your own POSIX signal handler.
When writing a custom crash logger, you have to decide which of these mechanisms you will use to intercept crashes. While this might seem like a purely academic choice, there is at least one salient edge case: POSIX signal handlers are run on the crashed thread, not a separate thread, thus using the same stack as the crashed thread. If that thread encountered a stack overflow, there will by definition be no available space on top of the stack for your signal handler to be executed, and you will be unable to capture the crash. A Mach exception handler is immune to this edge case because your handler — or, more accurately, your exception server — is listening for exceptions on a dedicated thread, which likely has enough room on its call stack to execute crash-handling code.
PLCrashReporter can use either POSIX signals or Mach exceptions, but the authors strenuously recommend against using a Mach exception server in production code.
Async Signal Safety
Within your crash-handling code, there are truly profound limitations on what APIs you are able to use. There is almost nothing available to you. There is a shortlist of what are called “async-signal-safe” APIs that can be used within a signal handler. Because the heap could be corrupted, they can only use stack memory. Fortunately, they include essential file-system functions like
write(2), which is how a custom crash handler is able to save a crash log to disk.
The “async” part is misleading from the perspective of a practicing iOS or macOS developer. It doesn’t mean they’re safe for concurrent access from multiple threads. Rather, an API is considered async-signal-safe if it is guaranteed to be fully re-entrant. A crash could occur at any moment during program execution, including somewhere within a call to a function that your crash handler might need to call itself! If your crash handler then called that same function during the course of handling the crash, and that function isn’t async-signal-safe, your crash handler might deadlock, leading to a lost crash log. You need async-signal-safe turtles all the way down.
Some additional things you cannot do because they aren’t async-signal-safe:
- You cannot allocate memory because
mallocisn’t async-signal-safe, and also because the heap might be corrupted. Any memory your crash handler needs to perform its duties must be allocated during app launch when the handler is first registered. Its memory budget is thus fixed and predetermined, even though a crash log could contain any amount of information. Consider how many megabytes of data could be in a single crash log if there are deep call stacks and hundreds of libraries.
- Signal handlers cannot use Objective-C or Swift. Period. This is because the Objective-C runtime makes extensive use of non-recursive locking and because both Objective-C and Swift provide no way to prevent their runtimes from calling
- You cannot dispatch to other threads or queues. Allowing program execution to continue beyond the crash could lead to corruption of the user’s data .
Popping Back Off The Stack…
Putting this all together, to write a proper crash reporter requires:
- Knowledge of UNIX/Mach/Darwin systems beyond the ken of mere mortals
- Writing mission-critical code without the benefit of any higher-level abstractions
- Heroic reasoning about code that cannot be debugged with a debugger
- Being able to finesse many megabytes of data (binary images, call stacks, etc.) through a tiny synchronous window of CPU time
- Doing all of the above on a shoestring memory budget
By now, you should be sufficiently terrified of writing your own crash logger. If you aren’t, you’re braver than most, or you weren’t paying attention. The rest of us are fortunate that there are existing implementations.
A reliable, well-maintained, open-source library for capturing crashes on Apple platforms is PLCrashReporter. It has changed hands a few times as its ownership hopped from Plausible Labs to Hockey App to Microsoft, but the fundamental design of the library remains the same.
The next post in our series will walk you through adding the PLCrashReporter library to an application so that you can obtain crash logs directly on your device without having to resort to a third party service.
: Landon Fuller, the primary author of PLCrashReporter, gives an example of how allowing program execution to continue after the crash can corrupt or destroy user data in the section “Failure Case: Async-Safety and Data Corruption” of Reliable Crash Reporting.