Recently, we encountered a interesting bug on one of our multi-platform app development projects: an older Android (4.1) client was quietly failing to connect to the backend API. Upon checking the server logs, we discovered the cryptic error: “SSL Handshake Failure.” Oh boy.
We rarely run into low-level errors of this nature since we usually build our applications on a Platform as a Service (PaaS). Part of the allure of using a PaaS on this and other projects is that we get to delegate our infrastructure needs to the platform, which frees us up to focus on developing the applications themselves. Encountering this type of error has an upside, though: it’s a good opportunity to peel back some abstraction layers, learn a bit more about the underlying server infrastructure of our app, and (my personal favorite) employ the bug hammer.
The first question: what exactly is a SSL handshake? Well, it’s the series of negotiation steps to establish a secure connection between two computing devices.
Okay, now we understand the handshake steps, but we’re still not sure when exactly the failure is occurring. Let’s start with the first half of the handshake: agreeing upon the best possible connection protocol and cipher suite. In our PaaS’ configuration, there’s a single flag to toggle subsets of allowed SSL protocols and cipher suites. After adjusting it to the most permissive setting, we observe that our older Android client can connect. Hurray! But wait, what’s the root cause? And is this a good solution?
In the case of our PaaS, there wasn’t good documentation around the effects of this configuration toggle. So how do we introspect about the SSL capabilities of the server? There are a few tool options, but I leaned on these two:
nmap on the command line, and SSL Server Test by Qualys SSL Labs on the web. Both tools give you a list of the supported protocols and cipher suites, but SSL Server Test also tests your server against a variety of clients and rates the security of your server in terms of common SSL vulnerabilities.
nmap against a host outputs a result that looks like this:
Running SSL Server Test gives you a report that looks like this:
When we configure our server to the most permissive setting and run
nmap1, we see SSLv3, TLSv1.0, TLSv1.1, TLSv1.2 as the supported protocols. When we toggle this setting, we see SSLv3 disappear from the list. And for good reason: SSLv3 is considered “fundamentally insecure” due to its vulnerability to attack and inability to support strong cipher suites (RFC 7525). But it would appear that our older Android client is relying on this protocol to connect.
Aha! We’ve gotten to the root cause of the problem and even found a very simple fix. However, is this solution the right one? Completely securing an application that exists on the open web is next to impossible. Rather, we strive to minimize the attack surface of our applications—that is, we aim to develop software that has the fewest and most difficult to exploit vulnerabilities. Enabling SSLv3 on our server would run counter to that goal.
With this in mind, the better solution is to configure our client device to use a more secure protocol. In the case of an Android 4.x device, though all protocols are supported, only SSLv3 and TLSv1.0 are enabled by default (Android SSLSocket docs). So to finally squash the bug, our Android developers wrote a custom connection specification to enable the newer protocols and cipher suites.
Using a Platform as a Service is a wonderful abstraction that enables us as developers to not worry about the infrastructural “plumbing” of our apps. But sometimes that plumbing–ahem–abstraction springs a leak, and requires us to go investigating in lesser-known territory. Luckily (for the sake of developer morale), the components underlying the abstraction aren’t scarily opaque, and, as we’ve seen, there’s a rich toolset to leverage when solving problems at this level of your application architecture. So grab your galoshes and bug hammer, and go boldly in search of a fix.