Hadoop Across Availability Zones
Hadoop is a great solution to the big data problem and with the instant access to servers and storage in the cloud, it’s easier than ever to spin up and manage your own cluster. If you haven’t heard too much about it yet, hadoop provides access to a distributed file system along with a framework for running map reduce jobs over the data. It takes care of replicating chunks of data to each node and running jobs in parallel for you. However, when you want to expand your hadoop cluster across availability zones you can run into some unexpected problems. So lets dig into the ideas we tried and the final solution that worked the best for our configuration.
Communicating across data centers is a pretty old problem with a lot of different solutions so we wanted to start out with the simplest thing we could think of. First, we decided to allow access by adding an elastic ip to each node and adding those addresses in the security group of the opposite availability zone. So this would basically just allow open communication from certain public addresses on the internet, which isn’t a bad solution. But if you ever need to replace or spin up a new node, it requires a decent bit of manual intervention to reassign the elastic ip or to create a new one and add it back to the security groups. Another downside is that, due to hadoop’s fancy hostname resolution, we can only utilize tasktrackers within the same availability zone as the namenode.
So next we tried using OpenVPN to create a bridge across the two networks, but there was one small problem. Because we didn’t have full control over what ip address our servers were assigned, there was the potential for two machines with the same ip on separate networks which would cause all kinds of weird problems. Amazon’s VPC can help solve this problem by giving you more control over the network but we decided to go a different route.
Instead of using a classic VPN solution, we gave tinc a try. Tinc is still considered a vpn daemon but instead of using it for a point to point connection, because it is so lightweight, you can install it on all of your servers and expose a new network interface to route traffic over. This allowed us to easily configure the vpn on the production system and then just switch the interface hadoop was listening on once everything was up and running. So here’s how we set it up.
Three files make up the basic configuration. The tinc-up script is responsible for creating the new network interface. The tinc.conf names the local node and keeps track of all the other nodes it needs to talk to. We also need a hosts file which is named after the local node we set in the tinc.conf. This file includes attributes for setting up the connection to the vpn like the external ip, vpn ip, vpn subnet and public key. Since we have everything already set up in chef we just set up a cookbook that ran a few searches across our servers and generated all the configuration files we needed. We also took advantage of hadoop’s rack awareness feature to make sure data is always streamed from the closest node and also makes sure replicas are spread across different data centers. There was only 1 problem that we came across which again involved hadoop’s hostname resolution. Map reduce jobs were periodically failing because the tasktracker would sometimes use the ec2 hostname and resolve the ip of a datanode incorrectly. So, we had to setup a tiny dns server to make sure that hostnames always pointed to correct tinc address. After fixing that everything worked great. We also had the added benefit of being able to connect through the vpn and access the files from hdfs directly using hue.
Since we quickly covered a lot of new technology, here are some useful resources for when you want to try this out yourself:
What other tips do you have on breaking the single availability zone barrier?