Hadoop Across Availability Zones

Patrick Van Stee

3 min read

Jan 9, 2012

Hadoop Across Availability Zones

Hadoop is a great solution to the big data problem and with the instant access to servers and storage in the cloud, it’s easier than ever to spin up and manage your own cluster. If you haven’t heard too much about it yet, hadoop provides access to a distributed file system along with a framework for running map reduce jobs over the data. It takes care of replicating chunks of data to each node and running jobs in parallel for you. However, when you want to expand your hadoop cluster across availability zones you can run into some unexpected problems. So lets dig into the ideas we tried and the final solution that worked the best for our configuration.

Communicating across data centers is a pretty old problem with a lot of different solutions so we wanted to start out with the simplest thing we could think of. First, we decided to allow access by adding an elastic ip to each node and adding those addresses in the security group of the opposite availability zone. So this would basically just allow open communication from certain public addresses on the internet, which isn’t a bad solution. But if you ever need to replace or spin up a new node, it requires a decent bit of manual intervention to reassign the elastic ip or to create a new one and add it back to the security groups. Another downside is that, due to hadoop’s fancy hostname resolution, we can only utilize tasktrackers within the same availability zone as the namenode.

So next we tried using OpenVPN to create a bridge across the two networks, but there was one small problem. Because we didn’t have full control over what ip address our servers were assigned, there was the potential for two machines with the same ip on separate networks which would cause all kinds of weird problems. Amazon’s VPC can help solve this problem by giving you more control over the network but we decided to go a different route.

Instead of using a classic VPN solution, we gave tinc a try. Tinc is still considered a vpn daemon but instead of using it for a point to point connection, because it is so lightweight, you can install it on all of your servers and expose a new network interface to route traffic over. This allowed us to easily configure the vpn on the production system and then just switch the interface hadoop was listening on once everything was up and running. So here’s how we set it up.

Three files make up the basic configuration. The tinc-up script is responsible for creating the new network interface. The tinc.conf names the local node and keeps track of all the other nodes it needs to talk to. We also need a hosts file which is named after the local node we set in the tinc.conf. This file includes attributes for setting up the connection to the vpn like the external ip, vpn ip, vpn subnet and public key. Since we have everything already set up in chef we just set up a cookbook that ran a few searches across our servers and generated all the configuration files we needed. We also took advantage of hadoop’s rack awareness feature to make sure data is always streamed from the closest node and also makes sure replicas are spread across different data centers. There was only 1 problem that we came across which again involved hadoop’s hostname resolution. Map reduce jobs were periodically failing because the tasktracker would sometimes use the ec2 hostname and resolve the ip of a datanode incorrectly. So, we had to setup a tiny dns server to make sure that hostnames always pointed to correct tinc address. After fixing that everything worked great. We also had the added benefit of being able to connect through the vpn and access the files from hdfs directly using hue.

Since we quickly covered a lot of new technology, here are some useful resources for when you want to try this out yourself:

What other tips do you have on breaking the single availability zone barrier?

Juan Pablo Claude

Reviewer Big Nerd Ranch

During his tenure at BNR, Juan Pablo has taught bootcamps on macOS development, iOS development, Python, and Django. He has also participated in consulting projects in those areas. Juan Pablo is currently a Director of Technology focusing mainly on managing engineers and his interests include Machine Learning and Data Science.

Slide Design for Developers

Design Leveling Up

Patrick Van Stee

Nov 18, 2013

OAuth2 and So Can You

Back-End Full-Stack Web

Have you recently used a Twitter, Facebook or Google API? If so, you probably authenticated with OAuth2. Instead of using their own authentication schemes, most...

Patrick Van Stee

Oct 2, 2013

Making connections with WebRTC

Until recently, browsers really only supported stateless communication over HTTP. That’s beginning to change with new technologies like WebSockets and ServerSentEvents, but WebRTC throws...

Patrick Van Stee

Feb 28, 2013

Hadoop Across Availability Zones

Hadoop Across Availability Zones

Juan Pablo Claude

Related Posts

Slide Design for Developers

OAuth2 and So Can You

Making connections with WebRTC

We are ready to discuss your needs.