How to choose a load balancer for the cloud

If you are hosting a scalable application (e.g., a web application) in the cloud, you will have to choose a load balancing solution so that you can spread your workload across many cloud machines. Even though there are dedicated solutions out there already, how to choose one is still far from obvious. You will have to evaluate a potential solution from both the cost and performance perspectives. We illustrate these considerations with two examples.

First, let us take Amazon’s Elastic Load Balancing (ELB) offering, and evaluate its cost implications. Let us assume you have an application that sends/receives 25Mbps of traffic on average. It will cost you $0.008/GB * 25Mbps * 3600 sec/hour = $0.09/hour, already more than the cost of a small EC2 Linux instance in N. Virginia. The cost makes it unsuitable for most applications. If your application does not have a lot of traffic, ELB makes sense economically. But for that small amount of traffic (< 25Mbps), you most likely do not need a load balancer. We have run performance studies based on the SpecWeb benchmark — a suite of benchmarks designed to simulate realistic web applications. Even for the most computation intensive benchmark in the suite (the banking benchmark), a small EC2 instance can handle 60Mbps of traffic. A slightly larger c1.xlarge instance is able to process 310Mbps. This means that even if you application is 10 times more CPU intensive per unit of traffic, you can still comfortably host it on a c1.xlarge instance. If you application has a larger amount of traffic (> 25Mbps), it is more economical to roll you own load balancer. In our test, a small EC2 instance is able to forward 400Mbps traffic even for a chatty application with a lot of small user sessions. Based on the current pricing scheme, ELB only makes sense if your application is very CPU intensive, or if the expected traffic fluctuates widely. You can refer to our benchmarking results (CloudCom paper section 2) and calculate the tradeoff based on your own application’s profile.

Second, we have to look at the performance a load balancing solution can deliver. You cannot simply assume a solution would deliver the performance requirement until you test it out. For example, Google App Engine (GAE) promises unlimited scalability, where you can simply drop your web application and Google handles the automatic scaling. Alternatively, you can run a load balancer in Google App Engine and load balance an unlimited amount of traffic. Even though it sounds promising on paper, our test shows that it cannot support more than 100 simultaneous SpecWeb sessions (< 5Mbps) due to its burst quota. To put this into perspective, we are able to run tests that support 1,000 simultaneous sessions even on a small Amazon EC2 instances. We worked with the GAE team for a while trying to resolve the limitation, but we were never able to get it working. Others have noticed its performance limitation as well . Note that this happened between Feb. and Apr. of 2009, so its limit may have improved since then.

The two examples illustrate that you have to do your homework to understand both the cost and performance implications. You have to understand your application’s profile and conduct performance studies for each potential solution. Although setting up performance testing is time consuming, fortunately, we have done some leg work already for the common solutions. You can leverage our performance report (section 2 of our CloudCom paper). We have set up a fully automated performance testing harness, so if you have a scenario not covered, we will be happy to help you test it out.

The two examples also illustrate that you cannot rely on a cloud provider’s solution. In many cases, you still need to roll your own load balancing solution, for example, by running a software load balancer inside a cloud VM. The existing software load balancers differ in design, and hence their performance characteristics. In the following, we discuss some tradeoffs in choosing the right software load balancer.

A load balancer can either forward traffic at layer 4 or at layer 7. In layer 4 (TCP layer), the load balancer only sees packets. It inspects the packet header of each packet and then decides where to forward it. The load balancer does not need to terminate TCP sessions with the users and originate TCP sessions with the backend web servers; therefore, it can be implemented efficiently. Note that not all layer 4 load balancers would work in the Amazon cloud. Amazon disallows source IP spoofing, so if a load balancer just forwards the incoming packet as it is (i.e., keeping the source IP address intact), the packet would be dropped by Amazon because the source IP does not match the load balancer’s IP address. In layer 7 (application layer), a load balancer has to terminate a TCP connection, receive the HTTP content, and then relays the content to the web servers. For each incoming TCP session from a user, the load balancer not only has to open a socket to terminate the incoming TCP session, it also has to generate a new TCP session to one of the web servers to relay the content. Because of the extra states, layer 7 load balancer is more inefficient. This is especially bad if SSL is enabled because the load balancer has to terminate the incoming SSL connection, and possibly generate a new SSL connection to the web servers, which is a very CPU-intensive operation.

Now the general theories are behind us, let us look at some free load balancers out them and tell you a little about their performance tradeoffs.

HaProxy

HaProxy could operate at both layer 4 or layer 7 mode. However, if you want session persistency (same user always load balanced to the same backend server), you have to operate it at layer 7. This is because HaProxy uses cookies to remember the session persistency, and to manipulate cookies, you have to operate at layer 7. Using cookies alleviates the need to keep local session states. Probably due to this reason (at least partly), HaProxy performs really well in our test. It has almost the same efficiency as other layer 4 load balancers for non-SSL traffic.

One drawback of HaProxy is that it does not support SSL termination. Therefore, you have to run a front end (e.g., an Apache web server) to terminate the SSL first. If the front end is hosted on the same server, it would impact how much traffic could be load balanced. In fact, SSL termination and origination (to the backend web servers) could significantly drain the CPU capacity. If it is hosted on a different server, the traffic between the SSL terminator and the load balancer is in the clear, making it easy for evedropping.

Nginx

Nginx operates at layer 7. It could run either as a web server or as a load balancer. In our performance test, we see Nginx consumes roughly twice the CPU cycle as other layer 4 load balancers. The overhead is much greater when SSL termination is enabled.

Unlike HaProxy, Nginx natively supports SSL termination. Unfortunately, the backend traffic from the load balancer to the web servers is in the clear. Depending on how much evedropping you believe that could happen in a cloud’s internal network, it may or may not be acceptable to you.

Rock Load Balancer

Rock load balancer operates at layer 4. Among the three load balancers we have evaluated, it has the highest performance. In particular, it seems that it can forward SSL traffic without terminating and re-originating connections. This saves a lot of CPU cycles for SSL traffic. Unfortunately, Rock Load Balancer still has an open bug where it could not effectively utilize all cores in a multi-core machine. Thus, it is not suitable for very high-bandwidth (>400Mbps) web applications which require multi-core CPUs in the load balancer.

I have quickly summarized the key pros and cons of the software load balancers we have evaluated. I hope it is useful to you in helping you decide which load balancer to choose. If you have a good estimate of what is your application profile, please feel free to ping me and we would be happy to help.

10 Responses to How to choose a load balancer for the cloud

  1. Arthur says:

    You can run haproxy with pound as the front end to handle https to great effect and with little additional CPU load.

    • huanliu says:

      Pound itself is a great load balancer. I wonder why would not one use Pound alone instead of using it along with HaProxy? Although we have not evaluated Pound in our performance setup, our experience is that SSL termination is quite expensive. It would be great if you could comment on Pound’s overhead quantitively. Thanks for your comment.

  2. Michael Lenaghan says:

    I think that some of your Elastic Load Balancer (ELB) conclusions aren’t quite right.

    ELB provides you with a DNS name. That DNS name gets resolved by clients to an IP address. That’s where the “Elastic” in “Elastic Load Balancer” comes in: as your traffic increases Amazon will invisibly add more ELBs behind that DNS name.

    That, of course, means that you have to test ELB the right way. First, you have to make sure you don’t cache DNS names throughout the entire test. (Some testing tools–and some runtimes–do that by default.) Second, you need more than one test client in order to see and use more than one resolved IP address. Third, you have to ramp up traffic over a period of time in order to see ELB itself scale–ie, in order for there to be multiple IP addresses behind the DNS name.

    Ultimately, there is no known limit to what ELB can handle.

    I think your comments miss the mark in some other key areas too. Since a load balancer sits right at your front door it can’t fail. That means that if you’re running a site that matters you’d need a fail-over instance–doubling costs and increasing complexity.

    Now take it one step further and imagine pursuing high availability by running servers in multiple availability zones. There are few mechanisms to distribute traffic evenly across zones. You could use DNS round-robin, of course–but now you’ve torpedoed your high availability *and* you’d need a load balancer with hot fail-over in each zone. (Similar issues would arise if you had too much traffic for a single instance to handle.)

    ELB isn’t perfect–and the documentation is lacking (to be polite). But in many cases it might actually be more economical than the alternatives.

    • huanliu says:

      I do not think I made any conclusion on ELB. The whole point of the post is that there is no one thing that fits all.

      In fact, I did not talk about ELB’s performance at all, and I did it for a reason — because we do not have enough insight into how it works. As far as we can tell, it is using DNS load balancing, as you have pointed out. We have done some preliminary performance test and our initial observation is that ELB is very scalable.

      I like your point about comparing cost with ELB for a high-availability scenario. If you have a hot standby load balancer, you have to double the cost for your load balancing solution. However, in my dealings with many clients, I have never come across anyone who is interested in more than two availability zones. Given how fast you can start another standby, and given how unlikely two reliability zones fail at the same time, it seems that more than one hot standby is an overkill.

      I think you may be arguing that roll-your-own costs more because you have to spend the engineering effort. In reality, you want to use a third-party pre-built solution. For example, there is Scalr. We also have a pre-built solution, called WebScalar, for our client — free of charge. Regardless, there is no escape on doing one’s homework. One has to understand his application’s profile and the tradeoffs of the various load balancing solutions, and then chooses wisely.

      • Michael Lenaghan says:

        “In fact, I did not talk about ELB’s performance at all…”

        Sorry; I wasn’t very clear. I was referring to section 2.5 of the CloudCom paper you link to above. In that paper you say:

        “Amazon ELB fails to meet the QoS requirement above 850 sessions and refuses to serve more than 950 sessions complaining that the server is too busy. Therefore, we conclude that Amazon ELB’s current scalability is about the same as a single instance Rock load balancer.”

      • huanliu says:

        We did more investigation after the paper was published. It appears ELB takes quite some time to scale beyond a single node. While it is scaling out, our automated test probably is still banging the same node, causing the test to fail. In a real-life scenario, users may experience momentary disruption, but we believe ELB can scale up further.

  3. globehost.in says:

    Thanks for such great posts.

  4. Pingback: Migrating your servers to Amazon EC2: Load balancing « Off the record - Craig Box

  5. At Jovian Networks, we’re working to build a loadbalancer as a cloud service that is functionally equivalent to an F5 or Netscaler. I think your points were salient and I’m always happy to see quantitative analysis, but I think load-balancers as they’re currently constituted in the cloud also lack essential features.

    I’d welcome any and all opinions on the subject.

  6. Joe Marty says:

    I appreciate your evaluation of alternatives to ELB! Thanks.

    I just wanted to reply because I have been very confused about the pricing structure, and your cost analysis wan’t clear enough to make me understand. So for anybody who is still confused, I believe I finally understand how to calculate how much ELB actually costs:

    There is a single hourly charge: $0.025/hour (slightly more than a micro instance)
    Then there is a data processing charge: $0.008/GB
    THEN there is the regular data transfer charge, which depends on volume and can be found here, under Data Transfer: http://aws.amazon.com/ec2/pricing/

    Since you will be using the regular data transfer charge whether you have a load-balancer or not, that cost always balances out, and can be left out of the equation. So what you have to compare is the first two ELB charges (hourly and data processing) against the cost of whatever EC2 instance(s) you would use to replace ELB. Hence the comparison in the article🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: