Open sourcing Cloud MapReduce

After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

  • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
  • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
  • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.

Advertisements

What is a Cloud Operating System?

You know a word is new if you could not find a definition for it on Wikipedia. Based on the Wikipedia test, Cloud Operating System is clearly not well known yet. Like the word “Cloud” it inherits from, it does not have a precise definition yet. But I will give you a little sense about what it is.

We are all familiar with an Operating System (OS) since we use one everyday. Be it Microsoft Windows or Apple MAC OS or even Linux, they are the indispensable software that make our PC run. An operating system manages the machine resources, abstracts away the underlying hardware complexity and exposes useful interfaces to upper layer applications. A traditional OS manages resources within the machine boundary (such as the CPU, memory, hard disk, and network), but it has no visibility beyond the box.

Like a traditional OS, a cloud OS also manages hardware resources, but at a much higher level. It manages many servers, not only within a single data center, but could also span multiple data centers at geographically distributed locations. They could operate at a scale that is much large than what we are used to. For example, Google manages millions of servers, while Amazon manages hundreds of thousands of servers.

There are already two cloud OSs open for public usage: Amazon and Microsoft Azure. The services they offer have clear parallels in a traditional OS.

  • Amazon EC2 & Microsoft Azure workers: These services manage computing resources. They are similar to processes and threads in a traditional OS. But instead of scheduling processes/threads on a CPU, the cloud OS schedules the computing resources in a cluster of servers.
  • Amazon S3 & Microsoft Azure Blob: These services manage the storage resources. They are similar to the file system in a traditional OS.
  • Amazon SimpleDB & Microsoft Azure table: These services provide a central persistent state storage. This is similar to the Registry in a Windows OS. Other traditional OSs have similar mechanisms to store persistent states.
  • Amazon SQS & Microsoft Azure queue: There services provide a mechanism to allow different processes to communicate asynchronously. It is exactly the service provided by a pipe in an Unix OS, such as Linux.

There are other cloud OSs besides Amazon and Microsoft. Google has a cloud OS running internally. For example, its GFS file system manages the storage at the data center level. I am not talking about chrome, but more about the software stack Google uses internally. Although marketed as a cloud OS, chrome is really just a browser, i.e., an application running on a traditional OS. In addition, VMWare is working hard on its VCloud offering, which promises to manage an internal cloud (although only the compute resources, no other services are provided yet).

Compared to the myriad services offered by a traditional OS, a cloud OS is still immature. There are likely more services in the future to make a cloud OS easy to use. Watch out for more offerings from the like of Amazon, Microsoft and Google.

Eventual consistency, its manifestation in Amazon Cloud and how to design your applications around it

A cloud is such a large-scale system that its implementation requires tradeoffs in its design goals. The Amazon cloud has adopted a weaker consistency model called eventual consistency. It has implications on how we can architect and design applications to overcome its limitations. This post talks about the problems this weaker consistency model exposes and how we designed our applications around it.

Eric Brewer, professor at UC Berkeley and founder of Inktomi, conjectured that a large system such as a cloud can only achieve two out of three properties: data consistency, system availability and tolerance to network partition.  Being a public infrastructure, a cloud cannot sacrifice on system availability. Due to its scale and performance requirements, a cloud has to rely on a distributed implementation. Thus, it cannot sacrifice on its tolerance on network partitioning either. Therefore, a cloud has to trade off against data consistency and that is exactly what Amazon Cloud has done.

Eventual consistency states that when you write to and then read from a cloud immediately, the return result may not be exactly what you are expecting. However, the system guarantees that, “eventually” it will return what you expect it to return, but the time it takes is undeterministic.

In the Amazon implementation, the effects of eventual consistency is only noticeable on two fronts, both are in the SQS (Simple Queue Service) implementation.

  • When two clients are asking SQS to create the same queue at the same time, both could be notified that the queue has been created successfully. Normally if you create the same queue twice at different times, the second attempt will get an error message stating that the queue exists already. This could be a problem if your application relies on the fact that a queue can only be created once. In one of the applications we developed, we have to completely re-architect in order to get around the problem. Since the solution is application specific, I would not elaborate on how we solved the problem.
  • When a client reads from a SQS queue, especially when the queue is getting close to being empty, the queue may respond to say that it is empty even though there are still messages in it. This could be a problem for your application if your application thinks the queue is empty when it is not. Amazon documentation says, in most cases, you only need to wait for “a few seconds”, but it is unclear how many are “a few” since it depends on many factors, such as SQS load, data distribution and degree of replication etc.

We have devised a solution for the second problem. The solution is more widely applicable; hence, we describe it here in the hope that it is helpful to someone.

The idea is simple. A message producer could count how many messages it has generated for a particular queue so that the consumer for that queue would know how many messages to expect. In the solution we used for an application, we used Amazon SimpleDB to host this count data. Since we have many producers for each queue, we require each producer to write a separate row in SimpleDB, and all counts for a queue fall under the same column. A sample SimpleDB table is shown below:

queue1    queue2    queue3   ….
producer1    c_11      c_12          c_13
producer2    c_21      c_22         c_23
……

When the consumers start to process (may need a separate status to know all producers have stopped already), they can run a query against SimpleDB, such as

select queue1 from domain_name

The query will return a list of items. The consumers just need to iterate through the list and sum up all numbers under the same column (e.g., queue1). The sum is what the consumers should check. If the number of messages dequeued is less than the sum, the consumers should continue polling SQS. This solution avoids setting an arbitrary wait time, which either results in idle processing, or results in missed messages.

Hope this solution is helpful. Let me know if there are other symptoms of eventual consistency that I have missed.

Cloud is more secure than your own data center

I have given many Cloud Computing presentations to our clients in the past year. Everyone is interested, but all are concerned that the Cloud is not secure. My answer to them is that the Cloud, at least the Amazon Cloud, is more secure than your own data center. There are two reasons that Amazon Cloud is more secure.

The first reason is that Amazon gives you greater, instant control on your firewall settings. In our current IT infrastructure, we may have one firewall for the whole organization, or one for each division at best, and this firewall is controlled by some central IT guy. There are three problems associated with the current architecture. First of all, the firewall only protects you from people outside of your organization. The firewall is ineffective if the guy in the next cubicle decides to attack you or if his laptop is infected with virus. The second problem is that you do not have visibility into the firewall settings. A port could be left open for the hackers to exploit, but since you do not know, you would not have put in the necessary counter-measures. The third problem is that you could not easily change the security settings when deemed necessary. For example, you found a security hole, and you want to block future exploits. But you have to submit a form to IT to request firewall changes and it takes at least a week to implement the change. Meanwhile, your application remains vulnerable.

In contrast, Amazon gives you as many security groups (their term for their software-based firewall) as you want, and as the application owner, you have direct control on their settings. You can take advantage of security groups to have fine grain security control even at the application component level. For example, let us consider a simple two tier intranet application which has several application servers and one database server. We will create two security groups called “appserver” and “dbserver” respectively. By default, all accesses are off, and you have to explicitly enable permissions for each security group. We will add one rule to the “appserver” security group which says that only people from your IP range (e.g.,  x.y.x.w/24) can access port 80. Then we will add another rule to the “dbserver” security group which says that only those in the “appserver” security group can access. Once you set up the rules, you can fire up the application servers in the “appserver” security group, and then the database server in the “dbserver” security group. Now people from your organization can access your application as your designed, but a hacker has to go through extra hoops to get to your database. First, a hacker has to gain access to a server in your intranet, from there, he can only exploit port 80 to gain access to the application servers. Even if he is successful, he still has to gain access to the database server through the “appserver” group because that is the only one enabled to talk to the “dbserver” group. The chance of hacking open one machine is low already, the probability that one can hack all three open, in sequence, is essentially zero.

The second reason that Amazon is secure is that they have disabled all layer 2 functionalities. These layer 2 functionalities are the key to enable many security exploits. For example, you cannot fake your IP address from an Amazon server. If you send packets with a source IP address that is different from the one you are assigned, the packets are simply dropped by the hypervisor. Also, if you enable “prosmiscuous” mode to snoop traffic on the network, it is simply ineffective. Lastly, if you ARP for any IP address trying to find out who is nearby, you always get back the gateway’s MAC address, so you would not be able to know who is sitting in the same subnet.

Obviously, I am not alone at saying that Cloud is secure. Check out Gnucitizen for example.

Cloud is more secure than your own hard disk

I had several feedbacks from my last post on the Outlook Attachment Remover from my colleagues. The number one response is: “Do not put our client’s data there, even if encrypted, it is against the policy”. In this post, I will discuss why Cloud is secure and what a sensible company policy should be.

When CIO gives us the company laptop, we promise to take full responsibility for it. We are expected to set a strong password so that no one can logon to our machine, and we are expected to lock our screen whenever we are away. When clients send us their confidential data, they expect us to secure it in areas where only we have access to. We do not need client permission to store the data on our hard drive because we have promised to our CIO and our clients that we will guard our laptop and hard drive.

When we request a bucket from Amazon S3, the bucket, by default, is readable/writable by us only. Similar to a password, our access to the bucket is guarded by our Amazon credential, which includes both a 20 alpha-numerical characters of Access Key ID and a 40 alpha-numerical characters of Secret Access Key. We promise to keep the Keys to ourselves and Amazon promises the access right works as designed. So, just like our hard disk, the bucket is ours and ours alone. Why should not we be able to store our and client data there? Why do we need client permission?

As much as we promise, accidents do happen. Our laptop could be infected with virus and Trojan horses, we could lose our laptops, Amazon security could be breached. In the past year alone, I know at least two incidents where our company laptops were stolen. In contrast, I have not heard ANY S3 security breach since they launched their service three years ago. It is a more dramatic contrast than you think because S3 has millions of customers and it hosts 29 billion objects, whereas, our company has much fewer employees and far fewer number of laptops. So, is our hard disk more secure than S3?

Since no one can say their system is 100% secure, we have to put in measures to guard against the rare events. Our company laptop has encryption software installed. When the laptop is lost, we are safe because no one can read the data.

 Now, if I encrypt my email attachments, including client data, and put them in my own S3 bucket that is readable/writable by me only, and hold on to the password to myself, why would I need client permission? Why is it not secure? Why is it against the company’s policy? If anything, based on the past track record, CIO should ban us from storing data on our hard drive instead.

An “unlimited” email inbox in the Cloud

Do you work for one of those stingy companies who only give you a tiny email Inbox? My CIO gives me 120MB, which runs out two month after I joined the company. Even if you have a bigger one, it will run out fast enough because everyone likes to send large attachments around.

If you are like me, you will spend hours each week cleaning up your Inbox, archiving your emails, and backing up all your data. Well, I am happy to report that help is finally here. There is a new Outlook Attachment Remover from a startup that can detach your attachments and embedded images and put them on the Amazon Cloud (i.e., S3). For $0.15/GB/month (the price Amazon charges), you can get rid of all the hassles and have unlimited storage.

When I first started using their software, I did a few experiments to see how well it performs. I have an archive folder which has 12000 messages and it is about 890MB in size. The size is small because I deleted most attachments before I put the emails into my archive folder. After I converted them, my archive folder shrinks to about 400MB, which is very impressive since I did not have many attachments in the folder. I guess those embedded images take quite a bit of space.

Next I ran the same test on my working folder. My working folder has all important emails that I need to keep around for reference. They all have their original attachments because that is the reason I keep them in my working folder. It has 400 messages at about 200MB. After the conversion, it is at 25MB. Wow! That is a size that I can fit into the mailbox my CIO gives me.

After I converted my old mail, I just enabled the “auto-detach” option. So whenever a new mail arrives, the attachments are automatically stripped and stored in Amazon. If I want to convert it back for whatever reason, all I have to do is click the “re-attach” button.

I have been using the product for a few weeks and I am quite happy with it. I hope you find the tool useful too.