Cloud MapReduce — MapReduce built on a cloud operating system

We have finally finished a cool project — Cloud MapReduce — that we have been working on on-off for almost the whole past year. It is a new MapReduce implementation built on top of a cloud operating system. I described what is a cloud operating system before. We looked hard to understand how different is a cloud operating system (OS) from a traditional OS.  I think we have found the key difference — a cloud OS’s scalability. Unlike a traditional OS, a cloud OS has to be much more scalable because it must manage a large infrastructure (much bigger than a PC) and it must serve many customers.  By exploiting a cloud OS’ scalability, Cloud MapReduce achieves three advantages against other MapReduce implementations, such as Hadoop:

Faster: Cloud MapReduce is faster than Hadoop, up to 60 times in one case.

More scalable: No single point of bottleneck, i.e., no single master node that coordinates everything. It is a fully distributed implementation.

Simpler: Only 3000 lines of Java code. Which means it is very easy to change it to suit your needs. Have you ever thought about changing Hadoop? I got a headache even thinking about the 280K lines of code in Hadoop.

I encourage you to read the Cloud MapReduce technical report to learn more about what we have done.

Cloud is time sharing?

I was talking to Jeanne Harris — the author of the book “Competing on Analytics” — last year. I was trying to explain what is Cloud Computing and what are the various benefits. It took a while for her to peel through the marketing hype, until finally she said: “I got it, it is time sharing!”.

For those not familiar, time sharing is a concept introduced in 1957 and it is the prominent model of computing in the 1970s. Computers, such as main frames, were expensive back then. To share the expensive machine more efficiently, programmers use remote terminals whose accesses are multiplexed to a single computer.

Since the advent of PCs in the 80’s, computers get cheaper over time and the prominent model of computing has shifted to happen more at the client side. The main driver is bandwidth and latency. Can you imagine running an interactive and UI-rich application over the remote terminal on a mainframe? It is a good tradeoff in exchange for a lowered machine utilization because the machine is cheap anyway.

Things are changing again over the last decade. With the advance in search, social networking, business intelligence etc., we are all of a sudden inundated with a lot of data to analyze. The problem we try to solve and hence the computation capacity needs get dramatically bigger. Economics is again in favor of a time-sharing model where many people share the expensive Cloud. Because of its scale, only a handful of companies, such as Amazon, Google, Yahoo, and Microsoft can afford to build such a Cloud.

History does come around, does not it?

Cloud is more secure than your own hard disk

I had several feedbacks from my last post on the Outlook Attachment Remover from my colleagues. The number one response is: “Do not put our client’s data there, even if encrypted, it is against the policy”. In this post, I will discuss why Cloud is secure and what a sensible company policy should be.

When CIO gives us the company laptop, we promise to take full responsibility for it. We are expected to set a strong password so that no one can logon to our machine, and we are expected to lock our screen whenever we are away. When clients send us their confidential data, they expect us to secure it in areas where only we have access to. We do not need client permission to store the data on our hard drive because we have promised to our CIO and our clients that we will guard our laptop and hard drive.

When we request a bucket from Amazon S3, the bucket, by default, is readable/writable by us only. Similar to a password, our access to the bucket is guarded by our Amazon credential, which includes both a 20 alpha-numerical characters of Access Key ID and a 40 alpha-numerical characters of Secret Access Key. We promise to keep the Keys to ourselves and Amazon promises the access right works as designed. So, just like our hard disk, the bucket is ours and ours alone. Why should not we be able to store our and client data there? Why do we need client permission?

As much as we promise, accidents do happen. Our laptop could be infected with virus and Trojan horses, we could lose our laptops, Amazon security could be breached. In the past year alone, I know at least two incidents where our company laptops were stolen. In contrast, I have not heard ANY S3 security breach since they launched their service three years ago. It is a more dramatic contrast than you think because S3 has millions of customers and it hosts 29 billion objects, whereas, our company has much fewer employees and far fewer number of laptops. So, is our hard disk more secure than S3?

Since no one can say their system is 100% secure, we have to put in measures to guard against the rare events. Our company laptop has encryption software installed. When the laptop is lost, we are safe because no one can read the data.

 Now, if I encrypt my email attachments, including client data, and put them in my own S3 bucket that is readable/writable by me only, and hold on to the password to myself, why would I need client permission? Why is it not secure? Why is it against the company’s policy? If anything, based on the past track record, CIO should ban us from storing data on our hard drive instead.