Google’s MapReduce patent and its impact on Hadoop and Cloud MapReduce

It is widely covered that Google finally received its patent on MapReduce, after several rejections. Derrick argued that Google would not enforce its patent because Google would not “risk the legal and monetary consequences of losing any hypothetical lawsuit“. Regardless of its business decision (whether to risk or not), I want to comment on the technical novelty aspects. Before I proceed, I have to disclaim that I am not a lawyer, and the following does not constitute a legal advice. It is purely a personal opinion based on my years of experience as a consulting expert in patent litigations.

First of all, in my view, the patent is an implementation patent, where it covers the Google implementation of the MapReduce programming model, but not the programming model itself. The independent claims 1 (system claim) and 9 (method claim) both describe in details the Google implementation including the processes used, how the operators are invoked and how to coordinate the processing.

The reason that Google did not get a patent on the programming model is because the model is not novel, at least in legal terms (that is probably why the patent took so long to be granted). First, it borrows ideas from functional programming, where the idea of “map” and “reduce” has been around for a long time. As pointed out by the database community, MapReduce is a step backward partly because it is “not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago”. Second, the User Defined Function (UDF) aspect is also a well known idea in the database community,  which has been implemented in several database product before Google’s invention.

Even though it is arguable whether the programming model is novel in  legal terms, it is clear to me that the specific Google implementation is novel. For example, the fine grain fault tolerance capability is clearly missing in other products. A recent debate on MapReduce vs. DBMS would shed light on what aspects of MapReduce is novel, see CACM articles here, and here, so I would not elaborate further.

Let us first talk about what the patent means to Cloud MapReduce. The answer is: Cloud MapReduce does not infringe. The independent claims 1 and 9 state that “the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes“. Since Cloud MapReduce does not have any master node, it clearly does not infringe. Cloud MapReduce uses a totally different architecture than what Google described in their MapReduce paper, so it only implements the MapReduce programming model, but does not copy the implementation.

For Hadoop, my personal opinion is that it infringes the patent, because Hadoop exactly copies the Google implementation as described in the Google paper. If Google enforces the patent, Hadoop can do several things. First, Hadoop can find an invalidity argument, but I personally think it is hard. The Google patent is narrow, it only covers the specific Google implementation of MapReduce. Given how widely MapReduce is known, if there were a similar system, we would have known about it by now. Second, Hadoop could change its implementation. The patent claim language includes many “wherein” clauses. If Hadoop does not meet any one of those “wherein” clauses, it can be off the hook. The downside, though, is that a change in implementation could introduce a lot of inefficiencies. Last, Hadoop can adopt an architecture like Cloud MapReduce‘s. Hadoop is already moving in this direction. The latest code base moved HDFS into a separate module. This is the right move to separate out functions into independent cloud services. Now only if Hadoop can implement a queue service, Cloud MapReduce can port right over :-).

Top five reasons you should adopt Cloud MapReduce

There are a lot of MapReduce implementations out there, including the popular Hadoop project. So why would you want to adopt a new implementation like Cloud MapReduce? I list the top five reasons here.

1. No single failure point.

Almost all other MapReduce implementations adopted a master/slave architecture as described in Google’s MapReduce paper. The master node presents a single point of failure. Even though there are secondary nodes, failure recovery is still a hassle at best. For example, in the Hadoop implementation, the secondary node only keeps a log. When the primary master fails, you have to bring back up the primary, then replay the log file in the secondary master. Many enterprise clients we work with simply cannot accept a single point of failure for their critical data.

2. Single storage location.

When running MapReduce in a cloud, most people store their data permanently in the cloud storage (e.g., Amazon S3), and copy over their data to the Hadoop file system before they start the analysis. The copy stage not only wastes valueable time, but it is also a hassle to maintain two copies of the same data. In comparison, Cloud MapReduce stores everything in a single location (e.g., Amazon S3) and all accesses during analysis go directly to the storage location. In our test, Amazon S3 can sustain a high throughput and it is not a bottleneck in analysis.

3. No cluster configuration.

Unlike other MapReduce implementations, you do not have to setup a cluster first, e.g., setup a master and then add in slaves. You simply launch a number of machines and each will be working away on the job. Further, there is no hassle when you need to dynamically reconfigure your cluster. If you feel the job progress is too slow, you can simply launch more machines, and they will join the computation right away. No complicated cluster reconfiguration is needed.

4. Simple to change.

Some applications do not fit the MapReduce programming model. One can try to change the application to fit the rigid programming model, which will result in either inefficiency or complicated change or setup on the framework (e.g., Hadoop). With Cloud MapReduce, you can easily change the framework to suit your needs. Since there are only 3,000 lines of code, it is easy to change.

5. Higher performance.

Cloud MapReduce is faster than Hadoop in our study. The exact speed up really depends on the application. In one representative case, we saw a 60x speedup. This is neither the maximum nor the minimum speedup you can get. We could massage the data (e.g., having more and even smaller files) to show a much bigger speedup, but we decide to make the experiment more realistic (uses the “reverse index” application — the application the MapReduce framework was designed for — and a public set of data to enable easy replication). One may argue that the comparison is unfair becasue Hadoop is not designed to handle small files. It is true that we can apply bandit to Hadoop to close the gap, but the experiment is really a scaled down version of a large-scale test with many large files and many slave nodes. The experiment highlights a bottleneck in the master/slave architecture that you will eventually encounter. Even without hitting the scalability bottleneck, Cloud MapReduce is faster than Hadoop. The detailed reasons are listed in the paper.

    Open sourcing Cloud MapReduce

    After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

    MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

    MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

    There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

    Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

    By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

    • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
    • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
    • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

    All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

    On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

    Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

    Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.

    Cloud MapReduce — MapReduce built on a cloud operating system

    We have finally finished a cool project — Cloud MapReduce — that we have been working on on-off for almost the whole past year. It is a new MapReduce implementation built on top of a cloud operating system. I described what is a cloud operating system before. We looked hard to understand how different is a cloud operating system (OS) from a traditional OS.  I think we have found the key difference — a cloud OS’s scalability. Unlike a traditional OS, a cloud OS has to be much more scalable because it must manage a large infrastructure (much bigger than a PC) and it must serve many customers.  By exploiting a cloud OS’ scalability, Cloud MapReduce achieves three advantages against other MapReduce implementations, such as Hadoop:

    Faster: Cloud MapReduce is faster than Hadoop, up to 60 times in one case.

    More scalable: No single point of bottleneck, i.e., no single master node that coordinates everything. It is a fully distributed implementation.

    Simpler: Only 3000 lines of Java code. Which means it is very easy to change it to suit your needs. Have you ever thought about changing Hadoop? I got a headache even thinking about the 280K lines of code in Hadoop.

    I encourage you to read the Cloud MapReduce technical report to learn more about what we have done.