Google’s MapReduce patent and its impact on Hadoop and Cloud MapReduce

It is widely covered that Google finally received its patent on MapReduce, after several rejections. Derrick argued that Google would not enforce its patent because Google would not “risk the legal and monetary consequences of losing any hypothetical lawsuit“. Regardless of its business decision (whether to risk or not), I want to comment on the technical novelty aspects. Before I proceed, I have to disclaim that I am not a lawyer, and the following does not constitute a legal advice. It is purely a personal opinion based on my years of experience as a consulting expert in patent litigations.

First of all, in my view, the patent is an implementation patent, where it covers the Google implementation of the MapReduce programming model, but not the programming model itself. The independent claims 1 (system claim) and 9 (method claim) both describe in details the Google implementation including the processes used, how the operators are invoked and how to coordinate the processing.

The reason that Google did not get a patent on the programming model is because the model is not novel, at least in legal terms (that is probably why the patent took so long to be granted). First, it borrows ideas from functional programming, where the idea of “map” and “reduce” has been around for a long time. As pointed out by the database community, MapReduce is a step backward partly because it is “not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago”. Second, the User Defined Function (UDF) aspect is also a well known idea in the database community,  which has been implemented in several database product before Google’s invention.

Even though it is arguable whether the programming model is novel in  legal terms, it is clear to me that the specific Google implementation is novel. For example, the fine grain fault tolerance capability is clearly missing in other products. A recent debate on MapReduce vs. DBMS would shed light on what aspects of MapReduce is novel, see CACM articles here, and here, so I would not elaborate further.

Let us first talk about what the patent means to Cloud MapReduce. The answer is: Cloud MapReduce does not infringe. The independent claims 1 and 9 state that “the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes“. Since Cloud MapReduce does not have any master node, it clearly does not infringe. Cloud MapReduce uses a totally different architecture than what Google described in their MapReduce paper, so it only implements the MapReduce programming model, but does not copy the implementation.

For Hadoop, my personal opinion is that it infringes the patent, because Hadoop exactly copies the Google implementation as described in the Google paper. If Google enforces the patent, Hadoop can do several things. First, Hadoop can find an invalidity argument, but I personally think it is hard. The Google patent is narrow, it only covers the specific Google implementation of MapReduce. Given how widely MapReduce is known, if there were a similar system, we would have known about it by now. Second, Hadoop could change its implementation. The patent claim language includes many “wherein” clauses. If Hadoop does not meet any one of those “wherein” clauses, it can be off the hook. The downside, though, is that a change in implementation could introduce a lot of inefficiencies. Last, Hadoop can adopt an architecture like Cloud MapReduce‘s. Hadoop is already moving in this direction. The latest code base moved HDFS into a separate module. This is the right move to separate out functions into independent cloud services. Now only if Hadoop can implement a queue service, Cloud MapReduce can port right over :-).