Introducing Cloud MapReduce on Appistry

When we introduced Cloud MapReduce (CMR), many people were very interested in the technology. However, some would not consider using the technology because it only runs in the Amazon environment. Even though Amazon cloud is popular among startups and SMBs, large enterprises are often concerned with data security and privacy; and thus, they are not willing to put their data outside their firewall.

For those not familiar with Cloud MapReduce, it is an implementation of the MapReduce programming model popularized by Google. CMR innovates by employing a different architecture, which results in a number of nice advantages over other implementations, such as Hadoop and Google’s own implementation.

Taking these clients’ feedback into consideration, we have been securely working on the next version of Cloud MapReduce. I am happy to announce that we now have a version of Cloud MapReduce that runs inside an enterprise. This version is jointly developed with Appistry — a company focused on building the cloud platform for enterprises. You can read Appistry’s announcement, which is also covered by GigaOM and New York Times. Appistry’s product can provide a full stack of cloud services within an enterprise, just like what a public cloud offers. The capabilities provided by the Appistry platform include:

  • Compute service. This is similar to a cloud compute service such as Amazon EC2. But, instead of getting a virtual machine, a user can submit a task for execution in Appistry’s Fabric (their term for the cluster of servers under management). Unlike virtual machine based cloud offerings, Appistry Fabric can guarantee a task to be successfully executed through mirroring and checkpointing.
  • Storage service. Appistry offers two storage services:
    • Similar to Amazon S3, Appistry’s CloudIQ storage product offers reliable key-value pair storage. It transparently handles failure through replication.
    • In addition, Appistry also offers a tuple space-oriented storage called FAM (Fabric Accessible Memory). Although the data model is different, it can serve the same purpose as the Amazon SimpleDB service.
  • Communication service. Similar to Amazon SQS, Appistry also offers a queue service. The queue is stored entirely in memory; thus, the queue size is limited by the aggregate memory size across all servers in the fabric. Even though in memory, it is reliable through replication.

You will notice that there is a one-to-one mapping from Amazon services to Appistry services, that makes it easy for us to port Cloud MapReduce to run on top of the Appistry’s platform.

In addition to giving customers a choice on whether they run CMR inside or outside their firewall. The latest CMR also offers several new capabilities:

  • Locality optimization. One of the concerns of the early CMR is that it uses network exclusively for I/O. Given the limited network bisection bandwidth in today’s data center networks, this could limit the throughput. In Appistry, a server in the Fabric provides both the storage service and the compute service. By leveraging their affinity primitive, we try to place the computation on the server that holds the corresponding data; thus, fully leverage the local hard disk bandwidth.
  • Streaming support. MapReduce is designed for batch applications, but with simple modification, it can support continuous streaming applications and incremental batch applications. We made a modification to the original CMR where we store the intermediate data in a key-value storage instead of in the queues, and only keep a pointer to the blob in the queues. This change makes it really easy to keep intermediate data around when you want to do incremental MapReduce later. Our design is similar in concept to the HOP work, but it is using a different architecture and it is implemented on a commercial grade.
  • Rolling window support. Many time series applications compute statistics in a rolling window fashion, e.g., compute average for the last 30 minutes. In the near future, we will support rolling window update. Because we use a push iterator interface, we can easily take out results contributed by older data and add in results contributed by newer data.

With the Appistry support, you can now run CMR in several setups.

  • You can run CMR inside your firewall on top of the Appistry platform.
  • You can run CMR outside your firewall in any cloud, such as Amazon EC2 or GoGrid, on top of the Appistry platform. You benefit from Appistry’s management platform and the locality optimization we discussed above.
  • You can run CMR natively on Amazon EC2. You benefit from the native integration and Amazon’s cheap pay-per-use storage without running your own storage cluster.

We are in the process of writing up what we have developed into a research paper, where we will be able to discuss more technical details beyond what we can in this blog. Stay tuned if you are technical inclined.