Cloud Computing Testbeds

July 29, 2009

Cloud computing is still an immature field: there are lots of interesting research problems, no standards, few benchmarks, and very limited interoperability between different applications and services.

The network infrastructure for the Phase 1 of the Open Cloud Testbed.

Currently, there are relatively few testbeds available to the research community for research in cloud computing and few resources available to developers for testing interoperability. I expect this will change over time, but below are the testbeds that I am aware of and a little bit about each of them. If you know of any others, please let me know so that I can keep the list current (at least for a while until cloud computing testbeds become more common).

Before discussing the testbeds per se, I want to highlight one of the lessons that I have learned while working with one of the testbeds — the Open Cloud Testbed (OCT).

Disclaimer: I am one of the technical leads for the OCT and one of the Directors of the Open Cloud Consortium.

Currently the OCT consists of 120 identical nodes and 480 cores. All were purchased and assembled at the same time by the same team. One thing that caught me by suprise is that there are enough small differences between the nodes that the results of some experimental studies can vary by 5%, 10%, 20%, or more, depending upon which nodes are used within the testbed. This is because even one or two nodes with slightly inferior performance can impact the overall end-to-end performance of an application that uses some of today’s common cloud middleware.

Amazon Cloud. Although not usually thought of as a testbed, Amazon’s EC2, S3, SQS, EBS and related services are economical enough that they they can serve as the basis for an on-demand testbed for many experimental studies. In addition, Amazon provides grants so that their cloud services can be used for teaching and research.

Open Cloud Testbed (OCT). The Open Cloud Testbed is a testbed managed by the Open Cloud Consortium. The testbed currently consists of 4 racks of servers, located in 4 data centers at Johns Hopkins University (Baltimore), StarLight (Chicago), the University of Illinois (Chicago), and the University of California (San Diego). Each rack has 32 nodes and 128 cores. Two Cisco 3750E switches connect the 32 nodes, which then connects to the outside by a 10Gb/s uplink. In contrast to other cloud testbeds, the OCT utilizes wide area high performance networks, not the familiar commodity Internet. There are 10Gb/s networks that connect the various data centers. This network is provided by Cisco’s CWave national testbed infrastructure and through a partnership with the National Lambda Rail. Over the next few months the OCT will double in size to 8 racks and over 1000 cores. In the OCT, a variety of cloud systems and services are installed and available for research, including Hadoop, Sector/Sphere, CloudStore (KosmosFS), Eucalyptus, and Thrift. The OCT is a testbed designed to support systems-level, middleware and application level research in cloud computing, as well as the development of standards and interoperability frameworks. A technical report described the OCT is available from

Open Cirrus(tm) Testbed. The Open Cirrus Testbed is a joint initiative sponsored by HP, Intel and Yahoo! in collaboration with the NSF, the University of Illinois at Urbana-Champaign (UIUC), Karlsruhe Institute of Technology, and the Infocomm Development Authority (IDA) of Singapore. Each of the six sites consists of at least 1000 cores and associated storage. The Open Cirrus Testbed is a federated system designed to support systems-level research in cloud computing. A technical report describing the testbed can be found here.

Eucalyptus Public Cloud. The Eucalyptus Public Cloud is a testbed for Eucalyptus applications. Eucalyptus shares the same APIs as Amazon’s web services. Currently, users are limited to no more than 4 virtual machines and experimental studies that require 6 hours or less.

Google-IBM-NSF CLuE Resource. Another cloud computing testbed is the IBM-Google-NSF Cluster Exploratory or CluE Resource. The IBM-Google NSF CLuE resource appears to be a testbed for cloud computing applications in the sense that Hadoop applications can be run on the testbed but that the testbed does not support systems research and experiments involving cloud middleware and cloud services per se, as is possible with the OCT and the Open Cirrus Testbed. (At least this was the case the last time I checked. It may be different now. If it is possible to do systems level research on the testbed, I would appreciate it if someone would let me know.) NSF has awarded nearly $5 million in grants to 14 universities through its Cluster Exploratory (CLuE) program to support research on this testbed.

Large Data Clouds FAQ

July 16, 2009

This is a post that contains some questions and answers about large data clouds that I expect to update and expand from time to time.

What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:

  • Small data. Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the Netflix Prize. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop. I discuss some lessons in analytic strategy that you learn from this contest in this post.

    Building the ATLAS Detector at Cern's Large Hadron Collider

  • Medium data. Medium data fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies to create 1 to 10 TB or large data warehouses.
  • Large data. Large data is so large that it is challenging to manage it in a database and instead specialized systems are used. We’ll discuss some examples of these specialized systems below. Scientific experiments, such as the Large Hadron Collider (LHC), produce large datasets. Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets.

There have always been large datasets, but until recently, most large datasets were produced by the scientific and defense communities. Two things have changed: First, large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media. Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.

What is a large data cloud? There is no standard definition of a large data cloud, but a good working definition is that a large data cloud
provides i) storage services and ii) compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center. You can find some background information on clouds on this page containing an overview about clouds.

What are some of the options for working with large data? There are several options, including:

  • The most mature large data cloud application is the open source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce. An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS.
  • Another option is Sector, which consists of the Sector Distributed File System (SDFS) and a compute service called Sphere that allows users to execute arbitrary User Defined Functions (UDFs) over the data managed by SDFS. Sector supports MapReduce as a special case of a user-defined Map UDF, followed by Shuffle and Sort UDFs provided by Sphere, followed by a user-defined Reduce UDF. Sector is a C++ open source application. Unlike Hadoop, Sector includes security. There is public Sector cloud for those interested in trying out Sector without downloading it and installing it.
  • Greenplum uses a shared-nothing MPP (massively parallel processing) architecture based upon commodity hardware. The Greenplum architecture also integrates MapReduce-like functionality into its platform.
  • Aster has a MPP-based data warehousing appliance that supports MapReduce. They have an entry level system that manages up to 1 TB of data and an enterprise level system that is designed to support up to 1 PB of data.
  • How do I get started? The easiest way to get started is to download one of the applications and to work through some basic examples. The example that most people work through is word count. Another common example is the terasort example (soring 10 billion 100 byte records where the first 10 bytes is the key that is sorted and the remaining 90 bytes is the payload). A simple analytic to try is MalStone, which I have described in another post.

    What are some of the issues that arise with large data cloud applications? The first issue is mapping your problem to the MapReduce or generalized MapReduce (like Spheres UDFs) frameworks. Although this type of data parallel framework may seem quite special initially, it is surprising how many problems can be mapped to this framework with a bit effort.

    The second issue is that tuning Hadoop clusters can be challenging and time consuming. This is not surprising, considering the power provided by Hadoop to tackle very large problems.

    The third issue is that with medium (100 nodes) and large (1000 node) clusters, even a few under performing nodes can impact the overall performance. There can also be problems with switches that impact performance in subtle ways. Dealing with these types of hardware issues can also be time consuming. It is sometimes helpful to run a known benchmark such as terasort or MalStone to distinguish hardware issues from programming issues.

    What is the significance of large data clouds? Just a short time ago, it required specialized proprietary software to analyze 100 TB or more of data. Today, a competant team should be able to do this relatively straightforwardly with a 100 node large data cloud powered by Hadoop, Sector or similar software.

    Getting involved. I just set up a Google Group for large data clouds: Please use this group to discuss issues related to large data clouds, including lessons learned, questions, annoucements, etc. (no advertising please). In particular, if you have software you would like added to the list below, please comment below or send a node to the large data cloud google group.

Test Drive the Sector Public Cloud

June 23, 2009

Sector is an open source cloud written in C++ for storing, sharing and processing large data sets.   Sector is broadly similar to the Google File System and the Hadoop Distributed File System, except that it is designed to utilize wide area high performance  networks.

Sphere is middleware that is designed to process data managed by Sector.  Sphere implements a framework for distributed computing that allows any User Defined Function (UDF) to be applied to a Sector dataset.

One way to think about this is as a generalized MapReduce. With MapReduce, users work with pairs and define a Map function and a Reduce function, and the MapReduce application creates a workflow consisting of a Map, Shuffle, Sort and Reduce. With Sector, users can create a workflow consisting of any sequence of User Define Functions (UDFs) and apply these to any datasets managed by Sector. In particular, Sphere has predefined Shuffle and Sort UDFs that can be applied to datasets consisting of pairs so that MapReduce applications can be implemented once a user defines a Map and Reduce UDF.

Sector also implements security and we are currently using it to bring up a HIPAA-compliant private cloud.

Since Sector/Sphere is written in C++, it is straightforward to support C++ based data access tools and programming APIs.

If you have access to high speed research network (for example if you network can reach StarLight, the National Lambda Rail, ESNet, or Internet2), then you can try out the Sector Public Cloud.

You can reach the Sector Public Cloud from the Sector home page

There is a technical report on the design of Sector on arXiv: arXiv:0809.1181v2.

There is some information on the performance of Sector/Sphere in my post on the MalStone Benchmark, a benchmark for clouds that support data intensive computing.

Sector – When You Really Need to Process 10 Billion Records

April 19, 2009

As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center. The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).

Although the Google stack of services is not directly available, the open source Hadoop system, which has a broadly similar architecture, is available.

The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce, are examples of clouds designed for data intensive computing — these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.

There are still many open questions about how best to design clouds for data intensive computing. During the best several years, I have been involved with a cloud designed for data intensive computing called Sector. The lead developer of Sector is Yunhong Gu of the University of Illinois at Chicago. Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).

To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone. I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce. The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).

MalStone B Benchmarks

System Time (min)
Hadoop MapReduce 799 min
Hadoop Streaming with Python 143 min
Sector 44 min

Tests were done using 20 nodes on the Open Cloud Testbed. Each node contained 500 million 100-byte records.

Comparing Sector and Hadoop

Hadoop Sector
Storage cloud block-based file system file-based
Programming model MapReduce user defined functions and MapReduce
Protocol TCP UDP
Security NA HIPAA capable
Replication at time of writing periodically
Language Java C++

I’ll be giving a talk on Sector at CloudSlam ’09 on Monday, April 20, 2009 at 4pm ET. CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.