Building Your Own Large Data Clouds (Raywulf Clusters)

September 27, 2009

We recently added four new racks to the Open Cloud Testbed. The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing. Since there is not a lot of information available describing how to put together these types of clouds, I thought I would share how we configured our racks.

These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala.

These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out. Photograph by Michal Sabala.

These racks can be used as a basis for private clouds, hybrid clouds, or condo clouds.

There is a lot of information about building Beowulf clusters, which are designed for compute intensive computing. Here is one of the first tutorials and some more recent information.

In contrast, our racks are designed to support data intensive computing. We sometimes call these Raywulf clusters. Briefly, the goal is to make sure that there are enough spindles moving data in parallel with enough cores to process the data being moved. (Our data intensive middleware is called Sector, Graywulf is already taken, and there are not many words that rhyme with Beo- left. Other suggestions are welcome. Please use the comments below.)

The racks cost about $85,000 (with standard discounts), consist of 32 nodes and 124 cores with 496 GB of RAM, 124 TB of disk & 124 spindles, and consume about 10.3 kW of power (excluding the power required for cooling).

With 3x replication, there is about 40 TB of usable storage available, which means that the cost to provide balanced long term storage and compute power is about $2,000 per TB. So, for example, a single rack could be used as a basis for a private cloud that can manage and analyze approximately 40 TB of data. At the end of this note, is some performance information about a single rack system.

Each rack is a standard 42U computer rack and consists of a head node and 31 compute/storage nodes. We installed GNU/Debian Linux 5.0 as the operating system. Here is the configuration of the rack and of the compute/storage nodes.

In contrast, there are specialized configurations, such as designed by Backblaze, that provide 67TB for $8,000. This is 1/2 the storage for 1/10 the cost. The difference is that Raywulf clusters are designed for data intensive computing using middleware such as Hadoop and Sector/Sphere, not just storage.

Rack Configuration

  • 31 compute/storage nodes (see below)
  • 1 head node (see below)
  • 2 Force10 S50N switches, with 2 10 Gbps uplinks so that the inter-rack bandwidth is 20 Gbps
  • 1 10GE module
  • 2 optics and stacking modules
  • 1 3Com Baseline 2250 switch to provide to provide additional cat5 ports for IPMI management interfaces.
  • cabling

Compute/storage node.

  • Intel Xeon 5410 Quad Core CPU with 16GB of RAM
  • SATA RAID controller
  • four (4) SATA 1TB hard drives in RAID-0 configuration
  • 1 Gbps NIC
  • IPMI management

Benchmarks. We benchmarked these new racks using the Terasort Benchmark and version 0.20.1 of Hadoop and version 1.24a of Sector/Sphere. Replication was turned off in both Hadop and Sector. All the racks were located within one data center. It is clear from these tests that the new versions of Hadoop and Sector/Sphere are both faster than the previous versions.

Configuration Sector/Sphere Hadoop
1 rack (32 nodes) 28m 25s 85m 49s
2 racks (64 nodes) 15m 20s 37m 0s
3 racks (96 nodes) 10m 19s 24m 14s
4 racks (128 nodes) 7m 56s 17m 45s

The Raywulf clusters were designed by Michal Sabala and Yunhong Gu of the National Center for Data Mining at the University of Illinois at Chicago.

We are working on putting together more information of how to build a Raywulf cluster.

Sector/Sphere and our Raywulf Clusters were selected as one of the Disruptive Technologies that will be highlighted at SC 09.

The photograph above of two racks from the Open Cloud Testbed was taken by Michal Sabala.


Large Data Clouds FAQ

July 16, 2009

This is a post that contains some questions and answers about large data clouds that I expect to update and expand from time to time.

What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:

  • Small data. Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the Netflix Prize. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop. I discuss some lessons in analytic strategy that you learn from this contest in this post.

    Building the ATLAS Detector at Cern's Large Hadron Collider

  • Medium data. Medium data fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies to create 1 to 10 TB or large data warehouses.
  • Large data. Large data is so large that it is challenging to manage it in a database and instead specialized systems are used. We’ll discuss some examples of these specialized systems below. Scientific experiments, such as the Large Hadron Collider (LHC), produce large datasets. Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets.

There have always been large datasets, but until recently, most large datasets were produced by the scientific and defense communities. Two things have changed: First, large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media. Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.

What is a large data cloud? There is no standard definition of a large data cloud, but a good working definition is that a large data cloud
provides i) storage services and ii) compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center. You can find some background information on clouds on this page containing an overview about clouds.

What are some of the options for working with large data? There are several options, including:

  • The most mature large data cloud application is the open source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce. An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS.
  • Another option is Sector, which consists of the Sector Distributed File System (SDFS) and a compute service called Sphere that allows users to execute arbitrary User Defined Functions (UDFs) over the data managed by SDFS. Sector supports MapReduce as a special case of a user-defined Map UDF, followed by Shuffle and Sort UDFs provided by Sphere, followed by a user-defined Reduce UDF. Sector is a C++ open source application. Unlike Hadoop, Sector includes security. There is public Sector cloud for those interested in trying out Sector without downloading it and installing it.
  • Greenplum uses a shared-nothing MPP (massively parallel processing) architecture based upon commodity hardware. The Greenplum architecture also integrates MapReduce-like functionality into its platform.
  • Aster has a MPP-based data warehousing appliance that supports MapReduce. They have an entry level system that manages up to 1 TB of data and an enterprise level system that is designed to support up to 1 PB of data.
  • How do I get started? The easiest way to get started is to download one of the applications and to work through some basic examples. The example that most people work through is word count. Another common example is the terasort example (soring 10 billion 100 byte records where the first 10 bytes is the key that is sorted and the remaining 90 bytes is the payload). A simple analytic to try is MalStone, which I have described in another post.

    What are some of the issues that arise with large data cloud applications? The first issue is mapping your problem to the MapReduce or generalized MapReduce (like Spheres UDFs) frameworks. Although this type of data parallel framework may seem quite special initially, it is surprising how many problems can be mapped to this framework with a bit effort.

    The second issue is that tuning Hadoop clusters can be challenging and time consuming. This is not surprising, considering the power provided by Hadoop to tackle very large problems.

    The third issue is that with medium (100 nodes) and large (1000 node) clusters, even a few under performing nodes can impact the overall performance. There can also be problems with switches that impact performance in subtle ways. Dealing with these types of hardware issues can also be time consuming. It is sometimes helpful to run a known benchmark such as terasort or MalStone to distinguish hardware issues from programming issues.

    What is the significance of large data clouds? Just a short time ago, it required specialized proprietary software to analyze 100 TB or more of data. Today, a competant team should be able to do this relatively straightforwardly with a 100 node large data cloud powered by Hadoop, Sector or similar software.

    Getting involved. I just set up a Google Group for large data clouds:
    groups.google.com/group/large-data-clouds. Please use this group to discuss issues related to large data clouds, including lessons learned, questions, annoucements, etc. (no advertising please). In particular, if you have software you would like added to the list below, please comment below or send a node to the large data cloud google group.