Building Your Own Large Data Clouds (Raywulf Clusters)

September 27, 2009

We recently added four new racks to the Open Cloud Testbed. The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing. Since there is not a lot of information available describing how to put together these types of clouds, I thought I would share how we configured our racks.

These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala.

These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out. Photograph by Michal Sabala.

These racks can be used as a basis for private clouds, hybrid clouds, or condo clouds.

There is a lot of information about building Beowulf clusters, which are designed for compute intensive computing. Here is one of the first tutorials and some more recent information.

In contrast, our racks are designed to support data intensive computing. We sometimes call these Raywulf clusters. Briefly, the goal is to make sure that there are enough spindles moving data in parallel with enough cores to process the data being moved. (Our data intensive middleware is called Sector, Graywulf is already taken, and there are not many words that rhyme with Beo- left. Other suggestions are welcome. Please use the comments below.)

The racks cost about $85,000 (with standard discounts), consist of 32 nodes and 124 cores with 496 GB of RAM, 124 TB of disk & 124 spindles, and consume about 10.3 kW of power (excluding the power required for cooling).

With 3x replication, there is about 40 TB of usable storage available, which means that the cost to provide balanced long term storage and compute power is about $2,000 per TB. So, for example, a single rack could be used as a basis for a private cloud that can manage and analyze approximately 40 TB of data. At the end of this note, is some performance information about a single rack system.

Each rack is a standard 42U computer rack and consists of a head node and 31 compute/storage nodes. We installed GNU/Debian Linux 5.0 as the operating system. Here is the configuration of the rack and of the compute/storage nodes.

In contrast, there are specialized configurations, such as designed by Backblaze, that provide 67TB for $8,000. This is 1/2 the storage for 1/10 the cost. The difference is that Raywulf clusters are designed for data intensive computing using middleware such as Hadoop and Sector/Sphere, not just storage.

Rack Configuration

  • 31 compute/storage nodes (see below)
  • 1 head node (see below)
  • 2 Force10 S50N switches, with 2 10 Gbps uplinks so that the inter-rack bandwidth is 20 Gbps
  • 1 10GE module
  • 2 optics and stacking modules
  • 1 3Com Baseline 2250 switch to provide to provide additional cat5 ports for IPMI management interfaces.
  • cabling

Compute/storage node.

  • Intel Xeon 5410 Quad Core CPU with 16GB of RAM
  • SATA RAID controller
  • four (4) SATA 1TB hard drives in RAID-0 configuration
  • 1 Gbps NIC
  • IPMI management

Benchmarks. We benchmarked these new racks using the Terasort Benchmark and version 0.20.1 of Hadoop and version 1.24a of Sector/Sphere. Replication was turned off in both Hadop and Sector. All the racks were located within one data center. It is clear from these tests that the new versions of Hadoop and Sector/Sphere are both faster than the previous versions.

Configuration Sector/Sphere Hadoop
1 rack (32 nodes) 28m 25s 85m 49s
2 racks (64 nodes) 15m 20s 37m 0s
3 racks (96 nodes) 10m 19s 24m 14s
4 racks (128 nodes) 7m 56s 17m 45s

The Raywulf clusters were designed by Michal Sabala and Yunhong Gu of the National Center for Data Mining at the University of Illinois at Chicago.

We are working on putting together more information of how to build a Raywulf cluster.

Sector/Sphere and our Raywulf Clusters were selected as one of the Disruptive Technologies that will be highlighted at SC 09.

The photograph above of two racks from the Open Cloud Testbed was taken by Michal Sabala.

Cloud Computing Testbeds

July 29, 2009

Cloud computing is still an immature field: there are lots of interesting research problems, no standards, few benchmarks, and very limited interoperability between different applications and services.

The network infrastructure for the Phase 1 of the Open Cloud Testbed.

Currently, there are relatively few testbeds available to the research community for research in cloud computing and few resources available to developers for testing interoperability. I expect this will change over time, but below are the testbeds that I am aware of and a little bit about each of them. If you know of any others, please let me know so that I can keep the list current (at least for a while until cloud computing testbeds become more common).

Before discussing the testbeds per se, I want to highlight one of the lessons that I have learned while working with one of the testbeds — the Open Cloud Testbed (OCT).

Disclaimer: I am one of the technical leads for the OCT and one of the Directors of the Open Cloud Consortium.

Currently the OCT consists of 120 identical nodes and 480 cores. All were purchased and assembled at the same time by the same team. One thing that caught me by suprise is that there are enough small differences between the nodes that the results of some experimental studies can vary by 5%, 10%, 20%, or more, depending upon which nodes are used within the testbed. This is because even one or two nodes with slightly inferior performance can impact the overall end-to-end performance of an application that uses some of today’s common cloud middleware.

Amazon Cloud. Although not usually thought of as a testbed, Amazon’s EC2, S3, SQS, EBS and related services are economical enough that they they can serve as the basis for an on-demand testbed for many experimental studies. In addition, Amazon provides grants so that their cloud services can be used for teaching and research.

Open Cloud Testbed (OCT). The Open Cloud Testbed is a testbed managed by the Open Cloud Consortium. The testbed currently consists of 4 racks of servers, located in 4 data centers at Johns Hopkins University (Baltimore), StarLight (Chicago), the University of Illinois (Chicago), and the University of California (San Diego). Each rack has 32 nodes and 128 cores. Two Cisco 3750E switches connect the 32 nodes, which then connects to the outside by a 10Gb/s uplink. In contrast to other cloud testbeds, the OCT utilizes wide area high performance networks, not the familiar commodity Internet. There are 10Gb/s networks that connect the various data centers. This network is provided by Cisco’s CWave national testbed infrastructure and through a partnership with the National Lambda Rail. Over the next few months the OCT will double in size to 8 racks and over 1000 cores. In the OCT, a variety of cloud systems and services are installed and available for research, including Hadoop, Sector/Sphere, CloudStore (KosmosFS), Eucalyptus, and Thrift. The OCT is a testbed designed to support systems-level, middleware and application level research in cloud computing, as well as the development of standards and interoperability frameworks. A technical report described the OCT is available from

Open Cirrus(tm) Testbed. The Open Cirrus Testbed is a joint initiative sponsored by HP, Intel and Yahoo! in collaboration with the NSF, the University of Illinois at Urbana-Champaign (UIUC), Karlsruhe Institute of Technology, and the Infocomm Development Authority (IDA) of Singapore. Each of the six sites consists of at least 1000 cores and associated storage. The Open Cirrus Testbed is a federated system designed to support systems-level research in cloud computing. A technical report describing the testbed can be found here.

Eucalyptus Public Cloud. The Eucalyptus Public Cloud is a testbed for Eucalyptus applications. Eucalyptus shares the same APIs as Amazon’s web services. Currently, users are limited to no more than 4 virtual machines and experimental studies that require 6 hours or less.

Google-IBM-NSF CLuE Resource. Another cloud computing testbed is the IBM-Google-NSF Cluster Exploratory or CluE Resource. The IBM-Google NSF CLuE resource appears to be a testbed for cloud computing applications in the sense that Hadoop applications can be run on the testbed but that the testbed does not support systems research and experiments involving cloud middleware and cloud services per se, as is possible with the OCT and the Open Cirrus Testbed. (At least this was the case the last time I checked. It may be different now. If it is possible to do systems level research on the testbed, I would appreciate it if someone would let me know.) NSF has awarded nearly $5 million in grants to 14 universities through its Cluster Exploratory (CLuE) program to support research on this testbed.

Some Reasons to Consider Condominium Clouds (Condo Clouds)

June 8, 2009

In this post, I’ll introduce condominium clouds and discuss some of their potential for changing computing. From an architectural point of view, condominium clouds are essentially the same as private clouds. Condominium clouds have a different business model though, which, in certain circumstances provides some definite advantages.

I argue here that condominium clouds and related offerings represent a fundamental shift in our computing platforms. To explain this, I’ll take a short detour and recall a computing experience I had about a decade ago and the business model (condominium fiber) that made these types of experiences available to a broader community.

Some racks in data center.

One of most exciting technical experiences I have had occurred in 2000 when I ran a distributed data intensive computing application over a dedicated 155 Mbps network link connecting clusters located at NCAR in Boulder and the University of Michigan in Ann Arbor. Prior to that I only had access to 1.5 Mbps networks and these networks were shared by the rest of the campus. The application was able to perform sustained computation at about 96 Mbps, which was not bad considering that each computer was limited by a 100 Mbps NIC. Reaching a 96 Mbps over a wide area network was quite difficult at that time, but we did this using a new network protocol that was the precursor to UDT. The reason for our excitement was that one day we were were limited to distributed computations that rarely reached 1 Mbps, while the next day we reached 96 Mbps, almost two orders of magnitude improvement.

By 2003, with improved protcols and 10 Gbps networks, sustained distributed computations reached 6.8 Gbps. Within a four year span, we had passed through an inflection point in which high performance distributed computing improved by over 3 orders of magnitude. Three things were required:

  • A new computing platform, in this case, clusters connected by wide area, high performance networks.
  • A new network protocol and associated libraries, since TCP was not effective at data intensive computing over wide area high performance networks.
  • A new business model, which made high performance wide area networks more broadly available.

Let’s turn now to cloud computing. Cloud computing has two faces: the most familiar face offers utility-based pricing, on-demand elastic availability, and infrastructure as a service. There is no doubt that this combination is changing the face of computing. On the other hand, the other side of cloud computing is just as important. This side is about thinking of the data center as your unit of computing. Previously you probably thought of computing as requiring a certain number of racks. With cloud computing, you now think of computing as requiring a certain number of data centers. This is computing measured with Data Center Units or DCUs.

The problem is acquiring computing at the scale of data centers is prohibitive except for handful of companies (Google, Microsoft, Yahoo, IBM, …)

This is where the condominium clouds enter. But first, here is a description of customer owned and condominium fiber from a 2002 FAQ titled “FAQ about Community Dark Fiber Networks” written by Bill St Arnaud:

Dark fiber is optical fiber, dedicated to a single customer and where the customer is responsible for attaching the telecommunications equipment and lasers to “light” the fiber. Traditionally optical fiber networks have been built by carriers where they take on the responsibility of lighting the fiber and provide a managed service to the customer.

Professional 3rd parties companies who specialize in dark fiber systems take care of the actual installation of the fiber and also maintain it on behalf of the customer. Technically these companies actually own the fiber, but sell IRUs (Indefeasible Rights of Use) for up to 20 years for unrestricted use of the fiber.

All across North America businesses, school boards and municipalities are banding together to negotiate deals to purchase customer owned dark fiber. A number of next generation service providers are now installing fiber networks and will sell strands of fiber to any organization who wish to purchase and manage their own dark fiber.

Many of these new fiber networks are built along the same model as a condominium apartment building. The contractor advertises the fact that they intend to build a condominium fiber network and offers early participants special pricing before the construction begins. That way the contractor is able to guarantee early financing for the project and demonstrate to bankers and other investors that there are some committed customers to the project.

The condominium fiber is operated like a condominium apartment building. The individual owners of fiber strands can do whatever they want they want with their individual fiber strands. They are free to carry any type of traffic and terminate the fiber any way they so choose. The company that installs the fiber network is responsible for overall maintenance and repairing the fiber in case of breaks, moves, adds or changes. The “condominium manager” charges the owners of the individual strands of fiber a small annual maintenance fee which covers all maintenance and right of way costs.

The initial primary driver for dark fiber by individual customers is the dramatic savings in telecommunication costs. The reduction in telecommunication costs can be in excess of 1000% depending on your current bandwidth requirements.

It is now easy to explain condominium clouds. For those who cannot afford private clouds at the scale of data centers, condominium clouds became a way to share the expense with other members of the condominium.

The condominium cloud model is also attractive if there are compliance issues or security issues that make a private cloud desirable, but your scale is such that justifying your own private cloud at the scale of a data center does not make sense.

As with condominium fiber, professionals would build and operate the data center. One way of looking at condominium clouds is as a more cost effective private clouds for certain organizations or associations that might benefit from the scale and operational control that data centers offer.

Condominium clouds might make sense for companies in a regulated industry that belong to an association that can manage the condominium. They would also make sense for scientific collaborations, especially those with large data. Also, although the business model would be slightly different, government organizations that couldn’t justify their own cloud could work together and jointly manage a condominium cloud.

The image above is courtesy of Cory Doctorow.

Running R on Amazon’s EC2

May 17, 2009

This is a note for those who use R, but haven’t yet used Amazon’s (EC2 cloud services.

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

  • Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
  • On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
  • Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
  • Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Here are the main steps to use R on a pre-configured AMI.

Set up.
The set up needs to be done just once.

  1. Set up an Amazon Web Services (AWS) account by going to:

    If you already have an Amazon account for buying books and other items from Amazon, then you can use this account also for AWS.

  2. Login to the AWS console
  3. Create a “key-pair” by clinking on the link “Key Pairs” in the Configuration section of the Navigation Menu on the left hand side of the AWS console page.
  4. Clink on the “Create Key Pair” button, about a quarter of the way down the page.
  5. Name the key pair and save it to working directory, say /home/rlg/work.

Launching the AMI. These steps are done whenever you want to launch a new AMI.

  1. Login to the AWS console. Click on the Amazon EC2 tab.
  2. Click the “AMIs” button under the “Images and Instances” section of the left navigation menu of the AWS console.
  3. Enter “opendatagroup” in the search box and select the AMI labeled
    “opendatagroup/r-timeseries.manifest.xml”, which
    is AMI instance “ami-ea846283”.
  4. Enter the number of instances to launch (1), the name of the key pair that you have previously created, and select “web server” for the security group. Click the launch button to launch the AMI. Be sure to terminate the AMI when you are done.
  5. Wait until the status of the AMI is “running.” This usually takes about 5 minutes.

Accessing the AMI.

  1. Get the public IP address of the new AMI. The easiest way to do this is to select the AMI by checking the box. This provides some additional information about the AMI at the bottom of the window. You can can copy the IP address there.
  2. Open a console window and cd to your working directory which contains the key-pair that you previously downloaded.
  3. Type the command:
    ssh -i testkp.pem -X

    Here we assume that the name of the key-pair you created is “testkp.pem.” The flag “-X” starts a session that supports X11. If you don’t have X11 on your machine, you can still login and use R but the graphics in the example below won’t be displayed on your computer.

Using R on the AMI.

  1. Change your directory and start R

    #cd examples
  2. Test R by entering a R expression, such as:

    > mean(1:100)
    [1] 50.5
  3. From within R, you can also source one of the example scripts to see some time series computations:

    > source('NYSE.r')

  4. After a minute or so, you should see a graph on your screen. After the graph is finished being drawn, you should see a prompt:

    CR to continue

    Enter a carriage return and you should see another graph. You will need to enter a carriage return 8 times to complete the script (you can also choose to break out of the script if you get bored with the all the graphs.
  5. To plot the time series xts.return and write the result to a file called ‘ret-plot.pdf’ use:

    > pdf("ret-plot.pdf")
    > plot(xts.return)

    You can then copy the file from the Instance to your local machine using the command:

    scp -i testkp.pem ret-plot.pdf
  6. When you are done, exit your R session with a control-D. Exit your ssh session with an “exit” and terminte your AMI from the Amazon AWS console. You can also choose to leave your AMI running (it is only a few dollars a day).

Acknowledgements: Steve Vejcik from Open Data Group wrote the R scripts and configured the AMI.

One day course. I’ll be covering this example as well as several other case studies in a one day course taking place in San Mateo on July 14. See the courses page for more details.

Open Source Analytics Reaches Main Street (and Some Other Trends in Analytics)

May 11, 2009

This is the first of three posts about systems, applications, services and architectures for building and deploying analytics. Sometimes this is called analytic infrastructure. This post is primarily directed at the analytic infrastructure needs of companies. Later posts will look at analytic infrastructure for the research community.

In this first post of the series, we discuss five important trends impacting analytic infrastructure.

Trend 1. Open source analytics has reached Main Street. R, which was first released in 1996, is now the most widely deployed open source system for statistical computing. A recent article in the New York Times estimated that over 250,000 individuals use R regularly. Dice News has created a video called “What’s Up with R” to inform job hunters using their services about R. In the language of Geoffrey A. Moore’s book Crossing the Chasm, R has reached “Main Street.”

Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used. Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.

Trend 2. The maturing of open, standards based architectures for analytics. Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician. It is usually a challenge to deploy the model produced by the application into operational systems. Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.

The Predictive Model Markup Language (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models. PMML has become the dominant standard for statistical and data mining models. Many applications now support PMML.

By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.

Disclaimer: I’m one of the many people that has been involved in the development of the PMML standard.

Trend 3. The emergence of systems that simplify the analysis of large datasets. Analyzing large datasets is still very challenging, but with the introduction of Hadoop, there is now an open source system supporting MapReduce that scales to thousands of processors.

The significance of Hadoop and MapReduce is not only the scalability, but also the simplicity. Most programmers, with no prior experience, can have their first Hadoop job running on a large cluster within a day. Most programmers find that it is much easier and much quicker to use MapReduce and some of its generalizations than it is develop and implement an MPI job on a cluster, which is currently the most common programming model for clusters.

Trend 4. Cloud-based data services. Over the next several years, cloud-based services will begin to impact analytics significantly. A later post in this series will show simple it is use R in a cloud for example. Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.

Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity — if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating and not a capital expense.)

Trend 5. The commoditization of data. Moore’s law applies not only to CPUs, but also to the chips that are used in all of the digital device that produce data. The result has been that the cost to produce data has been falling for some time. Similarly, the cost to store data has also been falling for some time.

Indeed, more and more datasets are being offered for free. For example, end of day stock quotes from Yahoo, gene sequence data from NCBI, and public data sets hosted by Amazon, including the U.S. Census Bureau, are all available now for free.

The significance to analytics is that the cost to enrich data with third party data, which often produces better models, is falling. Over time, more and more of this data will be available in clouds, so that the effort to integrate this data into modeling will also decrease.