Large Data Clouds FAQ

July 16, 2009

This is a post that contains some questions and answers about large data clouds that I expect to update and expand from time to time.

What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:

  • Small data. Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the Netflix Prize. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop. I discuss some lessons in analytic strategy that you learn from this contest in this post.

    Building the ATLAS Detector at Cern's Large Hadron Collider

  • Medium data. Medium data fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies to create 1 to 10 TB or large data warehouses.
  • Large data. Large data is so large that it is challenging to manage it in a database and instead specialized systems are used. We’ll discuss some examples of these specialized systems below. Scientific experiments, such as the Large Hadron Collider (LHC), produce large datasets. Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets.

There have always been large datasets, but until recently, most large datasets were produced by the scientific and defense communities. Two things have changed: First, large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media. Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.

What is a large data cloud? There is no standard definition of a large data cloud, but a good working definition is that a large data cloud
provides i) storage services and ii) compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center. You can find some background information on clouds on this page containing an overview about clouds.

What are some of the options for working with large data? There are several options, including:

  • The most mature large data cloud application is the open source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce. An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS.
  • Another option is Sector, which consists of the Sector Distributed File System (SDFS) and a compute service called Sphere that allows users to execute arbitrary User Defined Functions (UDFs) over the data managed by SDFS. Sector supports MapReduce as a special case of a user-defined Map UDF, followed by Shuffle and Sort UDFs provided by Sphere, followed by a user-defined Reduce UDF. Sector is a C++ open source application. Unlike Hadoop, Sector includes security. There is public Sector cloud for those interested in trying out Sector without downloading it and installing it.
  • Greenplum uses a shared-nothing MPP (massively parallel processing) architecture based upon commodity hardware. The Greenplum architecture also integrates MapReduce-like functionality into its platform.
  • Aster has a MPP-based data warehousing appliance that supports MapReduce. They have an entry level system that manages up to 1 TB of data and an enterprise level system that is designed to support up to 1 PB of data.
  • How do I get started? The easiest way to get started is to download one of the applications and to work through some basic examples. The example that most people work through is word count. Another common example is the terasort example (soring 10 billion 100 byte records where the first 10 bytes is the key that is sorted and the remaining 90 bytes is the payload). A simple analytic to try is MalStone, which I have described in another post.

    What are some of the issues that arise with large data cloud applications? The first issue is mapping your problem to the MapReduce or generalized MapReduce (like Spheres UDFs) frameworks. Although this type of data parallel framework may seem quite special initially, it is surprising how many problems can be mapped to this framework with a bit effort.

    The second issue is that tuning Hadoop clusters can be challenging and time consuming. This is not surprising, considering the power provided by Hadoop to tackle very large problems.

    The third issue is that with medium (100 nodes) and large (1000 node) clusters, even a few under performing nodes can impact the overall performance. There can also be problems with switches that impact performance in subtle ways. Dealing with these types of hardware issues can also be time consuming. It is sometimes helpful to run a known benchmark such as terasort or MalStone to distinguish hardware issues from programming issues.

    What is the significance of large data clouds? Just a short time ago, it required specialized proprietary software to analyze 100 TB or more of data. Today, a competant team should be able to do this relatively straightforwardly with a 100 node large data cloud powered by Hadoop, Sector or similar software.

    Getting involved. I just set up a Google Group for large data clouds:
    groups.google.com/group/large-data-clouds. Please use this group to discuss issues related to large data clouds, including lessons learned, questions, annoucements, etc. (no advertising please). In particular, if you have software you would like added to the list below, please comment below or send a node to the large data cloud google group.


The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing

May 25, 2009

The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

Benchmark

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.

MalStone is a stylized analytic computation of a type that is common in data intensive computing. The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: code.google.com/p/malgen (look in the feature downloads section along the right hand side).

Detecting Drive-By Exploits from Log Files

We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ’07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

There is a further assumption that if the computers become infected, at perhaps a later time, then this is known. That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:

   | User ID | Compromise Flag

Here the Compromise field is a flag, with 1 denoting a compromise. A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.

We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites. Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging. For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today. On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.

The MalStone benchmarks use records of the following form:

   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

Here site abstracts web site and entity abstracts the possibly infected computer. We assume that each record is 100 bytes long.

In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and subsequently becomes compromised is divided by the total number of records for which an entity visited the site. The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest). MalStone A-10 uses 10 billion records so that in total there is 1 TB of data. Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records. MalStone B-10, B-100 and B-1000 are defined in the same way.

TeraSort Benchmark

One of the motivations for choosing 10 billion 100-byte records is that the TeraSort Benchmark (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.

In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark. It was able to sort 1 TB of data using using 910 nodes in 209 seconds, breaking the previous record of 297 seconds. Hadoop set a new record in 2009 by sorting 100 TB of data at 0.578 TB/minute using 3800 nodes. For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton Hadoop Wins Terasort.

Note that the TeraSort Benchmark is now deprecated and has been replaced by the Minute Sort Benchmark. Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.

Generating Data for MalStone Using MalGen

We have developed a generator of synthetic data for MalStone called MalGen. MalGen is open source and available from code.google.com/p/malgen. Using MalGen, data can be generated with power law distributions, which is useful when modeling web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).

Using MalStone to Study Design Tradeoffs

Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records. The experiments were done on 20 nodes of the Open Cloud Testbed. Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.

We compared three different implementations: 1) Hadoop HDFS with Hadoop’s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using Sphere User Defined Functions (UDFs).

MalStone A
Hadoop MapReduce 454m 13s
Hadoop Streams/Python 87m 29s
Sector/Sphere UDFs 33m 40s
MalStone B
Hadoop MapReduce 840m 50s
Hadoop Streams/Python 142m 32s
Sector/Sphere UDFs 43m 44s

Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations.

If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice. What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams. In addition, you may also want to consider using Sector.

The image above is from Strolling everyday and available via a Creative Commons license.

Disclaimer: I am involved in the development of Sector.


Sector – When You Really Need to Process 10 Billion Records

April 19, 2009

As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center. The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).

Although the Google stack of services is not directly available, the open source Hadoop system, which has a broadly similar architecture, is available.

The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce, are examples of clouds designed for data intensive computing — these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.

There are still many open questions about how best to design clouds for data intensive computing. During the best several years, I have been involved with a cloud designed for data intensive computing called Sector. The lead developer of Sector is Yunhong Gu of the University of Illinois at Chicago. Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).

To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone. I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce. The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).

MalStone B Benchmarks

System Time (min)
Hadoop MapReduce 799 min
Hadoop Streaming with Python 143 min
Sector 44 min

Tests were done using 20 nodes on the Open Cloud Testbed. Each node contained 500 million 100-byte records.

Comparing Sector and Hadoop

Hadoop Sector
Storage cloud block-based file system file-based
Programming model MapReduce user defined functions and MapReduce
Protocol TCP UDP
Security NA HIPAA capable
Replication at time of writing periodically
Language Java C++

I’ll be giving a talk on Sector at CloudSlam ’09 on Monday, April 20, 2009 at 4pm ET. CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.