This is a post that contains some questions and answers about large data clouds that I expect to update and expand from time to time.
What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:
- Small data. Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the Netflix Prize. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop. I discuss some lessons in analytic strategy that you learn from this contest in this post.
- Medium data. Medium data fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies to create 1 to 10 TB or large data warehouses.
- Large data. Large data is so large that it is challenging to manage it in a database and instead specialized systems are used. We’ll discuss some examples of these specialized systems below. Scientific experiments, such as the Large Hadron Collider (LHC), produce large datasets. Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets.
There have always been large datasets, but until recently, most large datasets were produced by the scientific and defense communities. Two things have changed: First, large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media. Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.
What is a large data cloud? There is no standard definition of a large data cloud, but a good working definition is that a large data cloud
provides i) storage services and ii) compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center. You can find some background information on clouds on this page containing an overview about clouds.
What are some of the options for working with large data? There are several options, including:
- The most mature large data cloud application is the open source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce. An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS.
- Another option is Sector, which consists of the Sector Distributed File System (SDFS) and a compute service called Sphere that allows users to execute arbitrary User Defined Functions (UDFs) over the data managed by SDFS. Sector supports MapReduce as a special case of a user-defined Map UDF, followed by Shuffle and Sort UDFs provided by Sphere, followed by a user-defined Reduce UDF. Sector is a C++ open source application. Unlike Hadoop, Sector includes security. There is public Sector cloud for those interested in trying out Sector without downloading it and installing it.
- Greenplum uses a shared-nothing MPP (massively parallel processing) architecture based upon commodity hardware. The Greenplum architecture also integrates MapReduce-like functionality into its platform.
- Aster has a MPP-based data warehousing appliance that supports MapReduce. They have an entry level system that manages up to 1 TB of data and an enterprise level system that is designed to support up to 1 PB of data.
How do I get started? The easiest way to get started is to download one of the applications and to work through some basic examples. The example that most people work through is word count. Another common example is the terasort example (soring 10 billion 100 byte records where the first 10 bytes is the key that is sorted and the remaining 90 bytes is the payload). A simple analytic to try is MalStone, which I have described in another post.
What are some of the issues that arise with large data cloud applications? The first issue is mapping your problem to the MapReduce or generalized MapReduce (like Spheres UDFs) frameworks. Although this type of data parallel framework may seem quite special initially, it is surprising how many problems can be mapped to this framework with a bit effort.
The second issue is that tuning Hadoop clusters can be challenging and time consuming. This is not surprising, considering the power provided by Hadoop to tackle very large problems.
The third issue is that with medium (100 nodes) and large (1000 node) clusters, even a few under performing nodes can impact the overall performance. There can also be problems with switches that impact performance in subtle ways. Dealing with these types of hardware issues can also be time consuming. It is sometimes helpful to run a known benchmark such as terasort or MalStone to distinguish hardware issues from programming issues.
What is the significance of large data clouds? Just a short time ago, it required specialized proprietary software to analyze 100 TB or more of data. Today, a competant team should be able to do this relatively straightforwardly with a 100 node large data cloud powered by Hadoop, Sector or similar software.
Getting involved. I just set up a Google Group for large data clouds:
groups.google.com/group/large-data-clouds. Please use this group to discuss issues related to large data clouds, including lessons learned, questions, annoucements, etc. (no advertising please). In particular, if you have software you would like added to the list below, please comment below or send a node to the large data cloud google group.