Open Source Analytics Reaches Main Street (and Some Other Trends in Analytics)

May 11, 2009

This is the first of three posts about systems, applications, services and architectures for building and deploying analytics. Sometimes this is called analytic infrastructure. This post is primarily directed at the analytic infrastructure needs of companies. Later posts will look at analytic infrastructure for the research community.

In this first post of the series, we discuss five important trends impacting analytic infrastructure.

Trend 1. Open source analytics has reached Main Street. R, which was first released in 1996, is now the most widely deployed open source system for statistical computing. A recent article in the New York Times estimated that over 250,000 individuals use R regularly. Dice News has created a video called “What’s Up with R” to inform job hunters using their services about R. In the language of Geoffrey A. Moore’s book Crossing the Chasm, R has reached “Main Street.”

Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used. Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.

Trend 2. The maturing of open, standards based architectures for analytics. Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician. It is usually a challenge to deploy the model produced by the application into operational systems. Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.

The Predictive Model Markup Language (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models. PMML has become the dominant standard for statistical and data mining models. Many applications now support PMML.

By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.

Disclaimer: I’m one of the many people that has been involved in the development of the PMML standard.

Trend 3. The emergence of systems that simplify the analysis of large datasets. Analyzing large datasets is still very challenging, but with the introduction of Hadoop, there is now an open source system supporting MapReduce that scales to thousands of processors.

The significance of Hadoop and MapReduce is not only the scalability, but also the simplicity. Most programmers, with no prior experience, can have their first Hadoop job running on a large cluster within a day. Most programmers find that it is much easier and much quicker to use MapReduce and some of its generalizations than it is develop and implement an MPI job on a cluster, which is currently the most common programming model for clusters.

Trend 4. Cloud-based data services. Over the next several years, cloud-based services will begin to impact analytics significantly. A later post in this series will show simple it is use R in a cloud for example. Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.

Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity — if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating and not a capital expense.)

Trend 5. The commoditization of data. Moore’s law applies not only to CPUs, but also to the chips that are used in all of the digital device that produce data. The result has been that the cost to produce data has been falling for some time. Similarly, the cost to store data has also been falling for some time.

Indeed, more and more datasets are being offered for free. For example, end of day stock quotes from Yahoo, gene sequence data from NCBI, and public data sets hosted by Amazon, including the U.S. Census Bureau, are all available now for free.

The significance to analytics is that the cost to enrich data with third party data, which often produces better models, is falling. Over time, more and more of this data will be available in clouds, so that the effort to integrate this data into modeling will also decrease.

Sector – When You Really Need to Process 10 Billion Records

April 19, 2009

As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center. The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).

Although the Google stack of services is not directly available, the open source Hadoop system, which has a broadly similar architecture, is available.

The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce, are examples of clouds designed for data intensive computing — these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.

There are still many open questions about how best to design clouds for data intensive computing. During the best several years, I have been involved with a cloud designed for data intensive computing called Sector. The lead developer of Sector is Yunhong Gu of the University of Illinois at Chicago. Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).

To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone. I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce. The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).

MalStone B Benchmarks

System Time (min)
Hadoop MapReduce 799 min
Hadoop Streaming with Python 143 min
Sector 44 min

Tests were done using 20 nodes on the Open Cloud Testbed. Each node contained 500 million 100-byte records.

Comparing Sector and Hadoop

Hadoop Sector
Storage cloud block-based file system file-based
Programming model MapReduce user defined functions and MapReduce
Protocol TCP UDP
Security NA HIPAA capable
Replication at time of writing periodically
Language Java C++

I’ll be giving a talk on Sector at CloudSlam ’09 on Monday, April 20, 2009 at 4pm ET. CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.

Cloud Interoperability – Five Challenge Problems

March 30, 2009

I attended the OMG Workshop on Cloud Interoperability in Arlington, Virgina that took place on March 23, 2009.

There were presentations about cloud interoperability from the Cloud Computing Interoperability Forum (CCIF), the Open Cloud Consortium (OCC), the Open Grid Forum (OGF), the Distributed Management Task Force (DMTF), as well from several other companies and organizations.

(Full disclosures: I gave the OCC presentation and have some opinions of my own about cloud interoperability.)

To say the least, it was clear from the meeting that there was no consensus yet about how to begin to draft standards for clouds and for their interoperability. This was not surprising. But what was surprising, at least to me, was that there was not even consensus about what were the real problems and challenges.

After the meeting, I decided that it might be helpful to create a list containing some challenge problems that, hopefully, with the right standards, would be much easier to solve. Here is a very rough first attempt at creating such a list.

Problem 1: Assume I have a web application that uses three communicating virtual machines in Cloud 1. Move the application transparently to Cloud 2.

Problem 2: Assume I have a web application that uses three communicating virtual machines in Cloud 1. Increase the capacity of the application by add three more virtual machine instances from Cloud 2.

Problem 3. Assume I have a cloud application in Cloud 1 that relies upon a cloud storage service, such as a storage service provided by the Hadoop Distributed File System (HDFS). Assume that Cloud 2 has cloud storage services, but that they are not implemented using Hadoop. Move the cloud application transparently to Cloud 2.

Problem 4. Assume I have a cloud application in Cloud 1 that relies upon a cloud storage service as well as a cloud compute service that is based upon MapReduce or one of its variants or generalizations. Assume that Cloud 2 has cloud storage and compute services, but that they are not implemented using Hadoop. Move the cloud application
transparently to Cloud 2.

Problem 5. Assume that there is a HIPAA compliant application in Cloud 1 (managed by Provider P) and another in Cloud 2 (managed by Provider Q). Assume that a patient has given permission to Cloud Provider P to copy any of his information (say, it is a file) that is in Cloud 2 and move it to Cloud 1 in order create a consolidated record in Cloud 1.
Create the consolidated record (say, the concatenation of the two files) automatically in Cloud 1.

The first two problem concern clouds that provide on demand computing instances, such as provided by Amazon’s EC2 and S3 services. The next two problems concern clouds designed for data intensive cloud applications, such as applications using Hadoop services. The fifth problem concerns sharing information between two clouds.

As always with any new standards, the goal is to rely on existing standards as much as possible and only to introduce something new when absolutely required. Therefore the first challenge is to identify the gaps in current standards for these five problems.

None of the problems are precise enough yet to determine whether new standards are required, but perhaps there are suggestive of problems that can be made precise enough for this purpose.