Learning About Cloud Analytics

April 6, 2009

Clouds are changing the way that analytic models get built and the way they get deployed.

Neither analytics nor clouds have standard definitions yet.

A definition I like is to define analytics as the analysis of data to support decisions. For example, analytics is used in marketing to develop statistical models for acquiring customers and predicting the future profitability of customers. Analytics is used in risk management to identify fraud, to discover compromises in operations, and to reduce risk. Analytics is used in operations to improve business and operational processes.

Cloud computing also doesn’t yet have a standard definition. A good working definition is to define clouds as racks of commodity computers that provide on-demand resources and services over a network, usually the Internet, with the scale and the reliability of a data center.

There are two different, but related, types of clouds: the first category of clouds provide computing instances on demand, while the second category of clouds provide computing capacity on demand. Both use the same underlying hardware, but the first is designed to scale out by providing additional computing instances, while the second is designed to support data- or compute-intensive applications by scaling capacity. Amazon’s EC2 and S3 services are an example of the first type of cloud. The Hadoop system is an example of the second type of cloud.

Currently, as a platform for analytics, clouds offer several advantages:

  1. Building analytic models on very large datasets. “Hadoop style clouds” provide a very effective platform for developing analytic models on very large datasets.
  2. Scoring data using analytic models. Given an analytic model and some data (either a file of data or a stream of data), “Amazon style clouds” provide a simple and effective platform for scoring data. The Predictive Model Markup Language (PMML) has proved to be a very effective mechanism for moving a statistical or analytic model built using one analytic system into a cloud for scoring. Sometimes the terminology PMML Producer is used for the application that builds the model and PMML Consumer is used for the application that scores new data using the model. Using this terminology, “Amazon style clouds” can be used to score data easily using PMML models built elsewhere.
  3. Simplifying modeling environments. Finally, computing instances in a cloud can be built that incorporate all the analytic software required for building models, including preconfigured connections to all the data required for modeling. At least for small to medium size datasets, preconfiguring computing instances in this way can simplify the development of analytic models.
  4. Easy access to data. Clouds can also make it much easier to access data for modeling. Amazon has recently made available a variety of public datasets. For example, using Amazon’s EBS service, the U.S. Census data can be accessed immediately.

I’ll be one of the lecturers in two up coming courses on cloud analytics that introduce clouds as well as cloud analytics.

The first course will be taught in Chicago on June 22, 2009 and the second one in San Mateo on July 14, 2009.   You can register for the Chicago course using this registration link and the San Mateo course using this registration link.

This one day course will give a quick introduction to cloud computing and analytics. It describes several different types of clouds and what is new about cloud computing, and discusses some of the advantages and disadvantages that clouds offer when building and deploying analytic models. It includes three case studies, a survey of vendors, and information about setting up your first cloud.

The course syllabus can be found here: www.opendatagroup.com/courses.htm.