In Analytics, It’s the Actions that Matter

April 28, 2009

In this note, let’s define analytics as the analysis of data in order to take actions. (This is a narrow definition of analytics, but one that is useful here.) If you don’t have day to day work experience with analytics, it is easy to have the mistaken impression that analytics is only about data and statistical models.

Although understanding data and developing statistical models is certainly an important component of an analytic project, this is just one aspect of analytics. This aspect includes cleaning data, enriching data, exploring data, developing features, building models, validating models, and iterating the process. From a broad perspective, this is a process in which the input is data and the output is a statistical model. When most people think of modeling, this is what they think of. For many analytic projects, this is just a small part of what is required for a successful engagement.

The second aspect of analytics is what I am concerned with in this note. This is the aspect of analytics concerned with:

  • developing an appropriate score for a statistical model;
  • using the score to define useful actions;
  • determining which measures are best for evaluating the effectiveness of these actions;
  • tracking these measures (often with a dashboard) and making sure that that they advance the strategic objectives of the company or organization.

One way to remember this is using the mnemonic SAMS for Scores, Actions, Measures and Strategies.

For example, with a response model, often a threshold is used. If the score from the response model is above the threshold, an offer is made (this is the action); if not, no offer is made.

Here are some examples of SAMS:

Model Score Action Measure Strategy
on-line response model likelihood to respond to an offer display the offer to the visitor that has the highest likelihood of response and available inventory revenue per day generated by the web site increase revenue from a website by improving targeting of offers
fraud model likelihood that a transaction is fraudulent approve, decline, or obtain more information detection and false positive rates reduce costs and improve customer experience by lowering fraud rates
data quality model likelihood that a data source has data quality problems if the score is above a threshold, manually investigate the data to check whether there is in fact a data quality problem detection and false positive rates improve operational efficiencies by detecting data quality problems more quickly

A successful analytics projects requires a careful study of what actions are possible; of the possible actions, which can be deployed into operational systems; and, how the systems can be instrumented so that the data required to compute the required measures is available.

The organizational challenge when developing and deploying analytics is that four groups must work together to complete a successful analytic project:

  • The IT group must provide the required data to build the model.
  • The analytics group must build the appropriate models and develop the appropriate scores.
  • The operations group must decide which actions are possible and how these actions can be integrated with current systems and business processes.
  • An executive sponsor must make sure that the measures have strategic relevance and the three groups above collaborate effectively.

Sector – When You Really Need to Process 10 Billion Records

April 19, 2009

As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center. The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).

Although the Google stack of services is not directly available, the open source Hadoop system, which has a broadly similar architecture, is available.

The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce, are examples of clouds designed for data intensive computing — these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.

There are still many open questions about how best to design clouds for data intensive computing. During the best several years, I have been involved with a cloud designed for data intensive computing called Sector. The lead developer of Sector is Yunhong Gu of the University of Illinois at Chicago. Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).

To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone. I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce. The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).

MalStone B Benchmarks

System Time (min)
Hadoop MapReduce 799 min
Hadoop Streaming with Python 143 min
Sector 44 min

Tests were done using 20 nodes on the Open Cloud Testbed. Each node contained 500 million 100-byte records.

Comparing Sector and Hadoop

Hadoop Sector
Storage cloud block-based file system file-based
Programming model MapReduce user defined functions and MapReduce
Protocol TCP UDP
Security NA HIPAA capable
Replication at time of writing periodically
Language Java C++

I’ll be giving a talk on Sector at CloudSlam ’09 on Monday, April 20, 2009 at 4pm ET. CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.


Learning About Cloud Analytics

April 6, 2009

Clouds are changing the way that analytic models get built and the way they get deployed.

Neither analytics nor clouds have standard definitions yet.

A definition I like is to define analytics as the analysis of data to support decisions. For example, analytics is used in marketing to develop statistical models for acquiring customers and predicting the future profitability of customers. Analytics is used in risk management to identify fraud, to discover compromises in operations, and to reduce risk. Analytics is used in operations to improve business and operational processes.

Cloud computing also doesn’t yet have a standard definition. A good working definition is to define clouds as racks of commodity computers that provide on-demand resources and services over a network, usually the Internet, with the scale and the reliability of a data center.

There are two different, but related, types of clouds: the first category of clouds provide computing instances on demand, while the second category of clouds provide computing capacity on demand. Both use the same underlying hardware, but the first is designed to scale out by providing additional computing instances, while the second is designed to support data- or compute-intensive applications by scaling capacity. Amazon’s EC2 and S3 services are an example of the first type of cloud. The Hadoop system is an example of the second type of cloud.

Currently, as a platform for analytics, clouds offer several advantages:

  1. Building analytic models on very large datasets. “Hadoop style clouds” provide a very effective platform for developing analytic models on very large datasets.
  2. Scoring data using analytic models. Given an analytic model and some data (either a file of data or a stream of data), “Amazon style clouds” provide a simple and effective platform for scoring data. The Predictive Model Markup Language (PMML) has proved to be a very effective mechanism for moving a statistical or analytic model built using one analytic system into a cloud for scoring. Sometimes the terminology PMML Producer is used for the application that builds the model and PMML Consumer is used for the application that scores new data using the model. Using this terminology, “Amazon style clouds” can be used to score data easily using PMML models built elsewhere.
  3. Simplifying modeling environments. Finally, computing instances in a cloud can be built that incorporate all the analytic software required for building models, including preconfigured connections to all the data required for modeling. At least for small to medium size datasets, preconfiguring computing instances in this way can simplify the development of analytic models.
  4. Easy access to data. Clouds can also make it much easier to access data for modeling. Amazon has recently made available a variety of public datasets. For example, using Amazon’s EBS service, the U.S. Census data can be accessed immediately.

I’ll be one of the lecturers in two up coming courses on cloud analytics that introduce clouds as well as cloud analytics.

The first course will be taught in Chicago on June 22, 2009 and the second one in San Mateo on July 14, 2009.   You can register for the Chicago course using this registration link and the San Mateo course using this registration link.

This one day course will give a quick introduction to cloud computing and analytics. It describes several different types of clouds and what is new about cloud computing, and discusses some of the advantages and disadvantages that clouds offer when building and deploying analytic models. It includes three case studies, a survey of vendors, and information about setting up your first cloud.

The course syllabus can be found here: www.opendatagroup.com/courses.htm.