The Three Most Important Interfaces in Analytics

June 17, 2009

If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics. I use the term analytic infrastructure to refer to these components, services, applications and systems.

The Data Mining Group, which develops the Predictive Model Markup Language.

For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.

There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:

Step Inputs Outputs
Preprocessing dataset (data fields) dataset of features
Modeling dataset of features model
Scoring dataset (data fields), model scores
Postprocessing scores actions

Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments. The former are examples of what are sometimes called model producers, while the latter are sometimes called model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.

On June 16, the Data Mining Group released version 4.0 of the Predictive Model Markup Language or PMML. Version 4.0 is the first release of PMML since Version 3.2 was released in May, 2007.

Version 4.0 of PMML adds the following new features:

  • support for time series models;
  • support for multiple models, which includes support for both
    segmented models and ensembles of models;
  • improved support for preprocessing data, which will help simplify
    deployment of models;
  • new models, such as survival models;
  • support for additional information about models called model
    explanation, which includes information for visualization, model
    quality, gains and lift charts, confusion matrix, and related

Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models). This is probably the second most important interface in analytics.

With Version 4.0 now released, the PMML working group is now working on Version 4.1. One of the goals is to enable PMML describe postprocessing of scores. This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics.

Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.

If you are interested in getting involved in the PMML working group, please visit the web site:

Disclaimer:I’m a member of the PMML working group and worked on PMML Version 4.0.

Five Common Mistakes in Analytic Projects

June 1, 2009

Managing projects is often challenging. Developing predictive models can be very challenging. Managing projects that develop analytic models can present some especially difficult challenges. In this post, I’ll describe some of the most common mistakes that occur when managing analytic projects.
Managing projects involving analytics can be difficult.
Mistake 1. Underestimating the time required to get the data. This is probably the most common mistake in modeling projects. Getting the data required for analytic projects usually requires a special request to the IT department. Any special requests made to IT departments can take time. Usually, several meetings are required between the business owners of the analytic problem, the statisticians building the models, and the IT department in order to decide what data is required and whether it is available. Once there is agreement on what data is required, then the special request to the IT department is made and the wait begins. Project managers are sometimes under the impression that good models can be built without data, just as statisticians are sometimes under the impression that modeling projects can be managed without a project plan.

Mistake 2. There is not a good plan for deploying the model. There are several phases in a modeling project. In one phase, data is acquired from the IT department and the model is built. A statistician is usually in charge of building the model. In the next phase, the model is deployed. This is the responsibility of the IT department. This requires providing the model with the appropriate data, post-processing the scores produced by the model to compute the associated actions, and then integrating these actions into the required business processes. Deploying models is in many cases just as complicated or more complicated than building the models and requires a plan. A good standards-compliant architecture can help here. It is often useful for the statistician to export the model as PMML. The model can then be imported by the application used in the operational system.

Mistake 3. Working backwards, instead of starting with an analytic strategy. To say it another way: first, decide on an analytic strategy; then, check that the data that is available supports the analytic strategy; then, make sure that there are modelers (or statisticians) available to develop the models; and, then, finally, make sure that the modelers have the right (software) tools. The most important factor effecting the success of an analytic project is choosing the right analytic project and approaching it in the right way. This is a matter of analytic strategy. Once the right project is chosen, the success of the project is most dependent on the data that is available; next on the talent of the modeler that is developing the models; and then on the software that is used. In general, companies new to modeling proceed in precisely the opposite direction. First, they buy software they don’t need (for many problems open source analytic software works just fine). Then, when the IT staff has trouble using the modeling software, they hire a statistician to build models. Finally, once a statistician is on board, someone looks at the data, and realizes (often) that the data will not support the model required. Finally, much later, the business owners of the problem realize they started with the wrong analytic problem. This is usually because they didn’t start with an analytic strategy.

Mistake 4. Trying to build the perfect model. Another common mistake is trying to build the perfect statistical model. Usually, the impact of a model will be much higher if a model that is good enough is deployed and then a process is put in place that: i) reviews the effectiveness of the model frequently with the business owner of the problem; ii) refreshes the model on a regular basis with the most recent data; and, iii) rebuilds the model on a periodic basis with the lessons learned from the reviews.

Mistake 5. The predictions of the model are not actionable. This was the subject of a recent post about an approach that I call the SAMS methodology. Recall that SAMS is an abbreviation for Scores/Actions/Measures/Strategy. From this point of view, the model is evaluated not just by its accuracy but instead by measures that directly support a specified strategy. For example, the strategy might be to increase sales by recommending another product after an initial product is selected. Here the relevant measure might be the incremental revenue generated by the recommendations. The actions would be either to present up to three additional products to the shopper. The scores might be a score from 1 to 1000. The products with the highest three scores are then presented. This is a simple example. Unfortunately, in most of the projects that I have been involved with determining the appropriate actions and measures often requires an iterative process to get it right.

Please share by making comments below any lessons you have learned building analytic models. I would like to expand this list over time to include many of the common mistakes that occur in analytic projects.

The image above is from and is available under a Creative Commons license.

Open Source Analytics Reaches Main Street (and Some Other Trends in Analytics)

May 11, 2009

This is the first of three posts about systems, applications, services and architectures for building and deploying analytics. Sometimes this is called analytic infrastructure. This post is primarily directed at the analytic infrastructure needs of companies. Later posts will look at analytic infrastructure for the research community.

In this first post of the series, we discuss five important trends impacting analytic infrastructure.

Trend 1. Open source analytics has reached Main Street. R, which was first released in 1996, is now the most widely deployed open source system for statistical computing. A recent article in the New York Times estimated that over 250,000 individuals use R regularly. Dice News has created a video called “What’s Up with R” to inform job hunters using their services about R. In the language of Geoffrey A. Moore’s book Crossing the Chasm, R has reached “Main Street.”

Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used. Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.

Trend 2. The maturing of open, standards based architectures for analytics. Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician. It is usually a challenge to deploy the model produced by the application into operational systems. Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.

The Predictive Model Markup Language (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models. PMML has become the dominant standard for statistical and data mining models. Many applications now support PMML.

By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.

Disclaimer: I’m one of the many people that has been involved in the development of the PMML standard.

Trend 3. The emergence of systems that simplify the analysis of large datasets. Analyzing large datasets is still very challenging, but with the introduction of Hadoop, there is now an open source system supporting MapReduce that scales to thousands of processors.

The significance of Hadoop and MapReduce is not only the scalability, but also the simplicity. Most programmers, with no prior experience, can have their first Hadoop job running on a large cluster within a day. Most programmers find that it is much easier and much quicker to use MapReduce and some of its generalizations than it is develop and implement an MPI job on a cluster, which is currently the most common programming model for clusters.

Trend 4. Cloud-based data services. Over the next several years, cloud-based services will begin to impact analytics significantly. A later post in this series will show simple it is use R in a cloud for example. Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.

Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity — if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating and not a capital expense.)

Trend 5. The commoditization of data. Moore’s law applies not only to CPUs, but also to the chips that are used in all of the digital device that produce data. The result has been that the cost to produce data has been falling for some time. Similarly, the cost to store data has also been falling for some time.

Indeed, more and more datasets are being offered for free. For example, end of day stock quotes from Yahoo, gene sequence data from NCBI, and public data sets hosted by Amazon, including the U.S. Census Bureau, are all available now for free.

The significance to analytics is that the cost to enrich data with third party data, which often produces better models, is falling. Over time, more and more of this data will be available in clouds, so that the effort to integrate this data into modeling will also decrease.

Learning About Cloud Analytics

April 6, 2009

Clouds are changing the way that analytic models get built and the way they get deployed.

Neither analytics nor clouds have standard definitions yet.

A definition I like is to define analytics as the analysis of data to support decisions. For example, analytics is used in marketing to develop statistical models for acquiring customers and predicting the future profitability of customers. Analytics is used in risk management to identify fraud, to discover compromises in operations, and to reduce risk. Analytics is used in operations to improve business and operational processes.

Cloud computing also doesn’t yet have a standard definition. A good working definition is to define clouds as racks of commodity computers that provide on-demand resources and services over a network, usually the Internet, with the scale and the reliability of a data center.

There are two different, but related, types of clouds: the first category of clouds provide computing instances on demand, while the second category of clouds provide computing capacity on demand. Both use the same underlying hardware, but the first is designed to scale out by providing additional computing instances, while the second is designed to support data- or compute-intensive applications by scaling capacity. Amazon’s EC2 and S3 services are an example of the first type of cloud. The Hadoop system is an example of the second type of cloud.

Currently, as a platform for analytics, clouds offer several advantages:

  1. Building analytic models on very large datasets. “Hadoop style clouds” provide a very effective platform for developing analytic models on very large datasets.
  2. Scoring data using analytic models. Given an analytic model and some data (either a file of data or a stream of data), “Amazon style clouds” provide a simple and effective platform for scoring data. The Predictive Model Markup Language (PMML) has proved to be a very effective mechanism for moving a statistical or analytic model built using one analytic system into a cloud for scoring. Sometimes the terminology PMML Producer is used for the application that builds the model and PMML Consumer is used for the application that scores new data using the model. Using this terminology, “Amazon style clouds” can be used to score data easily using PMML models built elsewhere.
  3. Simplifying modeling environments. Finally, computing instances in a cloud can be built that incorporate all the analytic software required for building models, including preconfigured connections to all the data required for modeling. At least for small to medium size datasets, preconfiguring computing instances in this way can simplify the development of analytic models.
  4. Easy access to data. Clouds can also make it much easier to access data for modeling. Amazon has recently made available a variety of public datasets. For example, using Amazon’s EBS service, the U.S. Census data can be accessed immediately.

I’ll be one of the lecturers in two up coming courses on cloud analytics that introduce clouds as well as cloud analytics.

The first course will be taught in Chicago on June 22, 2009 and the second one in San Mateo on July 14, 2009.   You can register for the Chicago course using this registration link and the San Mateo course using this registration link.

This one day course will give a quick introduction to cloud computing and analytics. It describes several different types of clouds and what is new about cloud computing, and discusses some of the advantages and disadvantages that clouds offer when building and deploying analytic models. It includes three case studies, a survey of vendors, and information about setting up your first cloud.

The course syllabus can be found here: