If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics. I use the term analytic infrastructure to refer to these components, services, applications and systems.
For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.
There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:
|Preprocessing||dataset (data fields)||dataset of features|
|Modeling||dataset of features||model|
|Scoring||dataset (data fields), model||scores|
Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments. The former are examples of what are sometimes called model producers, while the latter are sometimes called model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.
Version 4.0 of PMML adds the following new features:
- support for time series models;
- support for multiple models, which includes support for both
segmented models and ensembles of models;
- improved support for preprocessing data, which will help simplify
deployment of models;
- new models, such as survival models;
- support for additional information about models called model
explanation, which includes information for visualization, model
quality, gains and lift charts, confusion matrix, and related
Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models). This is probably the second most important interface in analytics.
With Version 4.0 now released, the PMML working group is now working on Version 4.1. One of the goals is to enable PMML describe postprocessing of scores. This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics.
Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.
If you are interested in getting involved in the PMML working group, please visit the web site: www.dmg.org
Disclaimer:I’m a member of the PMML working group and worked on PMML Version 4.0.