Test Drive the Sector Public Cloud

June 23, 2009

Sector is an open source cloud written in C++ for storing, sharing and processing large data sets.   Sector is broadly similar to the Google File System and the Hadoop Distributed File System, except that it is designed to utilize wide area high performance  networks.

Sphere is middleware that is designed to process data managed by Sector.  Sphere implements a framework for distributed computing that allows any User Defined Function (UDF) to be applied to a Sector dataset.

One way to think about this is as a generalized MapReduce. With MapReduce, users work with pairs and define a Map function and a Reduce function, and the MapReduce application creates a workflow consisting of a Map, Shuffle, Sort and Reduce. With Sector, users can create a workflow consisting of any sequence of User Define Functions (UDFs) and apply these to any datasets managed by Sector. In particular, Sphere has predefined Shuffle and Sort UDFs that can be applied to datasets consisting of pairs so that MapReduce applications can be implemented once a user defines a Map and Reduce UDF.

Sector also implements security and we are currently using it to bring up a HIPAA-compliant private cloud.

Since Sector/Sphere is written in C++, it is straightforward to support C++ based data access tools and programming APIs.

If you have access to high speed research network (for example if you network can reach StarLight, the National Lambda Rail, ESNet, or Internet2), then you can try out the Sector Public Cloud.

You can reach the Sector Public Cloud from the Sector home page sector.sourceforge.net.

There is a technical report on the design of Sector on arXiv: arXiv:0809.1181v2.

There is some information on the performance of Sector/Sphere in my post on the MalStone Benchmark, a benchmark for clouds that support data intensive computing.

The Three Most Important Interfaces in Analytics

June 17, 2009

If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics. I use the term analytic infrastructure to refer to these components, services, applications and systems.

The Data Mining Group, which develops the Predictive Model Markup Language.

For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.

There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:

Step Inputs Outputs
Preprocessing dataset (data fields) dataset of features
Modeling dataset of features model
Scoring dataset (data fields), model scores
Postprocessing scores actions

Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments. The former are examples of what are sometimes called model producers, while the latter are sometimes called model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.

On June 16, the Data Mining Group released version 4.0 of the Predictive Model Markup Language or PMML. Version 4.0 is the first release of PMML since Version 3.2 was released in May, 2007.

Version 4.0 of PMML adds the following new features:

  • support for time series models;
  • support for multiple models, which includes support for both
    segmented models and ensembles of models;
  • improved support for preprocessing data, which will help simplify
    deployment of models;
  • new models, such as survival models;
  • support for additional information about models called model
    explanation, which includes information for visualization, model
    quality, gains and lift charts, confusion matrix, and related

Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models). This is probably the second most important interface in analytics.

With Version 4.0 now released, the PMML working group is now working on Version 4.1. One of the goals is to enable PMML describe postprocessing of scores. This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics.

Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.

If you are interested in getting involved in the PMML working group, please visit the web site: www.dmg.org

Disclaimer:I’m a member of the PMML working group and worked on PMML Version 4.0.

Some Reasons to Consider Condominium Clouds (Condo Clouds)

June 8, 2009

In this post, I’ll introduce condominium clouds and discuss some of their potential for changing computing. From an architectural point of view, condominium clouds are essentially the same as private clouds. Condominium clouds have a different business model though, which, in certain circumstances provides some definite advantages.

I argue here that condominium clouds and related offerings represent a fundamental shift in our computing platforms. To explain this, I’ll take a short detour and recall a computing experience I had about a decade ago and the business model (condominium fiber) that made these types of experiences available to a broader community.

Some racks in data center.

One of most exciting technical experiences I have had occurred in 2000 when I ran a distributed data intensive computing application over a dedicated 155 Mbps network link connecting clusters located at NCAR in Boulder and the University of Michigan in Ann Arbor. Prior to that I only had access to 1.5 Mbps networks and these networks were shared by the rest of the campus. The application was able to perform sustained computation at about 96 Mbps, which was not bad considering that each computer was limited by a 100 Mbps NIC. Reaching a 96 Mbps over a wide area network was quite difficult at that time, but we did this using a new network protocol that was the precursor to UDT. The reason for our excitement was that one day we were were limited to distributed computations that rarely reached 1 Mbps, while the next day we reached 96 Mbps, almost two orders of magnitude improvement.

By 2003, with improved protcols and 10 Gbps networks, sustained distributed computations reached 6.8 Gbps. Within a four year span, we had passed through an inflection point in which high performance distributed computing improved by over 3 orders of magnitude. Three things were required:

  • A new computing platform, in this case, clusters connected by wide area, high performance networks.
  • A new network protocol and associated libraries, since TCP was not effective at data intensive computing over wide area high performance networks.
  • A new business model, which made high performance wide area networks more broadly available.

Let’s turn now to cloud computing. Cloud computing has two faces: the most familiar face offers utility-based pricing, on-demand elastic availability, and infrastructure as a service. There is no doubt that this combination is changing the face of computing. On the other hand, the other side of cloud computing is just as important. This side is about thinking of the data center as your unit of computing. Previously you probably thought of computing as requiring a certain number of racks. With cloud computing, you now think of computing as requiring a certain number of data centers. This is computing measured with Data Center Units or DCUs.

The problem is acquiring computing at the scale of data centers is prohibitive except for handful of companies (Google, Microsoft, Yahoo, IBM, …)

This is where the condominium clouds enter. But first, here is a description of customer owned and condominium fiber from a 2002 FAQ titled “FAQ about Community Dark Fiber Networks” written by Bill St Arnaud:

Dark fiber is optical fiber, dedicated to a single customer and where the customer is responsible for attaching the telecommunications equipment and lasers to “light” the fiber. Traditionally optical fiber networks have been built by carriers where they take on the responsibility of lighting the fiber and provide a managed service to the customer.

Professional 3rd parties companies who specialize in dark fiber systems take care of the actual installation of the fiber and also maintain it on behalf of the customer. Technically these companies actually own the fiber, but sell IRUs (Indefeasible Rights of Use) for up to 20 years for unrestricted use of the fiber.

All across North America businesses, school boards and municipalities are banding together to negotiate deals to purchase customer owned dark fiber. A number of next generation service providers are now installing fiber networks and will sell strands of fiber to any organization who wish to purchase and manage their own dark fiber.

Many of these new fiber networks are built along the same model as a condominium apartment building. The contractor advertises the fact that they intend to build a condominium fiber network and offers early participants special pricing before the construction begins. That way the contractor is able to guarantee early financing for the project and demonstrate to bankers and other investors that there are some committed customers to the project.

The condominium fiber is operated like a condominium apartment building. The individual owners of fiber strands can do whatever they want they want with their individual fiber strands. They are free to carry any type of traffic and terminate the fiber any way they so choose. The company that installs the fiber network is responsible for overall maintenance and repairing the fiber in case of breaks, moves, adds or changes. The “condominium manager” charges the owners of the individual strands of fiber a small annual maintenance fee which covers all maintenance and right of way costs.

The initial primary driver for dark fiber by individual customers is the dramatic savings in telecommunication costs. The reduction in telecommunication costs can be in excess of 1000% depending on your current bandwidth requirements.

It is now easy to explain condominium clouds. For those who cannot afford private clouds at the scale of data centers, condominium clouds became a way to share the expense with other members of the condominium.

The condominium cloud model is also attractive if there are compliance issues or security issues that make a private cloud desirable, but your scale is such that justifying your own private cloud at the scale of a data center does not make sense.

As with condominium fiber, professionals would build and operate the data center. One way of looking at condominium clouds is as a more cost effective private clouds for certain organizations or associations that might benefit from the scale and operational control that data centers offer.

Condominium clouds might make sense for companies in a regulated industry that belong to an association that can manage the condominium. They would also make sense for scientific collaborations, especially those with large data. Also, although the business model would be slightly different, government organizations that couldn’t justify their own cloud could work together and jointly manage a condominium cloud.

The image above is courtesy of Cory Doctorow.

Five Common Mistakes in Analytic Projects

June 1, 2009

Managing projects is often challenging. Developing predictive models can be very challenging. Managing projects that develop analytic models can present some especially difficult challenges. In this post, I’ll describe some of the most common mistakes that occur when managing analytic projects.
Managing projects involving analytics can be difficult.
Mistake 1. Underestimating the time required to get the data. This is probably the most common mistake in modeling projects. Getting the data required for analytic projects usually requires a special request to the IT department. Any special requests made to IT departments can take time. Usually, several meetings are required between the business owners of the analytic problem, the statisticians building the models, and the IT department in order to decide what data is required and whether it is available. Once there is agreement on what data is required, then the special request to the IT department is made and the wait begins. Project managers are sometimes under the impression that good models can be built without data, just as statisticians are sometimes under the impression that modeling projects can be managed without a project plan.

Mistake 2. There is not a good plan for deploying the model. There are several phases in a modeling project. In one phase, data is acquired from the IT department and the model is built. A statistician is usually in charge of building the model. In the next phase, the model is deployed. This is the responsibility of the IT department. This requires providing the model with the appropriate data, post-processing the scores produced by the model to compute the associated actions, and then integrating these actions into the required business processes. Deploying models is in many cases just as complicated or more complicated than building the models and requires a plan. A good standards-compliant architecture can help here. It is often useful for the statistician to export the model as PMML. The model can then be imported by the application used in the operational system.

Mistake 3. Working backwards, instead of starting with an analytic strategy. To say it another way: first, decide on an analytic strategy; then, check that the data that is available supports the analytic strategy; then, make sure that there are modelers (or statisticians) available to develop the models; and, then, finally, make sure that the modelers have the right (software) tools. The most important factor effecting the success of an analytic project is choosing the right analytic project and approaching it in the right way. This is a matter of analytic strategy. Once the right project is chosen, the success of the project is most dependent on the data that is available; next on the talent of the modeler that is developing the models; and then on the software that is used. In general, companies new to modeling proceed in precisely the opposite direction. First, they buy software they don’t need (for many problems open source analytic software works just fine). Then, when the IT staff has trouble using the modeling software, they hire a statistician to build models. Finally, once a statistician is on board, someone looks at the data, and realizes (often) that the data will not support the model required. Finally, much later, the business owners of the problem realize they started with the wrong analytic problem. This is usually because they didn’t start with an analytic strategy.

Mistake 4. Trying to build the perfect model. Another common mistake is trying to build the perfect statistical model. Usually, the impact of a model will be much higher if a model that is good enough is deployed and then a process is put in place that: i) reviews the effectiveness of the model frequently with the business owner of the problem; ii) refreshes the model on a regular basis with the most recent data; and, iii) rebuilds the model on a periodic basis with the lessons learned from the reviews.

Mistake 5. The predictions of the model are not actionable. This was the subject of a recent post about an approach that I call the SAMS methodology. Recall that SAMS is an abbreviation for Scores/Actions/Measures/Strategy. From this point of view, the model is evaluated not just by its accuracy but instead by measures that directly support a specified strategy. For example, the strategy might be to increase sales by recommending another product after an initial product is selected. Here the relevant measure might be the incremental revenue generated by the recommendations. The actions would be either to present up to three additional products to the shopper. The scores might be a score from 1 to 1000. The products with the highest three scores are then presented. This is a simple example. Unfortunately, in most of the projects that I have been involved with determining the appropriate actions and measures often requires an iterative process to get it right.

Please share by making comments below any lessons you have learned building analytic models. I would like to expand this list over time to include many of the common mistakes that occur in analytic projects.

The image above is from www.flickr.com/photos/doobybrain/360276843 and is available under a Creative Commons license.