I have moved my blog and will no longer be keeping this site up to date. You can find my blog at my web site: rgrossman.com.
I have integrated my web site and blog and you can find both at:
The SC 09 Conference took place early this month in Portland. The Bandwidth Challenge (BWC) is an interesting and friendly rivalry between research groups to develop high performance network protocols and interesting applications that use them. The Bandwidth Challenge was started ten years ago at SC 99, which also took place in Portland.
Some of the history is available at the web site scinet.supercomputing.org. For example, in 2000, there were 2 OC-48 (2.5 Gbps) circuits that connected the research exhibits at the conference to external research networks and the challenge was to develop network protocols and applications that could fill these circuits. The winner of the BWC (called the Network Bandwidth Challenge in 2000) was a scientific visualization application called Visapult that reached 1.48 Gbps and transferred 262 GB in 1 hour (providing 582 Mbps of sustained bandwidth utilization).
This year, there were approximately 24 10 GE circuits and one 40 GE circuit that connected research exhibits to external exhibits and one of the applications reached a bandwidth utilization of over 114 Gbps.
I have had an interest in the BWC over the years, because you cannot analyze data without accessing it and accessing and transporting large remote datasets has always been a challenge. To say it slightly different, for large datasets and high performance networks, network transport protocols are an important element of the analytic infrastructure.
It’s useful to know the bandwidth delay product of a network, which is the product of the network capacity (in Mbps, say) multiplied by the round trip time (RTT) of a packet (in sec). This measures the amount of data on the network that has been transmitted but not yet received. This can be MB of data for wide area high performance networks. This data must be buffered so that it can be resent if a packet is not received.
Challenges that have been worked out over the past decade include:
For the past several years, it has been relatively routine for applications using FAST TCP or UDT to fill a wide area 10 Gbps network link or multiple 10 Gbps network links, if these are available.
Today’s problems include:
I ran into the first problem just after I got back from SC 09. At SC 09, we ran a number of wide area data intensive applications, and in fact won the 2009 BWC for these applications. For example, a new variant of UDT called UDX reached 9.2 Gbps over a network link with 200 ms RTT. In contrast, as soon as I got back to Chicago, I worked for a couple of days trying to get access to 200 GB of sequence data, since the sequencing instrument that produced it was not connected to a high performance network. With the device connected to a high performance research network, the data would have been available in a few minutes.
To summarize, today network experts are comfortable designing systems that can easily fill wide area 10 GE networks, but most analytic applications are not designed to use the required protocols or to to take advantage of high performance networks, and most do not have access to the required networks, even if the applications could benefit from them.
In disciplines, like biology, that are becoming data intensive, this type of analytic infrastructure will provide distinct competitive advantages.
If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics. I use the term analytic infrastructure to refer to these components, services, applications and systems.
For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.
There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:
|Preprocessing||dataset (data fields)||dataset of features|
|Modeling||dataset of features||model|
|Scoring||dataset (data fields), model||scores|
Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments. The former are examples of what are sometimes called model producers, while the latter are sometimes called model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.
Version 4.0 of PMML adds the following new features:
Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models). This is probably the second most important interface in analytics.
With Version 4.0 now released, the PMML working group is now working on Version 4.1. One of the goals is to enable PMML describe postprocessing of scores. This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics.
Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.
If you are interested in getting involved in the PMML working group, please visit the web site: www.dmg.org
Disclaimer:I’m a member of the PMML working group and worked on PMML Version 4.0.
The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.
The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.
MalStone is a stylized analytic computation of a type that is common in data intensive computing. The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: code.google.com/p/malgen (look in the feature downloads section along the right hand side).
We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ’07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”
The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:
| Timestamp | Web Site ID | User ID
There is a further assumption that if the computers become infected, at perhaps a later time, then this is known. That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:
| User ID | Compromise Flag
Here the Compromise field is a flag, with 1 denoting a compromise. A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.
We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites. Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging. For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today. On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.
The MalStone benchmarks use records of the following form:
| Event ID | Timestamp | Site ID | Compromise Flag | Entity ID
Here site abstracts web site and entity abstracts the possibly infected computer. We assume that each record is 100 bytes long.
In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and subsequently becomes compromised is divided by the total number of records for which an entity visited the site. The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest). MalStone A-10 uses 10 billion records so that in total there is 1 TB of data. Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records. MalStone B-10, B-100 and B-1000 are defined in the same way.
One of the motivations for choosing 10 billion 100-byte records is that the TeraSort Benchmark (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.
In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark. It was able to sort 1 TB of data using using 910 nodes in 209 seconds, breaking the previous record of 297 seconds. Hadoop set a new record in 2009 by sorting 100 TB of data at 0.578 TB/minute using 3800 nodes. For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton Hadoop Wins Terasort.
Note that the TeraSort Benchmark is now deprecated and has been replaced by the Minute Sort Benchmark. Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.
We have developed a generator of synthetic data for MalStone called MalGen. MalGen is open source and available from code.google.com/p/malgen. Using MalGen, data can be generated with power law distributions, which is useful when modeling web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).
Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records. The experiments were done on 20 nodes of the Open Cloud Testbed. Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.
We compared three different implementations: 1) Hadoop HDFS with Hadoop’s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using Sphere User Defined Functions (UDFs).
|Hadoop MapReduce||454m 13s|
|Hadoop Streams/Python||87m 29s|
|Sector/Sphere UDFs||33m 40s|
|Hadoop MapReduce||840m 50s|
|Hadoop Streams/Python||142m 32s|
|Sector/Sphere UDFs||43m 44s|
Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations.
If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice. What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams. In addition, you may also want to consider using Sector.
The image above is from Strolling everyday and available via a Creative Commons license.
Disclaimer: I am involved in the development of Sector.
This is the first of three posts about systems, applications, services and architectures for building and deploying analytics. Sometimes this is called analytic infrastructure. This post is primarily directed at the analytic infrastructure needs of companies. Later posts will look at analytic infrastructure for the research community.
In this first post of the series, we discuss five important trends impacting analytic infrastructure.
Trend 1. Open source analytics has reached Main Street. R, which was first released in 1996, is now the most widely deployed open source system for statistical computing. A recent article in the New York Times estimated that over 250,000 individuals use R regularly. Dice News has created a video called “What’s Up with R” to inform job hunters using their services about R. In the language of Geoffrey A. Moore’s book Crossing the Chasm, R has reached “Main Street.”
Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used. Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.
Trend 2. The maturing of open, standards based architectures for analytics. Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician. It is usually a challenge to deploy the model produced by the application into operational systems. Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.
The Predictive Model Markup Language (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models. PMML has become the dominant standard for statistical and data mining models. Many applications now support PMML.
By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.
Disclaimer: I’m one of the many people that has been involved in the development of the PMML standard.
Trend 3. The emergence of systems that simplify the analysis of large datasets. Analyzing large datasets is still very challenging, but with the introduction of Hadoop, there is now an open source system supporting MapReduce that scales to thousands of processors.
The significance of Hadoop and MapReduce is not only the scalability, but also the simplicity. Most programmers, with no prior experience, can have their first Hadoop job running on a large cluster within a day. Most programmers find that it is much easier and much quicker to use MapReduce and some of its generalizations than it is develop and implement an MPI job on a cluster, which is currently the most common programming model for clusters.
Trend 4. Cloud-based data services. Over the next several years, cloud-based services will begin to impact analytics significantly. A later post in this series will show simple it is use R in a cloud for example. Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.
Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity — if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating and not a capital expense.)
Trend 5. The commoditization of data. Moore’s law applies not only to CPUs, but also to the chips that are used in all of the digital device that produce data. The result has been that the cost to produce data has been falling for some time. Similarly, the cost to store data has also been falling for some time.
Indeed, more and more datasets are being offered for free. For example, end of day stock quotes from Yahoo, gene sequence data from NCBI, and public data sets hosted by Amazon, including the U.S. Census Bureau, are all available now for free.
The significance to analytics is that the cost to enrich data with third party data, which often produces better models, is falling. Over time, more and more of this data will be available in clouds, so that the effort to integrate this data into modeling will also decrease.
In this note, let’s define analytics as the analysis of data in order to take actions. (This is a narrow definition of analytics, but one that is useful here.) If you don’t have day to day work experience with analytics, it is easy to have the mistaken impression that analytics is only about data and statistical models.
Although understanding data and developing statistical models is certainly an important component of an analytic project, this is just one aspect of analytics. This aspect includes cleaning data, enriching data, exploring data, developing features, building models, validating models, and iterating the process. From a broad perspective, this is a process in which the input is data and the output is a statistical model. When most people think of modeling, this is what they think of. For many analytic projects, this is just a small part of what is required for a successful engagement.
The second aspect of analytics is what I am concerned with in this note. This is the aspect of analytics concerned with:
One way to remember this is using the mnemonic SAMS for Scores, Actions, Measures and Strategies.
For example, with a response model, often a threshold is used. If the score from the response model is above the threshold, an offer is made (this is the action); if not, no offer is made.
Here are some examples of SAMS:
|on-line response model||likelihood to respond to an offer||display the offer to the visitor that has the highest likelihood of response and available inventory||revenue per day generated by the web site||increase revenue from a website by improving targeting of offers|
|fraud model||likelihood that a transaction is fraudulent||approve, decline, or obtain more information||detection and false positive rates||reduce costs and improve customer experience by lowering fraud rates|
|data quality model||likelihood that a data source has data quality problems||if the score is above a threshold, manually investigate the data to check whether there is in fact a data quality problem||detection and false positive rates||improve operational efficiencies by detecting data quality problems more quickly|
A successful analytics projects requires a careful study of what actions are possible; of the possible actions, which can be deployed into operational systems; and, how the systems can be instrumented so that the data required to compute the required measures is available.
The organizational challenge when developing and deploying analytics is that four groups must work together to complete a successful analytic project: