JavaTM Data Mining (JSR-73):
Overview and Status
Abstract

Mark Hornick
JSR-73 Specification Lead
Senior Manager, Development
Data Mining Technologies
Oracle Corporation

August 27, 2003

Data mining continues to gain mainstream acceptance for providing strategic advantage in the areas of customer relationship management, fraud detection, and national security, and life sciences, among others.  Java technology, especially as leveraged within the scalable J2EE architecure, facilitates integration with applications such as B2C / B2B web sites, customer care centers, campaign management, and genomic / molecular pattern recognition and discovery.

Historically, application developers coded homegrown data mining algorithms into applications, or used sophistocated end-user GUIs. These GUIs packaged a suite of algorithms complete with support for data transformation, model building, testing, and scoring. However, it was difficult, if not impossible, to embed data mining end-to-end in applications using commercial data mining products due to inadequate APIs. If a vendor had an API, it was proprietary, making the development of a product using that API risky. If a different vendor's solution was required, rewriting that product was also potentially costly.

The ability to leverage data mining functionality via a standard API greatly reduces risk and potential cost. With a standard API customers can use multiple products for solving business problems by applying the most appropriate algorithm implementation without investing resources to learn each vendor's proprietary API. Moreover, a standard API makes data mining more accessible to developers while making developer skills more transferable. Vendors can now differentiate themselves on price, performance, accuracy, and features. JavaData Mining (JDM) addresses this need for the Java.

As with any standard, defining compliance for vendor implementations raises a myriad of issues. Should all implementations be required to support all algorthms and features? Should the results of data mining operations, e.g., rules in a decision tree model or scoring results, be the same for the same datasets across vendor implementations? For JDM, compliance is based on a core feature set with optional packages for each mining function and algorithm. This enables vendors that specialize in certain algorithms, e.g., neural networks, to conform to the JDM standard while only implementing relevant packages. JDM also provides supportsCapability methods that allow applications to determine at runtime if a vendor implementation supports a finer grained feature, e.g., whether classification model build accepts a cost matrix specification, or the clustering algorithm produces hierarchically arranged clusters. Regarding absolute correctness of results, JDM does not specify a particular result for mining algorithms, instead it focuses on the structure of that results. For example, if a decision tree is produced, each tree node can have at most one parent and their can be at most one root node.

In JDM, we defined one specialized conformance option: the scoring engine. Vendors who specialize in scoring data, or who provide a scoring engine option to their product offering, may support a subset of JDM features, namely tasks for model import and apply. A scoring engine minimally must support the import of models, either in a proprietary format or using a standard format such as PMML, and  the scoring itself, either in batch or realtime. The scoring engine option fits well into enterprise architectures where model building is performed in one location and scoring in other, possibly geographically distant, locations. To this end, JDM also specifies an export task to move mining objects between data mining systems.

Through the course of designing the API, the expert group has made numerous tough choices of which features to include in the first release. We have selected classification, regression, associations, clustering, and attribute importance as first release mining functions, and build, apply, test, import, and export as first release mining tasks. Features being considered for the next release of JDM include: a web services interface, transformations, mining unstructured data such as text and images, multi-target models, ensembles, and additional mining functions such as feature extraction and forecasting.

As a Java Specification Request under SUN’s Java Community Process (JCP), JDM must go through several reviews and final vote by the JCP Executive Committee before being accepted as a Java standard. In addition, the JCP requires that the API have a Reference Implementation (RI) and Technology Compatibility Kit (TCK). The RI ensures that the specification is implementable while providing potential users and implementors a working system to understand intended behavior. Vendors implementing the standard must certify their implementation by executing and passing the TCK, a test suite ensuring compliance. Both the RI and TCK pose certain challenges for a data mining API in terms of completeness and depth of conformance specification. Feedback from the JDM Community Review and Public Review has been positive and supportive. Initial experience with the RI has further helped to refine the model.