Abstract

 

 

The XELOPES Data Mining Library

 

 

Michael Thess

prudsys AG, thess@prudsys.com

 

 

 

1. Introduction

 

The XELOPES library is an open and platform-independent Data Mining library that supports the main Data Mining standards. XELOPES is available under GNU GPL.

 

XELOPES was developed in close cooperation between the German Data Mining vendor prudsys AG and the Russian MDA specialist ZSoft Ltd.

 

XELOPES is characterized by the following unique features:

 

·         Maximum utilization of reusable components: highly modular architecture based on the CWM standard,

·         Platform-independency: specified in UML, implementations exist for Java, C++, C#, and interfaces for Corba and Web Services,

·         Independency of data sources: abstraction of data source allows to treat different data sources like memory, files, databases in a unified way,

·         Support of main Data Mining standards and libraries: CWM, PMML, OLE DB for DM, (JDMAPI), MLC++, Weka.

 

XELOPES can be advantageous for many Data Mining users and specialists because of:

 

·         XELOPES can be used as a bridge to existing Data Mining standards, especially as a comfortable PMML converter,

·         XELOPES can be considered as one of the first pilot implementations of the emerging CWM standard,

·         XELOPES can be used to include Data Mining models and algorithms into third-party software,

·         XELOPES can be used as source of inspiration for own Data Mining software.

 

In this abstract we describe the design of XELOPES and its relation to existing Data Mining standards and libraries.

 

 

2. Motivation

 

The creation of XELOPES was motivated by the following requirements:

 

·         Take part in the worldwide development of Data Mining specifications,

·         Provide a unique algorithm basis for all prudsys products,

·         Since prudsys is a small company, the use of open standards is crucial for the acceptance of its Data Mining products,

·         prudsys software is coded in Java, C++, and also uses CORBA; therefore a platform-independent approach was required,

·         Provide a library that allows the users of prudsys products to integrate the Data Mining models generated by the prudsys products into their applications,

·         Increased  popularity of prudsys products.

 

To meet these requirements, XELOPES was built on a modern and complex architecture that is described in the next section.

 

 

3. Architecture

 

In order to meet the upper requirements, it was decided to build the library completely on the (extended) CWM standard. In addition, (extended) PMML should be used as format for XML-serialization of the Data Mining models.

 

The library was modelled strictly following the MDA (Model Driven Architecture) approach of the OMG (Figure 1).

 

 

Figure 1: Layers of the MDA model.

 

The basic idea behind MDA is to develop a base Platform-Independent Model (PIM) that maps to one or more Platform-Specific Models (PSMs) which can be implemented. These models are defined in UML (Unified Modelling Language), making OMG's standard modelling language the foundation of the MDA. 

 

From the PSMs the platform-specific interfaces can be derived. This is usually done by UML tools like Rational Rose. In the final step, for programming languages the interfaces must be implemented manually. The whole MDA process for XELOPES is shown in Figure 3.    

 

3.1 The PIM of XELOPES

 

The starting point was the use of the CWM 1.0 standard  which also contains an UML-based description of Data Mining models (Chapter 5). Using the corresponding XMI format, the complete CWM specification including the Data Mining part was imported into Rational Rose. As a result, three basic Data Mining UML diagrams were derived. These diagrams were extended according to further Data Mining requirements like PMML export. The three diagrams are:

 

1.      Model: This diagram represents the Data Mining models produced by Data Mining algorithms. The central class is MiningModel. This CWM class was extended by further XELOPES classes like AssociationRulesMiningModel and ClusteringMiningModel representing basic mining model types (often refered to as functions). In addition, further classes for describing the specific models were added. These classes were strongly influenced by the PMML and JDM standards.

2.      Settings: The Settings diagram represents the mining settings of the Data Mining algorithms on the function level including specific mining attributes. The central CWM class MiningSettings was extended by some function classes like SequentialSettings and TimeSeriesMiningSettings still not included in the CWM specification.

3.      Attributes: The Attributes diagram defines the Data Mining attributes. The basic class is MiningAttribute which is subclassed into CategoricalAttribute and NumericAttribute. XELOPES extensions mainly specify the category hierarchy that can be assigned to categorical attributes.

 

These three conceptual areas describe a mining model produced by an algorithm. However, XELOPES intents to model the whole Data Mining process including data access, transformations, algorithms, and automation. For this purpose, the following four diagrams were added to the PIM:

 

4.      DataAccess: The DataAccess diagram describes the data source of a mining algorithm. The basic idea is that any Data Mining algorithm operates in the (often very high-dimensional) Cartesian space of the mining attributes and always requires a data matrix which of course can be sparse. This data matrix is represented by the abstract class MiningInputStream that contains a wide spectrum of data access mechanisms. The spectrum of access mechanisms ranges from simple cursor-based access over block-wise to random access. For some data storage types like memory, files, and databases the corresponding extensions of MiningInputStream are already predefined (classes MiningStoredData, MiningFileStream, MiningSqlStream, respectively). The use of the MiningInputStream makes XELOPES independent from specific data sources. The read method of MiningInputStream returns a MiningVector which represents a vector of the data matrix (counterpart to MiningAttribute). Following the approach of the Weka library, the MiningVector is extended by MiningSparseVector and MiningBinarySparseVector in order to handle sparse data.

5.      Transformation: The Transformation diagram is used to model different types of data transformations. This is the most complicated diagram because it must ensure three compatibilities at the same time: support the Transformation package of CWM, combination with any MiningInputStream, and export in PMML. The central classes are MiningTransformationActivity for a sequence of transformation steps and MiningTransformationStep for simultaneous transformations. Different types of transformations are supported: inner and outer, dynamic and static, on vector and stream level, uni- and multidimensional. Transformations are also tracked in the metadata of the models (diagrams Model and Settings) and transformation elements are included in most of the other PIM diagrams.

6.      Algorithm: This diagram desribes the mining algorithm itself. It is the central element of the XELOPES PIM. The main class is MiningAlgorithm  and for the basic function models extended classes like ClassificationAlgorithm and ClusteringAlgorithm are included. Through the class MiningAlgorithmSpecification the algorithm-specific mining settings are defined. Any MiningAlgorithm gets a MiningInputStream object as data input, MiningSettings and MiningAlgorithmSpecification objects as mining settings on function and algorithm level, and returns a MiningModel object of the Data Mining model produced. In addition, the class MiningListener allows to append listeners to a mining algorithm for controling the progress of the algorithm work.

7.      Automation: The Automation diagram was included into the PIM because the integration of mining algorithms into applications often requires that they work fully automatically and tune their mining settings autonomously. The diagram contains two basic classes: The class MiningModelAssessment assesses the quality of a mining model whereas MiningAutomationCallback uses the assessment result to generate new mining settings for restarting the mining algorithm. Thus, the three classes MiningAlgorithm, MiningModelAssessment, and MiningAutomationCallback form a closed loop of running a mining process (Figure 2). The class MiningModelAssessment is further subclassed into IntrospectiveModelAssessment which only assesses the model itself (examples are the number of association rules or the VC dimension for SVMs) and ExternalDataModelAssessment which also uses new data to assess the model (like classification rate). The main purpose of the Automation diagram was to built all automation strategies like pruning of decision trees, cross-validation techniques, or optimisation of regularization parameters from a small number of building blocks. The diagram is complicated and still requires substantial improvement.

 

 

Figure 2: The closed loop process of XELOPES mining automation.

 

All seven diagrams of the PIM are united into a big one. This UML class diagram is also available in XMI format for download. The PIM specification of XELOPES is still very general. During the formation of the library all platform-specific methods and classes have been removed from the PIM.

 

3.2 The PSMs of XELOPES

 

The PSMs of XELOPES contain the platform-specific UML diagrams and are based on the PIM. Examples of platform-specific elements are getter and setter methods of attributes defined in the PIM or the specific PMML handling (for example, the Java version uses DTDs and Data Binding, the C++ version also DTDs and DOM parsers, the C# version works with schemas and Data Binding). Therefore, the PSMs are much more complex than the PIM. The compatibility level between the PSMs is defined through the PIM and the PMML which allows to exchange mining models between the different platforms (see also Figure 3).

 

3.3 Deriving the Interfaces for different Platforms

 

Once the PSMs have been defined, they must be translated into the corresponding skeleton of their platform. In XELOPES Rational Rose was used to derive the Java skeleton, Together ControlCenter for C++, and Rational Rose XDE for C#.  For CORBA and Web Services simplified PSMs where used which are still not fully compliant to the PIM.

 

Figure 3: Scheme of the different specification levels of XELOPES.

 

3.4 Implementations of XELOPES

 

At present, there are three implementations of XELOPES available: for Java, C++, and C#. The Java implementation is the most extensive one containing algorithms of all main function levels. The C++ implementation is most fixed on classification and regression methods. The C# implementation is still in an early stage and currently only contains decision trees.

 

All three implementations are continuously improved and extended by new algorithms.

 

 

4 Relation to Data Mining Standards and Libraries

 

The creation of XELOPES was necessary because no one of the existing Data Mining standards and libraries was able to meet the main of the requirements listed in Section 2.

 

Indeed, the CWM standard (also including Version 1.1) is still not extensive enough to be used as full basis of a Data Mining library. (Therefore it was extended for XELOPES.) The PMML standard only specifies Data Mining models. The SQL / MM and OLE DB for Data Mining standards are limited to databases. The JDM standard is not platform-independent. Same holds for the popular Data Mining libraries MLC++ and Weka.

 

We want to discuss how XELOPES is related to the existing standards and libraries:

 

·         CWM: XELOPES is entirely based on the CWM 1.0 standard and hence fully compatible. However, the proposed CWM 1.1 Data Mining part is not compatible to that of CWM 1.0 and hence also XELOPES is not, too. There are no plans to rewrite the PIM of XELOPES for CWM 1.1. Instead, there will be an XML connector for XML serialization of XELOPES for CWM 1.1.

·         PMML: XELOPES uses PMML as format for XML serialization of its Data Mining Models. XELOPES fully supports the PMML standard 2.0, the C# version even PMML version 2.1. As soon as PMML 3 will be adopted, it will be supported by XELOPES.

·         OLE DB for Data Mining: Via a special adapter, XELOPES supports the main part of the syntax of OLE DB for Data Mining but many functions have been removed from the specification.

·         JDM: A JDM interface for XELOPES is in development and will be available soon.

·         SQL / MM: Still not supported.

·         MLC++ library: The C++ version of XELOPES contains an adapter to the MLC++ library which allows to call MLC++ algorithms from XELOPES.

·         Weka library: The Java version of XELOPES has an adapter to the Weka library allowing to call Weka algorithms from XELOPES.

 

The list shows that XELOPES truly extends nearly all Data Mining standards. Thus, it is also an interesting instrument to compare these standards and libraries.

 

 

5 Experiences and Future Developments

 

The first version of XELOPES was released in September 2002. Since this time prudsys is integrating XELOPES into all of its products. This has unveiled a number of shortcomings in the library. Most problematic was the transformation concept which therefore has been extended repeatedly. In July, the integration process will be finished.

 

The GPL version of the library has attracted a lot of attention. Since the first XELOPES release there have been over 6000 downloads and the number is continuously growing. Because of many emails and regularly questionings prudsys has got sharp picture of the user impressions. It turned out that XELOPES strongly polarizes the users: While many Data Mining experts have praised the library as an ambitious Data Mining project („Why a commercial company like prudsys offers the library under GPL?“), other Data Mining users have strongly criticised the library as being too complicated („What costs an reasonably priced consultant who explains me how to work with XELOPES?“). The latter problem will probably also apply to the JDM standard which has a complex fundament, too. Nevertheless, prudsys is satisfied with the resonance to XELOPES and has signed the first commercial licence agreements for the library.

 

Starting from the first experiences with XELOPES, the following main directions of the future development of the library have been fixed:

 

·         Provision of a simplified interface for using XELOPES and improvement of the GUI.

·         Integration of new algorithms into XELOPES.

·         Internally rewrite the data access classes of XELOPES using the CWM resources layer. 

·         XML serialization of the XELOPES classes („CWML“) will allow to store the whole Data Mining process including data access and automation in XML format. At present, this is only possible for the models using PMML.

·         Integration of the remaining standards JDM and SQL/MM as well as new versions of standards like PMML 3.0.

 

The list shows that - like in the past - the main direction of future development of XELOPES is to deepen the CWM and PMML integration.

 

For more information and downloads of XELOPES, please visit the web site

http://www.prudsys.com/Produkte/Algorithmen/Xelopes/Produktinfo/.

 

 

About prudsys

 

The prudsys AG is a German Data Mining vendor that was found in 1998. It develops distributed Data Mining applications for large data volumes. prudsys customers include BASF, Bayer, Commerzbank, Deutsche Post, GlaxoSmithKline, Home Shopping Europe, and Westfalia.