PMML Version 3 - Overview and Status

Gregor Meyer
IBM

August 27, 2003

The Predictive Model Markup Language (PMML) is developed by the Data Mining Group, a vendor led consortium, and consists of the following components:

  1. Data Dictionary: defines the input attributes to models and specifies the type and value range for each attribute.
  2. Mining Schema: Each model contains one mining schema that lists the attributes used in the model. These attributes are a subset of the attributes in the Data Dictionary. The mining schema contains information specific to a certain model, while the data dictionary contains data definitions that do not vary with the model. For example, the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model).
  3. Transfomation Dictionary: defined derived attributes. The derived attributes may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages.
  4. Model Statistics: contain univariate statistics about the attributes used in the model.
  5. Models: The parameters defined statistical and data mining models per se. Models in PMML Verson 2.0 include regression models, clusters models, trees, neural networks, bayesian models, association rules, and sequence models.

PMML Version 1.0 basically concerned itself with defining standards for various common data mining models assuming that the inputs to the models had already been defined. The inputs are called DataFields. PMML Version 2.0 introduced the TransformationDictionary which contains DerivedFields. The inputs to models in PMML Version 2.0 may be DataFields or DerivedFields. In principle, this approach is powerful enough to capture the process of preparing data for statistical and data mining models.

PMML Version 2.1 was released as a Source Forge package in March 2003 and provided improved support for cleaning, transforming and aggregating data. These operations can be used to both prepare data and to shape it in real time enabling PMML Version 2.1 to be used for scoring data in one application using models developed by another application.

In this talk, we provide a quick overview of PMML Version 2.1 and describe some of the changes being considered for PMML Version 3.0.

This abstract is based in part on the article: Robert Grossman, Mark Hornick, and Gregor Meyer, Data Mining Standards Initiatives, Communications of the ACM, Volume 45-8, 2002, pages 59-61.