PMML Version 3 - Overview and Status
Gregor Meyer
IBM
August 27, 2003
The Predictive Model Markup Language (PMML) is developed by
the Data Mining Group, a vendor led consortium, and consists of the following
components:
- Data Dictionary: defines the input attributes
to models and specifies the
type and value range for each attribute.
- Mining Schema: Each model contains one mining schema that lists the
attributes used in the model. These attributes are a subset of the attributes
in the Data Dictionary. The mining schema contains information specific to a
certain model, while the data dictionary contains data definitions that do not
vary with the model. For example, the Mining Schema specifies the usage type
of an attribute, which may be active (an input of the model), predicted (an
output of the model), or supplementary (holding descriptive information and
ignored by the model).
- Transfomation Dictionary: defined derived attributes. The derived
attributes may be defined by normalization, which maps continuous or
discrete values to numbers; by discretization, which maps continuous
values to discrete values; by value mapping, which maps discrete values
to discrete values; or by aggregation, which summarizes or collects
groups of values, for example by computing averages.
- Model Statistics: contain univariate statistics about the
attributes used in the model.
- Models: The parameters defined statistical and data mining models per se. Models
in PMML Verson 2.0 include regression models, clusters models, trees, neural
networks, bayesian models, association rules, and sequence models.
PMML Version 1.0 basically concerned itself with defining standards
for various common data mining models assuming that the inputs to the models had
already been defined. The inputs are called DataFields. PMML Version 2.0
introduced the TransformationDictionary which contains DerivedFields. The inputs
to models in PMML Version 2.0 may be DataFields or DerivedFields. In principle,
this approach is powerful enough to capture the process of preparing data for
statistical and data mining models.
PMML Version 2.1 was released as a Source Forge package in March 2003
and provided improved support for cleaning, transforming and aggregating
data. These operations can be used to both prepare data and to shape it
in real time enabling PMML Version 2.1 to be used for scoring data in
one application using models developed by another application.
In this talk, we provide a quick overview of PMML Version 2.1 and
describe some of the changes being considered for PMML Version 3.0.
This abstract is based in part on the article: Robert
Grossman, Mark Hornick, and Gregor Meyer, Data Mining Standards
Initiatives, Communications of the ACM, Volume 45-8, 2002, pages
59-61.