Here is a synopsis of my thesis. A
short
abstract is also available.
The Minimum Description Length Principle
and Reasoning under Uncertainty
Peter Grünwald
ILLC-Dissertation Series nr. DS-1998-03
Most research reported in the thesis concerns the so-called
Minimum Description Length (MDL) Principle. Here we briefly describe
this principle and summarize the main research questions we posed
ourselves and the conclusions we reached.
1. The MDL Principle
To be able to forecast future events, science wants to infer general
laws and principles from particular instances. This process of
inductive inference is a central theme in statistics, pattern
recognition and the branch of Artificial Intelligence called `machine
learning'. The Minimum Description Length (MDL) Principle is a
relatively recent method for inductive inference. It has its roots
in information theory and theoretical computer science (Kolmogorov
complexity) rather than statistics.
The fundamental idea behind the MDL Principle is that any regularity
in a given set of data can be used to compress the data,
i.e. to describe it using fewer symbols than needed to describe the
data literally. The more regularities there are in the data, the more
we can compress it. This leads to the view (which is just a version of
Occam's famous razor) that the more we can compress a given set of
data, the more we can say we have learned about the data.
2. Contents of the Thesis
The thesis consists of three parts. Part I contains an introduction to
MDL intended for a general audience, followed by some theoretical
research on MDL. In Part II, we apply the MDL Principle in a
practical context and we empirically compare it to other well-known
methods of inductive inference. Part III reports on some research we
performed that is not directly related to MDL; this research concerns
non-monotonic logic applied to common-sense reasoning about
action and change (the so-called `frame problem' of Artificial
Intelligence). In an Epilogue to Part III, we show that the
nonmonotonic logic we used there can be interpreted from a
probabilistic/MDL point of view (see below), thereby establishing a
connection to the first two parts. Below we briefly consider the two
central themes of the thesis.
3. Research Goals and Conclusions: 'Safe' and 'Risky' Statistics
First Central Theme: can we use models that are too simple?
The result of statistical analysis of a given set of data is nearly
always a model for this data that is really a gross simplification of
the process that actually underlies these data. Nevertheless, such
overly simple models are often used with great success to classify
and/or predict (aspects of) future data generated by the same
process. For example, we use linear models for data which are not
really linearly related; we assume that `errors' (discrepancy between
actual data and assumed underlying functional relationship) are
normally distributed whereas closer inspection reveals they are not;
we assume data to be independent which are not etc. Yet such
simplifying - and wrong - assumptions often lead to acceptable
prediction, interpolation and extrapolation. How is this
possible? And can we identify situations in which this is
possible and situations in which it is not?
These are the central
questions of the first part of the thesis.
It turns out that they are closely related to the question
whether the MDL Principle can be theoretically justified: in the
presence of few data, the MDL Principle will often select a model for
this data that in the end, when more data will have become available,
turns out to be too simple. Can one show that this leads to acceptable
(or even, in a sense, optimal) results nevertheless?
Briefly, we reach the following conclusion: overly simple models can
be applied to make predictions and decisions in two different ways: a
`safe' one and a `risky' one. If a model is used in the `safe' way,
then it will be `reliable' in the following sense: the model gives a
correct impression of the prediction error that will be made if the
model is used to predict future data, even in the case where the model
is a gross simplification of the process that truely underlies the
given data. If the model is used in the `risky' way, there is no such
guarantee (nevertheless, such usage of a model often makes sense). We
state and prove several theorems which show that incorrect models can
be `reliable' in the sense indicated above under many
circumstances. The concept of `reliability' is based on a non-standard
interpretation of probabilities. This is the second main theme of the
thesis:
Second Central Theme: the Coding Theoretic Interpretation of Probability
It so happens that the notions of `description method' and
`probability distribution' are very closely connected: every
description method or code can be re-interpreted as a
probability distribution and vice versa. If data can be coded with the
help of model M in only a few bits, then this may be reinterpreted
as saying `the data has a high probability under M'. From the MDL
point of view, a model of the data is really a means of
describing properties of the data, and hence a `model' coincides with
a `description method'. Because of the correspondence referred to
above, a probability distribution can also be seen simply as a means
to describe properties of the data.
This view of probabilities connects MDL to more traditional methods of
statistical inference. MDL is closely related to (yet different from)
Bayesian statistics and the controversial yet successful `Principle of Maximum
Entropy'. A crucial difference between MDL and these related methods
is the different way in which probabilities are interpreted. MDL's
coding theoretic interpretation of probability sheds new light on the
debate between those who hold a subjectivist view of probability
(probability as a degree of belief) and those who hold the objectivist
or `frequentist' view. Here, we (roughly) reach the following
conclusion: probabilities can - and should - indeed be used in many
situations where they are not related to frequencies; the `maximum
entropy principle' can indeed be used to assign probabilities in the
presence of ignorance. However, the way such probabilities should
be used to arrive at predictions and decisions is different from the
way probabilistic knowledge is usually applied!
Back to
Peter's thesis Page
Back to Peter's home page