SCIENCE AT THE SHINE DOME 2004: ANNUAL SYMPOSIUM
A celebration of Australian science
7 May 2004
The computational and statistical science of intelligent
systems
by Professor Peter Bartlett
My research is at the intersection of computer science and statistics.
The broad aim of what I do is understanding how we can develop systems
to solve complex information processing problems, and in particular the
sorts of things that people can do much more effectively than computers:
things like recognising a face or understanding a sentence or learning
to fly a helicopter.
In particular, I am interested in how we can get learning systems to
solve these problems that can improve their performance through experience,
and that's where the statistics comes in.
We are here celebrating a 50th anniversary. It was also almost 50 years
ago that a group of scientists and mathematicians, including John McCarthy
and Claude Shannon, submitted a proposal to the Rockefeller Foundation
for funding for a meeting at Dartmouth. They said, 'The study is to proceed
on the basis of the conjecture that every aspect of learning or any other
feature of intelligence can in principle be so precisely described that
a machine can be made to simulate it.' This was the famous Dartmouth meeting;
it was the first concerted effort at artificial intelligence (AI) and
understanding how we can develop computer systems that can solve the sorts
of complex problems that people can solve easily.
I think a lot has happened over that 50 years. In the last perhaps 10
years, especially, AI has become much more quantitative, and learning
and adaptation have taken on a much more central role. The reason for
this is that these problems typically involve incomplete information and
uncertain and changing environments, and those characteristics mean that
the right way to formalise the problems is in the language of probability
theory, and they beg for statistical approaches. So the key scientific
issues here can be split into the statistical and the computational ones.
In this talk I want to start by giving a few examples of the sorts of
complex decision problems that we would like to get computers to solve,
and look at some of these key scientific issues. I want to focus on one
example, and that is in the area of pattern classification. There is a
class of techniques called large margin classifiers that are a computational
simplification of the pattern classification problem and illustrate, I
think, some of these scientific issues rather nicely.
So let me start by giving some examples of these complex decision problems.
The first is from the arena of computer vision. Here we want to be able
to recognise handwriting. Imagine that we had an image of somebody's handwriting.
We would like to provide that as input to a system and have as output
a sequence of characters, this sentence that was written.
This is certainly a complex decision problem. There is a lot of uncertainty.
So there is no way we could come up with a complete model for everybody's
handwriting. And it is something that changes over time, there are all
sorts of unmeasured influences: we have all had the experience of trying
to sign our credit card slip with parcels in our hands. So there are certainly
many aspects to this problem that we are never going to be able to model
accurately.
The second is from robotics. Suppose we wanted to have a system that
could take measurements of the state of something like a helicopter. We
would like this system to take measurements of various velocities, positions
and so on, and use those to make decisions about the actions to take,
so as to take the helicopter through some kind of manoeuvre.
Once again it is an uncertain and changing environment. There are all
sorts of influences that are difficult to model. The predominant one in
this case is the wind and air movements.
Another example is from bioinformatics. Suppose we have a protein and
we would like to predict the function that that protein performs in the
cell. We have various kinds of information about it: we might know the
sequence of amino acids, we might know how that sequence aligns with various
other proteins or types of proteins, we might know how it interacts with
other proteins, we might know how the gene that codes for this protein
gets expressed under various different experimental conditions.
From this information we would like to be able to predict what kind of
function the protein performs in the cell. Is it involved in metabolism,
or the energy cycle, and so on? And, of course, as is inherently a problem
with incomplete information, we are wanting to use this sort of data that
is easy to gather in an automated way so as to make predictions about
the function that the protein performs that we can then go and verify
experimentally later.
The final example is from natural language processing. Wouldn't it be
great if we could communicate with computers in plain language? So when
we wanted to find out some fact, instead of going to Google and trying
to work out the right sequence of search terms to use, and then sifting
through the information, the computer could understand what our intention
was just from a sentence that would be a great advance.
A key component of that is the parsing problem. So here the input to
our system would be a sentence, 'The pedestrian crossed the road,' and
the output would be something like a parse tree that tells us what are
the different components of the sentence, who does what to whom. In this
case the sentence consists of a noun phrase, 'the pedestrian', and a verb
phrase, which is what the pedestrian does.

Natural language processing: Parsing
(Click on image for a larger version)
This is actually a very complex decision problem. This is a particularly
simple sentence. If you look at the average sentence from a newspaper,
there is an enormous number of candidate parse trees that are all syntactically
legitimate but you would typically only accept one of them as the correct
parse for the sentence.
All of these problems are complex decision problems. They involve incomplete
information we don't have a model for everybody's handwriting
and the environment is often changing: the wind, for instance, in the
helicopter control problem. And these are characteristics that tell us
that the right way to model these problems is in terms of probability
theory and that adaptive and learning systems, or statistical methods,
are an effective approach to solving these problems.
In all of these cases we can gather data to give us information about
the kind of function that our system should perform. In the case of character
recognition, we could get a bunch of sentences and the images of the corresponding
handwriting, and use that to try to infer the relationship between these
images and the sentences that are conveyed in them.
In helicopter control we could gather together measurements of the state
of the helicopter, and the actions that were taken, and use that information
in order to determine the appropriate control actions to take in different
situations.
In the protein classification problem we could go and look at various
databases for yeast protein, say. For many of those proteins we might
have information about the function that the protein performs, and we
would use that to try to determine the relationship between these other
sorts of data the sequence and gene expression data, for instance
and the function, so that for subsequent proteins whose functions
we don't know we can make that prediction.
And in the natural language processing example, there is for instance
a corpus of sentences from the Wall Street Journal that somebody
has gone and produced parse trees for. We could use that the pairs of
sentences, together with the correct parse tree to try to infer the
relationship between those two so that subsequently we can take a sentence
and make a prediction about the correct parse tree.
The key scientific issues can be split up into the statistical and the
computational ones. On the statistical side we can ask the question, 'Do
we have an appropriate model class? Is our system powerful enough to represent
a good approximation to the relationship between, for instance, images
of handwritten sentences and the sentence? If it is, how can we estimate
that from the data, and what are the things that affect the performance
of the system?' And, of course, computational complexity is always a key
constraint.
The example that I want to focus on for the rest of the talk is a good
one that illustrates the trade-off between these two issues. It is from
the area of pattern classification; it is a class of methods called large
margin classifiers.

Large margin classifiers
(Click on image for a larger version)
Just for the purposes of illustration let's consider a very simple pattern
classification problem. So rather than thinking of classifying images
of handwriting and producing sentences, let's think about looking at single
characters, and let's think about just a binary decision: we want to look
at an image of a character and make a decision, 'Is this an A or not?'
To further simplify things, rather than looking at the full image of
the character suppose we extracted two simple features. This one [Feature
1, toward the right-hand end of the x-axis] might be some measure of the
presence of a loop in the character, and this one [Feature 2, nearer the
y-axis] might be some measure of the presence of a horizontal at the top
of the character. One of these dots corresponds to an image where this
value [for Feature 1] is the loopiness feature and this one [for Feature
2] is the presence of the horizontal bar feature.
From that pattern, the image of the character, we want to predict the
binary label: is it an A or not? We might have a bunch of labelled data,
in the form of these XY pairs images together with whether it corresponds
to an A or not and for instance the red dots [for Feature 1] might correspond
to the As and the blue dots [for Feature 2] to the other characters. So
we want to use that data to infer a good decision rule, a mapping that
takes us from the pattern to the label.
How might we do that? We might just build a model of the probability
distribution. We might take the reds, the images of As, and build some
sort of probability model maybe it's a Gaussian or a mixture of
Gaussians or something like that and then we would use those models
to try to infer the optimal decision, and perhaps it corresponds to this
decision boundary.

Pattern classification (1)
(Click on image for a larger version)
There is another approach to this problem, and that is to consider instead
a class of decision rules for instance, straight lines in the simple
two-dimensional example shown here and directly optimise some performance
criterion of interest.

Pattern classification (2)
(Click on image for a larger version)
So we might instead consider this decision rule, and this one and this
one, and choose the one that, for instance, minimises the number of errors
that we make on the training data if we are interested in minimising the
misclassification probability.
It turns out that under reasonable assumptions you can show that you
get better statistical performance out of this approach than the plug-in
estimate of estimating probability distributions, and the intuition there
is that you are not using your information to model nuances of the probability
distribution that are far from the decision boundary, that are not important
in that sense. So that is encouraging.
Unfortunately, the optimisation problem that emerges in all but the most
trivial cases ends up being formally a hard problem to solve. There are
results that suggest that the number of features does not have to get
very large before finding the optimal solution is something that you can
expect would take the lifetime of the universe. So that is an issue.
The approach that is taken with large margin classifiers is to take that
discrete, combinatorial optimisation problem, where we are counting up
the number of mistakes that are made versus things that are correctly
classified to the correct side of the decision boundary, and to produce
some sort of convex relaxation of that problem. So we are changing this
discrete and hence combinatorial problem into one involving a convex criterion,
and since it is convex then we have a unique solution; we can come up
with efficient algorithms to find that solution. So that is a good thing.

Large margin classifiers
(Click on image for a larger version)
But of course we have changed the problem, and so the issue arises, 'Well,
what is the statistical cost of this computational convenience of changing
a problem into a convex optimisation problem?' And you can ask questions
like, 'Do you have asymptotic optimality?' As the sample size gets large,
or as the amount of data gets large, are we going to be making the optimal
decision, and for what kinds of relaxations of our original computationally
hard problem is that the case? And you can ask more refined questions:
How does the performance depend on the sample size, and on various other
properties of the learning system?
These are the sorts of questions that I have worked on. I think for simple
cases like the two-class classification problem we have a fairly complete
understanding of these sorts of things. I want to give you a couple of
illustrations of the success of these large margin classification techniques
in practice.

Example: Character recognition
(Click on image for a larger version)
The first of them is in a simple character recognition problem. These
are images taken from envelopes of digits of handwritten zip-codes. Somebody
has labelled them, and so we have a set of some 7000 of these digits and
we can use those to infer the relationship between images and the correct
label, and for various plug-in estimates where we model the probability
distribution, we get some level of performance. So this is a performance
figure here smaller is better (I won't go into the details)
and there are various non-parametric methods. This can be viewed as a
differentiable relaxation of the hard combinatorial optimisation problem
and this one is the convex relaxation, and we can get performance that
is something close to the human level of recognition here, where of course
all of these methods get access to 7000 characters. That is all they know
about this classification problem. People have perhaps significantly more
information than that.
One thing I didn't tell you about that last technique was that it involves
a particular approach to classification, an approach involving similarity
measures called kernels. So let's imagine this is my one equation
for the talk, f(x) = Σ,α,k(xi,x) that we have
a pattern x here, an image, and we are going to make our decision of some
score function that we assign to x, and that function is a linear combination
of measures of similarity, so this k(xi,x), is how similar
x is to xi, some pattern that we have seen in the training
set. So you can think of them that formally they are inner products in
some space; you can think of them just as a measure of similarity.
It turns out when you do things in this way that the parameterisation
is linear, that we still have a nice convex optimisation problem to solve
and we can solve it efficiently, but we can have a much richer class of
classifiers. The other key point is that it is only inner products or
these similarities between patterns that are relevant.

Example: Protein function prediction
(Click on image for a larger version)
In the protein function prediction problem, for instance, we can come
up with various notions of similarity. We can look at the protein-protein
interaction data, and that gives us some measure of the similarity between
different proteins. We can look at gene expression data, which gives us
another notion of similarity, and sequence data and so on. We can ask
the question, 'Can we combine these different sources of information in
order to come up with a more effective decision?' and if you pose that
in the right way it turns out there is a convex optimisation formulation
of that. It reduces to something called a semi-definite program, which
has become of more interest in optimisation theory lately.

Predicting membrane proteins
(Click on image for a larger version)
In this problem it turns out I won't go into any details
that you can get significant improvements in performance, by combining
these different sources of information in this way, over any one type
of data that you have about the proteins in a particular protein prediction
application.
To summarise: I think this class of methods, large margin classifiers,
gives us a good illustration of the trade-offs between the statistical
and the computational issues in the design of learning systems. I think
in future the key direction here is to look at more complex decision problems.
For instance, the natural language processing problem that I mentioned
earlier is one that a lot of us are engaged in.
More generally, I think that learning and adaptive systems are going
to see application across all areas of computer science such as computer
vision, robotics, natural language processing and so on. I think this
is a very exciting thing. We will see a shift from program systems to
adaptive and autonomous systems.
Questions/discussion
Question: I would like to ask whether the recent
methods you have been describing could elicit structure. Let me concretise
that question in the following way. Suppose you are doing text recognition,
and suppose it is Swedish or Norwegian or Danish and you don't know. You
build a classifier that can take all that sort of stuff. Will the structure
of that classifier have something about it which clearly indicates that
first you sort out the language and then you sort out the content of the
sentence?
Certainly you can incorporate prior information that you have about distinct
languages in a sense, exploit structure that you are already aware of in these sorts of techniques. The question, I guess, is whether that
is something that would emerge in some sort of automatic way.
It may well be the case, but the sorts of techniques that I have been
talking about are non-parametric techniques. It is typically very hard
to extract information like that after the fact. I think the reason for
that is that what you are shooting for with these sorts of techniques
is good predictive accuracy, and that is not especially compatible with
being able to understand the basis for the decision that has been made.
For instance, in these examples that you described, it may well be that
there are characteristics that distinguish these types of languages. It
is not clear on what basis they would be distinguished, using a technique
such as this.
Question: These things that you are describing must
be very important for national security. What puzzles me in that respect and you may know about these sorts of things is that I have been told
that fingerprint recognition, for example, is not very good, even for
police. Can your processes bring extra information into something like
that? Can they bring extra information which could make that not 95 per
cent reliable but 100 per cent or 99.9 per cent or something of that kind?
For fingerprint recognition, I guess that is possible. I am not really
aware of work that has gone on in that area and what the limitations are.
Question: In your field it is probably well known,
but with optical character recognition of, say, a faxed document, does
the software try to discriminate and cut out each character, or does it
look at things like word shapes? My understanding is that the dropping
characters and the high-rising characters give a lot of words a little
shape profile from which it is very easy to say, 'That word is "the" and
that word is "who"', for instance, rather than actually saying 't-h-e'.
In fact, those things work at several levels. There are systems that
are looking for individual characters and then these are combined with
probability models for text. So, for instance, the word 'the' is known
to be a very common word. If you see something that has some plausible
likelihood of being a 'the', then this is the conclusion that you would
make. So, in fact, down at a low level there is some sort of recognition
engine, and then at a higher level you would try to construct words, for
instance, out of the sequences of conjectures you can think of
it that way.
Question: The benchmark should be, I think, how
well human beings do these tasks. What is the situation? In general, do
these things do things better than human beings, or about the same, or
are they not yet nearly as good? That is a very broad question.
I gave an example of a digit recognition problem, and there it is close
to human performance not as good. It is a very constrained experimental
set-up there. The adaptive systems have access to 7000 digits in that
case; it is not clear that that contains as much information as a human
would get about the relationship between images of digits and their correct
classification. But the performance is close in that case.
In other cases the performance is superior. I did not talk about it in
any detail here, but in the helicopter control problem there are examples
where a human pilot has enormous difficulty in, for instance, going through
a particular manoeuvre in the presence of disturbances such as wind as
accurately as one of these systems can do.
So it really varies between different tasks. We don't have a complete
understanding of what are the characteristics that allow us to get a better
performance in one case rather than another.
Question: In physical measurement, one of the first
elementary lessons one learns is to just give a result; it is not a result
unless you also give an uncertainty for that result. I am just wondering
whether, in this character recognition, you address the uncertainty in
recognition, and can you somehow express that and quantify it in any way?
In character recognition, for example, I imagine you haven't reached 100
per cent accuracy, it is not always just right. In some cases you would
be much more confident than in other cases, and so on. How do you deal
with that general concept of uncertainty?
That's interesting. There are advantages in coming up with estimates
of probabilities in that way. Particularly if in the fax example
you are wanting to use those estimates for subsequent decisions,
then it is great to have the detailed information about the probability
distribution.
If you pose the problem as, 'I want to make the most accurate decision,'
in the sense of minimising the misclassification probability, it turns
out that you can do better by throwing away that information not
throwing it away, but by not trying to estimate probabilities. The intuition
for that is that, if you try to estimate probabilities, then you can be
putting your effort into using information to try to estimate nuances
of the probabilities that are far from the decision boundary. There are
some formal results about the benefits of not trying to estimate probabilities
in that sense.
This is if the problem that you want to solve is precisely stated as
this decision problem.
Question: One case where I understand there has
been some success in pattern recognition is in the recognition of melanomas
in the skin, in recognising the difference between ones that are benign
and ones that are likely to be dangerous. I just wondered whether there
was something about the storage of the expert information, you might say.
As I understand it, what they did was to take large numbers of these things
and get experts to decide whether or not they thought it was going to
become malignant, and then they followed it up later to find out and compared
the performance with that of the computer. And the computer was marginally
better than most of their experts.
And whether the prognostic type of data that they were dealing with for
the computer was comparable, yes. They were using statistical techniques;
I don't know the details of those techniques in that example.
Question: Intelligent agents of exactly the type
that Peter has been discussing are used now by BHP Billiton, or being
developed, for blast furnace performance, the company uses them an awful
lot in the oil industry for mapping oil-bearing structures, and it is
even automating a helicopter to take the exploration instrumentation around.
So I wondered if you had any applications of that type that you would
like to comment on.
The kinds of applications that you have mentioned I have not had involvement
in personally. I have done some work with graduate students in Berkeley,
looking at spatio-temporal data not geological predictions but predicting,
for instance, rainfall from limited information. The essential statistical
and computational problems I think are very much the same.
One of the exciting things, I think, about this area is that there is
an incredibly broad array of problems to which these techniques are applicable.
|