SCIENCE AT THE SHINE DOME canberra 5 - 7 may 2004

Symposium: A celebration of Australian science

Friday, 7 May 2004

Professor Peter Bartlett
Division of Computer Science and Department of Statistics, University of California at Berkeley, USA

Peter BartlettPeter Bartlett is a Professor in the Division of Computer Science and the Department of Statistics at the University of California at Berkeley, USA. He is the co-author, with Martin Anthony, of the book Learning in Neural Networks: Theoretical Foundations. He has served as an associate editor of the journals Machine Learning, Mathematics of Control Signals and Systems, Journal of Machine Learning Research, and Journal of Artificial Intelligence Research. He was a Miller Institute Visiting Research Professor in Computer Science and Statistics at UC Berkeley in 2001. He received his PhD and undergraduate degrees from the University of Queensland, where he is an Honorary Professor, and he has had positions in the Research School of Information Sciences and Engineering at the Australian National University's Institute for Advanced Studies and at a bioinformatics start-up in Berkeley. He received the 2001 Malcolm McIntosh Prize for Physical Scientist of the Year. His research focus is the design and analysis of intelligent systems, and in particular machine learning and statistical learning theory.

The computational and statistical science of intelligent systems

My research is at the intersection of computer science and statistics. The broad aim of what I do is understanding how we can develop systems to solve complex information processing problems, and in particular the sorts of things that people can do much more effectively than computers: things like recognising a face or understanding a sentence or learning to fly a helicopter.

In particular, I am interested in how we can get learning systems to solve these problems that can improve their performance through experience, and that's where the statistics comes in.

We are here celebrating a 50th anniversary. It was also almost 50 years ago that a group of scientists and mathematicians, including John McCarthy and Claude Shannon, submitted a proposal to the Rockefeller Foundation for funding for a meeting at Dartmouth. They said, 'The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.' This was the famous Dartmouth meeting; it was the first concerted effort at artificial intelligence (AI) and understanding how we can develop computer systems that can solve the sorts of complex problems that people can solve easily.

I think a lot has happened over that 50 years. In the last perhaps 10 years, especially, AI has become much more quantitative, and learning and adaptation have taken on a much more central role. The reason for this is that these problems typically involve incomplete information and uncertain and changing environments, and those characteristics mean that the right way to formalise the problems is in the language of probability theory, and they beg for statistical approaches. So the key scientific issues here can be split into the statistical and the computational ones.

In this talk I want to start by giving a few examples of the sorts of complex decision problems that we would like to get computers to solve, and look at some of these key scientific issues. I want to focus on one example, and that is in the area of pattern classification. There is a class of techniques called large margin classifiers that are a computational simplification of the pattern classification problem and illustrate, I think, some of these scientific issues rather nicely.

So let me start by giving some examples of these complex decision problems.

The first is from the arena of computer vision. Here we want to be able to recognise handwriting. Imagine that we had an image of somebody's handwriting. We would like to provide that as input to a system and have as output a sequence of characters, this sentence that was written.

This is certainly a complex decision problem. There is a lot of uncertainty. So there is no way we could come up with a complete model for everybody's handwriting. And it is something that changes over time, there are all sorts of unmeasured influences: we have all had the experience of trying to sign our credit card slip with parcels in our hands. So there are certainly many aspects to this problem that we are never going to be able to model accurately.

The second is from robotics. Suppose we wanted to have a system that could take measurements of the state of something like a helicopter. We would like this system to take measurements of various velocities, positions and so on, and use those to make decisions about the actions to take, so as to take the helicopter through some kind of manoeuvre.

Once again it is an uncertain and changing environment. There are all sorts of influences that are difficult to model. The predominant one in this case is the wind and air movements.

Another example is from bioinformatics. Suppose we have a protein and we would like to predict the function that that protein performs in the cell. We have various kinds of information about it: we might know the sequence of amino acids, we might know how that sequence aligns with various other proteins or types of proteins, we might know how it interacts with other proteins, we might know how the gene that codes for this protein gets expressed under various different experimental conditions.

From this information we would like to be able to predict what kind of function the protein performs in the cell. Is it involved in metabolism, or the energy cycle, and so on? And, of course, as is inherently a problem with incomplete information, we are wanting to use this sort of data that is easy to gather in an automated way so as to make predictions about the function that the protein performs that we can then go and verify experimentally later.

The final example is from natural language processing. Wouldn't it be great if we could communicate with computers in plain language? So when we wanted to find out some fact, instead of going to Google and trying to work out the right sequence of search terms to use, and then sifting through the information, the computer could understand what our intention was just from a sentence – that would be a great advance.

A key component of that is the parsing problem. So here the input to our system would be a sentence, 'The pedestrian crossed the road,' and the output would be something like a parse tree that tells us what are the different components of the sentence, who does what to whom. In this case the sentence consists of a noun phrase, 'the pedestrian', and a verb phrase, which is what the pedestrian does.

Figure 1
Natural language processing: Parsing

(Click on image for a larger version)

This is actually a very complex decision problem. This is a particularly simple sentence. If you look at the average sentence from a newspaper, there is an enormous number of candidate parse trees that are all syntactically legitimate but you would typically only accept one of them as the correct parse for the sentence.

All of these problems are complex decision problems. They involve incomplete information – we don't have a model for everybody's handwriting – and the environment is often changing: the wind, for instance, in the helicopter control problem. And these are characteristics that tell us that the right way to model these problems is in terms of probability theory and that adaptive and learning systems, or statistical methods, are an effective approach to solving these problems.

In all of these cases we can gather data to give us information about the kind of function that our system should perform. In the case of character recognition, we could get a bunch of sentences and the images of the corresponding handwriting, and use that to try to infer the relationship between these images and the sentences that are conveyed in them.

In helicopter control we could gather together measurements of the state of the helicopter, and the actions that were taken, and use that information in order to determine the appropriate control actions to take in different situations.

In the protein classification problem we could go and look at various databases for yeast protein, say. For many of those proteins we might have information about the function that the protein performs, and we would use that to try to determine the relationship between these other sorts of data – the sequence and gene expression data, for instance – and the function, so that for subsequent proteins whose functions we don't know we can make that prediction.

And in the natural language processing example, there is for instance a corpus of sentences from the Wall Street Journal that somebody has gone and produced parse trees for. We could use that – the pairs of sentences, together with the correct parse tree – to try to infer the relationship between those two so that subsequently we can take a sentence and make a prediction about the correct parse tree.

The key scientific issues can be split up into the statistical and the computational ones. On the statistical side we can ask the question, 'Do we have an appropriate model class? Is our system powerful enough to represent a good approximation to the relationship between, for instance, images of handwritten sentences and the sentence? If it is, how can we estimate that from the data, and what are the things that affect the performance of the system?' And, of course, computational complexity is always a key constraint.

The example that I want to focus on for the rest of the talk is a good one that illustrates the trade-off between these two issues. It is from the area of pattern classification; it is a class of methods called large margin classifiers.

Figure 2
Large margin classifiers

(Click on image for a larger version)

Just for the purposes of illustration let's consider a very simple pattern classification problem. So rather than thinking of classifying images of handwriting and producing sentences, let's think about looking at single characters, and let's think about just a binary decision: we want to look at an image of a character and make a decision, 'Is this an A or not?'

To further simplify things, rather than looking at the full image of the character suppose we extracted two simple features. This one [Feature 1, toward the right-hand end of the x-axis] might be some measure of the presence of a loop in the character, and this one [Feature 2, nearer the y-axis] might be some measure of the presence of a horizontal at the top of the character. One of these dots corresponds to an image where this value [for Feature 1] is the loopiness feature and this one [for Feature 2] is the presence of the horizontal bar feature.

From that pattern, the image of the character, we want to predict the binary label: is it an A or not? We might have a bunch of labelled data, in the form of these XY pairs – images together with whether it corresponds to an A or not – and for instance the red dots [for Feature 1] might correspond to the As and the blue dots [for Feature 2] to the other characters. So we want to use that data to infer a good decision rule, a mapping that takes us from the pattern to the label.

How might we do that? We might just build a model of the probability distribution. We might take the reds, the images of As, and build some sort of probability model – maybe it's a Gaussian or a mixture of Gaussians or something like that – and then we would use those models to try to infer the optimal decision, and perhaps it corresponds to this decision boundary.

Figure 3
Pattern classification (1)

(Click on image for a larger version)

There is another approach to this problem, and that is to consider instead a class of decision rules – for instance, straight lines in the simple two-dimensional example shown here – and directly optimise some performance criterion of interest.

Figure 4
Pattern classification (2)

(Click on image for a larger version)

So we might instead consider this decision rule, and this one and this one, and choose the one that, for instance, minimises the number of errors that we make on the training data if we are interested in minimising the misclassification probability.

It turns out that under reasonable assumptions you can show that you get better statistical performance out of this approach than the plug-in estimate of estimating probability distributions, and the intuition there is that you are not using your information to model nuances of the probability distribution that are far from the decision boundary, that are not important in that sense. So that is encouraging.

Unfortunately, the optimisation problem that emerges in all but the most trivial cases ends up being formally a hard problem to solve. There are results that suggest that the number of features does not have to get very large before finding the optimal solution is something that you can expect would take the lifetime of the universe. So that is an issue.

The approach that is taken with large margin classifiers is to take that discrete, combinatorial optimisation problem, where we are counting up the number of mistakes that are made versus things that are correctly classified to the correct side of the decision boundary, and to produce some sort of convex relaxation of that problem. So we are changing this discrete and hence combinatorial problem into one involving a convex criterion, and since it is convex then we have a unique solution; we can come up with efficient algorithms to find that solution. So that is a good thing.

Figure 5
Large margin classifiers

(Click on image for a larger version)

But of course we have changed the problem, and so the issue arises, 'Well, what is the statistical cost of this computational convenience of changing a problem into a convex optimisation problem?' And you can ask questions like, 'Do you have asymptotic optimality?' As the sample size gets large, or as the amount of data gets large, are we going to be making the optimal decision, and for what kinds of relaxations of our original computationally hard problem is that the case? And you can ask more refined questions: How does the performance depend on the sample size, and on various other properties of the learning system?

These are the sorts of questions that I have worked on. I think for simple cases like the two-class classification problem we have a fairly complete understanding of these sorts of things. I want to give you a couple of illustrations of the success of these large margin classification techniques in practice.

Figure 6
Example: Character recognition

(Click on image for a larger version)

The first of them is in a simple character recognition problem. These are images taken from envelopes of digits of handwritten zip-codes. Somebody has labelled them, and so we have a set of some 7000 of these digits and we can use those to infer the relationship between images and the correct label, and for various plug-in estimates where we model the probability distribution, we get some level of performance. So this is a performance figure here – smaller is better (I won't go into the details) – and there are various non-parametric methods. This can be viewed as a differentiable relaxation of the hard combinatorial optimisation problem and this one is the convex relaxation, and we can get performance that is something close to the human level of recognition here, where of course all of these methods get access to 7000 characters. That is all they know about this classification problem. People have perhaps significantly more information than that.

One thing I didn't tell you about that last technique was that it involves a particular approach to classification, an approach involving similarity measures called kernels. So let's imagine – this is my one equation for the talk, f(x) = Σ,α,k(xi,x) – that we have a pattern x here, an image, and we are going to make our decision of some score function that we assign to x, and that function is a linear combination of measures of similarity, so this k(xi,x), is how similar x is to xi, some pattern that we have seen in the training set. So you can think of them that formally they are inner products in some space; you can think of them just as a measure of similarity.

It turns out when you do things in this way that the parameterisation is linear, that we still have a nice convex optimisation problem to solve and we can solve it efficiently, but we can have a much richer class of classifiers. The other key point is that it is only inner products or these similarities between patterns that are relevant.

Figure 7
Example: Protein function prediction

(Click on image for a larger version)

In the protein function prediction problem, for instance, we can come up with various notions of similarity. We can look at the protein-protein interaction data, and that gives us some measure of the similarity between different proteins. We can look at gene expression data, which gives us another notion of similarity, and sequence data and so on. We can ask the question, 'Can we combine these different sources of information in order to come up with a more effective decision?' and if you pose that in the right way it turns out there is a convex optimisation formulation of that. It reduces to something called a semi-definite program, which has become of more interest in optimisation theory lately.

Figure 8
Predicting membrane proteins

(Click on image for a larger version)

In this problem it turns out – I won't go into any details – that you can get significant improvements in performance, by combining these different sources of information in this way, over any one type of data that you have about the proteins in a particular protein prediction application.

To summarise: I think this class of methods, large margin classifiers, gives us a good illustration of the trade-offs between the statistical and the computational issues in the design of learning systems. I think in future the key direction here is to look at more complex decision problems. For instance, the natural language processing problem that I mentioned earlier is one that a lot of us are engaged in.

More generally, I think that learning and adaptive systems are going to see application across all areas of computer science such as computer vision, robotics, natural language processing and so on. I think this is a very exciting thing. We will see a shift from program systems to adaptive and autonomous systems.


Questions/discussion

Question: I would like to ask whether the recent methods you have been describing could elicit structure. Let me concretise that question in the following way. Suppose you are doing text recognition, and suppose it is Swedish or Norwegian or Danish and you don't know. You build a classifier that can take all that sort of stuff. Will the structure of that classifier have something about it which clearly indicates that first you sort out the language and then you sort out the content of the sentence?

Certainly you can incorporate prior information that you have about distinct languages – in a sense, exploit structure that you are already aware of – in these sorts of techniques. The question, I guess, is whether that is something that would emerge in some sort of automatic way.

It may well be the case, but the sorts of techniques that I have been talking about are non-parametric techniques. It is typically very hard to extract information like that after the fact. I think the reason for that is that what you are shooting for with these sorts of techniques is good predictive accuracy, and that is not especially compatible with being able to understand the basis for the decision that has been made.

For instance, in these examples that you described, it may well be that there are characteristics that distinguish these types of languages. It is not clear on what basis they would be distinguished, using a technique such as this.

Question: These things that you are describing must be very important for national security. What puzzles me in that respect – and you may know about these sorts of things – is that I have been told that fingerprint recognition, for example, is not very good, even for police. Can your processes bring extra information into something like that? Can they bring extra information which could make that not 95 per cent reliable but 100 per cent or 99.9 per cent or something of that kind?

For fingerprint recognition, I guess that is possible. I am not really aware of work that has gone on in that area and what the limitations are.

Question: In your field it is probably well known, but with optical character recognition of, say, a faxed document, does the software try to discriminate and cut out each character, or does it look at things like word shapes? My understanding is that the dropping characters and the high-rising characters give a lot of words a little shape profile from which it is very easy to say, 'That word is "the" and that word is "who"', for instance, rather than actually saying 't-h-e'.

In fact, those things work at several levels. There are systems that are looking for individual characters and then these are combined with probability models for text. So, for instance, the word 'the' is known to be a very common word. If you see something that has some plausible likelihood of being a 'the', then this is the conclusion that you would make. So, in fact, down at a low level there is some sort of recognition engine, and then at a higher level you would try to construct words, for instance, out of the sequences of conjectures – you can think of it that way.

Question: The benchmark should be, I think, how well human beings do these tasks. What is the situation? In general, do these things do things better than human beings, or about the same, or are they not yet nearly as good? That is a very broad question.

I gave an example of a digit recognition problem, and there it is close to human performance – not as good. It is a very constrained experimental set-up there. The adaptive systems have access to 7000 digits in that case; it is not clear that that contains as much information as a human would get about the relationship between images of digits and their correct classification. But the performance is close in that case.

In other cases the performance is superior. I did not talk about it in any detail here, but in the helicopter control problem there are examples where a human pilot has enormous difficulty in, for instance, going through a particular manoeuvre in the presence of disturbances such as wind as accurately as one of these systems can do.

So it really varies between different tasks. We don't have a complete understanding of what are the characteristics that allow us to get a better performance in one case rather than another.

Question: In physical measurement, one of the first elementary lessons one learns is to just give a result; it is not a result unless you also give an uncertainty for that result. I am just wondering whether, in this character recognition, you address the uncertainty in recognition, and can you somehow express that and quantify it in any way? In character recognition, for example, I imagine you haven't reached 100 per cent accuracy, it is not always just right. In some cases you would be much more confident than in other cases, and so on. How do you deal with that general concept of uncertainty?

That's interesting. There are advantages in coming up with estimates of probabilities in that way. Particularly if – in the fax example – you are wanting to use those estimates for subsequent decisions, then it is great to have the detailed information about the probability distribution.

If you pose the problem as, 'I want to make the most accurate decision,' in the sense of minimising the misclassification probability, it turns out that you can do better by throwing away that information – not throwing it away, but by not trying to estimate probabilities. The intuition for that is that, if you try to estimate probabilities, then you can be putting your effort into using information to try to estimate nuances of the probabilities that are far from the decision boundary. There are some formal results about the benefits of not trying to estimate probabilities in that sense.

This is if the problem that you want to solve is precisely stated as this decision problem.

Question: One case where I understand there has been some success in pattern recognition is in the recognition of melanomas in the skin, in recognising the difference between ones that are benign and ones that are likely to be dangerous. I just wondered whether there was something about the storage of the expert information, you might say. As I understand it, what they did was to take large numbers of these things and get experts to decide whether or not they thought it was going to become malignant, and then they followed it up later to find out and compared the performance with that of the computer. And the computer was marginally better than most of their experts.

And whether the prognostic type of data that they were dealing with for the computer was comparable, yes. They were using statistical techniques; I don't know the details of those techniques in that example.

Question: Intelligent agents of exactly the type that Peter has been discussing are used now by BHP Billiton, or being developed, for blast furnace performance, the company uses them an awful lot in the oil industry for mapping oil-bearing structures, and it is even automating a helicopter to take the exploration instrumentation around. So I wondered if you had any applications of that type that you would like to comment on.

The kinds of applications that you have mentioned I have not had involvement in personally. I have done some work with graduate students in Berkeley, looking at spatio-temporal data – not geological predictions but predicting, for instance, rainfall from limited information. The essential statistical and computational problems I think are very much the same.

One of the exciting things, I think, about this area is that there is an incredibly broad array of problems to which these techniques are applicable.