AUSTRALIAN FRONTIERS OF SCIENCE, 2008

The Shine Dome, Canberra, 21-22 February

Session 8: Discussion

Question: This is a general question for both of you. We see lots of two-dimensional representations of data, and occasionally we see three-dimensional representations such as of mortality, adding the motion in there, and the colour. Does it ever get above three dimensions to four dimensions and beyond? And how do you represent that, if it is useful to do so? If we are not doing that, are there some really handy bits of data that we are not really getting to look at?

Rob Hyndman: People do look at multiple dimensions. What normally happens is that they take a very large dimensional space and look at projections onto two- and three-dimensional sub-spaces, and then they will change the projection so that they can look at multiple dimensions. Physically you can't look at more than three at once, four if you add in time, but if you do these projections onto smaller sub-spaces then it is possible to start exploring things in multiple dimensions. That is often what people do when they are trying to find outliers in very large dimensional spaces.

There is some free software around that does it, called GGobi, which now has an R interface so it is a bit more usable. That is the main way that I have seen it done. I don't know whether Ian has seen other tools.

Ian Wood: I have seen some other methods, but I think it would take a while to explain how to interpret them, and people might see things that aren't really there because there are still lots of dimensions that haven't been shown. I have seen up to six and stuff like that. It's difficult.

Question: Ian, when you build a classifier and use it in a microarray situation, as those microarrays are released into the real world to be used, do they get sold with a contract whereby data is returned to the manufacturer when there is independent data to suggest whether a diagnosis is correct or incorrect or what the prognosis is, and does that cycle continue? Or once these microarrays are released is that the end of it?

Ian Wood: The MammaPrint example? I am not entirely sure on that commercialisation. In general, data tend to be released very reliably by most groups, and a lot of their funding contracts say they must be put out there pretty quickly and there are good repositories around. I am not sure on the MammaPrint example, which is a currently FDA (Food and Drug Administration) approved classifier for this. I wouldn't be surprised.

Sue Wilson: Basic data is available.

Ian Wood: Is it available for every new case?

Sue Wilson (cont.): Not for the new ones, only for the published ones. If anything is published these days using microarrays, the standard is for the data to be made available online.

Question: Rob, I wonder if you could tell us a bit more about how the forecasted values are actually calculated. In the plots it seemed that for the most part the forecasted values for phi were a linear extrapolation from the past, based on the past 10 to 20 years, but then there was one case of the Phi 2 for fertility where the forecasted value turned over, where for the last 50 years it seemed to be going straight. Could you phenomenologically describe what is going on there?

Rob Hyndman: For each of those betas we are trying to forecast, we fit a univariate time series model. There are several classes that we can choose from: ARIMA models are one, and that is what I was using for the slides. An ARIMA model is essentially like a weighted average of the past, with higher weights on the most recent observations.

I think for the fertility I actually imposed some damping on it. If you don't do that, it tends to shoot off into areas that don't seem physically likely. It is common in forecasting to use damping when you want to rein in the most extravagant forecasts, so that is what was going on there.

The two classes that we use are ARIMA models or exponential smoothing state space models. They both seem to work pretty well for univariate time series.

Question: Following on from that, Rob: situations like wars are obviously very hard to predict, but there may be some other situations where you have some physical constraints or restraints that may or may not be reflected in existing trends but that you may be able to predict with existing knowledge. I'm not sure if this is an urban myth or not, but economists are often accused of having models that operate in 'infinite worlds'. In population analysis, for example, I'd love to know whether your population models ever taper off within error, or whether they just keep going up. And can you build in some of these physical constraints?

Rob Hyndman: The advantage of using a decomposition method as we have done is that you focus the time dimension down to only a few components and you can do whatever you like with them. Because they are univariate time series, if you have some external information about things that you think will happen and you want your forecast to reflect that, then you build that into the model. You can build whatever model you like for those components, and then it just all feeds back in when you reconstruct the curves at the end.

Actually, the demographers don't like my fertility forecasts, because they say they are far too wide. I say to them, 'Well, you're looking at the last 20 years, where it has been extremely stable, but if you look at the last 100 years you see that there are occasions when it just changes drastically – and it's not due to a war or an epidemic or anything, it is just that social changes mean that fertility changes. Over the course of five years that can cause a big change.' If things have happened in the past, then the variations in the model should allow that to be seen in the prediction intervals. For things that haven't happened in the past but that you believe are going to happen, you have to try to build a model that explicitly incorporates that information.

Question (cont.): But what do your population models look like? Do they taper off?

Rob Hyndman: Do you mean do they dampen?

Question (cont.): Well, do they keep going up seemingly exponentially from what we are seeing at present?

Rob Hyndman: For population I wouldn't forecast more than about 25 years ahead, and for the next 25 years they taper slightly but not much. If you did predict several hundred years ahead using these models, then yes, it would look crazy because it would exponentially grow out of control. I wouldn't do that, because the model is not really built for it. It is built for short- to medium-term forecasting, not long-term forecasting.

Question: Have you tested your forecasts by, say, truncating your data at 1950 and seeing how well the predictions predicted what actually happened?

Rob Hyndman: Absolutely. We do it for a rolling forecast horizon. You hold it at 1950 and predict the next 10 years, then hold it at 1951 and predict the next 10, and do that over and over again. For our mortality forecasts we did this against all of the competing methods that everyone else had proposed, and for those forecasts we had the smallest errors by comparison with the other methods. That is essentially how you test whether you are doing very well, because you don't know for the real forecasts but you can test on historical data, using that sort of backcasting method.

Question: Rob, can the models actually be updated very regularly to provide an early warning of an outlying period?

Rob Hyndman: It is easy to update them; it only takes a few seconds to fit these things. As to whether you would get early warning of weird things, I guess you would, because if what you saw was sufficiently different from what you forecast, then that would constitute an early warning. Yes, you could do that.

For the applications that I am looking at, we are dealing with data that is coming in once a year, so it is not really something we worry about. But if you had an application where you were sampling these curves much more regularly, then that might be something that would be worth doing.

Question: Australia is obviously not a closed system – as you said, you have got people coming in and people leaving, and I guess there may be biases in there of people's perceptions on fertility and so on as well. But, equally obviously, the world is a closed system. To what degree can you scale these and be predicting population changes in different parts of the world? Have there been attempts to do this on a global scale?

Rob Hyndman: Because mortality and fertility rates vary so much across sub-populations, the best forecasts tend to work on relatively homogeneous populations and then add up to get world forecasts. The trouble is that we don't have very good data on a lot of third world countries. For countries like, say, Bangladesh we would have at best 10 years of reasonable data, but nothing more. It is pretty hard to produce 50-year forecasts on 10 years of data. So what tends to happen for those long-term global forecasts is that they are using much more subjective methods, whereas our methods tend to be much more data-intensive so they are more suitable for countries that have long histories of good data collection, such as the western European countries and Australia.