AUSTRALIAN FRONTIERS OF SCIENCE, 2003

The Shine Dome, Canberra, 31 July to 1 August

Data challenges in astronomy
by Dr David Barnes

David Barnes

David Barnes is a Research Fellow in Astronomy and Astrophysics at the University of Melbourne, where he also received his BSc and PhD degrees. After graduating in 1998, David spent two years working as a research astronomer and visualisation programmer at the CSIRO's Australia Telescope National Facility, before moving to Swinburne University of Technology in Melbourne. There he continued astronomy research while building and maintaining the University's supercomputer facility which now ranks as one of the top three in the country. In 2002, David returned to the University of Melbourne as a Research Fellow, and is presently Project Scientist for the Australian Virtual Observatory. His research interests include galaxies, groups of galaxies, sky surveys and advanced algorithms for radio astronomy data processing. David is an author on 24 papers, and has been granted time on many of the world's radio astronomy telescopes.

Firstly, astronomy is theoretical and observational. There are two basic approaches, and they have to meet at some point. Theory involves taking existing models or producing new models, generally simulating them on a high-performance computer and comparing the results with what you might go out and observe in today's universe.

Secondly, a basic tool in all of this is doing surveys of the universe not just going out and looking at one interesting object, but going and imaging the entire sky, or a substantial fraction of it, and basically seeing what you see and whether it matches up with today's theories. Surveys, basically, continue to challenge theory, and vice versa at some level.

Thirdly, there are present and future surveys that are going to collect an enormous amount of data, data that is basically unmanageable by current systems. So, fourthly, towards the end of this talk I want to talk about some of the new and innovative systems and technologies we will need to take advantage of, just to manage and make sense of the data that we are confronted with.

Fifthly, I will also point out at the end that I suspect many of these challenges are encountered across disciplines, and I will give an example of how the UK is approaching this problem.

Let me first talk about theoretical astronomy, just briefly. I am mostly going to talk about observational, but I just need to set the scene here.

With theory, we produce models of phenomena that we might discover in observations, or that we have discovered in observations. So theory is predictive as well as descriptive. Many of the theories are simply non-analytic: you can't just write them down and then do a couple of lines of eliminating variables and deduce the answer. What we have to do is take a high-performance computer I am talking here about something with 100 processors, a distributed cluster run a simulation on it for perhaps weeks or months at a time, and then you end up with some realisation of a synthetic universe, and you want to compare that with observation.

I am telling you those timescales because I suspect a lot of us here do computer models, and it is interesting to see what different groups think are tough computational problems. There are problems in astronomy that have taken years to compute on supercomputers.

We need to link the theory to the observations. The trick here is that we do not expect to produce our universe with a simulation; we expect to produce something that looks statistically like it. A good model will be one that is statistically indistinguishable from our universe for the sort of phenomena you are trying to investigate. If we are going to do statistical comparisons, we need numbers.

The most exciting and interesting parts of astronomy are the bits we don't understand yet, and all too frequently these are the parts that, when you bin the data into some nice representation, will be the bins with two objects or one object in them. So for some period of time, people will go and study those objects, those particular one-offs, as a follow-up of a survey. But at some point you need to go and get another survey that finds you 10, 100, 1000 of those objects. This is the basic motivation one of the motivations for building bigger telescopes and taking larger surveys.

A quick excursion through observational astronomy: most people are familiar with optical telescopes, but today we have telescopes that, taken as a set, span more than 12 decades in wavelength. So we go from very low-energy kilometre-wavelength radio telescopes through to extremely high-energy gamma ray observatories that must be space-based.

This is a real back-of-the-envelope idea, but roughly one half of all telescope time is dedicated to survey projects, or key projects, where you actually are not looking at particular objects but at a whole part of phase space, basically, to get a big sample to do statistics with. And these usually have more than one science goal in mind.

Figure 1
Click on image for a larger version of figure 1

One of these projects is the H1 Parkes All Sky Survey (HIPASS). H1 is neutral atomic hydrogen, and we often mean by H1 a hyperfine transition that we see in neutral hydrogen atoms, which has a wavelength of 20 cm. This is perfect for detection by something like the Parkes Telescope. So the Parkes Telescope is here (figure 1), and just to give you scale, that is roughly 64 metres across from edge to edge. This is like a 14-storey building that you can point in any direction.

The survey we did was to look for atomic hydrogen, H1, in galaxies that were within a few hundred million light-years of ours. This took roughly 300 days of observing, altogether, spread over four years. These are big team projects, so this involved roughly 40 observers. The end product here is a set of image 'cubes' of the sky. We don't just get a flat image; we get an image at different frequencies which correspond, effectively, to different velocities of the gas that we see. And that velocity can correspond to Doppler shift, so distance to the galaxy, or kinematic information within the galaxy.

Figure 2
Click on image for a larger version of figure 2

The catalogue of around 4300 galaxies will be published or, at least, submitted in the next month or so. Mostly we see spiral disc galaxies that have lots of hydrogen in the disc. Hydrogen is your basic fuel for producing stars. And we get all sorts of information. What I thought I would do is just show you a quick visualisation of this (figure 2). We have only imaged the southern sky. It is all we can really see easily from the south. The red colours correspond to distant galaxies, blue to nearer galaxies. This is just to give you an idea of the large-scale structure that we might see in a survey like this.

As this rotates you can see different planes which are large-scale structures where galaxies have formed in sheets, for some reason, and there are also voids where there are no galaxies forming or at least no galaxies that are visible in H1 and there are also clustered areas.

Figure 3
Click on image for a larger version of figure 3

What's the basic result we get out of this? This is the H1 mass function (figure 3). This is a plot of the frequency of galaxies with a different amount of neutral hydrogen in them. The advantage this survey has over previous ones is that it has the most galaxies detected in H1. (They have to be observed and found that way. You can't go and look for them in the optical and then measure their neutral gas mass.) And we have improved this low mass end. The bottom bin is sometimes like 10 million times the mass of the sun in neutral gas.

There are only two objects in those two bins, but we have beaten everyone else. They are two interesting objects, and follow-up will occur on those. But what we would like to do is plan a survey where there will be 1000 of them, and then we can do some statistics. This particular measure is interesting because this tells you about how much gas galaxies are formed with, and how they burn that in their lifetime into stars and turn it into other products.

Figure 4
Click on image for a larger version of figure 4

Here is one example of a survey where we would like some more data (figure 4). We have already seen this image today, and you should now know that this is an interesting image and it is one of the main results in astronomy in the last 10 years, let's say, or probably ever. You have heard that this bottom map here is a map of the original radiation in the universe that was a result of the Big Bang. (See Brian Schmidt's paper)

Figure 5
Click on image for a larger version of figure 5

The science result that comes out of this I don't want to describe this too much is simply a power spectrum of those fluctuations, so looking at how much power is in low-level, broad fluctuations and how much power is in the smaller, angular-scale ones. This is a plot of that (figure 5). And the point here is that here is another survey, a magnificent survey which probably cost a billion dollars, and again there are bins in this data which are the most interesting, where we need more points. So one day another survey will dwarf this and give us more data here.

I could give you a list of 20 other major surveys that are planned in astronomy. In the interests of time I am just going to skip over this one here, but basically we are looking at all these surveys picking up terabytes tens of terabytes, hundreds of terabytes of data.

Figure 6
Click on image for a larger version of figure 6

One particular example is the Two Micron All Sky Survey, whose images were catalogued by an automated algorithm and we end up with half a billion point sources on the sky, as the entire sky. Each of these point sources has something like 100 parameters associated with it, so if you actually want to process this entire catalogue and do statistics on it, you are going to need a pretty chunky system to do it (figure 6).

Let me go on from here and basically say that there are dozens of surveys planned. They are going to collect a whole amount of data that we are just not used to handling in fact, an amount that is almost beyond hardware technology but not quite and we have roughly 10 to 100 researchers per survey. So the 90 of those researchers who are based around the world and not at the base site of the survey are going to be a bit unhappy if all the data is just in one place. These are exclusively international projects, and just as they can't be done by one person any more, they can't actually be handled by a single storage facility or computational system.

This is where grid computing and the Virtual Observatory come in, and what I want to do is briefly describe grid computing and then tell you about how the Virtual Observatory will build on that to enable astronomy with these massive datasets.

You may have heard the grid mentioned. Even the definition of the grid itself is evolving. It is basically to data and researchers and simulation projects what the Internet is to the World Wide Web. The World Wide Web is built on top of the Internet, and the Internet provides the foundation for sharing web pages and this sort of thing. The grid will provide the foundation for sharing data, for sharing resources and for communicating in a scientific way and a commercial way as well, of course.

Figure 7
Click on image for a larger version of figure 7

We need a little picture to describe this to us (figure 7). The idea is that you use this grid to solve problems. (We are not just going to make a grid and think it's pretty cool.) You come in with a complex problem, and in our case it can often be a very simple statistic that you want to calculate on a very big dataset. That is considered a complex problem, but it can obviously get a whole lot harder than that. The grid is this mechanism in the middle that gathers together computing resources and data knowledge as well, so this could be journals or expert systems or things like this people and instruments. So this could even be hooked up directly to a telescope and commanded to go and gather an observation if you needed some more information to solve your problem. The grid is some piece of protocol that allows you to bring all of this together and come up with a solution to your problem.

What is a Virtual Observatory? Well, a Virtual Observatory will try and use the grid to solve problems that are specific to astronomers, and in particular the idea is to provide a uniform interface to data archives so that, rather than being an expert in the HIPASS survey, an astronomer can go to a uniform interface and immediately understand what the data means. It is described properly to them, and they can handle that data.

The thing is that, having been involved in the HIPASS survey, I have access to that data for some period of time. But once that time elapses, we make this data freely available to anyone else in the community who wants to use it. We promote sharing information like this, effectively.

That is all fine and good, but no-one is forcing me to give that data to someone in a format they can understand. So if I want to extend my proprietary period on the data I can sort of do it in a mean and unfair way.

The Virtual Observatory will make us use a uniform standard so that everyone, from day one, will understand the data that we have published and made available to them. As well as that, if you have got a uniform interface, one tool that works on optical data will also work on radio data or X-ray data. And you will be able to correlate and combine the processing of one dataset that was otherwise completely incompatible with another. The idea is to build this all using data grids and computational grids, the reason being that we have massive amounts of data and we have massive computational needs.

So what does the uniform interface get us? Well, at the moment every domain has its own set of analysis tools. To handle a radio astronomy observation I routinely use about four different packages, and at least one of those is made up of a million lines of code. So these are not small products. That is probably more than the code base of Excel or Word. And every domain has this sort of software. Then each key project has custom software to satisfy particular processing needs. We can't take that away, but we can actually start to share that even more than we have been.

By providing the process data in common formats you assist both the developer of the survey and the astronomer in some important ways. You only have to write a tool once, and you can then apply it to another set of data. You only have to learn it once most of us don't write tools, we use them and you can use it many times. Or you can learn it once and use it for many different domains, or different problems. You can integrate otherwise disparate data sources without having to think too hard, and the expertise requirements are simply reduced.

If I publish the HIPASS data in a nice format that is well described, it makes it available and accessible to a much wider range of astronomers, who may have completely different ideas about what to do with that data and come up with a new discovery as a result. This does not take away the requirement on the astronomer to think about the data they are using and understand it; it just helps them.

Figure 8
Click on image for a larger version of figure 8

Here is the old way of doing stuff (figure 8). These are three excerpts from papers containing data on a few sources. This particular astronomer is interested in M100 an exciting name. They have had to go and find this object in these catalogues and figure out which columns contain the data they want to compare, and manually do that. That's fine for a one-off, except in this case this object probably appears in two or three hundred catalogues, because it is a popular one. But if you want to do 1000 objects, or 10,000 objects, or a million objects, this is out of the question.

Figure 9
Figure 9
Figure 10
Figure 10
Click on images for larger versions

So there are services now that enable you to do this sort of thing in a relatively easy way (figures 9 and 10). Here I have said, 'I want to find catalogues that have a radio flux at 850 MHz.' I press Go and it comes back and tells me that there are six catalogues and I might like to investigate them. I can then go and check some boxes here and perhaps make some plots of various sources selected from those catalogues.

Virtual observatories are under way around the world. There is an International Virtual Observatory Alliance; Australia is a member of it. One particular example is the United States National Virtual Observatory, where they have compared 15 million sources from one survey with 160 million from another. In two minutes they managed to discover a brown dwarf. This is a relatively rare object; we only know of about 200. But to find this object the old way would have taken months or years, so this is an example of a Virtual Observatory that had a particular science goal in mind they were actually just attempting to rediscover existing sources, and what they actually ended up doing was finding a new one. So that's pretty neat, and that's a good example of what is to come.

I think I have motivated the need for data storage and access, and to use a grid to do this, as well as data processing. In the interests of time I will just skip past these, but what we are building is the 'Australian Astronomy Grid'. We have funding this year from LIEF and we hope to get a little bit more next year. This is built on infrastructure that Australia already has. It is a system called GrangeNet, which is a high-performance network linking Brisbane, Sydney, Canberra and Melbourne. What we are doing is building on top of that backbone and using the expertise that resides here at APAC or down at VPAC to build this grid. This is a rough idea of where we think we are going.

The last thing I want to tell you is that if you hadn't heard of these things called grids before this talk, you probably will start hearing a bit more about them. In the UK there are about 30 projects funded across all disciplines that are using grids to solve these tough data problems and to get researchers working together on the same datasets, speaking the same language. And common across all of the disciplines will be data storage and access, and data description and access to processing resources. That is my message, that to stay at the frontiers of science, at least in astronomy, we will be adopting the grid paradigm and trying to solve our problems that way, and we will drive the future of the grid within Australia.


Session 6 discussion