SCIENCE AT THE SHINE DOME canberra 5 - 7 may 2004

Symposium: A celebration of Australian science

Friday, 7 May 2004

Dr Bostjan Kobe
Department of Biochemistry and Molecular Biology, School of Molecular and Microbial Sciences, University of Queensland

Bostjan KobeBostjan Kobe is an Associate Professor in Structural Biology at the Department of Biochemistry and Molecular Biology, with a joint appointment with the Institute of Molecular Bioscience, both at the University of Queensland. He received his BSc at the University of Ljubljana, Slovenia, and his PhD at the University of Texas Southwestern Medical Center at Dallas, USA. Since his PhD studies, he has worked at the Howard Hughes Medical Institute in Dallas, and St Vincent's Institute of Medical Research in Melbourne. He moved to the University of Queensland in 2000. His research interests involve the structures of biological macromolecules, particularly the interactions between proteins. He is the recipient of an NHMRC Senior Research Fellowship. He received the 2001 Science Minister's Prize for Life Scientist of the Year. He has published more than fifty scientific papers and accumulated more than two thousand citations. He is a member of the editorial board of the Journal of Structural and Functional Genomics.

Use of protein three-dimensional structural data in functional genomics

What I will try to do today is to give you a few snapshots of the role that structural biology is playing in molecular biology today, and a couple of snapshots of how this is evolving in projects in our lab as well.

Figure 1
Post-genomic era

(Click on image for a larger version)

We live in what is called the post-genomic era. The reason it is called that is that quite a few genomes have now been sequenced, so we basically have complete catalogues of all the proteins that make up cells and organisms, and now the next big task is understanding what these proteins that are encoded by these genes actually do and how they do it together. It is this functional annotation that is a big task ahead of us.

Figure 2
The structure of a macromolecule determines its function

(Click on image for a larger version)

What does three-dimensional (3D) structure do for us? That is the main thing that I am going to be talking about today. It tells us how proteins function. Basically, the function of a protein – for example, this one binds a certain molecule – depends on its three-dimensional structure. The amino acid sequence determines what sort of 3D structure the protein folds into, and this 3D structure then defines what the protein does.

We can use the structural information for other reasons. It is very helpful when we are trying to design drugs, for medically relevant proteins, and we can also redesign proteins. We are trying to understand molecular folding; it can help us with that, and various other things.

Figure 3
X-ray crystallography

(Click on image for a larger version)

In our lab we use mainly the method of X-ray crystallography to look at protein structures. This is a very common method; the complementary method is NMR, nuclear magnetic resonance. Here I will just go very briefly through what we do using this sort of technique.

We have to isolate a protein and purify it, and if we do that we have a chance of actually crystallising it – we have to obtain a three-dimensional crystal. The reason we have to do that is that we then put this crystal in the X-ray and from the X-rays we can get a diffraction pattern, and that diffraction pattern actually helps us determine its 3D structure. So we go through this step here [of phase estimation].

This is a relatively old method, although it hasn't really been adapted to proteins that early on. This [shown at right-hand side of slide] is one of the first protein structures actually solved. In the old days, the 1950s, when the first structures were solved, they had to make up real-life models of these things; these days we use computers to look at the protein structures.

Figure 4
X-ray sources

(Click on image for a larger version)

The X-rays that we use have to be very strong and bright, and that is why we use synchrotrons. Undoubtedly all of you have heard that we are building a synchrotron in Australia; it is going to be down in Melbourne. When we don't use a synchrotron we can use a slightly smaller machine which we can put in our lab, which produces X-rays of an order of magnitude or so lower intensity.

What I will do today is to basically give you a snapshot of three different projects which we have going on in the lab, and they will give you an insight perhaps into the different things that structural biology is good for and how it is evolving.

The first one is more of a traditional approach, when we know an important biological system and we try to understand it by looking at three-dimensional structures. In the second one we are trying to look at a system in a slightly more comprehensive sense, looking at a number of different proteins that contribute to how this system works. Finally, I will mention a project where we are trying to use structural information in a comprehensive sense to understand how a lot of different proteins function.

Figure 5
Nucleocytoplasmic transport

(Click on image for a larger version)

The first project involves nuclear transport. All eukaryotic cells have a nucleus and the transport into the nucleus, where a lot of the transcription et cetera goes on, is an important selective process, an essential process in every cell.

Figure 6
Eukaryotic cell

(Click on image for a larger version)

This is a cell. Here is a nucleus.

Figure 7
Cell nucleus

(Click on image for a larger version)

If we actually look at the nucleus under an electron microscope, in its membrane we can see little pores. These are the only places where proteins can actually get into and out of the nucleus.

Figure 8
Classical nuclear import pathway

(Click on image for a larger version)

They can't do that by themselves; they actually need help from other proteins which are transport factors. There are several different pathways by which the proteins can get in, but the most common pathway is called the classical pathway, and this is the one we will talking about today.

It involves basically two major proteins, importin-α and importin-β, which form a complex. They bind a cargo protein, and the way they recognise the cargo protein is that it has a little signal on it, called the NLS. That is a little stretch of sequence and NLS stands for 'nuclear localisation signal'. Together they can then get through the pore, and once they are in there is another protein that helps unload this cargo. It is a GTPase called Ran.

Figure 9
Importin-α armadillo repeats

(Click on image for a larger version)

One thing we were particularly interested in is how this selection occurs. How does importin-α, which is the primary receptor for these cargo proteins, recognise only the correct proteins and not others?

A few years ago we started this project, first looking at the structure of importin-α itself. This is its structure. It is actually quite an unusual protein, because it is basically folded as a spiral, unlike most other proteins, which are folded in much more complicated ways. Every turn of the spiral is basically just three alpha helixes. The individual repeats that form these turns of the spiral are called the 'armadillo repeats'. Here is an armadillo. You can obviously see the great resemblance between the two! – although the name doesn't come from that at all, but from a protein in Drosophila.

Figure 10
Importin-α armadillo repeats

(Click on image for a larger version)

The next thing we did was to look at how these sequences, NLSs, actually bind importin-α. We determined a few complexes of importin-α with little peptides corresponding to these NLSs, and here we can see different types of peptides bound to the same protein.

Figure 11
Binding of NLSs to importin-α

(Click on image for a larger version)

From that we can get some idea of how this recognition occurs. It turns out that in this system the way this recognition occurs is very elegant, one of the most elegant ways I have seen molecular recognition happen in any system. Because of the repetitive nature of this protein, we basically have these repeats, and in the same position in every repeat we have certain conserved residues.

It turns out that in this spiral structure they line up in a ladder. You can see here asparagine, asparagine, asparagine, also tryptophan, tryptophan, tryptophan. These are exactly the right spacing to bind an extended amino acid chain. So this NLS peptide shown here can extendedly bind in such a way that this conserved ladder of amino acids is thinned down. The specificity then comes from other amino acids in other locations.

Figure 12
Binding of NLSs to importin-α

(Click on image for a larger version)

If we take our structure that determines how these peptides bind and line them up the way they are in space, we can find out which are the most important bits of these peptides and basically come up with a consensus sequence. So now we can much better understand what actually makes up this NLS. We can see that there has to be a lysine in a certain position, argenine, we don't need anything here, amino acid, lysine again here, and then we have a couple of them up here.This type of information helps us define this NLS, which in turn then helps us search protein databases and we can define other proteins that we might not know experimentally yet, that they go into the nucleus because they have this type of signal.

Figure 13
NLS peptide library binding to importin-α

(Click on image for a larger version)

We did another type of experiment to even better define this NLS consensus. What we did was to make a soup of peptides, called a peptide library, which has all different amino acids – randomly, in certain positions, they can have an amino acid. We then bind them to importin-α and find out which are the ones that bind best. Then we can look in every one of these positions and we can see for every amino acid how enriched it is there. We can see here that phenylalanine is the highest binder, lysine here, lysine here again, et cetera. So we can come up with a consensus which is very similar to the one that we inferred from the structural information. Both of these now help to define a very good consensus for NLSs and to find new nuclear proteins.

Just to summarise this part of the talk: NLS peptides take advantage of this particular specific structure that importin-α has, with these armadillo repeats, to bind. It has two major sites where it is binding. NLSs utilise different features of the binding groove to accomplish binding and its regulation. And, finally, the structure in these peptide library experiments that I showed you helps us define the NLS consensus and helps us find new nuclear proteins from the genome.

Figure 14
Structural proteomics of macrophage proteins

(Click on image for a larger version)

This was an example of a specific traditional-type project. We concentrated on importin-α, we were looking at it, we were looking how it interacts, et cetera. Now I am going to illustrate to you a different type of approach which we started a couple of years ago.

What we said here is that we were going to take an important process – in particular we chose macrophages – and look at a bunch of proteins that are involved in that process and try to understand how they work together and how they contribute to their function.

Figure 15
Traditional approach / Structural proteomics approach

(Click on image for a larger version)

Here is an illustration of how this differs from what I told you before. Here [in traditional approach] we concentrate on one protein; usually we know the function – we get the structure and we find out more about the function. Here [in structural proteomics approach] we go with a bunch of proteins, for many of which we have no idea what they do. Then we go and funnel, and basically this is slightly more effective, because we are picking out the ones that are a bit easier to work with. And then finally we get the structures of them and that helps us define the functions.

Figure 16
Microarray to structure structural proteomics pipeline

(Click on image for a larger version)

The way we have gone about it, as Julie [Campbell] mentioned, is that we are actually using microarrays to help us come up with the initial list of proteins. With the microarrays we define the proteins that are involved in a certain process, and then we choose the ones that you want to go through this pipeline for the structures – usually proteins that don't have a known structure or non-homologous proteins with a known structure. We chose a system where, hopefully, this will also lead to new therapeutics.

The systems that we chose are macrophages. These are specific, very important cells in mammals, and they are involved in immunity, particularly in innate immunity, which is the very old immune system. They actually detect pathogens by the receptors they have, and respond to the pathogens by engulfing them and trying to destroy them. So they have a very good, important function. But often this function goes wrong and they are actually responsible for a lot of diseases. They are associated with inflammatory disease and cancer; that is when this sort of process goes wrong or becomes continuous and does not stop when it is supposed to stop. That means that macrophage proteins will be targets for two important classes of therapeutics: one is when we want to boost the immune system and the other one is when we want to target inflammatory disease, for example.

Figure 17
Experimental protocol

(Click on image for a larger version)

Here we started a more high-throughput approach. This is just an experimental protocol. We select proteins and then we put them through this process, which we have pretty much automated so a lot of this can be done very quickly using some robotic equipment or other high-throughput approaches. Basically, also what we are doing here is that the ones that are technically difficult, we leave aside – maybe in a few years the technology will improve and we can go back to those – and continue with the ones that are a bit easier to work with, which gives us high throughput and we can more effectively get the new structures and new data.

Figure 18
Latexin

(Click on image for a larger version)

To illustrate what we get out by one of the structures as a result of this program: this is the structure of a protein called latexin. Latexin is the only known mammalian carboxypeptidase inhibitor. Carboxypeptidase is an enzyme that cleaves bits of the carboxy terminus of a protein, and this is an inhibitor of that. It is expressed at high levels in the brain and mast cells, and we are interested in it because it is expressed highly when macrophages are activated. It is also homologous to some other proteins which are tumour suppressant, which is quite interesting. This is why we chose it for our program; it is inducible by LPS, which is the marker for pathogens in mouse macrophages.

It has two domains, an interesting structure. There are two separate bits with one linker. It turns out that these particular domains are homologous to some other structures which we have seen before, but something we could not find out just from sequence comparisons. These domains were first found in cystatins, which are cysteine protease inhibitors. So there is obviously a connection here. We have got a carboxypeptidase inhibitor and it turns out to be similar in structure to some cysteine protease inhibitors.

Figure 19
Conservation / Likely binding surface

(Click on image for a larger version)

How does it bind carboxypeptidase? We don't really know yet, because we have not been able to crystallise the complex of carboxypeptidase, but by analysing the structure of latexin we can find out possible surfaces where it binds. We can look at conserved regions on the surface, which are usually the important functional sites – you can see here, for example, that there is one green patch, which is identical to the green patch which you find by other methods just by looking at the surface features, where they are usually found in protein-protein interaction sites. So we have a pretty good idea where it binds carboxypeptidase.

What our task is now is learning as much as we can from these structures about the function and linking it back to why they would be importing macrophages. Why do we need a carboxypeptidase inhibitor in macrophages?

Here is a summary of all that data that we have combined together. We have found likely carboxylase interaction sites, which gives us the biochemical molecular function of this protein. The structural similarity with cystatin suggests to us that it may actually interact with other proteins. Maybe it is also a cysteine protease inhibitor. Nobody has actually tested its function before, which now gives us this hypothesis and we can go and test it.

Microarray experiments show this induction in macrophage activation, and it is co-induced with a bunch of other proteins. So it is one of about 400 proteins that get induced during that activation process. So we can look in all those other proteins. Are they similar proteins that are co-induced? It turns out that a bunch of cysteine protease inhibitors are also co-induced, so this gives us another hypothesis about its cellular function.

One suggestion is that latexin may be a regulator of apoptosis and other signalling processes, together with the other cysteine protease inhibitors, because cysteine protease is very important in apoptosis, programmed cell death. It is obviously playing an important role, and it now gives us a lot to go on with in a focused way to try to find out exactly what it does.

To summarise the second type of approach: The structural proteomics method gives us a comprehensive approach to functional annotation of proteins. We have established this high-throughput methodology so we can go through a lot of proteins quickly. And, finally, I told you a little bit about an example of structural latexin.

I have just a couple of minutes left, in which I will quickly try to tell you about the last thing that I want to mention: how we use structural information in the most comprehensive sense to learn about the functions of proteins, in particular about a program that we developed called PREDIKIN. It has to do with protein kinases. Protein kinases are particular enzymes that put phosphates onto other proteins, and this is the most important type of signalling in a cell.

Figure 20
Signalling diagram

(Click on image for a larger version)

These are a few signalling pathways in a cell, and a lot of these proteins here are protein kinases.

So an important thing that we want to know is what these protein kinases – or a particular protein kinase, which there are hundreds of – actually phosphorylase. What are the substrates? In this way we can put in a certain biological process. In order to do so, we need to be able to do it cheaply and quickly, and that is why we want to develop rapid bioinformatic tools. That is what we have tried to do here.

Figure 21
Design principles of PREDIKIN

(Click on image for a larger version)

What we have tried to do is to use structural information, together with other types of information – to combine that – and come up with rules so we can predict what the substrates for particular kinases will be.

Figure 22a Figure 22b Figure 22c Figure 22d
Protein kinase A
(Click on images for larger versions)

This is an example of a protein kinase. It has a little groove where the substrate binds, and here is a structure of a substrate bound to a protein kinase. If you look under this surface you can find what actually determines what binds, and what determines what is in each of these amino acids, what the specificity is. So there are these different pockets.

Figure 23a Figure 23b Figure 23c
Protein kinase substrate production
(Click on images for larger versions)

Here are the rules that we came up with. There is a lot of work that went into developing these rules, but what we get in the end is to put these rules in a program so somebody can just place the sequence in here and the program tells you what its optimal substrate will be. You can then go search protein databases to find out about them.

Figure 24
Yeast: DNA damage checkpoint pathways

(Click on image for a larger version)

I will just mention a couple of applications. We looked at yeast pathways, and using our methodology we can predict a lot of new connections. All these dashed lines are for connections we can predict.

Figure 25
Yeast phospho-proteome analysis

(Click on image for a larger version)

In yeast a lot of phospho-proteins have been already experimentally determined, and we can now associate protein kinases with every one of these phospho-proteins, as we have done here. This gives us a tremendous insight into how the yeast cell works.

I will just do a summary of the whole thing. Protein function is determined by three-dimensional structure. Structural biology is moving from a case-to-case traditional approach to a more comprehensive approach. The way we can use structural information is in structural proteomics, the type of project that we are doing with macrophages, or by using structural information in bioinformatics.


Questions/discussion

Question: Macrophages are a very diverse type of cell, as you pointed out. So first of all I wanted to ask you: What is the source of your macrophages, and what sort of excitatory state are they in at the time when you look at your proteins?

To follow that up, I would make the comment that David Hume believes that the macrophage may indeed be an adult stem cell, that it is highly plastic and it can become other cell types. He cites, for example, that 15 per cent of cells in virtually every organ of your body are macrophages. I want to ask you, as a secondary question: Have you observed any proteins that would be consistent with that hypothesis that macrophages are highly plastic cells?

To answer the first part of your question: there are a lot of different processes – there are different ways macrophages can get activated, there are different types of macrophages. Obviously, I did not go into a lot of detail regarding this sort of thing. At the moment what we are actually doing is that all the work that we have done with the structural approach has been in one particular process where we use one particular type of macrophage, which is bone marrow derived macrophages, and one particular activation process, which is by lipopolysaccharide, or LPS. That is a kind of bacterial marker, on the surface of bacteria, which mimics bacterial response. So basically that is what we are doing at the moment.

David Hume's group are doing a lot of cell biology. They are looking at this in a much broader context. They are looking at different types of activation, using microarray technology, and also different types of macrophages, also the related cell type osteoclasts, and so on.

I am probably not qualified to answer the second part of your question. This is a problem that all people have where they are using this kind of large-scale approaches such as microarray technology. You come up with a large number of proteins that are involved in a particular process. A lot of work has gone into how you can now analyse that: how do you extract the important bits out, how do confirm that all this you are seeing in the microarray actually is important, is induced? Obviously there is another step here. We are looking at RNA molecules induced, and then they have to be transcribed into proteins. It doesn't always necessarily correlate with what proteins are induced, et cetera. So there are all those questions there.

I don't think at this point we are actually at the stage where we can answer your questions of which ones of those particular proteins that play a role go back to the stem cells and play a role there.

Question: I love your approach for looking at the protein-protein interactions, but we know that most of the genome doesn't code for proteins. It is sort of like the Dark Matter and the Dark Energy of the genome. Are we going to miss something important by concentrating solely on protein-protein interactions?

This is one approach to understanding how the cell works, and it doesn't cover everything, just as you are saying. Proteins obviously play an important role; they are probably the major work force in the cell. But many other things go on in the cell, and as you are saying, not all RNAs are actually transcribed into proteins. There are a lot of RNAs that have important functions. In Australia there is a lot of good research going on in that area. In particular, John Mattick has been trying to get the idea across that there are a lot of RNAs which function as RNAs and they are probably important in regulating how cells work.

Then there are all these other bits which do not even become RNA, and why they are there and what they are doing is another big question. We can't address everything simultaneously. We are concentrating on one bit, which is proteins, and we are not really addressing other things. But other researchers are addressing those areas. Together, hopefully, we can some day come up with a picture of how the whole cell works together.

Question: I am interested in the microarray technology that you mentioned. What is the basis of actually detecting the proteins? I am a bit familiar with the gene arrays concept, but in proteins do you use antibodies, or is it a gel formulation? What is the standard mechanism that is used?

Sorry, I didn't really explain this properly because I was going through it fast and I explained it very much at the surface. We are not using protein arrays at all. In our application we are using cDNA microarrays, pieces of DNA. This technology, as you are alluding to, is actually extendable to proteins, looking at protein-protein interactions, but it is at a very experimental stage at this point. It is not really generally adapted, I think, at this point.

Question: In your work where you are wanting to look at the structure of many proteins, I am just wondering whether what we are doing in Australia with the synchrotron will be right for you or not. I have heard overseas that they are going to have automatic crystallising systems, and the crystal will be automatically put onto the synchrotron, and all of these things. Are we thinking about that in Australia?

Yes, we are. In fact, the synchrotron is projected to come on line about 2007 or so. I think at that point in time we will be in a pretty good position where a lot of these methodologies are going to be developed and automated. So perhaps from Brisbane, for example, we won't even have to travel to Melbourne to use the synchrotron ourselves. I think it is very likely that at that point in time we will already have these automated systems to do this, and perhaps we will just have to send our crystals down there and there is going to be a technician putting our crystals on, and the data will be measured and beamed back to Brisbane. This is possibly how a lot of the projects are going to work, but probably not all. There are always going to be experimentally difficult projects which robots et cetera will not be able to deal with. For that reason I think it is going to be still very essential that we have a synchrotron in Australia with easier access: we can go there quickly and do our own experiments.

But you are right that this automation is really picking up. The types of projects like the macrophage project that I described – they are usually called structural genomics or structural proteomics – are what is driving this methodology. The big aim of this whole initiative, by people all over the world, is to get a structure for every family representative, also the structure of a representative protein from every family, which will basically be a structurome or a structural proteome so we will know, basically, all the structures that exist in nature. That is what is driving all that, and to do that we have to do structures much more quickly than we do at the present time, or than we did a decade ago. That is what is driving all this automation. There are a lot of crystallisation robots et cetera coming up; a lot of the things that we used to do by hand can be now be done by machines.

Question: Listening to what you are saying, I am just wondering if there is a possibility, and an approach, to try and then use the structural information you get to predict protein-protein interactions, which is another level on but clearly these proteins are not working by themselves but are interacting with other families of proteins. Do you know of initiatives that will take your sort of data and then go to the next level of prediction?

I totally agree with you that proteins don't work alone and they interact with each other. There are a lot of approaches to how to detect which proteins actually interact with other, which interact with which, on a large scale. They have been used for yeast cells and in other systems as well. These are yeast-to-hybrid system or affinity capture type approaches. So we can do that.

This complements our structural analysis. We – and most people all over the world, if they are doing high-throughput – are currently just going for single proteins or even fragments of proteins, to get structural information. But other information is out there of what these proteins interact with, and also from the structural data on the complexes we actually are starting to learn what makes a protein-protein interaction site. For the structure of latexin that I mentioned, we use one of such programs that looks at the surface and finds out which are the surface features which are most likely to be involved in a protein-protein interaction. That is where I predicted what might be the carboxypeptidase binding site.

Putting these things together, using them together, is where we can find out which proteins interact with each other and how they do that, and then we have to go to larger scales, look at big complexes in the cell, maybe using cryo-electromicroscopy and methods like that. All that together will give us a combined picture of what interacts with what, where it is in the cell, and all that kind of stuff. We hope to get a visual cell in the end.