AUSTRALIAN FRONTIERS OF SCIENCE, 2003

Canberra, 31 July to 1 August 2003

The role of protein structure in functional annotation of proteins
by Dr Bostjan Kobe

Bostjan Kobe Bostjan Kobe is an Associate Professor in Structural Biology at the Department of Biochemistry and Molecular Biology, with a joint appointment with the Institute of Molecular Bioscience, both at the University of Queensland. He received his BSc at the University of Ljubljana, Slovenia, and his PhD at the University of Texas Southwestern Medical Center at Dallas, USA. Since his PhD studies, he has worked at the Howard Hughes Medical Institute in Dallas, and St Vincent's Institute of Medical Research in Melbourne. He moved to the University of Queensland in 2000. His research interests involve the structures of biological macromolecules, particularly the interactions between proteins. He is the recipient of an NHMRC Senior Research Fellowship. He received the Science Minister's Prize for Achievement in Life Sciences in 2001. He has published over 50 scientific papers and accumulated over 2000 citations. He is a member of the Editorial Board of the Journal of Structural and Functional Genomics.

We have entered something called the post-genomic era. This is because of the fact that we have sequenced a lot of genomes now and we know about a lot of genes and a lot of proteins they produce, and for a lot of those proteins we have no idea what they do. So, basically, one of the big issues today is to try to understand what every one of these proteins does and how they work together to make our cells and our bodies work. The functional annotation of every gene product is one of the big issues of today's biology.

Figure 1
Click on image for a larger version of figure 1

How does structural biology play a role in all this? Here is a quote from every textbook: 'The structure of the protein determines its function'. The only reason different proteins have different functions is that they have a certain three-dimensional structure. For example, this guy has a little cavity here and it can bind an organic molecule here to work as an enzyme, for example (figure 1). Every protein, because it has a different and unique structure, can do different things.

This is why it is important to know three-dimensional structures. The most important thing is that we can then understand how it actually functions. There are a few other uses of knowing the three-dimensional structures. One of them, which is obviously the topic of the title of this session, is that we can then design drugs against these proteins a bit more efficiently. Basically, because we know what the structure is of a pocket where we want to bind a drug, we can then rationally design a drug that will bind there. And there are other reasons why we might want to know and use three-dimensional structures. For example, we might want to redesign proteins, we might want to understand how they actually fold up together and take these unique shapes, or understand what holds them together and makes them stable.

Figure 2
Click on image for a larger version of figure 2

In our lab we mostly use a technique called X-ray crystallography to determine these three-dimensional structures. That is quite an old technique, and it is a very useful one because it has very few limitations on the size or type of protein that we can use it to determine the structure of.

The way X-ray crystallography works is as follows. We need to purify relatively large amounts of protein – make it quite pure – and then an important thing is that we have to make crystals. We need crystals because of the next step, where we put these crystals in an X-ray beam and we get a diffraction pattern (figure 2). Unfortunately, we need the crystal because otherwise we do not get a sufficient-quality diffraction pattern to be able to use it. So we need a crystal specimen which has these ordered arrays of proteins in there. That, basically, increases the signal of the diffraction.

Once we get the diffraction, then we can calculate something called the electron density, and basically then we can interpret that as a model of a protein. In this electron density we put the individual atoms and link them together. That is how, in the end, we get the three-dimensional structure. What is shown here is the first three-dimensional structure of a protein, myoglobin, which was solved in the 1950s by John Kendrew and co-workers.

Figure 3
Click on image for a larger version of figure 3

How do we collect these X-ray data? We need a very strong X-ray beam, and in the lab we use something like this, which is an X-ray generator. This is one called Rigaku FR-E, which we recently obtained in the University of Queensland (figure 3). It is one of the strongest home sources available; there are just a few of those available in the world. A much stronger beam, even, can be obtained from the synchrotron. A synchrotron is a huge building where electrons or positrons go round in a circle, and whenever they change direction we get very strong X-rays coming out. I am sure all of you would have heard that Australia has recently decided to build its own synchrotron, which is going to be located at Monash University.

Today I want to address the issue of how we use structure to get some information about the functions of proteins. I want to illustrate it with three examples. Obviously, if I want to go through those three examples I can't go into very much detail on every one of them, but hopefully I will give you a little snapshot of where this is going and how it works. Basically, what I want to do is show how every one of these projects is slightly different in the way we look at things.

The first project is of a protein involving nuclear transport. This is an example of a traditional approach. We know the function of the protein but we want to understand the function better. Knowing the structure, we want to understand the function and how it fits into the cell better. That is the traditional approach, how we used to do structural biology.

The second project I want to touch on is trying to look at a number of proteins at the same time, or in a more high-throughput manner, trying to understand the process by looking basically at every protein that is involved in it. This is something we call structural proteomics, and in particular we are targeting a particular cell line called the macrophage, which is very important in our bodies.

The third topic is slightly different: how do we now use this structural information that we have obtained in a more general sense? This one is an example of one of the ways we can do that. In this case we are trying to predict substrates for an important class of proteins called protein kinases, which have a very important role in signalling.

Project no. 1 is nucleocytoplasmic transport. This is actually a collaboration with several other groups.

Figure 4
Click on image for a larger version of figure 4

In the cell we have an important organelle called the nucleus, and that is where most of the transcription happens and a lot of other important things. But all the proteins are synthesised outside this nucleus and they have to get in to the nucleus from the cytoplasm (figure 4).

Figure 5
Click on image for a larger version of figure 5

Figure 5 is an electron micrograph of a section through the nucleus, and if you look very carefully you can see little holes in this membrane which separates the nucleus from the cytoplasm. These holes are called nuclear pores, and that is the only place where the molecules can actually get in.

Figure 6
Click on image for a larger version of figure 6

Figure 6 is a schematic diagram of what people think the nuclear pore looks like, from the information we have. It is a huge protein complex. There is a ring of proteins, and inside there is a little pore. The way the proteins get in is by using transporters – they use other proteins which have a particular function as transporters.

The most abundant pathway to get in is a pathway which involves importin-alpha and importin-beta, a dimer of two proteins, and then a cargo protein has to have a little signal on it, called the NLS or nuclear localisation signal. Importin-alpha recognises this NLS, binds to this, and this trimeric complex gets in to the nucleus.

Figure 7
Click on image for a larger version of figure 7

In order to understand how this recognition happens, we looked at the structure of importin-alpha, which is the part of the complex responsible for recognising the NLS. We solved the structure. It is a very interesting structure, because the whole protein is made of sequence repeats, called the armadillo repeats. You can see that the resemblance to a real armadillo is absolutely amazing (figure 7).

Figure 8
Click on image for a larger version of figure 8

We studied the way this protein binds with these little signals by making small peptides, corresponding to these little signal sequences, and determines the structure of the complex of these things. In figure 8 you can see that the little peptide binds to this protein, in this case in two places and in this case there is the one longer peptide which binds across. So there are different modes, and this tells us how this happens. It is a very, very elegant way for proteins to be able to recognise another amino acid sequence.

Because this protein is repetitive, as I said, we can represent it like this: every repeat is one of these things. It will have conserved amino acids in particular places, in exactly the same place when we get to the next repeat.

Figure 9
Click on image for a larger version of figure 9

This makes something like a zipper (figure 9). As you can see, there are tryptophans here, there are asparagines here, and these signal peptides combined into the zipper by every amino acid of the signal going in a different pocket in the zipper. It is a very elegant way to bind these peptides.

Figure 10
Click on image for a larger version of figure 10

What we can do now that we have this structural information, knowing how they bind to the protein, is to line them up as we have done here (figure 10). I don't want you to read all these letters, but the bottom line is that in the end we can decide which are the important things and which are the less important things in this binding, and we can come up with something called a consensus sequence. This tells us that these amino acids are absolutely necessary in the signal sequence for it to be recognised efficiently.

Why is that of use? That is of use because now, knowing this consensus sequence, we can look for the sequence motif in other proteins – proteins with unknown function – so we can basically identify other proteins which will be transported in to the nucleus and become resident there.

Figure 11
Click on image for a larger version of figure 11

Very recently we did another experiment to try to find this consensus sequence in a slightly different way (figure 11). What we did here was to make a soup of different peptides, where we had random amino acids at particular positions. We made them bind to importin-alpha and then we just selected out the ones that bound better – the ones that actually bound – and discarded all the ones that didn't. We then sequenced that, and we found the consensus sequence here, or, to put it in a different way, what the optimal binding motif is. This is the motif we got from our structural studies and from knowing the signal peptides in various proteins. This [H-K-S/Q-K-K-K] is the one we got out of this peptide library. They are not quite the same but there are similarities between them, and this is what we are trying to address now, to find how these compare and why they are different. This will, hopefully, help us even better identify new nuclear proteins.

To summarise this part: NLS peptides take advantage of this interesting structure, importin-alpha, to bind very elegantly to the surface. They utilise different features of the binding groove – I did not really go into much detail – to accomplish this binding and regulation. And finally, what interests us in the end is that we can use this information to identify new nuclear proteins now that we know what is important in these peptides. This was an example where we knew the function of the protein, we studied the structure, we found out new things and this can give us some general implications for functionally annotating other proteins. So that is one approach.

Another approach is that we actually take a process and try to identify what the function is of every single protein that seems to have a role in there. This is a relatively recent project; it started in the last two years. So we are basically just going and setting everything up.

Figure 12
Click on image for a larger version of figure 12

This is the traditional approach (figure 12). We know when we have a protein with a function, we need to determine its structure. This is how traditionally things have been done, because it is not that easy to determine the structure.

But in recent years, with all the technical advances, it has become easier and what we can do now instead is to take a bunch of proteins and, in a more high-throughput way, go through a pipeline where we try to purify and crystallise them, and then we determine the structures of the ones for which it is technically feasible to do so, doing that much more quickly. In this way we can actually target some proteins where we have no idea what their function is. When it used to be hard to do that, nobody would give us money to do something like that. But now, because it is easier, we can do that. And now we can take advantage of the fact that when we know the structure we can actually learn something about the function. So, basically, we are going backwards as compared with the way we used to do it.

Figure 13
Click on image for a larger version of figure 13

In particular, how we are trying to do this is by taking advantage of another technique by which we can look at things comprehensively. That is something that was mentioned yesterday too: microarrays (figure 13). We use microarrays to identify all the proteins that are involved in a certain process. In particular, at the moment what we are trying to address is the process of activation of macrophages.

Then we have this huge list of proteins which are involved in that. We purify them and we go through this pipeline and try to determine their structures, and that helps us understand what they do in that process.

So why macrophages, and what are they? Macrophages are very important cells. They are quite abundant cells in most tissues, and they are cells involved in defence, basically. They are involved in a process called innate immunity: they recognise pathogens and they try to destroy them. Macrophages are associated with inflammatory disease and cancer. The reason for that is that sometimes this defence goes wrong and in fact leads to disease.

In summary, macrophages are really, really good targets in medical terms, because on the one hand we might want to activate them to make ourselves more protected against pathogens; on the other hand we might want to make inhibitors in cases when we have inflammatory diseases and actually want to stop this response. So it is a really good medical target.

In one project, we set up a pipeline where we put all these proteins through – we purify them, crystallise them, et cetera – with a few decision points where if things don't work, we just discard them or put them aside to deal with them later, and so on.

In another case we are dealing with a large number of proteins. In fact, there are about 100 now that we are trying to deal with. Some of them are completely unknown genes; some of them have some function associated.

Figure 14
Click on image for a larger version of figure 14

We go through the pipeline and just recently we solved one structure. So this is our first structure to come out of this pipeline, a protein called latexin (figure 14). It is a protein with very little known about the function. From the structure what we can do now is to compare it with other protein structures already known. We can find interesting similarities, and from those similarities we can now make a hypothesis of what this protein actually does and how it does it. In this particular case, it is similar to some protease inhibitors so our hypothesis is that it is involved in the regulation of degradation of proteins, and now we can do functional experiments to test that hypothesis.

Just to quickly summarise that little project: structural proteomics is a method that offers a comprehensive approach to the functional annotation of proteins. Instead of going for one protein, now we are going for a bunch of them, using this information that we get out of the structures to understand how they work.

We have established this pipeline, where we go from lists of proteins which we get out of the microarray technology through to structure determination, and the first structure that has come out of that has already given us some interesting information about how this particular protein works in the activation of macrophages.

Finally, I want to mention the third project, which is again a slightly different approach to how we use structural information. In this case, basically what we devised is a computational tool, or a bioinformatic computer program, which helps us, in this case, to predict what the substrates are for protein kinases. It is obvious why we chose PREDIKIN as the name of this, for predicting kinase, but it turns out that in the Dutch-related language Afrikaans predikin actually means 'preacher'. So we thought this was a very, very appropriate name for a program like that.

For our bodies to function the way they do, there needs to be a lot of signalling happening. The cells have to respond to what is happening around them; they have to respond to what is happening inside them. There have to be all these signal production pathways going on. When something happens in the cell, the signal has to be passed for some response to that. In these signal production pathways, the most important class of proteins is the protein kinases. Protein kinases are proteins that put phosphate groups on other proteins – they are involved in protein phosphorylation.

There are lots of these protein kinases. In humans there are over 500 of them. For many of them we have no idea what they phosphorylate. Basically, phosphorylation is the most abundant way to signal, and protein kinases are the enzymes that are responsible for putting phosphates on other proteins. The problem is that experimental approaches to finding out what the substrate is for a particular kinase are difficult. They are not easy. What we need is something that is easy and quick.

This is how we went about it. We know some three-dimensional structure information on kinases; we know something about the specificity in some particular cases of how kinases recognise substrates. Hopefully, if we put this information together we can extract rules, and then we can predict for a new kinase what its possible substrate might be.

 

Figure 15
Figure 15
Figure 16
Figure 16
Figure 17
Figure 17
Figure 18
Figure 18
Click on images for larger versions

Figure 15 shows one such protein, protein kinase A. You can see in figure 16 that there is a sort of cleft here. In this cleft is where the phosphorylated protein binds – this is just a little peptide from such a protein, where it binds – and if we look under this surface (figure 17) we can see that these particular amino acids, shown here in magenta, are responsible for recognising the different sequence of this peptide. For example, this side chain binds to these purple amino acids, and that is why only this type of side chain can fit in here. It is the same in all these other pockets. So we can identify several pockets (figure 18) in which amino acids are responsible for recognising these particular amino acids.

So we went through all this, extracted these rules and put them in a computer program so we can easily use it. I will give you a couple of snapshots of this program. It is a very simple program. We put in a sequence of a new kinase for which we have no idea what it does, and then this program predicts what the optimal sequence is for recognising that. Then we can use that to find a real protein that has this signal motif in it.

Figure 19
Click on image for a larger version of figure 19

Before I finish I will give you a couple of examples of how we can use this. We have used yeast, which has a few less protein kinases, to test how this works. This is one particular pathway, a DNA damage checkpoint. It is a very complicated pathway (figure 19). All these solid lines and the arrows show what is already known about what phosphorylates what; the yellow guys are the protein kinases involved. With our program we can then identify all these dashed lines of new phosphorylation events which happen for these proteins which we already know are involved in this process, so we find new connections.

Another thing that we use our program with is that another group tried to identify all the phospho-proteins in a yeast cell. They came up with 383 phosphorylation sites in 282 proteins, and we can now use our program to try to identify which protein kinase is linked to every one of these phosphorylation sites. From that, for some unknown proteins, we can learn what they actually do. For other ones we can find new functional correlations, et cetera.

Just to summarise that part: we developed this program that can help us predict substrates for these protein kinases. We tested it and the accuracy is reasonably high. This is the only method currently available to do anything like this, and this approach may also be applicable to other biological systems. What we did for protein kinases we can do for other types of signalling molecules.

As a final overall summary, I want to leave you with this: protein function depends on its 3D structure. That is why we do structural biology. And structural biology is now moving from a case-by-case approach to a more comprehensive approach. That is probably the only way we can tackle this problem of functional annotation of every protein. I illustrated that in two cases, one using structural proteomics to look at a lot of proteins in high-throughput, and another showing how we can use structural information in a comprehensive sense to find out more about the functions of proteins.

Session 4 discussion