SCIENCE AT THE SHINE DOME 2004: ANNUAL SYMPOSIUM
A celebration of Australian science
7 May 2004
Use of protein three-dimensional structural data
in functional genomics
by Dr Bostjan Kobe
What I will try to do today is to give you a few snapshots of the role
that structural biology is playing in molecular biology today, and a couple
of snapshots of how this is evolving in projects in our lab as well.

Post-genomic era
(Click on image for a larger version)
We live in what is called the post-genomic era. The reason it is called
that is that quite a few genomes have now been sequenced, so we basically
have complete catalogues of all the proteins that make up cells and organisms,
and now the next big task is understanding what these proteins that are
encoded by these genes actually do and how they do it together. It is
this functional annotation that is a big task ahead of us.

The structure of a macromolecule determines its function
(Click on image for a larger version)
What does three-dimensional (3D) structure do for us? That is the main
thing that I am going to be talking about today. It tells us how proteins
function. Basically, the function of a protein for example, this
one binds a certain molecule depends on its three-dimensional structure.
The amino acid sequence determines what sort of 3D structure the protein
folds into, and this 3D structure then defines what the protein does.
We can use the structural information for other reasons. It is very helpful
when we are trying to design drugs, for medically relevant proteins, and
we can also redesign proteins. We are trying to understand molecular folding;
it can help us with that, and various other things.

X-ray crystallography
(Click on image for a larger version)
In our lab we use mainly the method of X-ray crystallography to look
at protein structures. This is a very common method; the complementary
method is NMR, nuclear magnetic resonance. Here I will just go very briefly
through what we do using this sort of technique.
We have to isolate a protein and purify it, and if we do that we have
a chance of actually crystallising it we have to obtain a three-dimensional
crystal. The reason we have to do that is that we then put this crystal
in the X-ray and from the X-rays we can get a diffraction pattern, and
that diffraction pattern actually helps us determine its 3D structure.
So we go through this step here [of phase estimation].
This is a relatively old method, although it hasn't really been adapted
to proteins that early on. This [shown at right-hand side of slide] is
one of the first protein structures actually solved. In the old days,
the 1950s, when the first structures were solved, they had to make up
real-life models of these things; these days we use computers to look
at the protein structures.

X-ray sources
(Click on image for a larger version)
The X-rays that we use have to be very strong and bright, and that is
why we use synchrotrons. Undoubtedly all of you have heard that we are
building a synchrotron in Australia; it is going to be down in Melbourne.
When we don't use a synchrotron we can use a slightly smaller machine
which we can put in our lab, which produces X-rays of an order of magnitude
or so lower intensity.
What I will do today is to basically give you a snapshot of three different
projects which we have going on in the lab, and they will give you an
insight perhaps into the different things that structural biology is good
for and how it is evolving.
The first one is more of a traditional approach, when we know an important
biological system and we try to understand it by looking at three-dimensional
structures. In the second one we are trying to look at a system in a slightly
more comprehensive sense, looking at a number of different proteins that
contribute to how this system works. Finally, I will mention a project
where we are trying to use structural information in a comprehensive sense
to understand how a lot of different proteins function.

Nucleocytoplasmic transport
(Click on image for a larger version)
The first project involves nuclear transport. All eukaryotic cells have
a nucleus and the transport into the nucleus, where a lot of the transcription
et cetera goes on, is an important selective process, an essential process
in every cell.

Eukaryotic cell
(Click on image for a larger version)
This is a cell. Here is a nucleus.

Cell nucleus
(Click on image for a larger version)
If we actually look at the nucleus under an electron microscope, in its
membrane we can see little pores. These are the only places where proteins
can actually get into and out of the nucleus.

Classical nuclear import pathway
(Click on image for a larger version)
They can't do that by themselves; they actually need help from other
proteins which are transport factors. There are several different pathways
by which the proteins can get in, but the most common pathway is called
the classical pathway, and this is the one we will talking about today.
It involves basically two major proteins, importin-α and importin-β,
which form a complex. They bind a cargo protein, and the way they recognise
the cargo protein is that it has a little signal on it, called the NLS.
That is a little stretch of sequence and NLS stands for 'nuclear localisation
signal'. Together they can then get through the pore, and once they are
in there is another protein that helps unload this cargo. It is a GTPase
called Ran.

Importin-α armadillo repeats
(Click on image for a larger version)
One thing we were particularly interested in is how this selection occurs.
How does importin-α, which is the primary receptor for these cargo
proteins, recognise only the correct proteins and not others?
A few years ago we started this project, first looking at the structure
of importin-α itself. This is its structure. It is actually quite
an unusual protein, because it is basically folded as a spiral, unlike
most other proteins, which are folded in much more complicated ways. Every
turn of the spiral is basically just three alpha helixes. The individual
repeats that form these turns of the spiral are called the 'armadillo
repeats'. Here is an armadillo. You can obviously see the great resemblance
between the two! although the name doesn't come from that at all,
but from a protein in Drosophila.

Importin-α armadillo repeats
(Click on image for a larger version)
The next thing we did was to look at how these sequences, NLSs, actually
bind importin-α. We determined a few complexes of importin-α
with little peptides corresponding to these NLSs, and here we can see
different types of peptides bound to the same protein.

Binding of NLSs to importin-α
(Click on image for a larger version)
From that we can get some idea of how this recognition occurs. It turns
out that in this system the way this recognition occurs is very elegant,
one of the most elegant ways I have seen molecular recognition happen
in any system. Because of the repetitive nature of this protein, we basically
have these repeats, and in the same position in every repeat we have certain
conserved residues.
It turns out that in this spiral structure they line up in a ladder.
You can see here asparagine, asparagine, asparagine, also tryptophan,
tryptophan, tryptophan. These are exactly the right spacing to bind an
extended amino acid chain. So this NLS peptide shown here can extendedly
bind in such a way that this conserved ladder of amino acids is thinned
down. The specificity then comes from other amino acids in other locations.

Binding of NLSs to importin-α
(Click on image for a larger version)
If we take our structure that determines how these peptides bind and
line them up the way they are in space, we can find out which are the
most important bits of these peptides and basically come up with a consensus
sequence. So now we can much better understand what actually makes up
this NLS. We can see that there has to be a lysine in a certain position,
argenine, we don't need anything here, amino acid, lysine again here,
and then we have a couple of them up here.This type of information helps
us define this NLS, which in turn then helps us search protein databases
and we can define other proteins that we might not know experimentally
yet, that they go into the nucleus because they have this type of signal.

NLS peptide library binding to importin-α
(Click on image for a larger version)
We did another type of experiment to even better define this NLS consensus.
What we did was to make a soup of peptides, called a peptide library,
which has all different amino acids randomly, in certain positions,
they can have an amino acid. We then bind them to importin-α and
find out which are the ones that bind best. Then we can look in every
one of these positions and we can see for every amino acid how enriched
it is there. We can see here that phenylalanine is the highest binder,
lysine here, lysine here again, et cetera. So we can come up with a consensus
which is very similar to the one that we inferred from the structural
information. Both of these now help to define a very good consensus for
NLSs and to find new nuclear proteins.
Just to summarise this part of the talk: NLS peptides take advantage
of this particular specific structure that importin-α has, with these
armadillo repeats, to bind. It has two major sites where it is binding.
NLSs utilise different features of the binding groove to accomplish binding
and its regulation. And, finally, the structure in these peptide library
experiments that I showed you helps us define the NLS consensus and helps
us find new nuclear proteins from the genome.

Structural proteomics of macrophage proteins
(Click on image for a larger version)
This was an example of a specific traditional-type project. We concentrated
on importin-α, we were looking at it, we were looking how it interacts,
et cetera. Now I am going to illustrate to you a different type of approach
which we started a couple of years ago.
What we said here is that we were going to take an important process
in particular we chose macrophages and look at a bunch of
proteins that are involved in that process and try to understand how they
work together and how they contribute to their function.

Traditional approach / Structural proteomics approach
(Click on image for a larger version)
Here is an illustration of how this differs from what I told you before.
Here [in traditional approach] we concentrate on one protein; usually
we know the function we get the structure and we find out more
about the function. Here [in structural proteomics approach] we go with
a bunch of proteins, for many of which we have no idea what they do. Then
we go and funnel, and basically this is slightly more effective, because
we are picking out the ones that are a bit easier to work with. And then
finally we get the structures of them and that helps us define the functions.

Microarray to structure structural proteomics pipeline
(Click on image for a larger version)
The way we have gone about it, as Julie [Campbell] mentioned, is that
we are actually using microarrays to help us come up with the initial
list of proteins. With the microarrays we define the proteins that are
involved in a certain process, and then we choose the ones that you want
to go through this pipeline for the structures usually proteins
that don't have a known structure or non-homologous proteins with a known
structure. We chose a system where, hopefully, this will also lead to
new therapeutics.
The systems that we chose are macrophages. These are specific, very important
cells in mammals, and they are involved in immunity, particularly in innate
immunity, which is the very old immune system. They actually detect pathogens
by the receptors they have, and respond to the pathogens by engulfing
them and trying to destroy them. So they have a very good, important function.
But often this function goes wrong and they are actually responsible for
a lot of diseases. They are associated with inflammatory disease and cancer;
that is when this sort of process goes wrong or becomes continuous and
does not stop when it is supposed to stop. That means that macrophage
proteins will be targets for two important classes of therapeutics: one
is when we want to boost the immune system and the other one is when we
want to target inflammatory disease, for example.

Experimental protocol
(Click on image for a larger version)
Here we started a more high-throughput approach. This is just an experimental
protocol. We select proteins and then we put them through this process,
which we have pretty much automated so a lot of this can be done very
quickly using some robotic equipment or other high-throughput approaches.
Basically, also what we are doing here is that the ones that are technically
difficult, we leave aside maybe in a few years the technology will
improve and we can go back to those and continue with the ones
that are a bit easier to work with, which gives us high throughput and
we can more effectively get the new structures and new data.

Latexin
(Click on image for a larger version)
To illustrate what we get out by one of the structures as a result of
this program: this is the structure of a protein called latexin. Latexin
is the only known mammalian carboxypeptidase inhibitor. Carboxypeptidase
is an enzyme that cleaves bits of the carboxy terminus of a protein, and
this is an inhibitor of that. It is expressed at high levels in the brain
and mast cells, and we are interested in it because it is expressed highly
when macrophages are activated. It is also homologous to some other proteins
which are tumour suppressant, which is quite interesting. This is why
we chose it for our program; it is inducible by LPS, which is the marker
for pathogens in mouse macrophages.
It has two domains, an interesting structure. There are two separate
bits with one linker. It turns out that these particular domains are homologous
to some other structures which we have seen before, but something we could
not find out just from sequence comparisons. These domains were first
found in cystatins, which are cysteine protease inhibitors. So there is
obviously a connection here. We have got a carboxypeptidase inhibitor
and it turns out to be similar in structure to some cysteine protease
inhibitors.

Conservation / Likely binding surface
(Click on image for a larger version)
How does it bind carboxypeptidase? We don't really know yet, because
we have not been able to crystallise the complex of carboxypeptidase,
but by analysing the structure of latexin we can find out possible surfaces
where it binds. We can look at conserved regions on the surface, which
are usually the important functional sites you can see here, for
example, that there is one green patch, which is identical to the green
patch which you find by other methods just by looking at the surface features,
where they are usually found in protein-protein interaction sites. So
we have a pretty good idea where it binds carboxypeptidase.
What our task is now is learning as much as we can from these structures
about the function and linking it back to why they would be importing
macrophages. Why do we need a carboxypeptidase inhibitor in macrophages?
Here is a summary of all that data that we have combined together. We
have found likely carboxylase interaction sites, which gives us the biochemical
molecular function of this protein. The structural similarity with cystatin
suggests to us that it may actually interact with other proteins. Maybe
it is also a cysteine protease inhibitor. Nobody has actually tested its
function before, which now gives us this hypothesis and we can go and
test it.
Microarray experiments show this induction in macrophage activation,
and it is co-induced with a bunch of other proteins. So it is one of about
400 proteins that get induced during that activation process. So we can
look in all those other proteins. Are they similar proteins that are co-induced?
It turns out that a bunch of cysteine protease inhibitors are also co-induced,
so this gives us another hypothesis about its cellular function.
One suggestion is that latexin may be a regulator of apoptosis and other
signalling processes, together with the other cysteine protease inhibitors,
because cysteine protease is very important in apoptosis, programmed cell
death. It is obviously playing an important role, and it now gives us
a lot to go on with in a focused way to try to find out exactly what it
does.
To summarise the second type of approach: The structural proteomics method
gives us a comprehensive approach to functional annotation of proteins.
We have established this high-throughput methodology so we can go through
a lot of proteins quickly. And, finally, I told you a little bit about
an example of structural latexin.
I have just a couple of minutes left, in which I will quickly try to
tell you about the last thing that I want to mention: how we use structural
information in the most comprehensive sense to learn about the functions
of proteins, in particular about a program that we developed called PREDIKIN.
It has to do with protein kinases. Protein kinases are particular enzymes
that put phosphates onto other proteins, and this is the most important
type of signalling in a cell.

Signalling diagram
(Click on image for a larger version)
These are a few signalling pathways in a cell, and a lot of these proteins
here are protein kinases.
So an important thing that we want to know is what these protein kinases
or a particular protein kinase, which there are hundreds of
actually phosphorylase. What are the substrates? In this way we can put
in a certain biological process. In order to do so, we need to be able
to do it cheaply and quickly, and that is why we want to develop rapid
bioinformatic tools. That is what we have tried to do here.

Design principles of PREDIKIN
(Click on image for a larger version)
What we have tried to do is to use structural information, together with
other types of information to combine that and come up with
rules so we can predict what the substrates for particular kinases will
be.
 |
 |
 |
 |
Protein kinase A (Click
on images for larger versions) |
This is an example of a protein kinase. It has a little groove where
the substrate binds, and here is a structure of a substrate bound to a
protein kinase. If you look under this surface you can find what actually
determines what binds, and what determines what is in each of these amino
acids, what the specificity is. So there are these different pockets.
 |
 |
 |
Protein kinase substrate production
(Click on images for larger versions) |
Here are the rules that we came up with. There is a lot of work that
went into developing these rules, but what we get in the end is to put
these rules in a program so somebody can just place the sequence in here
and the program tells you what its optimal substrate will be. You can
then go search protein databases to find out about them.

Yeast: DNA damage checkpoint pathways
(Click on image for a larger version)
I will just mention a couple of applications. We looked at yeast pathways,
and using our methodology we can predict a lot of new connections. All
these dashed lines are for connections we can predict.

Yeast phospho-proteome analysis
(Click on image for a larger version)
In yeast a lot of phospho-proteins have been already experimentally determined,
and we can now associate protein kinases with every one of these phospho-proteins,
as we have done here. This gives us a tremendous insight into how the
yeast cell works.
I will just do a summary of the whole thing. Protein function is determined
by three-dimensional structure. Structural biology is moving from a case-to-case
traditional approach to a more comprehensive approach. The way we can
use structural information is in structural proteomics, the type of project
that we are doing with macrophages, or by using structural information
in bioinformatics.
Questions/discussion
Question: Macrophages are a very diverse type of cell, as you
pointed out. So first of all I wanted to ask you: What is the source of
your macrophages, and what sort of excitatory state are they in at the
time when you look at your proteins?
To follow that up, I would make the comment that David Hume believes
that the macrophage may indeed be an adult stem cell, that it is highly
plastic and it can become other cell types. He cites, for example, that
15 per cent of cells in virtually every organ of your body are macrophages.
I want to ask you, as a secondary question: Have you observed any proteins
that would be consistent with that hypothesis that macrophages are highly
plastic cells?
To answer the first part of your question: there are a lot of different
processes there are different ways macrophages can get activated,
there are different types of macrophages. Obviously, I did not go into
a lot of detail regarding this sort of thing. At the moment what we are
actually doing is that all the work that we have done with the structural
approach has been in one particular process where we use one particular
type of macrophage, which is bone marrow derived macrophages, and one
particular activation process, which is by lipopolysaccharide, or LPS.
That is a kind of bacterial marker, on the surface of bacteria, which
mimics bacterial response. So basically that is what we are doing at the
moment.
David Hume's group are doing a lot of cell biology. They are looking
at this in a much broader context. They are looking at different types
of activation, using microarray technology, and also different types of
macrophages, also the related cell type osteoclasts, and so on.
I am probably not qualified to answer the second part of your question.
This is a problem that all people have where they are using this kind
of large-scale approaches such as microarray technology. You come up with
a large number of proteins that are involved in a particular process.
A lot of work has gone into how you can now analyse that: how do you extract
the important bits out, how do confirm that all this you are seeing in
the microarray actually is important, is induced? Obviously there is another
step here. We are looking at RNA molecules induced, and then they have
to be transcribed into proteins. It doesn't always necessarily correlate
with what proteins are induced, et cetera. So there are all those questions
there.
I don't think at this point we are actually at the stage where we can
answer your questions of which ones of those particular proteins that
play a role go back to the stem cells and play a role there.
Question: I love your approach for looking at the protein-protein
interactions, but we know that most of the genome doesn't code for proteins.
It is sort of like the Dark Matter and the Dark Energy of the genome.
Are we going to miss something important by concentrating solely on protein-protein
interactions?
This is one approach to understanding how the cell works, and it doesn't
cover everything, just as you are saying. Proteins obviously play an important
role; they are probably the major work force in the cell. But many other
things go on in the cell, and as you are saying, not all RNAs are actually
transcribed into proteins. There are a lot of RNAs that have important
functions. In Australia there is a lot of good research going on in that
area. In particular, John Mattick has been trying to get the idea across
that there are a lot of RNAs which function as RNAs and they are probably
important in regulating how cells work.
Then there are all these other bits which do not even become RNA, and
why they are there and what they are doing is another big question. We
can't address everything simultaneously. We are concentrating on one bit,
which is proteins, and we are not really addressing other things. But
other researchers are addressing those areas. Together, hopefully, we
can some day come up with a picture of how the whole cell works together.
Question: I am interested in the microarray technology that
you mentioned. What is the basis of actually detecting the proteins? I
am a bit familiar with the gene arrays concept, but in proteins do you
use antibodies, or is it a gel formulation? What is the standard mechanism
that is used?
Sorry, I didn't really explain this properly because I was going through
it fast and I explained it very much at the surface. We are not using
protein arrays at all. In our application we are using cDNA microarrays,
pieces of DNA. This technology, as you are alluding to, is actually extendable
to proteins, looking at protein-protein interactions, but it is at a very
experimental stage at this point. It is not really generally adapted,
I think, at this point.
Question: In your work where you are wanting to look at the
structure of many proteins, I am just wondering whether what we are doing
in Australia with the synchrotron will be right for you or not. I have
heard overseas that they are going to have automatic crystallising systems,
and the crystal will be automatically put onto the synchrotron, and all
of these things. Are we thinking about that in Australia?
Yes, we are. In fact, the synchrotron is projected to come on line about
2007 or so. I think at that point in time we will be in a pretty good
position where a lot of these methodologies are going to be developed
and automated. So perhaps from Brisbane, for example, we won't even have
to travel to Melbourne to use the synchrotron ourselves. I think it is
very likely that at that point in time we will already have these automated
systems to do this, and perhaps we will just have to send our crystals
down there and there is going to be a technician putting our crystals
on, and the data will be measured and beamed back to Brisbane. This is
possibly how a lot of the projects are going to work, but probably not
all. There are always going to be experimentally difficult projects which
robots et cetera will not be able to deal with. For that reason I think
it is going to be still very essential that we have a synchrotron in Australia
with easier access: we can go there quickly and do our own experiments.
But you are right that this automation is really picking up. The types
of projects like the macrophage project that I described they are
usually called structural genomics or structural proteomics are
what is driving this methodology. The big aim of this whole initiative,
by people all over the world, is to get a structure for every family representative,
also the structure of a representative protein from every family, which
will basically be a structurome or a structural proteome so we will know,
basically, all the structures that exist in nature. That is what is driving
all that, and to do that we have to do structures much more quickly than
we do at the present time, or than we did a decade ago. That is what is
driving all this automation. There are a lot of crystallisation robots
et cetera coming up; a lot of the things that we used to do by hand can
be now be done by machines.
Question: Listening to what you are saying, I am just wondering
if there is a possibility, and an approach, to try and then use the structural
information you get to predict protein-protein interactions, which is
another level on but clearly these proteins are not working by themselves
but are interacting with other families of proteins. Do you know of initiatives
that will take your sort of data and then go to the next level of prediction?
I totally agree with you that proteins don't work alone and they interact
with each other. There are a lot of approaches to how to detect which
proteins actually interact with other, which interact with which, on a
large scale. They have been used for yeast cells and in other systems
as well. These are yeast-to-hybrid system or affinity capture type approaches.
So we can do that.
This complements our structural analysis. We and most people all
over the world, if they are doing high-throughput are currently
just going for single proteins or even fragments of proteins, to get structural
information. But other information is out there of what these proteins
interact with, and also from the structural data on the complexes we actually
are starting to learn what makes a protein-protein interaction site. For
the structure of latexin that I mentioned, we use one of such programs
that looks at the surface and finds out which are the surface features
which are most likely to be involved in a protein-protein interaction.
That is where I predicted what might be the carboxypeptidase binding site.
Putting these things together, using them together, is where we can find
out which proteins interact with each other and how they do that, and
then we have to go to larger scales, look at big complexes in the cell,
maybe using cryo-electromicroscopy and methods like that. All that together
will give us a combined picture of what interacts with what, where it
is in the cell, and all that kind of stuff. We hope to get a visual cell
in the end.
|