My name is Bob Tjian, I'm a professor at the University of California at Berkeley, where I've taught many years in Molecular Biology and Biochemistry, and more recently, I've also taken on the job of being the President of the Howard Hughes Medical Institute. And it's my pleasure today to continue with my second lecture in this series, to describe to you some exciting ideas about how gene regulation works, particularly in more complex organisms.
Now, in my last set of lectures, I left you with this view of the type of complexity that has to be evolving to allow the type of gene expression patterns that we see in the many, many organisms that we know exist on this planet. And so, there's some really intriguing questions that I'm going to address in this second lecture. And one thing I left you with was an image of the interplay of many molecules that have to come together, and to land on a particular site of the DNA molecule that's part of the chromosome of an organism or within a cell of an organism, and how this process might work. But, I think the question that's plagued us for decades, now that we had a better idea of what this molecular machinery looks like that's involved in decoding DNA information into gene expression, we wondered why is it so complex? And to sort of begin to address this issue, let me just take you back to a simple concept.
And you remember that different organisms have different sizes of their genomes, that is, the amount of DNA that is required to encode the particular organism. And here are some examples of both bacteria, simple, single-celled prokaryotic organisms, as well as single-cell eukaryotic organisms like the baker's yeast, and then there's the little, round soil worm C. elegans, and then you can go up to up mammals and vertebrates. And you'll see first of all that the amount of DNA can vary a lot from a few million base pairs all the way up to 3 billion base pairs or more.
To go along with this sort of expanding level of DNA and chromosome length, you also have different levels of genes. Now, you'll notice that the range of genes is a lot less than the range of DNA length, so this partly informs us about maybe why we need the complexity that we ultimately discovered is involved in forming this molecular machinery that's responsible for reading the genetic information. So this is just a little table to reemphasize that these more complex genomes, which also means more complex organisms, which really means a lot of different cell types, many different behaviors, complex interactions with their environment and so forth, how is all this information really decoded from our genomes? And on one side here, you see the prokaryotic core gene regulatory machinery, or the core transcription machinery, and in almost all bacteria, it's only a few polypeptides…
5, 6, 7 polypeptides. Then, on this side, you'll see that the so-called eukaryotic organisms, and particularly when you talk about multicellular metazoan organisms, now you see huge diversity and number of proteins or, as we call, transcription factors, that are necessary to assemble into very large, multi-subunit ensembles that are required to transcribe the 10,000 to 30,000 genes that define these more complex organisms. So, right away you can see that there's this proliferation of the subunits and the machinery and the complexity. So, in this lecture, I'm going to give you a little sense of maybe why this is the case, and what's special about the more complex, multicellular organism, and why this machinery may have to have been more elaborated through evolution, compared to simpler organisms.
Now, one of the first things that you realize when you look into the cell, or particularly the nucleus of a higher organism, let's say our own cells, versus a bacteria, is that the DNA, the very molecule that makes up the genetic information, is kind of packaged away in a very different way. So, in all eukaryotes, the double-stranded DNA doesn't sit there in the form that we would call the "naked" DNA, which is shown up at the top here. But rather, this DNA is wrapped up with a set of proteins, very basic proteins, called "nucleosomes," and these are in turn further packaged all the way to highly condensed form that ultimately forms the chromosomes that you'll be able to see under a microscope. And the blue figures over here and green figures just give you a view of the high-resolution structure of a nucleosome with DNA wrapped around it. So, what is the consequence of having all of our DNA, all our chromosomes, condensed and wrapped up in this way? You can think of it as packaged away.
Well, one thing is that you can shove all this down into a small nucleus, so if we strung out our DNA in every cell in our body, out from end to end and stretched it out like a string, it's almost a meter long. And yet, you have to cram all that into a tiny, little volume. And part of the way that that happens is that you can compact the DNA by these structures. Now, the consequence of that is, of course, you somehow have to negotiate through this highly compacted form of DNA to get access to the DNA information and the genes. So, to put it another way, you have to have a machinery, a transcriptional apparatus whose job is to read DNA and, you remember from the first lecture, convert that DNA information into RNA, an intermediate molecule which ultimately then gets translated into a protein product. Well, clearly one of the reasons we have this highly elaborated transcriptional machinery is in part to deal with having to navigate through a chromatin template, as opposed to a naked DNA template.
And so there are various proteins and protein complexes that are called "chromatin remodeling complexes," "chromatin modifying complexes," and these have to coordinate with the transcriptional machinery down here in the yellow and the orange, in order to navigate and basically express a series of interactions that are transactions between the protein machinery and the DNA. So this is a very challenging problem. So that's part of the problem, or part of the reason why we think there's such complexity.
So, how did we come to this picture? How did we finally get to figuring out that there were over 85 proteins that all have to assemble on a chromatin template, to give you gene expression and transcription, in the right place, in the right time? And I want to just give you one sort of quick look into a technology that one can use to address the issue of, how do we break down this complex machinery into understandable units? And as I said in the first lecture, there are many tools that molecular biologists and biochemists can use to try to tease out these complex molecular transactions. One of them, of course, is to use genetics, which is to use genetic mutation to either remove or alter one particular gene product and then ask what is the consequence. The other way to do it is to actually take a cell with all of its complexity and break it down literally into its component parts, and then try to put it back together again in a functional form.
And that's what I'm going to show you today. And it's a technology I kind of call the "biochemical complementation assay." And it's very simple: You ask, what are the minimal components, for example, in the case of a human gene… what are the minimal protein components of the transcriptional apparatus that you can extract from the nucleus of a cell that you need to put into a test tube that will allow you to essentially reconstruct or, as we say, reconstitute the activity that will allow you to read the gene in an accurate fashion? And you can keep adding or taking away different proteins, the yellow ones, the green ones, the orange ones, and so forth, and ask, does it make any difference? And by playing this adding and subtracting, or "biochemical complementation," assay, you can very quickly discover what are the minimal components you need to activate a gene in a regulated fashion, and what are other things that might be necessary to support this activity.
So, the first question that was asked was from the biochemical analysis of about four dozen different proteins: What are really necessary and sufficient? In other words, what's the minimal component set that you need to give you regulated transcription? So we're now asking a more complicated question. Not only what is necessary to just simply give you transcription, in other words the conversion of DNA into RNA, but to do it in a regulated fashion. Because after all, that's what's really interesting… is why one cell does it in one way and a different cell has a different program. And this experiment here says that our sequence-specific classical transcription factor that binds DNA at its regulatory promoter region, together with what we will call the "core" or "basal" machinery of transcription, is necessary but not sufficient. So, plus or minus the activator Sp1 doesn't make any difference, even though we know that in a living cell, Sp1 is highly activating this gene that we're looking at.
So, that means there's something missing in this reconstitution experiment. So, how do we go find what's missing? And this biochemical complementation really relies on our ability to take the cells that contain the necessary components and the sufficient components, and then start to extract it and to find which molecules are missing that we're not adding to our reaction yet. And to do that, we basically have to take the cells, in this case, human cells, break the cells apart, extract the nucleus, remove all the proteins from the nucleus, and begin to separate the thousands of different proteins that are in the nucleus into different pools, if you like.
And we separate them based on their physical and chemical properties, and some of you probably have had some experience in running column chromatographs. This is basically a way of separating proteins based on their positive charge, negative charge, molecular size, hydrophobicity (in other words, how greasy they are… how well they interact with water), and so forth. So if you do that iteratively, as is shown here in a series of different anion exchange and cation exchange, as well as gel filtration, chromatographs, you can eventually separate the thousands of different components of a nuclear extract into its individual parts.
And then you can test each one to see if they're the missing piece. And when you do that, lo and behold, you find that there are a couple of missing pieces that are necessary for you to add back, in other words, reconstitute, the reaction so that now you have regulated transcription. So unlike the previous data that I showed you, now you can see that the machinery is more complex and, most importantly, you can see also that the machinery is now responsive to the activator. So, the signal with the activator, plus Sp1, is much darker than in the signal without Sp1. That means that there is activated transcription that is Sp1-, a classical transcription factor, dependent. So that allowed us to identify two very important, key components that we didn't know about before we did this experiment: One is a multi-subunit complex called the "Transcription factor II D," and the other one is called the Mediator complex. And these turn out to actually define an entirely new class of transcription factors, which are the so-called co-factors.
So I'm going to tell you a little bit more about one of these co-factors, because they both really perform similar functions, but we happen to know quite a bit more about one of them than the other. So this so-called TFIID complex has roughly 15 subunits, in other words, 15 separate proteins that have to mesh together to form a complex. And it's a very large macromolecule, so it's a million daltons… that's a very, very large, floppy molecule, with many pieces to it. One of its functions you already know about, because it contains as one of its subunits the so-called "TATA-binding protein." That's that saddle-shaped molecule that binds to double-stranded DNA, at the AT-rich sequence called a TATA box, which is associated with many genes in animal cells.
But what we've come to learn in the last decade or so is that this little complex is doing much more than just simply binding to the TATA box; it's doing a whole bunch of other things that we didn't have any idea about. And now that we knew the existence of this activity and that it was critical not only for TATA binding, but also for mediating or potentiating transcription activation, we then could break down more of its functions of individual subunits, because you remember there's 15 different polypeptides here.
And this is just a little summary showing you that this complex of proteins is doing a lot of different functions. It's recognizing the nucleosomes, which have a basic protein called a "histone," and so it recognizes histones only when it's got a certain chemical modification called an acetylation event. This big orange complex also itself has enzymatic activity, including kinase activity, which can put phosphate groups on other proteins and enzymes. It has acetylase activity, and of course, it has to interact directly with activators in order to potentiate their function in turning on transcriptional activation. And I'm probably safe in speculating that there are yet unknown functions of this large complex that we still have to discover, because we've really only understood maybe half of the subunits, and even there, only partially understood the functions of that half of the subunits that are part of this complex.
So, there's clearly much more work to be done, but I think what's clear from these experiments is that these proteins are doing a lot more than just binding DNA. They're what I would think of as integrators of information. So, this integrator of information means that this structure and the function is very complex, and so, one of the things that we've had to do… it's been a very challenging problem that remains challenging, because we haven't solved all the technical problems… is that because it's a large, megadalton, floppy molecule, solving the three-dimensional structure of such large assemblies has proven to be rather technically challenging.
And we have to use many different techniques to try to address this in: X-ray crystallography, NMR… but one of the techniques that's emerging, that's very, very powerful for solving the structures of these large assemblies is something called "cryo-electron microscopy." It's basically a way of freezing these large assemblies in place, and then solving their structure by microscopy. And this is just about a 25 angstrom, so relatively low-resolution structural determination, of the human TFIID complex and, most importantly, its relationship to two other transcription factors that are part of the assembly that has to align itself up on the promoter to start transcription, and that's the other two transcription factors TFIIA and B, which are shown in green and purple here. So you can slowly start building up the entire complex in pretty accurate three-dimensional space to figure out what its shape will inform us about its function, and that's something that's an ongoing project in many laboratories in molecular biology.
So, this cartoon… and again I want to emphasize that all the figures there and the colored blobs are more a part of our imagination at this point, although, as I just showed you, we actually have real structures of some components of this pre-initiation complex. This slide just emphasizes the point that there's a lot of information integration going on, and that there is protein-protein and protein-nucleic acid interactions that are critical for the regulatory functions of these large, macromolecular assemblies. And this also reminds you that there are at least three separate classes of transcription factors that are playing a key role in the regulation of genes: the classical activator and repressor that are sequence-specific DNA-binding proteins, like the Sp1 protein I talked to you about earlier, just shown here in pink; there are the components of the core machinery, which are shown in yellow; and then you have these things we call co-factors or co-activators, that are integrating information between the activators and the core machinery. So this kind of gives you a slightly better view of why there's this kind of complexity, but it still doesn't really address all of the issues with respect to: Why do you need 85 proteins to do this? So, let me dig a little deeper into this.
So, first, let me just pose some of the questions that are really still largely unresolved in the field, even though this is a pretty mature area of study; we've been trying to address these issues for a couple of decades, and it goes to show how difficult it is to really tease apart this complex molecular machinery. And I should say that the complexity of this machinery is not unique to the transcriptional apparatus. Many other biological processes are also dependent on macromolecular machines that are very similar in complexity to this one.
So I think things that we learn about the transcriptional machinery could be applied in principle to many other machineries. So, couple of interesting questions: What are the transcriptional mechanisms that regulate complex cell types? Because, after all, multicellular organisms evolved to having many, many different cell types, so our bodies are made up of many different cell types, which means that each cell's performing a different function. Our hair follicle cells are producing hair, our red blood cells are producing hemoglobin and doing something else, our skin cells are protecting us. Each cell type is doing a different thing, so how does this happen, how do we generate this diversity of cell types through the gene regulatory networks? And then, knowing what we now know about the first level of complexity of the machinery that's responsible for decoding this information, what more can we learn about the process of regulation now? Particularly, what is the division of labor between the core machinery (which binds to the promoter), the activators, and the co-activators? So, what is their relationship, and what's their respective roles in defining cell type-specific gene expression? That's really the last topic that I want to cover in this lecture.
So, let's review a few basic facts about individual cell types. So, let's take two well-recognized cell types: fat cells and muscle cells. Very different cells that perform very different functions, but every cell in a particular organism has the same genetic information. It has the same DNA, it has the same set of chromosomes. That means that these two cells have to be using different parts of the information from the genome to give it their distinct identities. So, each cell must only express some subset of the genes, and that particular subset would define the function of a fat cell versus a muscle cell. And, so then the question becomes: Okay, that makes sense, but how do you get there? How do you get cell type-dependent differential gene expression patterns? How do you turn on the right genes to make fat versus keeping the muscle cell gene functions turned off, and vice versa? So that is a fundamental question of trying to understand the process of cellular differentiation, cell-specific function, and really, developmental biology.
Another set of interesting points to make is that, of the 20,000 to 30,000 genes that a typical metazoan organism encodes, a pretty big chunk of it is devoted to the very machinery that I'm talking about, in other words, the transcription factors. So roughly somewhere between 5 and 10% of the entire coding capacity of genes in a genome is devoted to encoding transcription factors. So this is clearly a very important class of molecules. So that means there are several thousand transcription factors. But now if you start thinking about the many, many thousands of cell types and the behavior of different cells, are a few thousand transcription factors, in and of themselves, enough to generate the diversity of function? And this is where we have to start thinking about, how do you create really large numbers of distinct transcriptional networks? And they really are networks, as you'll see in a minute. And one thing that became clear as we defined what genes look like and what a promoter as a transcriptional unit looks like, we come to understand that the only way to create the kind of huge levels of diversity of distinct transcriptional components and patterns, is to do it by combinatorial regulation.
And what do I mean by that? So, one way to think about it is that you might only have ten cards, but if you shuffle those ten cards and pick four at a time, you can have many, many combinations. So here's a perfect example of three different cell types, could be in the same organism, and each of those symbols represents binding sites, and then the little boxes and triangles above them represent the binding proteins. And you can see that those three cell types might express these sets of genes in similar ways, but they use different combinations of proteins to do it. And this is really the notion of combinatorial mechanisms for gene regulation, and we now know that that is indeed the way, at least in part, that gives us the ability to create many different specific transcription patterns. I have to now also tell you about another, I would say, defining, unusual property of transcription in animal cells, and this is a hard one sometimes to get your head around.
And that is that these different little units of DNA that specify the activity of a gene don't have to be sitting, linearly and spatially, directly next to the gene that it's activating or repressing. They can sit tens of thousands of base pairs away from the site. So these we call long-distance enhancers or silencers, so they can both upregulate a gene… in other words, make more of the gene or less or the gene. And the thing that was so surprising was that the intervening DNA can be very, very long; it can be thousands and maybe even millions of base pairs.
So how does this work? How can something sit so far away actually influence transcription at a very remote site? And this is one of the big conundrums that we still face in the field. We have some models and we have some ideas that we can test, and I'll end my lecture with a few speculations about that. But clearly, we don't fully understand this so-called long-distance regulation, which clearly is regulated by activators and repressors just like the same players that we've been talking about, like the Sp1 molecule and other activators. But yet, how they can reach across long distances of the chromosome to grab on to the core machinery to actually impart information and to create the kind of specific regulatory events is still somewhat obscure. So, another thing that I should say is that, because of the combinatorial mechanisms of generating diversity was so dependent on the distinct sets of sequence-specific DNA-binding proteins, over the last two decades we've come to kind of a traditional model that the core machinery stays relatively invariant. In fact, we kind of think of it as universal, because if you break open a nucleus of a very simple organism like yeast, or you break open the nucleus of a human cell, that machinery looks remarkably similar to each other.
And yet, their gene networks are very, very different, so we thought, well, maybe it's all having to do with the sequence-specific DNA-binding proteins, that will generate the diversity through combinatorial regulation. And that's probably true; in fact, there's a lot of evidence to support that. But it was only part of the story. So, a kind of related question would be: Are we really right in thinking that the core machinery is universal and invariant? And that turns out to be an oversimplification.
So it turns out evolution didn't work that way. And when we looked very carefully in the last few years, particularly at individual, different, distinct cell types, let's say muscle versus fat, or neuron, or liver cell, we certainly see differences in the activators, as we would expect, and indeed they are working in combinatorial fashion, but they're not only working combinatorially with each other, but they are combining in different combinations with the core machinery, which is itself variable. And that was kind of a revelation that's really become more clear just in the last few years. So, in addition to the sequence-specific binding proteins and their diversity, there turns out to be a much greater degree of diversity in the core machinery, the parts that we thought were invariant, than we ever imagined. Now, once you realize that that's the case, that opens up a whole other level of generating diversity that we didn't anticipate, and that of course really allows multicellular organisms to diversify in unbelievable ways.
So, let's drill down finally a little bit at how did we find this out, and where are we going? So now, unlike a few decades ago when we first began to study the process of transcription and discovered all of this initial complexity, in those days we mainly worked on just a few different cell types. But today, we have the ability technically to work with just about any cell type, from the most complex, such as embryonic stem cells, to perhaps the simple cell, like the skeletal muscle, and everything in between… liver cells, neuronal cells, and so forth. And this has really opened up our view of just how diverse, interesting, and variable the transcriptional apparatus is, that is probably really necessary from an evolutionary standpoint to drive the diversity of gene expression and cell types that we see. The first hint that this core machinery that we thought was so invariant may not be so invariant, came from studying the development of the skeletal muscle. So when you go from a precursor cell called a myoblast, which looks like most every other mammalian cell, with its standard, prototypic core machinery, and then when you look at it when that cell type differentiates, in other words, specializes into a myotube (which will ultimately form skeletal muscle, which is the muscle around your large bones that makes you be able to move), it turns out that it not only shifts which transcriptional activators it uses, but it also jettisons the prototypic core machinery and substitutes it with some modified versions of that core machinery, which is shown down here in the purple and the bright blue.
So this was really a change in the paradigm of the way we're thinking about regulation, and of course, this was just the first example. One wanted to know if similar things were happening in other different cell types, and very quickly, if you look at hepatocytes or liver cells, if you look at adipocytes or fat cells, if you look at neuronal cells, and you compare what's going on in muscle, in every case, one can find changes in the core machinery, either because a particular component like one of the TBP-associated factors is highly upregulated (that means its concentration went way up, when all the other ones went down), or some other permutation.
In other words, clearly, components of the so-called core machinery were variable from cell type to cell type, and that really changed the way we thought about how regulation of multicellular organisms works. At the same time that we were looking at these, what we would call mature, terminally differentiated cell types, we were also looking at perhaps one of the most interesting cell types that we could study, particularly if we're interested in understanding the process of mammalian development, and those are of course the embryonic stem cells. These are those amazing cells that, when tickled with just the right chemicals or physiological signals, can turn themselves into every cell type of an organism, maybe 10,000 different cell types.
So, this so-called pluripotency made these human and mouse embryonic stem cells very special for all kinds of reasons, partly because they are amazing models to study this process of development and differentiation, but partly because of biomedical possibilities for cell regeneration and therapeutics. So we've studied this, and these are very, very new studies, and I'll just very quickly touch on it. We really were curious, how can these cells be so pluripotent? That is, their capacity to turn into every other cell type seems so amazing, what is the mechanism, what's the machinery that's going to allow these cells to be able to differentiate into every cell type in the body? And so, we began to probe this.
In some cases, we did it by the genetic technology, which is we made mutations in certain candidate regulatory factors and transcription factors, and then asked, does that have a consequence on the development of different cell types? In other cases, we used a standard biochemical complementation technology to figure out what's going on. So, I'll finish with two quick stories. So, using the genetic tools of knocking genes out and asking what effect it has on differentiation and pluripotency, we discovered that a component of the core machinery (or at least we used to think of it as being purely of the core machinery), that is, one of the TBP-associated factors, particularly TAF3, turns out to be extremely important for the regulation and expression of genes that will ultimately define the so-called endoderm. And that's true for both the so-called primitive endoderm and the definitive endoderm, which ultimately will give rise to the placenta, the yolk sac, lungs, liver, pancreas, intestines, and so forth. At the same time, knocking out this TAF3 had the opposite effect on the other two major germ layers, which are the mesoderm and the ectoderm.
So here was a really beautiful case of differential function of a transcription factor that was not a standard sequence-specific binding protein. This core machinery factor, which by the way, probably on its own doesn't even bind to DNA directly, when you knock it out, you lose the ability to form endoderm, but you elevate the probabilities of forming mesoderm and ectoderm. In other words, the balance between these different cell types gets messed up, and of course this will cause major difficulties for a developing embryo. Even more interesting and intriguing, and this really goes to show the level of information that we still lack, although TAF3 was originally defined both genetically and biochemically as part of the TFIID core promoter recognition complex, and it is absolutely true that that is the case, it had another life that it led that we didn't know about.
So TAF3, it turns out, it doesn't have to strictly function as part of this large multi-subunit core promoter complex, but it can also do other jobs, and in this case, it pairs up, or partners up, with a different transcription factor called CTCF (doesn't really matter what the name is) and now it does its job in a completely different way. And in fact, the most recent experiments suggest that TAF3 and CTCF get together to partly allow that amazing property of long-distance regulation. So, regulators bound at thousands of base pairs away from the site of activity can be brought together in three-dimensional space by what's known as "DNA looping," and it turns out that TAF3 is involved in this DNA looping, together with a whole bunch of other proteins, whose relationship to TAF3 is still not entirely clear.
And we find it particularly intriguing and exciting that this type of long-distance function is being carried by a TAF and in the context of embryonic stem cell differentiation potential to form endoderm. So this is a very, very new type of way of thinking about the core transcription factors. Likewise, when we looked at the embryonic stem cell transcriptional circuitry and asked, what other transcriptional co-regulators, or regulators and co-factors, are necessary to allow this so-called pluripotency program? This amazing ability of these cells to be able to differentiate into every other cell type, how does that happen? What is allowing that to happen in this particular cell type, and not in other cell types? And again, using the biochemical complementation technology, we recently were able to identify a new co-factor complex, again a multi-subunit complex, called the SCC, or "stem cell co-factor." And remarkably, this SCC-B turns out to be a well-known protein that again had a different lifestyle in other cell types.
It's a protein complex that had previously been described as XPC, which stands for "Xeroderma pigmentosum, complex C," which means that it's involved in DNA repair. So up until now, we thought XPC was only functioning as a DNA repair complex, and now we know that it's doing something quite different, but only in the context of ES cells, which is to form a co-factor complex that will potentiate the activity of two critical transcriptional activators, Oct4 and Sox2, which define the pluripotent, self-renewing state of ES cells. So these are just two examples of sort of what we're learning about, the continuing saga of how transcriptional machinery evolved and works in animal cells. And I'll finish with this last model slide, which just simply reiterates what I just said: We have to keep in mind that, in generating large sets of combinatorial, specific gene networks, we have to use the diversity not only of sequence-specific DNA-binding proteins, but we more and more see examples that components of the previously thought to be invariant core machinery are an integral part of diversifying the combinatorial regulation of gene expression.
And this of course opens up many new possibilities, and I suspect that there are many question marks yet about what exactly each of these components is doing to drive complex regulation that gives rise to complexity like human beings, the human brain, all the physiology that goes on. And of course, as we understand these mechanisms in greater detail, I think we have a much better chance of tackling the problems of human disease and diseases of other organisms. Because ultimately, we have to understand the molecular basis of disease, and I think a big part of that is understanding the mechanisms of gene regulation..