1. Introduction to Computational and Systems Biology

The following
content is provided under a Creative
Commons license. Your support will help MIT
OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Welcome
to Foundations of Computational
and Systems Biology. This course has many numbers. We'll explain all the
differences and similarities. But briefly, there are three
undergrad course numbers, which are all similar in
content, 7.36, 20.390, 6.802. And then, there are
four graduate versions. The 7.91, 20.490, and HST
versions are all very similar, basically identical. But the 6.874 has some
additional AI content that we'll discuss in a moment. So make sure that
you are registered for the appropriate
version of this course. And please interrupt with
questions at any time. The main goal today is to give
an overview of the course, both the content as well as
the mechanics of how the course will be taught. And we want to make sure
that everything is clear. So this course is
taught by myself and Chris Burge from Biology,
Professor Fraenkel from BE, and Professor Gifford from EECS.

We have three TAs, Peter
Freese and Collette Picard, from Computational and Systems
Biology, and Tahin, from EECS. All the TAs have expertise
in computational biology as well as other
quantitative areas like math, statistics,
computer science. So in addition to the lectures
by the regular instructors, we will also have guest
lectures by George Church, from Harvard, toward
the end of the semester. Doug Lauffenburger
will give a lecture in the regulatory network
section of the course. And we'll have a guest
lecture from Ron Weiss on synthetic biology. And just a note
that today's lecture and all the lectures
this semester are being recorded by AMPS,
by MIT's OpenCourseWare.

So the videos, after a
little bit of editing, will eventually end
up on OpenCourseWare. What are these courses? So these course numbers are
the graduate level versions, which are survey courses
in computational biology. Our target audience
is graduate students who have a solid
background and comfort– or, a solid
background in biology, and also, a comfort with
quantitative approaches. We don't assume that
you've programmed before, but there will be
some programming content on the homeworks. And you will therefore need to
learn some Python programming. And the TAs will help with
that component of the course. We also have some online
tutorials on Python programming that are available. The undergrad course
numbers– this is an upper level
undergraduate survey course in computational biology. And our target audience are
upper level undergraduates with solid biology
background and comfort with quantitative approaches. So there's one key difference
between the graduate and undergraduate versions,
which I'll come to in a moment. So the goal of this course
is to develop understanding of foundational methods in
computational biology that will enable you to
contextualize and understand a good portion of research
literature in a growing field.

So if you pick up Science, or
Nature, or PLOS Computational Biology and you want
to read those papers and understand them,
after this course, you will have a better chance. We're not guaranteeing you'll be
able to understand all of them. But you'll be able to recognize,
perhaps, the category of paper, the class, and perhaps,
some of the algorithms that are involved. And for the graduate
version, another goal is to help you gain exposure
to research in this field. So it's actually
possible to do a smaller scale computational biology
research project on your own, on your laptop– perhaps,
on ATHENA– with relatively limited computational
resources and potentially even discover something new.

And so we want to give
you that experience. And that's through
the project component that we'll say more
about in a moment. So just to make sure that
everyone's in the right class– this is not a systems
biology class. There are some more
focused systems biology classes
offered on campus. But we will cover
some topics that are important for
analyzing complex systems. This is also certainly not
a synthetic biology course. Some of the systems methods are
also used in synthetic biology. And there will be this
guest lecturer I mentioned, Ron Weiss, which will
cover synthetic biology. It's also not an
algorithms class.

We don't assume that
you have experience in designing or
analyzing algorithms. We'll discuss various
bioinformatics algorithms. And you'll have the
opportunity to implement at least one bioinformatics
algorithm on your homework. But algorithms and not really
the center of the course. And there's one
exception to this, which is, those of you
who are taking 6.874 will go to a special
recitation that will cover more advanced
algorithm content. And there will be special
homework problems for you as well. So that course really does
have more algorithm content. So the plan for today
is that I will just do a brief, anecdotal history
of computational and systems biology. This is to set the stage and
the context for the class.

And then we'll spend
a significant amount of time reviewing the course
mechanics, organization, and content. Because as you'll see, it's
a little bit complicated. But it'll make sense once
we go over it, hopefully. So this is my take
on computational and systems biology. Again, it's not a
scholarly overview. It doesn't hit everything
important that happened. It just gives you
a flavor of what was happening in computational
biology decade by decade. So first of all,
where does this field fall in the academic
scheme of things? So I consider
computational biology to be actually part of biology. So in the way that genetics or
biochemistry are disciplines that have strategies
for understanding biological questions, so
does computational biology. You can use it to
understand a variety of computational
questions in gene regulation, many other areas.

There is also, some
people make a distinction that bioinformatics is
more about building tools whereas computational biology
is more about using tools, for example. Although many people
don't– it's very blurry– and don't try to– people
use them in various ways. And then, you could
think of bioinformatics as being embedded in the larger
field of informatics, where you include tools for management
and analysis of data in general. And it's certainly true that
many of the core concepts and algorithms in bioinformatics
come from the field, come from computer science,
come from other branches of engineering, from statistics,
mathematics, and so forth.

So it's really a cross
disciplinary field. And then, synthetic biology
cuts across in the sense that it's really an
engineering discipline. Because you're designing
and engineering synthetic molecular
cellular systems. But you can also use
synthetic biology to help understand natural
biological systems, of course. All right, so what was
happening, decade by decade? So in the '70s, there
were not genome sequences available or large sequence
databases of any sort. Except there were starting
to be some protein sequences. And early computational
biologists focused on comparing proteins,
understanding their function, structure, and evolution. And so in order to
compare proteins, you need a protein– an amino
acid substitution matrix– a matrix that describes
how often one amino acid is substituted for another.

And Margaret Dayhoff
was a pioneer in developing these
sorts of matrices. And some of the matrices she
developed, the PAM series, are still used today. And we'll discuss those
matrices early next week. So in terms of asking
evolutionary questions, two big thinkers were Russ
Doolittle and Carl Woese, analyzing both ribosomal RNA
sequences to study evolution. And Carl Woese realized,
looking at these RNA alignments, that actually, the prokaryotes,
which had been– there was this big split
between prokaryotes and eukaryotes was
sort of a false split– that actually,
there was a subgroup of single-celled
anuclear organisms that were closer
to the eukaryotes– and named them the Archaea.

So a whole kingdom of life
was recognized, really, by sequence analysis. And Russ Doolittle also
did a lot of analysis approaching sequences and came
up with this molecular clock idea, or contributed
to that idea, to actually build– instead
of systematics being based on phenotypic characteristics,
do it on a molecular level. So in the '80s, the
databases started to expand. Sequence alignment and
search became more important. And various people
developed fast algorithms to compare protein and DNA
sequences and align them. So the FASTA program
was widely used. BLAST– several of the authors
of BLAST are shown here– David Lipman, Pearson, Webb
Miller, Stephen Altschul. The statistics for knowing
when a BLAST search result is significant were developed
by Karlin and Altschul. And there was also progress
in gapped alignment, in particularly,
Smith-Waterman, shown above. Also progress in RNA
secondary structure prediction from Nusinov and Zuker. We'll talk about all of these
algorithms during the course. And there was also development
of literature databases. I always liked this picture. Many of you probably
used PubMed.

But Al Gore was well coached
here, by these experts, in how to use it. And then, in the '90s,
computational biology really started to expand. It was driven partly
by the development of a microarrays, the first
genome sequences, and questions like how to identify
domains in a protein, how to identify
genes in the genome. It was recognized that
this family of models from electrical engineering,
the hidden Markov model, were quite useful for these sort
of sequence labeling problems. That was really pioneered
by Anders Krogh here, and David Haussler.

And a variety of
algorithms were developed that performed
these useful tasks. There was also
important progress in the earliest comparative
genomic approaches, since you have–
the first genomes were sequenced in the mid
'90s, of free living organisms. And so you could then start
to compare these genomes and learn a lot. We'll talk a little bit
about comparative genomics. And there was important progress
on predicting protein structure from primary sequence. Particularly, David Baker
made notable progress on this Rosetta algorithm.

So it's a biophysics
field, but it's very much part of
computational biology as well. So in the 2000s, definitely,
genome sequencing became very fashionable,
as you can see here. And the genomes of
now larger organisms, including human, it became
possible to sequence them. And then, this
introduced a huge host of computational challenges
in assembling the genomes, annotating the
genomes, and so forth. And we'll hear from
Professor Gifford about some genome
assembly topics.

And annotation will
come up throughout. Actually, let's
just mention, this is Jim Kent, who's the guru
who did the first human genome assembly– at least,
that was widely used– and also was involved in UCSC. And here, Ewan Birney
has started Ensembl and continues to run it today. You know who these other
people are, probably. OK. All right. So in another phase
of the last decade, I would say that much
of biological research became more high throughput
than it was before. So molecular biology
had traditionally, in the '80s and
'90s, mostly focused on analysis of individual
gene or protein products. But now it became possible,
and in widespread use, that you could measure the
expression of all the genes, in theory– using
microarrays, for example– and you could start
to profile all of the transcripts
in the cell, all of the proteins in the
cell, and so forth.

And then, a variety
of groups started to use some of these
high throughput data to study various challenges
in gene expression to understand how transcription
works, how splicing works, how microRNAs work, translation,
epigenetics, and so forth. And you'll hear updates on some
of that work in this course. Bioimage informatics,
particularly for developmental
biology, became popular. It continues to be
a new emerging area. Systems biology was also really
born around 2000, roughly. A very prominent example
would be the development of the first gene
regulatory network models that describe sea urchin
development, here, my Eric Davidson, as well as a whole
variety of models of other gene networks in the cell
that control things like cell proliferation,
apoptosis, et cetera. At the same time, a new
field of synthetic biology was born with the development
of some of the first completely artificial gene
networks that would then program cells to perform
desired behavior. So an example would be this
so-called repressilator, where you have a network of
three transcription factors. Each represses the other. And then, one of
them represses GFP. And you put these into bacteria.

And it causes oscillations
in GFP, expressions that are described by these
differential equations here. And some of the
modeling approaches used in synthetic
biology will be covered by Professor Fraenkel
and Lauffenburger later. All right. So late 2000s, early
2010s, it's still too early to say, for sure, what the most
important developments will be. But certainly, in
the late 2000s, next gen sequencing–
which now probably should be called second
generation sequencing, since there may be
future generations– really started to transform
a whole wide variety of applications in biology,
from making genome sequencing– instead of having to be
done in the genome center, now an individual lab can easily
do microbial genome sequencing.

And when needed,
it's possible, also, to do genome sequencing
of larger organisms. Transcriptome sequencing
is now routine. We'll hear about that. There are applications
for mapping protein-DNA interactions genome
wide, including both sequence specific
transcription factors as well as more general
factors like histones, protein-RNA interactions–
a method called CLIP-Seq– methods for mapping
all the translated messages, the methylated sites in
the genome, open chromatin, and so forth. So many people contributed
to this, obviously. I'm just mentioning,
Barbara Wold was a pioneer in both
RNA-Seq as well as ChIP-Seq. And some of the sequencing
technologies that came out here are shown here. And we will discuss those
at the beginning of lecture on Thursday. So I encourage you to
read this review here, by Metzger, which covers
many of the newer sequencing technologies. And they're pretty interesting. As you'll see, there's
some interesting tricks, interesting chemistry and
image analysis tricks.

All right. So that was not very scholarly. But if you want a proper
history, then this guy, Hallam Stevens, who was a
History of Science PhD student at Harvard and
recently graduated, wrote this history
of bioinformatics. OK. So let's look at the syllabus. So also posted on the
[INAUDIBLE] site is a syllabus. It looks like this. This is quite an
information-rich document. It has all the lecture
titles, all the due dates of all the problem
sets, and so forth. So please print yourself a
copy and familiarize yourself with it. So we'll just try to
look at a high level first, and then zoom
in to the details. So at a high level, if you
look in this column here, we've broken the course
into six different topics. OK? So there's Genomic
Analysis I, that I'll be teaching, which is more
classical computational biology, you could say– local
alignment, global alignment, and so forth. Then, Genomic Analysis II,
which Professor Gifford will be teaching, covers
some newer methods that are required
when you're doing a lot of second
generation sequencing– the standard algorithms
are not fast enough, you need better
algorithms, and so forth.

And then, I will come back and
give a few lectures on modeling biological function. This will have to do with
sequence motifs, hidden Markov models, and RNA
secondary structure. Professor Fraenkel
will then do a unit on proteomics and
protein structure. And then, there will
be an extended unit on regulatory networks. Different types of
regulatory networks will be covered, with most
of the lectures by Ernest, one by David, and one by Doug. And then, we'll finish up
with computational genetics, by David. And there will also be some
guest lecturers, one of them interspersed in
regulatory networks, and then two at the end, from
Ron Weiss and George Church.

So I just wanted to point out
that in all of these topics, we will include some discussion
of motivating question. So, what are the
biological questions that we're seeking to address
with these approaches. And there will also
be some discussion of the experimental method. So for example,
in the first unit, it's heavy on sequence analysis. So we'll talk about
how sequencing is done, and then, quite a bit
about the interaction between the
experimental technology and the computational
analysis, which often involves statistical methods for
estimating the error rate of the experimental
method, and things like that.

So the emphasis is on
the computational part, but we'll have some
discussion about experiments. Everyone with me so far? Any questions? OK. All right. So what are some of these
motivating questions that we'll be talking about? So what are the instructions
encoded in our genomes? You can think of the
genome as a book. But it's in this very
strange language. And we need to understand the
rules, the code that underlies a lot of research
in gene expression. How are chromosomes organized? What genes are present– so
tools for annotating genomes. What regulatory
circuitry is encoded? You'd like to be able to
eventually look at a genome, understand all the
regulatory elements, and be able to predict
that there's some feedback circuit there that responds to–
a particular stimulation that responds to light, or
nutrient deprivation, or whatever it might be. Can the transcriptome be
predicted from the genome? This is a longstanding question. The translatome, if you
will– well, let's say, the proteome can be predicted
from the transcriptome in the sense that we
have a genetic code and we can look
up those triplets.

So there's a dream
that we would be able to model other
steps in gene expression with the precision with which
the genetic code predicts translation– that we'd
be able to predict where the polymerase would start
transcribing, where it will finish transcribing, how a
transcript will be spliced, et cetera– all the other
steps in gene expression. And that motivates a lot
of work in the field. Can protein function be
predicted from sequence? So this is a very
classical problem. But there are a number of new
and interesting developments as resolved from a lot of
this high throughput data generation, both in
nucleic acid sequencing as well as in proteomics. Can evolutionary history be
reconstructed from sequence? Again, this has been a
longstanding goal of the field.

And a lot of progress
has been made here. And now most evolutionary
classifications are actually based on molecular
sequence at some level. And new species are often
defined based on sequence. OK. Other motivating questions. So what would you
need to measure if you wanted to discover
the causes of a disease, the mechanisms of
existing drugs, metabolic pathways
in a micro-organism? So this is a systems
biology question. You've got a new bug. It causes some disease. What should you measure? Should you sequences its genome? Should you sequence
its transcriptome? Should you do proteomics? What type of proteomics? Should you perturb the system in
some way and do a time series? What are the most
efficient ways? What information should
be gathered, and in what quantities? And how should that information
be integrated in order to come up with an
understanding of the physiology of that organisms
so that, then, you can know where to
intervene, what would be suitable drug targets? Yeah.

What kind of modeling
would help you to use the data to
design new therapies, or even, in a synthetic
biology context, to re-engineer organisms
for new purposes? So microbes to
generate– to produce fuel, for example, or
other useful products. What can we currently measure? What does each type of
data mean individually? What are the strengths
and weaknesses of each of the types of
high throughput approaches that we have? And how do we integrate all
the data we have on a system to understand the
functioning of that system? So these are some of
the questions that motivate the latter topics
on regulatory networks. OK. So let's now zoom in
and look more closely at the course syllabus.

I've just broken
it into two halves, just so it's more readable. So today we're going over,
obviously, course mechanics mostly. On Thursday, we'll cover
both some DNA sequencing technologies and we'll talk
about local alignment on BLAST. More on that in a bit. The 6.8047 recitation–
6.874, thank you– recitation will be on Friday. The other recitations
will start next week. And then, as you can see, we'll
move through the other topics. So each of the
instructors is going to briefly review their topic. So I won't go through
all the titles here. But please note,
on the left side here, that their assignment
due dates are marked. OK? And they're all due at
noon on the indicated day. And so some of these
are problem sets. So for example,
problem set 1 will be due on Thursday,
February 20 at noon.

And some of the
other assignments relate to the project component
of the course, which we're going to talk more
about in a moment. In particular,
we're going to ask you to submit a brief
statement of your background and your research interests
related to forming teams. So the projects are
going to be done in teams of one
to five students. And in order to
facilitate especially cross disciplinary
teams– we'd love if you interact
with, maybe, students in a different grad
program, or whatever– you'll post your background.

You know, I'm a
first year BE student and I have a background
in Perl programming, but never done Python, or
whatever– something like that. And then, I'm interested
in doing systems biology modeling in microbial systems,
or something like that. And then you can match
up your interests with others and form teams. And then you'll come up
with your own project ideas so that the team
and initial idea will be due here, February 25th. Then you'll need to do
some aims and so forth. So the project components
here, these are only for those taking the grad
version of the course.

We'll make that clear later. OK. So after the first
three topics here, taught by myself and David,
there will be an exam. More on that later. And then there will be three
more topics, mostly taught by Ernest. And notice there are additional
assignments here related to the project– so,
to research strategy– and the final written report,
additional problems sets. We'll have a guest
lecturer here. This will be Ron Weiss. Then, there will
be the second exam. Exams are non-cumulative,
so the second exam will just cover these three
topics here predominantly. And then there will be
another guest lecturer. This would be
George Church here. And then notice,
here, presentation. So those who are doing
the project component, those teams will be given–
assigned a time to present, to the class, the results
of their research. And you'll be graded–
the presentation will be part of the overall
project grade assigned by the instructors. But you'll also– we'll
also ask all the students in the class to send comments
on the presentations. So you may find that you
get helpful suggestions about interpreting your data
from other people and so forth.

So that'll be a
required component of the course for all students,
to attend the presentations and comment on them. And we hope that
will be a lot of fun. OK. So is this the
right course for me? So I just wanted to
let you know you're fortunate to have a rich
selection of courses in computational systems,
synthetic biology here at MIT. I've listed many of them. Probably not all,
but the ones that I'm aware of that are
available on-campus. 757 is really only for
biology grad students. But the other courses listed
here are generally open. Some are more geared for
graduate students, some more undergrads. Some are more specialized. So for example, Jeff Gore's
systems biology course, it's more focused
on systems biology whereas our course covers both
computational and systems. So keep that in mind. Make sure you're
in the right place, that this is what you want. OK. A few notes on the textbook–
so there is a textbook. It's not required. It's called Understanding
Bioinformatics by Zvelebil and Baum. It's quite good
on certain topics. But it really only covers
about, maybe, a third of what we cover in the course.

So there is good content
on local alignment, global alignment,
scoring matrices– the topics of the
next couple lectures. And I'll point you
to those chapters. But it's very
important to emphasize that the content of
the course is really what happens in lecture,
and on the homeworks, and to some extent, what
happens in recitation. And the textbook is just there
as a backup, if you will, or for those who
would like to get more background on the
topic or want to read a different description
of that topic. So you decide whether you
want to purchase the textbook or not.

It's available at the
Coop or through Amazon. Shop around. You can find it. It's paperback. Pretty good general reference
on a variety of topics, but it doesn't really have
much on systems biology. All right. Another important reference
that was developed specifically for this course a few years
ago is the probability and statistics primer. So you'll notice that
some of the homeworks, particularly in the earlier
parts of the course, will have significant
probability and statistics. And we assume that you have
some background in this area. Many of you do. If you don't, you'll
need to pick that up. And this primer was written
to provide those topics, in probability especially,
that are foundational and most relevant to
computational biology. So for example, there
are some concepts like p-value, probability
density function, probability mass function, cumulative
distribution function, and then, common distributions,
exponential distribution, Poisson distribution,
extreme value distribution. If those are mostly sounding
familiar to you, that's good. If they're familiar, but
you couldn't– you really don't– you get binomial and
Poisson confused or something, then, definitely, you want
to consult this primer.

So I think, looking at the
lectures and the homeworks, it should probably
become pretty clear which aspects are
going to be relevant. And I'll try to point
those out when possible. And you can also
consult your TAs if you're having trouble with
the probability and statistics content. So we are going to
focus, here, on, really, the computational biology,
bioinformatics content. And we might briefly review
a concept from probability, like, maybe, conditional
probability when we talk about Markov chains.

But we're not going to
spend a lot of time. So if that's the
first time you've seen conditional probability,
you might be a little bit lost. So you'd be better off
reading about it in advance. OK? Questions? No questions? All right. Maybe it's that the video
is intimidating people. OK. All right. The TAs know a lot about
probability and statistics and will be able to help you. OK. So homework– so I
apologize, the font is a little bit small here. So I'll try to state it clearly. So there are going
to be five problem sets that are roughly
one per topic. Except you'll see p set two
covers topics two and three. So it might be a
little bit longer. The way we handle students who
have to travel– so many of you might be seniors. You might be interviewing
for graduate schools. Or you might have other
conflicts with the course.

So rather than
doing that on a case by case basis, which, we've
found, gets very complicated and is not necessarily fair,
the way we've set it up is that the total number of
points available on the five homeworks is 120. OK? But the maximum score
that you can get is 100. So if you, for example,
were to get 90% on all five of the homeworks,
that would be 90% of 120, which would
be 108 points. You would get the
full 100– you'd get 100% on your homework. That would be a perfect
score on the homework. OK? But because of that– because
there's more points available than you need– we don't
allow you to drop homeworks, or to do an alternate
assignment, or something like that. So the way it works
is you can basically miss– as long as
you do well on, say, four of the homeworks–
you could actually miss one without
much of a penalty. For example, if each of the
homeworks were worth 24 points and you got a perfect
score on four of them, that would be 96 points.

You would have an almost
perfect score on your homework and you could miss
that fifth homework. Now of course, we
don't encourage you to skip that homework. We think the homeworks are
useful and are a good way to solidify the information
you've gotten from lecture, and reading, and so forth. So It's good to do them. And doing the homeworks
will help you and perhaps prepare you for the exams. But that's the way we
handle the homework policy. Now I should also mention that
not all the homeworks will be the same number of points.

We'll apportion the
points in proportion to the difficulty and length
of the homework assignment. So for example, the
first homework assignment is a little bit easier
than the others. So it's going to have
somewhat fewer points. So late assignments–
so all the homeworks are due at noon on
the indicated day. And if it's within
24 hours after that, you'll be eligible
for 50% credit. And beyond that, you don't
get any points, in part because the TAs will be posting
the answers to the homeworks. OK? And we want to be able to post
them promptly so that you'll get the answers while
those problems are fresh in your mind.

Questions about homeworks? OK. Good. So collaboration
on problem sets– so we want you to
do the problem sets. You can do them independently. You can work with a friend
on them, or even in a group, discuss them together. But write up your
solutions independently. You don't learn
anything by copying someone else's solution. And if the TAs see duplicate
or near identical solutions, both of those
homeworks will get a 0. OK? And this occasionally happens. We don't want this
to happen to you. So just avoid that. So discuss together, but write
up your solutions separately. Similarly, with
programming, if you have a friend who's a more
experienced programmer than you are, by
all means, ask them for advice, general things, how
should I structure my program, do you know of a function
that generates a loop, or whatever it is that you need.

But don't share code
with anyone else. OK? That would be a no-no. So write up your
code independently. And again, the graders will
be looking for identical code. And that will be thrown out. And so we don't want
to have any misconduct of that type occurring. All right. So recitations– there are
three recitation sessions offered each week,
Wednesday at 4:00 by Peter, Thursday at 4:00 by Colette,
Friday at 4:00 by Tahin. And that is a special
recitation that's required for the 6.874
students– David? Yes– and has
additional AI content. So anyone is welcome to
go to that recitation. But those who are
taking 6.874 must go. For students registered for the
other versions of the course, going to recitation is optional
but strongly recommended. Because the TAs will go over
material from the lectures, material that's helpful
for the homeworks or for studying for exams–
in the first weeks, Python, probability as well. So go to the
recitations, particularly if you're having
trouble in the course. So Tahin's recitation
starts this week. And Peter and Colette's
will start next week.

Question. Yes. AUDIENCE: Are they covering
the same material, Peter and Colette? PROFESSOR: Peter and Colette
will cover similar material and Tahin will cover
different material. Python instruction– so
the first problem set, which will be
posted this evening, doesn't have any
programming on it. But if you don't
have programming, you need to start
learning it very soon. And so there will be a
significant programming problem on p set two. And we'll be posting that
problem soon– this week, some time. And you'll want to
look at that and gauge, how much Python
do I need to learn to at least do that problem. So what is this
project component that we've been hearing about? So here's a more
concrete description. So again, this is only
for the graduate versions of the course.

So students will–
basically, we've structured it so you
work incrementally toward the final
research project and so that we can offer
feedback and help along the way if needed. So the first assignment
will be next week. I think it's on Tuesday. I'll have more instructions
on Thursday's lecture. But all the students
registered for the grad version will submit their
background and interests for posting on the
course website. And then you can
look at those and try to find other students who
have, ideally, similar interests but perhaps somewhat
different backgrounds.

It's particularly
helpful to have a strong biologist on the team. And perhaps, a strong
programmer would help as well. So then, you will
choose your teams and submit a project title
and one-paragraph summary, the basic idea of your project. Now we are not providing a
menu of research projects. It's your choice–
whatever you want to do, as long as it's related
to computational and systems biology. So it could be analysis of
some publicly available data. Could be analysis of
some data that you got during your rotation. Or for those of you who are
already in labs that you're actually working
on, it's totally fine and encouraged that the
project be something that's related to your main PhD work
if you've started on that. And it could also be more in
the modeling, some modeling with MATLAB or something, if
you're familiar with that.

But, a variety of possibilities. We'll have more
information on this later. But we want you to form teams. The teams can work independently
or with up to four friends in teams of five. And if people want
to have a giant team to do some really
challenging project, then you can come
and discuss with us. And we'll see if
that would work. So then there's
the initial title and one-paragraph summary. We'll give you a little
feedback on that. And then you'll submit an
actual specific aims document– so with actual,
NIH-style, specific aims– the goal is to understand
whether this organism has operons or not, or some actual
scientific question– and a bit about how you will
undertake that.

And then you'll submit a longer
two-page research strategy, which will include,
specifically, we will use these data. We will use this software,
these statistical approaches– that sort of thing. And then, toward the end of
the semester, a final written report will be due
that'll be five pages. You'll work on it
together, but it'll need to be clear who did what. You'll need an author
contribution statement. So and so did this analysis. So and so wrote this
section– that sort of thing. And then, as I
mentioned before, there will be oral presentations,
by each team, on the last two course sessions. OK? Questions about the projects? There's more information on the
course info document online. OK? Good. Yes, so for those taking 6.874,
in addition to the project, there will also be
additional AI problems on both the p
sets– and the exam? David? Yes. OK. And they're optional for others. All right. Good. OK. So how are we going
to do the exam? So as I mentioned, there's
two 80-minute exams. They're non-cumulative. So the first exam
covers, basically, the first three topics. The second exam is on
the last three topics.

They're 80 minutes. They're during
normal class time. There's no final exam. The grading– so for those
taking the undergrad version, the homeworks will count 36%
out of the maximum 100 points. And the exams will count 62%. And then, this
peer review, where there's two days
where you go, and you listen to presentations,
and you submit comments online counts 2%. For the graduate
Bio BE HST versions, it's 30% homeworks, 48% exams,
20% project, 2% peer review. For the EECS version, 6.874,
25% homework, 48% exams, 20% project, and then, 5% for
these extra AI related problems, and 2% peer review. And those should add up to 100. And then, in addition, we
will reward 1% extra credit for outstanding
class participation– so questions,
comments during class. OK. All right. So a few announcements
about topic one. And then, each of us will review
the topics that are coming up. So p set one will
be posted tonight. It's due February 20th, at noon. It involves basic microbiology,
probability, and statistics.

It'll give you some
experience with BLAST and some of the
statistics associated. P set two will be
posted later this week. So you don't need to, obviously,
start on p set two yet. It's not due for several weeks. But definitely, look at
the programming problem to give you an idea
of what's involved and what to focus on when
you're reviewing your Python. Mentioned the
probability/stats primer. Sequencing technology–
so for Thursday's lecture, it will be very helpful if you
read the review, by Metzger, on next gen sequencing
technologies. It's pretty well written. Covers Illumina,
454, PACBIO, and a few other interesting
sequence technologies. Other background
reading– so we'll be talking about local
alignment, global alignment, statistics, and similarly
matrices for the next two lectures. And chapters four and
five of the textbook provide a pretty good
background on these topics.

I encourage you to take a look. OK. So I'm just going to
briefly review my lectures. And then we'll have David
and Ernest do the same. So sequencing
technologies will be the beginning of lecture two. And then we'll talk about
local ungapped sequence alignment– in
particular, BLAST. So BLAST is something
like the Google search engine of bioinformatics,
if you will. It's one of the most
widely used tools. And it's important to understand
something about how it works, and in particular,
how to evaluate the significance of BLAST
hits, which are described by this extreme value
distribution here. And then, lecture three, we'll
talk about global alignment and introducing gaps
into sequence alignments. We'll talk about some
dynamic programming algorithms– Needleman-Wunsch,
Smith-Waterman. And in lecture four,
we'll talk about comparative genomic analysis
of gene regulation– so using sequence similarity
across genomes two infer location of regulatory elements
such as microRNA target sites, other things like that.

All right. So I think this is now– oh
sorry, a few more lectures. Then, in the next unit,
modeling biological function, I'll talk about the
problem of motif finding– so searching a set of sequences
for a common subsequence, or similar subsequences,
that possess a particular
biological function, like binding to a protein. It's often a complex
search space. We'll talk about
the Gibbs sampling algorithm and some alternatives. And then, in lecture
10, I'll talk about Markov and hidden
Markov models, which have been called the Legos
of bioinformatics, which can be used to model a
variety of linear sequence labeling problems. And then, in the last
lecture of that unit, I'll talk a little
bit about protein– I'm sorry, about RNA–
secondary structure– so the base pairing of
RNAs– predicting it from thermodynamic
tools, as well as comparative genomic approaches.

And you'll learn
about the mfold tool and how you can use
a diagram like that to infer that this RNA may have
different possible structures that I can fold into,
like those shown. All right. So I'm going to pass
it off to David here. PROFESSOR: Thanks
very much, Chris. All Right. So I'm David Gifford. And I'm delighted to be here. It's really a
wonderfully exciting time in computational biology. And one of the reasons
it's so exciting is shown on this slide, which
is the production of DNA base sequence per
instrument over time. And as you can see, it's
just amazingly more efficient as time goes forward. And if you think about the
reciprocal of this curve, the cost per base is basically
becoming extraordinarily low. And this kind of
instrument allows us to produce
hundreds of millions of sequence reads for
a single experiment. And thus do we not only need
new computational methods to handle this kind
of a big data problem, but we need
computational methods to represent the results
in computational models.

And so we have multiple
challenges computationally. Because modern biology
really can't be done outside of a
computational framework. And to summarise the
way people have adapted these high throughput DNA
sequencing instruments, I built this small
figure for you. And you can see that–
Professor Burge will be talking about DNA
sequencing next time. And obviously, you
can use DNA sequencing to sequence your own
genomes, or the genomes of your favorite pet,
or whatever you like. And so we'll talk about,
in lecture six, how to actually do
genome sequencing. And one of the challenges
in doing genome sequencing is how to actually find
what you have sequenced.

And we'll be talking about how
to map sequence reads as well. Another way to
use DNA sequencing is to take the RNA species
present in a single cell, or in a population of
cells, and convert them into DNA using
reverse transcriptase. Then we can sequence the DNA
and understand the RNA component of the cell, which,
of course, either can be used as messenger
RNA to code for protein, or for structural RNAs, or
for non-coding RNAs that have other kinds of functions
associated with chromatin. In lecture seven,
we'll be talking about protein/DNA interactions,
which Professor Burge already mentioned– the idea
that we can actually locate all the regulatory
factors associated with the genome using a single
high throughput experiment.

We do this by isolating the
proteins and their associated DNA fragments and
sequencing the DNA fragments using this DNA
sequencing technology. So briefly, the first
thing that we'll look at in lecture five is,
given a reference genome sequence and a basket
of DNA sequence reads, how do we build
an efficient index so that we can either
map or align those reads back to the
reference genome. That's a very important
and fundamental problem. Because if I give you a
basket of 200 million reads, we need to build its
alignment very, very rapidly, and quickly, and
accurately, especially in the context of
repetitive elements. Because a genome obviously
has many repeats in it.

We need to consider how
our indexing and searching algorithms are going to handle
those sorts of elements. In the next lecture,
we'll talk about how to actually sequence a
genome and assemble it. So the fundamental way that
we approach genome sequencing given today's
sequencing instruments is that we take intact
chromosomes at the top, which, of course, are hundreds
of millions of bases long, and we shatter them into pieces. And then we sequence
size selected pieces in a sequencing instrument. And then we need to put the
jigsaw puzzle back together with a computational assembler.

So we'll be talking about
assembly algorithms, how they work, and furthermore,
how resolve ambiguities as we put that puzzle
together, which often arise in the context of
repetitive sequence. And in the next lecture,
in lecture seven, we'll be looking at how to
actually take those little DNA molecules that are
associated with proteins and analyze them to figure out
where particular proteins are bound to the genome and how they
might regulate target genes.

Here we see two
different occurrences of OCT4 binding events
binding proximally to the SOX2 gene, which
they are regulating. So that's the beginning of our
analysis of DNA sequencing. And then we'll look at RNA
sequencing– once again, with going through
a DNA intermediate– and ask the question,
how can we look at the expression
of particular genes by mapping RNA sequence
reads back onto the genome. And there are two fundamental
questions we can address here, which is, what is the level
of expression of a given gene, and secondarily, what
isoforms are being expressed. The second sets of reads,
you see up on the screen, are split reads that
cross splice junctions. And so by looking at how
reads align to the genome, we can figure out which
particular axons are included or excluded from a
particular transcript. So that's the beginning of
the high throughput biology genomic analysis module. And I'll be returning to
talk, later in the term, about computational genetics,
which, really, is a way to summarize everything
we're learning in the course into
an applicable way to ask fundamental questions
about genome function, which Professor Burge
talked about earlier.

Now we all have about 3
billion bases in our genomes. And as you know, you
differ from the individual sitting next to you in about
one in every 1,000 base pairs, on average. And so one question
is, how do we actually interpret these differences
between genomes. And how can we build
accurate computational models that allow us to infer
function from genome sequence? And that's a pretty
big challenge. So we'll start by
asking questions about, what parts of the
genome are active and how could we annotate them. So we can use, once again,
different kinds of sequencing based assays to identify the
regions of the genome that are active in any
given cellular state. And furthermore, if we
look at different cells, we can tell which
parts of the genome are differentially active.

Here you can see
the active chromatin during the
differentiation of an ESL into a terminal type over
a 50 kilobase window. And the regions of the genome
that are shaded in yellow represent regions that
are differentially active. And so, using this kind of
DNA seq. data and other data, we can automatically
annotate the genome with where the
regulatory elements are and begin to understand what the
regulatory code of the genome is. So once we understand what
parts of the genome are active, we can ask questions
about, how do they contribute to some
overall phenotype. And our next
lecture, lecture 19, we'll be looking
at how we can build a model of a
quantitative trait based upon multiple loci and the
particular alleles that are present at that loci. So here you see an
example of a bunch of different
quantitative trait loci that are contributing to
the growth rate of yeast in a given condition.

So part of our exploration
during this term will be to develop
computational methods to automatically identify
regions of the genome that control such traits and to
assign them significance. And finally, we'd like to
put all this together and ask a very fundamental
question, which is, how do we assign
variations in the human genome to differential risk
for human disease. And associated questions
are, how could we assign those variants to what
best therapy would be applied to the disease–
what therapeutics might be used, for example.

And here we have
a bunch of results from genome wide
association studies, starting at the top
with bipolar disorder. And at the bottom
is type 2 diabetes. And looking along
the chromosomes, we're asking which
locations along the genome have variants that
are highly associated with these particular diseases
in these so-called Manhattan plots. Because the things that
stick up look like buildings. And so these sorts of studies
are yielding very interesting insights into variants that are
associated with human disease. And the next step, of
course, is to figure out how to actually prove that
these variants are causal, and also, to look
at mechanisms where we might be able to address
what kinds of therapeutics might be applied to deal
with these diseases.

So those are my two units. Once again, high throughput
genomic analysis, and secondarily,
computational genetics. And finally, if you have
any questions about 6.874, I'll be here after lecture. Feel free to ask me. Ernest. PROFESSOR: Thank
you very much, Dave. All right. So in the preceding
lectures, you'll have heard a lot about
the amazing things we can learn from
nucleic acid sequencing. And what we're going to do in
the latter part of the course is look at other
modalities in the cell.

Obviously, there's
a lot that goes on inside of cells
that's not taking place at the level of DNA/RNA. And so we're going to start
to look at proteins, protein interactions, and ultimately,
protein interaction networks. So we'll start with
the small scale, looking at intermolecular
interactions of the biophysics– the
fundamental biophysics of a protein structure. Then we'll start to look at
protein-protein interactions. And as I said, the final
level will be networks. There's been an amazing
advance in our ability to predict protein structure. So it's always been the dream
of computational biology to be able to go from
the sequence of a gene to the structure of the
corresponding protein. And ever since Anfinsen,
we know that, at least, that should be theoretically
possible for a lot of proteins. But it's been computationally
virtually intractable until quite recently. And a number of
different approaches have allowed us to
predict protein structure.

So this slide
shows, for example– I don't know which one's
which– but in blue, perhaps, prediction, and in red, the
true structure, or the other way around. But you can see, it
doesn't matter very much. We're getting extremely accurate
predictions of small protein structure. There are a variety
of approaches here that live on a spectrum. On the one hand, there's
the computational approach that tries to make special
purpose hardware to carry out the calculations for
protein structure. And that's been
wildly successful. On the other end of
the extreme, there's been the crowd sourcing
approach to have gamers try to predict
protein structure. And that turns out
to be successful. And there's a lot of interesting
computational approaches in the middle as well. So we'll explore some of
these different strategies. Then, the ability to go from
just seeing a single protein structure, how these
proteins interact.

So now there are
pretty good algorithms predicting protein-protein
interactions as well. And that allows us, then,
to figure out not only how these proteins
function individually, but how they begin to
function as a network. Now one of the things
that we've already– going to be touched on in
the early parts of the course are protein-DNA interactions
through sequencing approaches. We want to now look at
them at a regulatory level. And could you reconstruct
the regulation of a genome by predicting the protein-DNA? And there's been a
lot of work here. We talked earlier
about Eric Davidson's pioneering approaches, a lot
of interesting computational approaches as well, that go from
the relatively simple models we saw on those earlier slides to
these very complicated networks that you see here.

We'll look at what
these networks actually tell us, how much information
is really encoded in them. They are certainly
pretty, whatever they are. We'll look at other kinds
of interactions networks. We'll look at genetic
interaction networks as well, and perhaps, some other kinds. And finally, we'll look
at computable models. And what do we mean by that? We mean model that
make some kind of very specific prediction,
whether it's a Boolean prediction– this gene
will be on or off– or perhaps, an even more
quantitative prediction. So we'll look at
logic based modeling, and probably, Bayesian
networks as well. So that's been a very
whirlwind tour of the course. Just to go back
to the mechanics, make sure you're signed
up for the right course.

The undergraduate
versions of the course do not have a project. And they certainly do not have
the artificial intelligence problems. Then, there are the graduate
versions that have the project, but do not have the AI,
and finally, the 6.874, which has both. So please be sure you're
in the right class so you get credit for
the right assignments. And finally, if you've got
any questions about the course mechanics, we have
a few minutes. We can talk about it here,
in front of everybody. And then, we'll all be available
after class, a little bit, to answer questions. So please, any questions? Yes. AUDIENCE: Can
undergrads do a project? PROFESSOR: Can
undergrads do a project? Well, they can sign up for the
graduate version, absolutely. Other questions? Yes. AUDIENCE: Yeah, I actually can't
make either of those sessions. PROFESSOR: Well, send an
email to the staff list and we'll see what we can do. Other questions? Yes. AUDIENCE: 6.877
doesn't exist anymore– Computational Evolutionary
Biology, the class. PROFESSOR: Oh, the thing we
listed as alternative classes? Sorry. PROFESSOR: We'll fix that. AUDIENCE: Really? I would be so happy.

[LAUGHTER] PROFESSOR: We'll delete
it from our slides, yes. We are powerful, but
not that powerful. Other questions,
comments, critiques? Yes, in the back. AUDIENCE: Are each of the
exams equally weighted? PROFESSOR: Are each of the
exams equally weighted? Yes, they are. Yes. AUDIENCE: If we're in 6.874,
we have the additional AI problems. Does that mean that we have
more questions on the exam, but just as much
time to do them? PROFESSOR: The question
was, if you're in 6.874 and you have the additional
AI questions on the exam, does that mean you
have more questions, but the same amount of time. And I believe the
answer to that is yes. PROFESSOR: We may
revisit that question. PROFESSOR: We will revisit
that question at a later date. But course six students
are just so much smarter than everyone else, right? Yes. AUDIENCE: Can we switch between
different versions of the class by the add deadline? PROFESSOR: I'm sorry,
could you say that again? AUDIENCE: Can we switch
between versions of the class by the add deadline? PROFESSOR: Can you switch
between different versions of class by the
add/drop deadline? In principle, yes.

But if you haven't
done the work for that, then you will be in trouble. And there may not be
a smooth mechanism for making up missed
work that late. Because the add/drop
deadline is rather late. So I'd encourage you,
if you're considering doing the more intensive
version or not, sign up for the more
intensive version. You can always drop back. Once we form groups
though, for the projects, then, obviously,
there's an aspect of letting down your teammates. So think carefully, now, about
which one you want to join. It'll be hard to
switch between them. But we'll do whatever
the registrar requires us to do in terms
of allowing it. Other questions? Yes. AUDIENCE: If the course sites,
can we undergrads access the AI problems just for
fun, to look at them? PROFESSOR: Yes. If the course sites
remain separate, will undergrads be able
to access the AI problems? The answer is yes. Yes. AUDIENCE: Should
students in 6.874 attend both presentations? PROFESSOR: Should
students in 6.874 attend both presentations? Not necessary.

You're welcome to
both if you want. But the course six recitation
should be sufficient. Other questions? Yes. AUDIENCE: So what's covered
in the normal recitations? PROFESSOR: What's covered
in the normal recitations? It'll be reinforcing material
that's in the lectures. So the goal is not introduce any
new material in the course 20, course seven recitations. Just to clarify, not to
introduce any new material.

Other questions? OK. Great. And we'll hang
around a little bit to answer any remaining
questions when they come up.

You May Also Like

1. Introduction to Computational and Systems Biology – Ai Biotechnology