The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right. So today, we're going

to briefly review classical sequencing and

next-gen or second-gen sequencing, which sort of

provides a lot of the data that the analytical methods

we'll be talking about work on. And we'll then introduce

local alignment a la BLAST and some of

the statistics associated with that. So just a few brief

items on topic one. All right. So today, we're going to

talk about sequencing first. Conventional– or Sanger

sequencing– then next-gen or second-gen

sequencing briefly. And then talk about

local alignments. So background for the sequencing

part, the Metzger review covers everything you'll need. And for the alignment– we'll

talk about local alignment today, global

alignment on Tuesday– then chapters four and five of

the text cover it pretty well.

So here's the text. If you haven't decided

whether to get it or not, I'll have it up here. You can come flip

through it after class. Sequencing is mostly

done at the level of DNA. Whether the original

material was RNA or not, usually convert to DNA and

sequence at the DNA level. So we'll often think about

DNA as sort of a string. But it's important to

remember that it actually has a three dimensional

structure as shown here. And often, it's helpful to

think of it in sort of a two dimensional

representation where you think about the bases and

their hydrogen bonding and so forth as

shown in the middle. My mouse is not working

today for some reason, but hopefully, we won't need it. So the chemistry of sequencing

is very closely related to the chemistry of

the individual bases.

And there are really

three main types that are going to

be relevant here. Ribonucleotides,

deoxyribonucleotides, then for Sanger sequencing,

dideoxyribonucleotides. So who can tell me which

of these structures corresponds to which

of those names? And also, please let

me know your name and I'll attempt

to remember some of your names toward the end

of the semester probably. So, which are which? Yes, what's your name? AUDIENCE: I'm Simona. So the ribonucleotide

is the top right.

The deoxy is the one below it. And the dideoxy is

the one to the left. PROFESSOR: OK, so

that is correct. So one way to keep

these things in mind is the numbering of the bases. So the carbons in the ribo

sugar are numbered one, so carbon 1 is the one

where the base is attached. Two is here, which has an OH

in RNA and just an H in DNA. And then three is

very important. Four, and then five. So five connects

to the phosphates, which then will connect the

base to the sugar phosphate backbone. And three is where you extend. That's where you're going to

add the next base in a growing chain. And so what will happen if you

give DNA polymerase a template and some dideoxy nucleotides? It won't be able to extend

because there's no 3-prime OH. And all the chemistry

requires the OH. And so that's the basis

of classical or Sanger sequencing, which Fred

Sanger got the Nobel Prize for in the 1980s– I think

it was developed in the '70s– and it's really

the basis of most of the sequencing, or

pretty much all the DNA sequencing up until

the early 2000s before some newer

technologies came about.

And it takes advantage

of this special property of dideoxy nucleotides that they

terminate the growing chain. So imagine we have

a template DNA. So this is the

molecule whose sequence we want to determine

shown there in black. We then have a primer. And notice the primer's

written in 5-prime to 3-prime direction. The ends would be

primer sequences and then primer complimentary

sequences in the template. So you typically will have

your template cloned– this is in conventional

sequencing– cloned into some vector

like a phage vector for sequencing so you know

the flanking sequences.

And then you do four

sequencing reactions in conventional

Sanger sequencing. And I know some of you have

probably had this before. So let's take the first

chemical reaction. The one here with a DDGTP. So what would you

put in that reaction? What are all the

components of that reaction if you wanted to do

conventional sequencing on, say, an acrylonitrile? Anyone? What do you need and

what does it accomplish? Yeah, what's your name? AUDIENCE: I'm Tim. PROFESSOR: Tim? Oh yeah, I know you, Tim. OK, go ahead. AUDIENCE: So you need

the four nucleotides– the deoxynucleotides. You will need the

dideoxy P nucleotides. In addition, you need all

the other [INAUDIBLE]. You need polymerase. Generally, you need a buffer

of some sort, [INAUDIBLE], to [INAUDIBLE]. PROFESSOR: Yeah,

primary template. Yeah. Great. That's good. It sounds like Tim could

actually do this experiment. And what ratio would you put in? So you said you're

going to put in all four conventional

deoxynucleotides and then one dideoxynucleotide. So let's say dideoxy G

just for simplicity here.

So in what ratio would you

put the dideoxynucleotide compared to the

conventional nucleotides? AUDIENCE: To lower

the concentration. PROFESSOR: Lower? Like how much lower? AUDIENCE: Like, a lot lower. PROFESSOR: Like maybe 1%? AUDIENCE: Yeah. PROFESSOR: Something like that. You want to put it a lot lower. And why is that so important? AUDIENCE: Because you want the

thing to be able to progress. Because you need enough of the

ribonucleotide concentration so that [INAUDIBLE]

every [INAUDIBLE] equivalent or excess and you're

going to terminate [INAUDIBLE]. PROFESSOR: Right. So if you put equamolar

deoxy G and dideoxy G, then it's going to be a

50% chance of terminating every time you hit

a C in the template. So you're going to

have half as much of the material at the second

G, and a quarter as much as the third, and you're going to

have vanishingly small amounts.

So you're only going

to be able to sequence the first few C's

in the template. Exactly. So that's a very good point. So now let's imagine you do

these four separate reactions. You typically would

have radiolabeled primer so you can see your DNA. And then you would run

it on some sort of gel. This is obviously not a real

gel, but an idealized version. And then in the lane

where you put dideoxy G, you would see the

smallest products. So you read these guys

from the bottom up. And in this lane there

is a very small product that's just one base longer

than the primer here.

And that's because there was a

C there and it terminated there. And then the next C appears

several bases later. So you have sort of a gap here. And so you can see that the

first base in the template would be a complement

of T, or C. And the second base

would be, you can see, the next smallest product

in this dideoxy T lane, therefore it would

be A. And you just sort of snake your

way up through the gel and read out the sequence. And this works well. So what does it actually

look like in practice? Here are some actual

sequencing gels.

So you run four lanes. And on big polyacrylamide

gels like this. Torbin, you ever

run one of these? AUDIENCE: Yes. PROFESSOR: Yes? They're a big pain to cast. Run for several hours, I think. And you get these

banding patterns. And what limits the

sequence read length? So we normally call

the sequence generated from one run of a

sequencer as a read. So that one attempt to sequence

the template is called a read. And you can see

it's relatively easy to read the sequence

toward the bottom, and then it gets

harder as you go up. And so that's really what

fundamentally limits the read length, is that the bands get

closer and closer together. So they'll run inversely

proportional to size with the small ones

running faster. But then the difference between

a 20 base product and a 21 might be significant. But the difference between a

500 base product and a 501 base product is going

to be very small. And so you basically can't

order the lanes anymore.

And therefore, that's sort of

what fundamentally limits it. All right. So here we had to run

four lanes of a gel. Can anyone think of

a more efficient way of doing Sanger sequencing? Is there any way to

do it in one lane? Yeah, what's your name? AUDIENCE: Adrian. You can use four different

types of the entities. Maybe like four

different colors. AUDIENCE: Four different colors. OK, so instead of using radio

labeling on the primary, you use fluorophore on your

dideoxy entities, for example. And then you can run them. Depending where that

strand terminated, it'll be a different color. And you can run them

all in one lane.

OK, so that looks like that. And so this was an

important development called terminator

sequencing in the '90s. That was the basis of the ABI

3700 machine, which was really the workhorse of

genome sequencing in the late '90s

and early 2000s. Really what enabled the

human genome to be sequenced. And so one of the other

innovations in this technology was that instead of having a

big gel, they shrunk the gel. And then they just had

a reader at the bottom. So the gel was shrunk to as thin

as these little capillaries. I don't know if you

can see these guys. But basically it's like

a little thread here.

And so each one of these

is effectively– oops! Oh no. No worries, this

is not valuable. Ancient technology that I

got for free from somebody. So the DNA would be

loaded at the top. There would be a

little gel in each of these– it's called

capillary sequencing. And then it would

run out the bottom and there would be

a detector which would detect the

four different flours and read out the sequence. So this basically condensed the

volume needed for sequencing. Any questions about

conventional sequencing? Yes? AUDIENCE: Where

are the [INAUDIBLE] where you'd put the

fluorescent flags? Like the topic from

the [INAUDIBLE]? PROFESSOR: Yeah,

that's a good question. I don't actually remember. I think there are different

options available. And sometimes with some

of these reactions, you need to use modified

polymerases that can tolerate these

modified nucleotides. Yeah, so I don't remember that. It's a good question. I can look that up.

So how long can a

conventional sequencer go? What's the read length? Anyone know? It's about, say, 600 or so. And so that's reasonably long. How long is a typical

mammalian mRNA? Maybe two, three kb? So you have in a typical

exon, maybe 150 bases or so. So you have a chunk. You don't generally

get full length cDNA. But you get a chunk

of a cDNA that's say, three, four exons in length. And that is actually

generally sufficient to uniquely identify the gene

locus that that read came from. And so that was the

basis of EST sequencing– so-called Expressed

Sequence Tag sequencing. And millions of these 600 base

chunks of cDNA were generated and they have been quite

useful over the years. All right. So what is next-gen sequencing? So in next-gen sequencing, you

only read one base at a time. So it's often a

little bit slower. But it's really

massively parallel. And that's the big advantage. And it's orders of

magnitude cheaper per base than

conventional sequencing. Like when it first

came out it, it was maybe two orders

of magnitude cheaper. And now it's probably another

four orders of magnitude.

So it really blows away

conventional sequencing if the output that

you care about is mostly proportional to

number of bases sequence. If the output is proportional

to the quality of the assembly or something, then

there are applications where conventional

sequencing still is very useful Because the

next-gen sequencing tends to be shorter. But in terms of just volume, it

generates much, much more bases in one reaction. And so the basic ideas are

that you have your template DNA molecules. Now typically, tens of thousands

for technologies like PacBio or hundreds of millions for

technologies like Illumina that are immobilized on

some sort of surface– typically a flow

cell– and there are either single

molecule methods where you have a single

molecule of your template or there are methods that

locally amplify your template and produce, say, hundreds

of identical copies in little clusters.

And then you use

modified nucleotides, often with

fluorophores attached, to interrogate the next base at

each of your template molecules for hundreds and hundreds

of millions of them. And so there are several

different technologies. We won't talk about all of them. We'll just talk

about two or three that are interesting

and widely used. And they differ depending

on the DNA template, what types of modified

nucleotides are used, and to some extent, in the

imaging and the image analysis, which differs for single

molecule methods, for example, compared to the ones

that sequence a cluster. So there's a table in

the Metzger review. And so I've just told you

that next-gen sequencing is so cheap. But then you see how

much these machines cost and you could buy lots of

other interesting things with that kind of money. And I also want to emphasize

that that's not even the full cost.

So if you were to buy an

Illumina GA2– this would be like a couple years

ago when the GA2 was the state of the art– for

half a million dollars, the reagents to run

that thing, if you're going to run it continuously

throughout the year, the reagents to run it

would be over a million. So this actually

underestimates the cost. However, the cost per

base is super, super low. Because they generate

so much data at once. All right, So we'll talk

about a couple of these. The first next-gen

sequencing technology to be published and

still used today was from 454–

now Roche– and it was based on what's

called emulsion PCR. So they have these little

beads, the little beads have adapter DNA molecules

covalently attached.

You incubate the beads

with DNA, and you actually make an emulsion. So it's an oil water emulsion. So each bead, which

is hydrophilic, is in the little bubble

of water inside oil. And the reason for that

is so that you do it at a template

concentration that's low enough that only a

single molecule of template is associated with each bead. So the oil then

provides a barrier so that the DNA can't get

transferred from one bead to another. So each bead will have a

unique template molecule. You do sort of a local

PCR-like reaction to amplify that DNA

molecule on the bead, and then you do sequencing

one base at a time using a luciferase based

method that I'll show you on the next slide. So Illumina technology differs

in that instead of an emulsion, you're doing it on the

surface of a flow cell. Again, you start with a

single molecule of template. Your flow cell has

these two types of adapters covalently attached.

The template anneals to

one of these adapters. You extend the adapter molecule

with dNTPs and polymerase. Now you have the complement of

your template, your denature. Now you have the

inverse complement of your template

molecule covalently attached to the cell surface. And then at the other end

there's the other adapter. And so what you

could do is what's called bridge amplification

where that now complement of the template molecule

will bridge over hybridized to the other

adapter, and then you can extend that adapter. And now you've regenerated

your original template. And so now you have the

complementary strand, and the original

strand, your denature. And then each of those molecules

can undergo subsequent rounds of bridge amplification to

make clusters of typically several hundred

thousand molecules. Is that clear? Question.

Yeah, what's your name? AUDIENCE: Stephanie. How do they get the adapters

onto the template molecules? PROFESSOR: How do

you get the adapters onto the template molecules? So that's typically

by DNA ligation. So we may cover

that in later steps. It depends. There's a few

different protocol. So for example, if you're

sequencing microRNAs, you typically would

isolate the small RNAs and use RNA litigation

to get the adapters on. And then you would do

an RT step to get DNA. With most other applications

like RNA-seq or genome sequencing– so with RNA-seq,

you're starting from mRNA, you typically will isolate

total RNA, do poly(A) selection, you fragment your RNA

to reduce the effects of secondary structure,

you random prime with, like, random hexamers RT enzyme.

So that'll make little bits

of cDNA 200 bases long. You use second strand synthesis. Now you have double

stranded cDNA fragments. And then you do, like, blunt end

ligation to add the adapters. And then you denature so

you have single strand. AUDIENCE: I guess my

question is how do you make sure that the two

ends sandwiching the DNA are different as opposed to– PROFESSOR: That the

two ends are different. Yeah, that's a good question. I'll post some stuff about–

It's a good question. I don't want to sweep

it under the rug.

But I kind of want to move on. And I'll post a

little bit about that. All right so we

did 454 Illumina. Helicos is sort of like

Illumina sequencing except single molecule. So you have your

template covalently attached to your substrate. You just anneal primer and

just start sequencing it And there's major pros and cons

of single molecule sequencing, which we can talk about.

And then the PacBio

technology is fundamentally different in that the

template is not actually covalently attached

to the surface. The DNA polymerase is covalently

attached to the surface and the template is sort of

threaded into the polymerase. And this is a phage polymerase

that's highly processive and strand displacing. And the template is often

a circular molecule. And so you can

actually read around the template multiple

times, which turns out to be really useful in PacBio

because the error rate is quite high for the sequencing. So in the top, in

the 454, you're measuring luciferase

activity– light.

In Illumina, you're

measuring fluorescence. Four different

fluorescent tags, sort of like the four different tags

we saw in Sanger sequencing. Helicose, it's single

tag one base at a time. And in PacBio, you actually

have a fluorescently labeled dNTP that has

the label on– it's actually

hexaphosphate– it's got the label on the

sixth phosphate. So the dNTP is labeled. It enters the active site

of the DNA polymerase. And the residence

time is much longer if the base is actually

going to get incorporated into that growing chain. And so you measure how much time

you have a fluorescent signal. And if it's long, that

means that that base must have incorporated

into the DNA.

But then, the extension

reaction itself will cleave off the

last five phosphates and the fluorophore tag. And so you'll

regenerate native DNA. So that's another difference. Whereas in Illumina

sequencing, as we'll see, there's this reversible

terminator chemistry. So the DNA is not native

that you're synthesizing. So this is just a

little bit more on 454. Just some pretty pictures. I think I described that before. The key chemistry here is that

you add one dNTP at a time. So only a subset of the wells–

perhaps a quarter of them– that have that next base,

the complementary base free– as the next

one after the primer– will undergo synthesis. And when they undergo synthesis,

you release pyrophosphate. And they have these

enzymes attached to these little micro beads–

the orange beads– sulfurylase and luciferase, that use

pyrophosphate to basically generate light. And so then you have one of

these beads in each well. You look at which wells

lit up when we added dCTP.

And they must have had G as

the next base and so forth. And there's no termination here. The only termination

is because you're only adding one base at a time. So if you have a single

gene in the template, you'll add one base. But if you have two Gs in the

template, you'll add two Cs. And in principle, you'll

get twice as much light. But then you have to sort of

do some analysis after the fact to say, OK how much

light do we have? And was that one G,

two G, and so forth. And the amount of

light is supposed to be linear up to

about five or six Gs.

But that's still a

more error-prone step. And the most common

type of error in 454 is actually insertions

and deletions. Whereas in Illumina

sequencing, it's substitutions. David actually encouraged

me to talk more about sequencing errors

and quality scores. And I need to do a little

bit more background. But I may add that a little

bit later in the semester. OK, so in Illumina

sequencing, you add all four dNTPs

at the same time. But they're non-native.

They have two major

modifications. So one is that they're

three prime blocked. That means that

the OH is not free, I'll show the chemical

structure in a moment. So you can't extend

more than one base. You incorporate that one

base, and the polymerase can't do anything more. And they're also tagged

with four different fluors. So you add all

four dNTPs at once. You let the polymerase

incorporate them. And then you image

the whole flow cell using two lasers

and two filters. So basically, to

image the four fluors. So you have to sort of take

four different pictures of each portion of the flow

cell and then the camera moves and you scan the whole cell.

And so then, those clusters that

incorporated a C, let's say, they will show up in the

green channel as spots. And those incorporated

in A, and so forth. So you basically have these

clusters, each of them represents a distinct

template, and you read one base at a time. So, first you read the

first base after the primer. So it's sequencing

downwards into the template. And you read the

first base so you know what the first base

of all your clusters is. And then you reverse

the termination. You cleave off

that chemical group that was blocking the 3-prime

OH so now it can extend again. And then you add the

four dNTPs again, do another round of extension,

and then image again, and so forth. And so it takes a little while. Each round of imaging

takes about an hour. So if you want to do 100 base

single and Illumina sequencing, it'll be running on the machine

for about four days or so. Plus the time you have to

build the clusters, which might be several hours

on the day before.

So what is this? So actually the whole idea

of blocking termination– basically Sanger's idea–

is carried over here in Illumina sequencing

with a little twist. And that's that you can

reverse the termination. So if you look down

here at the bottom, these are two different

3-prime terminators. Remember your base counting. Base one, two, three. So this was the

3-prime OH, now it's got this methyl [INAUDIBLE],

or whatever that is. I'm not much of a chemist,

so you can look that one up.

And then here's another version. And this is sort

of chemistry that can cleave this off

when you're done. And then this whole thing

here, hanging off the base, is the fluor. And you cleave that off as well. So you add this big complicated

thing, you image it, and then you cleave

off the fluor and cleave off

the 3-prime block. These are some actual

sequencing images you would image in

the four channels. They're actually

black and white. These are pseudocode. And then you can merge

those and you can see then all the clusters

on the flow cell. So this is from a GA2 with

the recommended cluster density back in the day,

like a few years ago. And nowadays, the image

now since the software has gotten a lot better,

so you can actually load the clusters more

densely and therefore get more sequence

out of the same area.

But imagine just

millions and millions of these little

clusters like this. Notice the clusters are

not all the same size. Basically, you're

doing PCR in situ, and so some molecules are easier

to amplify by PCR than others. And that probably accounts

for these variations in size. So what is the

current throughput? These data are accurate as

of about, maybe, last year. So the HiSeq 2000 instrument

is the most high performance, widely used instrument. Now there's a 2500, but I

think it's roughly similar. You have one flow cell. So a flow cell looks sort

of like a glass slide, except that it has

these tunnels carved in it like eight little

tubes inside the glass slide. And on the surfaces

of those tubes is where the adapters

are covalently attached. And so you have

eight lanes and so you can sequence eight different

things in those eight lanes. You could do yeast genome

in one and fly RNA-seq in another, and so forth. And these days, a single

lane will produce something like 200 million reads. And this is typically routine

to get 200 million reads from a lane.

Sometimes you can get more. You can do up to 100 bases. You can do 150 these

days on a MiSeq, which is a miniature version. You can do maybe 300 or more. And so that's a whole

lot of sequence. So that's 160 billion bases of

sequence from a single lane. And that will cost you– that

single lane– maybe $2,000 to $3,000, depending

where you're doing it. And the cost doesn't

include the capital cost, that's just the reagent

cost for running that. So 160 billion– the

human genome is 3 billion, so you've now sequenced the

human genome over many times there. You can do more. So you can do

paired-end sequencing, where you sequence both

ends of your template. And that'll basically double

the amount of sequence you get. And you can also,

on this machine, do two flow cells at once.

So you can actually

double it beyond that. And so for many applications,

160 billion bases is overkill. It's more than you need. Imagine you're doing

bacterial genome sequencing. Bacterial genome might

be five megabases or so. This is complete overkill. So you can do bar coding

where you add little six base tags to

different libraries, and then mix them together,

introduce them to the machine, sequence the tags

first or second, and then sequence the templates. And then you effectively

sort them out later. And then do many

samples in one lane. And that's what people

most commonly do. So, questions about

next-gen sequencing? There's a lot more to learn. I'm happy to talk about it more. It's very relevant

to this class. But I'm sure it'll come up

later in David's sections, so I don't want to take

too much time on it. So, now once you generate

reads from an Illumina instrument or some

other instrument, you'll want to align

them to the genome to determine, for

example, if you're doing RNA-seq mapping

reads that come from mRNA, you'll want to know what

genes they came from.

So you need to map those

reads back to the genome. What are some other reasons you

might want to align sequences? Just in general, why

is aligning sequences– meaning, matching them up

and finding individual bases or amino acid residues that

match– why is that useful? Diego? AUDIENCE: You can

assemble them if you want. PROFESSOR: You

can assemble them? Yes. So if you're doing

genome sequencing, if you align them to

each other and you find a whole stack that

sort of align this way, you can then assemble

and infer the existence of a longer sequence. That's a good point. Yes, your name? AUDIENCE: Julianne. Looking at homologs.

PROFESSOR: Looking at homologs. Right. So if you, for example, are

doing disease gene mapping, you've identified a human

gene of unknown function that's associated

with a disease. Then you might want to search

it against, say, the mouse database and find

a homolog in mouse and then that might be what you

would want to study further. You might want to then

knock it out in mouse or mutate it or something. So those are some good reasons. There's others. So we're going to first talk

about local alignment, which is a type of alignment

where you want to find shorter stretches

of high similarity. You don't require alignment

of the entire sequence. So there are certain

situations where you might want to do that.

So here's an example. You are studying a recently

discovered human non-coding RNA. As you can see, it's 45 bases. You want to see if

there's a mouse homolog. You run it through NCBI

BLAST, which as we said is sort of the Google search

engine of mathematics– and you're going get a chance

to do it on pump set one, and you get a hit

that looks like this. So notice, this is

sort of BLAST notation. It says Q at the top. Q is for "query," that's

the sequence you put in. S is "subject,"

that's the database you were searching against. You have coordinates,

so 1 to 45. And then, in the subject, it

happened to be base 403 to 447 in some mouse

chromosome or something. And you can see that

it's got some matching.

But it also has some mismatches. So in all, there are 40

matches and five mismatches in the alignment. So is that significant? Remember, the mouse genome

is 2.7 billion bases long. It's big. So would you get a match

this good by chance? So the question is really,

should you trust this? Is this something you

can confidently say, yes mouse is a

homolog, and that's it? Or should you just

be like, well, that's not better

than I get by chance so I have no

evidence of anything? Or is it sort of

somewhere in between? And how would you tell? Yeah, what's your name? AUDIENCE: Chris. You would want to

figure out a scoring function for the alignment. And then, with that

scoring function, you would find whether or not

you have a significant match.

PROFESSOR: OK. So Chris says you want to

define a scoring system and then use the

scoring system to define statistical significance. Do want to suggest

a scoring system? What's the simplest

one you can think of? AUDIENCE: Just if there's a

match, you add a certain score. If it's a mismatch, you

subtract a certain score. PROFESSOR: So let's do

that scoring system. So the notation that's

often used is Sii. So that would be a match

between nucleotide i and then another

copy of nucleotide i.

We'll call that 1,

plus 1 for a match. And sij, where i

and j are different, we'll give that

a negative score. Minus 1. So this is i not equal to j. So that's a scoring matrix. It's a four by four matrix

with 1 on the diagonal and minus 1 everywhere else. And this is commonly

used for DNA. And then there's a

few other variations on this that are also used. So good, a scoring system. So then, how are we going

to do the statistics? Any ideas? How do we know

what's significant? AUDIENCE: The higher

score would probably be a little more significant

than a lower score. But the scale, I'm not sure– PROFESSOR: The scale

is not so obvious. Yes, question? AUDIENCE: My name is Andrea.

So if you shuffled the RNA,

like permute the sequence, then we'll get the

[INAUDIBLE] genome you get with that

shuffled sequence. And the score is about

the same as you'd get with the non-shuffled

sequence [INAUDIBLE] about very significant scores. PROFESSOR: Yeah, so

that's a good idea. BLAST, as it turns

out– is pretty fast. So you could shuffle

your RNA molecule, randomly permute the

nucleotides many times, maybe even like

1,000 times, search each one against

the mouse genome, and get a distribution

of what's the best score– the top score– that

you get against a genome, look at that

distribution and say whether the score

of the actual one is significantly higher

than that distribution or just falls in the

middle of that somewhere. And that's reasonable. You can certainly do that, and

it's not a bad thing to do. But it turns out there is

an analytical theory here that you can use. And so that you can determine

significance more quickly without doing so

much computation. And that's what

we'll talk about. But another issue, before

we get to the statistics, is how do you actually

find that alignment? How do you find the top scoring

match in a mouse genome? So let's suppose

this guy is your RNA.

OK, of course, we're

using T's, but that's just because you usually

sequences it at the DNA level. But imagine this is your RNA. It's very short. This is like 10 or so, I think. And this is your database. But it goes on a

few billion more. Several more blackboards. And I want to come up

with an algorithm that will find the highest scoring

segment of this query sequence against this database. Any ideas? So this would be like

our first algorithm. And it's not terribly

hard, so that's why it's a good

one to start with. Not totally obvious either. Who can think of an algorithm

or something, some operation that we can do on

this sequence compared to this sequence–

in some way– that will help us find the

highest scoring match? I'm sorry. Yeah? AUDIENCE: You have to consider

insertion and deletion. PROFESSOR: Yeah, OK. So we're going to

keep it simple.

That's true, in general. But we're going

to keep it simple and just say no

insertions and deletions. So we're going to look for

an ungapped local alignment. So that's the

algorithm that I want. First, no gaps. And then we'll do

gaps on Tuesday. Tim? AUDIENCE: You could just

compare your [INAUDIBLE] to [INAUDIBLE] all

across the database and turn off all the

[INAUDIBLE] on that [INAUDIBLE], and then figure out [INAUDIBLE]. PROFESSOR: Yeah, OK. Pretty much. I mean, that's

pretty much right. Although it's not quite as

much of a description as you would need if you want

to actually code that.

Like, how would you

actually do that? So, I want a

description that is sort of more at the

level of pseudocode. Like, here's how you would

actually organize your code. So, let's say you

entertain the hypothesis that the alignment can be

in different registers. The alignment can correspond to

base one of the query and base one of the subject. Or it could be shifted. It could be an alignment

where base 1 of the query matches these two, and so forth.

So there's sort of

different registers. So let's just consider

one register first. The one where base 1 matches. So let's just look

at the matches between corresponding bases. I'm just going to make these

little angle bracket guys here. Hopefully I won't

make any mistakes. I'm going to take this. This is sort of implementing

Tim's idea here. And then I'm going to

look for each of these– so consider it going down here. Now we're sort of looking

at an alignment here. Is this a match or a mismatch? That's a mismatch.

That's a match. That's a mismatch. That's a mismatch. That's a match. Match Match. Mismatch. Mismatch. Mismatch. So where is the

top scoring match between the query

and the subject? Tim? Anyone? AUDIENCE: 6, 7, 8. PROFESSOR: 6, 7, 8. Good. Oh– AUDIENCE: 5, 6, 7. PROFESSOR: 5, 6, 7. Right. Right here. You can see there's

three in a row. Well, what about this? Why can't we add

this to the match? What's the reason why

it's not 2, 3, 4, 5, 6, 7? AUDIENCE: Because the

score for that is lower. PROFESSOR: Because the

score for that is lower.

Right. We defined top scoring segment. You sum up the scores

across the map. So you can have

mismatches in there, but this will have a score of 3. And if you wanted to

add these three bases, you would be adding

negative 2 and plus 1, so it would reduce your score. So that would be worse. Any ideas on how to do this in

an automatic, algorithmic way? Yeah? What's your name? AUDIENCE: Simon. So if you keep shifting the

entire database, [INAUDIBLE]. PROFESSOR: OK so you

keep shifting it over, and you generate

one of these lines. But imagine my query was

like 1,000 or something. And my database

is like a billion. How do I look along here? And here it was obvious what

the top scoring match is. But if I had two

matches here, then we would've actually had

a longer match here.

So in general, how do I

find that that top match? For each of those

registers, if you will, you'll have a thousand

long diagonal here with 1's and minus 1's on it. How do I process those scores

to find the top scoring segment? What's an algorithm to do that? It's kind of

intuitively obvious, but I want to do something

with, you define a variable and you update it, and you

add to it, and subtract.

Something like that. But, like a computer

could actually handle. Yeah? What was your name? Julianne? AUDIENCE: Could you keep track

of what the highest total score is, and then you keep

going down the diagonal, And then you update it? PROFESSOR: OK. You keep track of what the

highest total score was? AUDIENCE: Yeah. The highest test score. PROFESSOR: The

highest segment score? OK. I'm going to put this up here. And we'll define max s. That's the highest segment

score we've achieved to date. And we'll initialize

that to zero, let's say. Because if you had

all mismatches, zero would be the

correct answer. If your query was A's

and your subject was T's.

And then what do you do? AUDIENCE: As you go down the

diagonal, you keep track of– PROFESSOR: Keep track of what? AUDIENCE: So you

look at 1 in 1 first. And then you go 1 in 2, and

you find a score of zero. But that's higher

than negative 1. PROFESSOR: But the score of the

maximum segment at that point, after base 2, is not zero. It's actually 1. Because you could have a

segment of one base alignment. The cumulative score is zero. I think you're onto something

here that may be also something useful

to keep track of. Let's do the cumulative score

and then you tell me more. We'll define cumulative

score variable. We'll initialize that to zero. And then we'll have some for

loops that, as some of you have said, you want to

loop through the subject. All the possible

registers of the subject.

So that would be

maybe j equals 1 to subject length

minus query length. Something like that. Don't worry too much about this. Again, this is not

real code, obviously. It's pseudocode. So then this will be,

say, 1 to query language. And so this will be

going along our diagonal. And we're going to plot

the cumulative score.

So here you would you have an

update where cumulative score plus equals the score

of query position i matched against

subject position j. And update that. So that's just cumulative score. So what will it look like? So in this case, I'll

just use this down here. So you have zero, 1,

2, minus 1, minus 2. So you'll start at position

zero in the sequence. At position 1 you're

down here at minus 1 because it was a mismatch. Then at position 2, as you

said, we're back up to zero. And then what happens? Go down to minus

1, down to minus 2. Then we go up three times in a

row until we're up here to 1. And then we go down after that. So where is your highest scoring

match in this cumulative score plot? People said it was from 5 to 7.

Yeah, question? AUDIENCE: So would it be

from like a local minimum to a local maximum? PROFESSOR: Yeah. Exactly. So, what do you want

to keep track of? AUDIENCE: You want to keep track

of the minimum and the maximum. And look for the range which

you maximize to different– PROFESSOR: Yeah, so

this is now sort of more what I was looking

for in terms of– so this was the local minimum,

and that's the local maximum. This is the score. That's your mass s there. And you also want to keep

track of where that happened in both the query

and the subject. Does that make sense? So you would keep track of

this running cumulative score variable.

You keep track of

the last minimum. The minimum that

you've achieved so far. And so that would then

be down here to minus 2. And then when your cumulative

score got up to plus 1, you always take that

cumulative score, minus the last minimum

cumulative score. That gives you a

potential candidate for a high scoring segment. And if that is bigger than

your current max high scoring segment, then you update it

and you would update this. And then you would

also have variables that would store where you are.

And also, where did

that last minimum occur. So I'm not spelling it all out. I'm not going to give

you all the variables. But this is an algorithm that

would find the maximum score. Yeah, question? AUDIENCE: So you're

keeping track of the global maximum,

local minimum, so that you can accept the most

recent local minimum following the global maximum? PROFESSOR: I'm not

sure I got all that. But you're keeping track

of the cumulative score.

The minimum that that

cumulative score ever got to. And the maximum

difference, the maximum that you ever in the

past have gone up. Where you've had a

net increment upwards. Like here. So this variable

here, this max s, it would be initialized to zero. When you got to here, your last

minimum score would be minus 1. Your cumulative

score would be zero. You would take the difference

of those, and you'd be like, oh I've got a high scoring

segment of score one. So I'm going to update that. So now, that variable

is now 1 at this point. Then you're going down, so

you're not getting anything. You're just lowering this

minimum cumulative score down to minus 2 here. And then when you

get to here, now you check the cumulative score

minus the last minimum. It's 1. That's a tie. We won't keep track of ties. Now at here, that

difference is 2. So now we've got a new record.

So now we update this maximum

score to 2 in the locations. And then we get here, now

it's 3, and we update that. Does that make sense? AUDIENCE: Imagine the first

dip– instead of going down to negative 1, it went

down to negative 3. PROFESSOR: Negative 3? AUDIENCE: That first dip. PROFESSOR: Right here? So we started back a little bit. So back here, like this? AUDIENCE: No. PROFESSOR: Down to negative 3? AUDIENCE: No. PROFESSOR: But how do

we get to negative 3? Because our scoring is this way. You want this dip to minus 3? AUDIENCE: No PROFESSOR: This one minus 3? Imagine we are at minus 3 here? AUDIENCE: Yeah.

Imagine it dipped to minus 3. And then the next one dipped to

higher than that, to minus 2. And then it went up to 1. And so, would the difference

you look at be negative 2 to 1, or negative 3 to 1? PROFESSOR: Like that, right? So, minus 3, let's

say, minus 2, 1. Something like that. What do people think? Anyone want to–? AUDIENCE: Minus 3 to 1. PROFESSOR: Minus 3 to 1. It's the minimum

that you ever got to. This might be a stronger match,

but this is a higher scoring match. And we said we want

higher scoring. So you would count that. AUDIENCE: So you keep track

of both the global minimum and the global

maximum, and you take the difference between them.

PROFESSOR: You keep track

of the global minimum and the current

cumulative score, and you take the difference. AUDIENCE: The global maximum– PROFESSOR: It's not

necessarily global maximum because we could be

well below zero here. We could do like this. From here to here. So this is not the

global maximum. This just happens

to be, we went up a lot since our last minimum. So that's your high

scoring segment. Does that make sense? I haven't completely

spelled it out.

But I think you guys have given

enough ideas here that there;s sort of the core

of an algorithm. And I encourage you to think

this through afterwards and let me know if

there are questions. And we could add an

optional homework where I ask you to do

this, that we've sometimes had in the past. It is a useful thing to look at. This is not exactly how

the BLAST algorithm works. It uses some tricks

for faster speed. But this is sort of

morally equivalent to BLAST in the sense that it has the

same order of magnitude running time. So this algorithm– what is the

running time in Big-O notation? So just for those who

are non-CS people, when you use this Big-O notation,

then you're asking, how does the running

time increase in the size of the input? And so what is the input? So we have two inputs. We have a query of length. And let's say

subject of length n. So clearly, if those are bigger,

it'll take longer to run. But when you compare

different algorithms, you want to know how the run

time depends on those lengths.

Yes. What's your name? AUDIENCE: Sally. m times n. PROFESSOR: So with

this, this is what you would call an

order mn algorithm. And why is that? How can you see that? AUDIENCE: You have two for

loops And for each length, essentially, you're

going through everything in the query. And then, for everything that

you go through in the query, you would [INAUDIBLE].

PROFESSOR: Right. In this second for loop here,

you're going through the query. And you're doing that

nested inside of a for loop that's basically the

length of the subject. And eventually

you're going to have to compare every

base in the query to every base in the subject. There's no way around that. And that takes

some unit of time. And so the actual time will

be proportional to that. So the bigger n gets

and m gets, it's just proportional

to the product. Does that make sense? Or another way to

think about it is, you're clearly going to have to

do something on this diagonal. And then you're going

to have to do something on this diagonal, and

this one, and this one. And actually, you have to

also check these ones here. And in the end, the total

number of computations there is going to

be this times that. You're basically doing

a rectangle's worth of computations. Does that makes sense? So that's not bad, right? It could be worse. It could be, like, mn squared

or something like that.

So that's basically

why BLAST is fast. So what do these things

look like, in general? And what is the condition on

our score for this algorithm to work? What if I gave a score

of plus 1 for a match, and zero for a mismatch? Could we do this? Joe, you're shaking your head. AUDIENCE: It would

just be going up. PROFESSOR: Yeah. The problem is, it might

be flat for a while, but eventually it would go up. And it would just

go up and up and up. And so your highest

scoring segment would, most of the

time, be something that started very

near the beginning and ended very near the end. So that doesn't work. So you have to have

a net negative drift. And the way that's formalized

is the expected score has to be negative.

So why is the expected score

negative in this scoring system that has plus 1 for a match,

and minus 1 for a mismatch? Why does that work? AUDIENCE: It should be wrong

three quarters of the time. PROFESSOR: Yeah. You'll have a mismatch

three quarters of the time. So on average, you

tend to drift down. And then you have these

little excursions upwards, and those are your

high scoring segments. Any questions about that? AUDIENCE: Question. Is there something

better than m times n? PROFESSOR: We've got some

computer scientists here. David? Better than m times n? I don't think so,

because you have to do all those comparisons. And so there's no way around

that, so I don't think so.

All right. But the constant–

you can do better on the constant than

this algorithm thing. AUDIENCE: With

multiple queries– PROFESSOR: With

multiple queries, yeah. Then you can maybe

do some hashing or find some– to speed it up. OK, so what about the

statistics of this? So it turns out that Karlin and

Altschul developed some theory for just exactly this problem. For searching a query sequence. It can be nucleotide or protein

as long as you have integer scores and the average– or the

expected– score is negative, then this theory

tells you how often the highest score of all–

across the entire query database comparison–

exceeds a cut off x using a local alignment

algorithm such as BLAST.

And it turns out

that these scores follow what's called an extreme

value or Gumbel distribution. And it has this kind of

double exponential form here. So x is some cut off. So usually x would be the

score that you actually observed when you searched your

query against the database. That's the one you care about. And then you want

to know, what's the probability we would've

seen something higher than that? Or you might do x is one less

than the score you observed.

So what's the chance we

observed something the same, as good as this, or better? Does that make sense? And so this is going to

be your P value then. So the probability

of S. The score of the highest segment

under a model where you have a random query

against a random database of the same length is 1

minus e to the minus KMN e to the minus lambda x. Where M and N are the lengths

of the query and the database. x is the score. And then K and lambda are

two positive parameters that depend actually on the

details of your score matrix and the composition

of your sequences.

And it turns out that lambda

is really the one that matters. And you can see

that because lambda is up there in that

exponent multiplying x. So if you double

lambda, that'll have a big effect on the answer. And K, it turns

out, you can mostly ignore it for most purposes. So as a formula, what

does this thing look like? It looks like that. Kind of a funny shape. It sort of looks like

an umlauf a little bit, but then has a different shape

on the right than the left.

And how do you

calculate this lambda? So I said that lambda is

sort of the key to all this because of its uniquely

important place in that formula,

multiplying the score. So it turns out that lambda is

the unique positive solution to this equation here. So now it actually depends

on the scoring matrix. So you see there's sij there. It depends on the

composition of your query. That's the pi's. The composition of your

subject, that's the rj's. You sum over the i

and j equal to each of the four nucleotides. And that sum has to be 1. So there's a unique positive

solution to this equation.

So how would we solve

an equation like this? First of all, what

kind of equation is this, given that we're

going to set the sij, and we're going to just

measure the pi and the rj? So those are all

known constants, and lambda is what we're

trying to solve for here. So what kind of an

equation is this in lambda? Linear? Quadratic? Hyperbolic? Anybody know what this is? So this is called a

transcendental equation because you have

different powers. That sounds kind of unpleasant. You don't take a class in

transcendental equations probably. So in general, they're not

possible to solve analytically when they get complicated. But in simple cases, you

can solve them analytically. And in fact, let's just do one. So let's take the

simplest case, which would be that all the

pi's are a quarter.

All the ri's are a quarter. And we'll use the scoring system

that we came up with before, where sii is 1,

and sij is minus 1. If i does not equal j. And so when we plug those in to

that sum there, what do we get? We'll get four terms that are

one quarter, times one quarter, times e to the lambda. There's four possible

types of matches, right? They have probability one

quarter times a quarter. That's pi and rj. And the e to the lambda

sii is just e to the lambda because sii is 1. And then there's 12 terms that

are one quarter, one quarter, e to the minus lambda.

Because there's

the minus 1 score. And that has to equal 1. So cancel this, we'll

multiply through by 4, maybe. So now we get e to

the lambda plus 3. e to the minus lambda equals 1. It's still a

transcendental equation, but it's looking

a little simpler. Any ideas how to

solve this for lambda? Sally? AUDIENCE: Wouldn't the 1 be 4? PROFESSOR: I'm sorry. 4. Thank you. Yeah, what's your name? AUDIENCE: [INAUDIBLE] I think

[INAUDIBLE] quadratic equation. If you multiply both sides by

[INAUDIBLE] then [INAUDIBLE]. PROFESSOR: OK, so

the claim is this is basically a

quadratic equation. So you multiply both

sides by e to the lambda. So then you get e to

the 2 lambda plus 3. And then it's going to

move this over and do minus 4 e to the

lambda equals zero. Is that good? So how is it quadratic? What do you actually

do to solve this? AUDIENCE: Well, [INAUDIBLE]. PROFESSOR: Change the variable,

x equals e to the lambda. Then it's quadratic in x.

Solve for x. We all know how to solve

quadratic equations. And then substitute

that for lambda. OK, everyone got that? If you use 16 different

scores to represent all the different types

of matches and mismatches, this will be very unpleasant. It's not unsolvable,

it's just that you have to use computational

numerical methods to solve it. But in simple cases

where you just have a couple different

types of scores, it will often be a

quadratic equation. All right. So let's suppose that we have

a particular scoring system– particular pi's, rj's– and

we have a value of lambda that satisfies those. So we've solved this

quadratic equation for lambda. I think we get lambda

equals natural log 3, something like that. Remember, it's a unique

positive solution. Quadratic equations

are two solutions, but there's going to be

just one positive one. And then we have that value. It satisfies this equation.

So then, what if we

double the scores? Instead of plus 1 minus

1, we use plus 2 minus 2? What would then happen? You can see that the

original version of lambda wouldn't necessarily still

satisfy this equation. But if you think

about it a little bit, you can figure out what

new value of lambda would satisfy this equation. We've solved for the lambda

that solves with these scores. Now we're going to

have new scores. sii prime equals 2. sij prime equals minus 2. What is lambda prime? The lambda that goes

with these scores? Yeah, go ahead. AUDIENCE: Half of the original? PROFESSOR: Half of the original? Right. So you're saying that lambda

prime equals lambda over 2. And why is that? Can you explain? AUDIENCE: Because

of the [INAUDIBLE]. PROFESSOR: Yeah, if you think

about these terms in the sum, the s part is all doubling. So if you cut the lambda apart,

and the product will equal what it did before. And we haven't changed the pi's

and rj's, so all those terms will be the same. So therefore, it will still

satisfy that equation.

So that's another way

of thinking about it. Yes, you're correct. So if you double

the scores, lambda will be reduced

by a factor of 2. So what does that

tell us about lambda? What is it? What is its meaning? Yeah, go ahead, Jeff. AUDIENCE: Scale of

the distribution to the expectant score? Or the range score? PROFESSOR: Yeah. It basically scales the scores. So we can have the

same equation here used with arbitrary scoring.

It just scales it. You can see the way it appears

as a multiplicative factor in front of the score. So if you double all

the scores, will that change what the highest

scoring segment is? No, it won't change

it because you'll have this cumulative thing. It just changes how

you label the y-axis. It'll make it bigger, but it

won't change what that is. And if you look

at this equation, it won't change the

statistical significance. The x will double in value,

because all the matches are now worth twice as much as

what they were before.

But lambda will be half

as big, and so the product will be the same and therefore,

the final probability will be the same. So it's just a scaling factor

for using different scoring systems. Everyone got that? All right. So what scoring matrix

should we use for DNA? How about this one? So this is now a

slight generalization. So we're going to keep

1 for the matches. You don't lose any generality

by choosing 1 here for matches, because if you use 2,

then lambda is just going to be reduced

to compensate. So 1 for matches. And then we're going to

use m for mismatches.

And m must be negative in

order to satisfy this condition for this theory to work,

that the average score has to be negative. Clearly, you have to have

some negative scores. And the question then

is, should we use minus 1 like we used before? Or should we use like minus 2

or minus 5, or something else? Any thoughts on this? Or does it matter? Maybe it doesn't matter. Yeah, what's your name? AUDIENCE: [INAUDIBLE].

Would it make sense to

not use [INAUDIBLE], because [INAUDIBLE]. PROFESSOR: Yeah, OK. So you want to use a more

complicated scoring system. What particular

mismatches would you want to penalize more and less? AUDIENCE: [INAUDIBLE]

I think [INAUDIBLE] needs to be [INAUDIBLE]. PROFESSOR: Yeah, you are

correct in your intuition. Maybe one of the

biologists wants to offer a suggestion here. Yeah, go ahead. AUDIENCE: So it's a mismatch

between purine and pyrimidine [INAUDIBLE]. PROFESSOR: OK so now we've

got purines and pyrimidines.

So everyone

remember, the purines are A and G. The

pyrimidines are C and T. And the idea is that

this should be penalized, or this should be penalized

less than changing a purine to a pyrimidine. And why does that makes sense? AUDIENCE: Well,

structurally they're– PROFESSOR: Structurally, purines

are more similar to each other than they are to pyrimidines. And? More importantly, I think. In evolution? AUDIENCE: [INAUDIBLE]. PROFESSOR: I'm sorry,

can you speak up? AUDIENCE: C to C mutations

happen spontaneously in [INAUDIBLE] chemistry. PROFESSOR: Yes. So C to C mutations

happen spontaneously. So basically, it's

easier because they look more similar structurally. The DNA polymerase is more

likely to make a mistake and substitute another purine. The rate of purine,

purine or pyrimidine, pyrimidine to

transversions which switch the type is

about three to one, or two to one in

different systems. So yeah, that's a good idea. But for simplicity, just

to keep the math simple, we're just going to go

with one mismatch penalty. But that is a good point. In practice, you

might want to do that. So now, I'm saying

I'm going to limit you to one mismatch penalty.

But I'm going to let you

choose any value you want. So what value should you choose? Or does it matter? Or maybe different applications? Tim, yeah? AUDIENCE: I've just

got a question. Does it depend on pi and ri? For example, we could

use all these numbers. But if the overall

wants to be negative, then you couldn't

use negative .1. PROFESSOR: Right,

that's a good point. You can't make it too weak. It may depend on what your

expected fraction of matches is, which actually

depends on pi and ri.

So if you have very biased

sequences, like very AT rich, your expected fraction of

matches is actually higher. When you're researching

an AT rich sequence against another

AT rich sequence, it's actually higher

than a quarter. So even minus one might

not be sufficient there. You might need to go

down more negative. So you may need to use a

higher negative value just to make sure that the

expected value is negative. That's true. And yeah, you may want to adjust

it based on the composition. So let's just do a bit more. So it turns out that the

Karlin and Altschul theory, in addition to telling you what

the p value is of your match– the statistical significance–

it also tells you what the matches will

look like in terms of what fraction of

identity they will have.

And this is the so-called

target frequency equation. The theory says that if

I search a query with one particular composition, p,

subject meta-composition r– here, I've just assumed

they're the same, both p just for simplicity– with a

scoring matrix sij, which has a corresponding of lambda. Then, when I take those

very high scoring matches– the ones that are

statistically significant– and I look at those

alignments of those matches, I will get values qij,

given by this formula. So look at the formula. So it's qij. So pipj e to the lambda sij.

So it's basically

the expected chance that you would have base

i matching based j just by chance. That's pipj. But then weighted by

e to the lambda sij. So we notice for a match,

s will be positive, so e to the lambda

will be positive. So that will be bigger than 1. And you'll have more

matches and you'll have correspondingly

less mismatches because the mismatch

has a negative. So get the target value score. And that also tells you that

the so-called natural scores are actually determined

by the fraction of matches that you want in your

high scoring segments. If we want 90% matches,

we just set qii to be 0.9, and use

this equation here. Solve for sij. For example, if you want to

find regions with R% identities. Little r is just the

r as a proportion. qii is going to be r over 4. This assumes unbiased

base composition. A quarter of the

matches are acgt. Qij, then, is 1 minus r over 12. 1 minus r is a fraction

of non-matching positions. They're 12 different types. Set sii equal to 1, that's

what we said we normally do.

And then you do a little

bit of algebra here. m is sij. And you sort of plug in

this equation twice here. And you get this equation. So it says that m equals

log of 4 1 minus r over 3 over log 4 r. And for this to be

true, this assumes that both the query

and the database have uniform composition

of a quarter, and that r is between

a quarter and 1. The proportion of matches in

your high scoring segment– you want it to be

bigger than a quarter. A quarter is what you

would see by chance. There's something wrong with

your scoring system if you're considering those

to be significant. So it's something above 25%. And so it's just

simple algebra– you can check my work at

home– to solve for m here. And then this equation

then tells you that if I want to find

75% identical matches in a nucleotide

search, I should use a mismatch penalty of minus 1. And if I want 99%

identical matches, I should use a

penalty of minus 3. Not minus 5, but minus 3. And I want you to think

about, does that make sense? Does that not make sense? Because I'm going to ask you

at the beginning of class on Tuesday to

explain and comment on this particular phenomenon

of how when you want higher percent identities, you want a

more negative mismatch score.

Any last questions? Comments? .