These AIs Are About to Revolutionize Biology

[♪ INTRO] There is an unsolved mystery at the heart
of biology that’s been slowing progress in medicine
for a half a century. Whether you’re a biochemist trying to understand
life, or a drug designer trying to save lives, you may have run into
the protein folding problem. That is, despite the fact that proteins are
fundamental to life, it’s really hard to predict what they look
like.

But on the 15th of July 2021, two independent
groups announced that they’d cracked it, and it’s all thanks to some
seriously clever artificial intelligence. Which is exciting, because it could eventually
lead to breakthroughs in the fights against cancer and covid-19, and even toxic
waste. See, proteins are the building blocks of life. Everything in your body, and everything in
organisms all the way down to viruses, is made either
from or by proteins.

Proteins transport oxygen in your blood, they
digest foods, copy DNA, fight infections, build structures… it’s not just the things you eat to gain
muscles here! In fact, DNA, which is the code that makes
you you, boils down to a series of instructions for
making proteins. And proteins are made from a set of twenty building block molecules called amino acids. If proteins are like words, amino acids are the alphabet they’re made from. When your body makes proteins, it reads instructions
from your DNA to make long strings of amino acids, which
fold in on themselves in particular ways, and settle
into specific shapes. That shape determines how the protein works,
because proteins need to fit together with other proteins like puzzle pieces,
or latch onto specific molecules. And that makes proteins different from something
like DNA, where knowing the sequence is the same as
knowing what it does. For a protein, the shape matters too. The neat thing, though, is that the way the
amino acid string folds is determined by the sequence. If you know that, you should be able to work out the protein’s
final shape or shapes.

And it’s easy enough to figure out what
the sequence should be, because that’s given to you by the DNA — and
we can read the genetic code. But working out the shape is way harder. A protein can be made from anywhere from fifty
to two thousand amino acids. And each amino acid has a slightly different
chemical structure, adding to the complexity.

Individual parts of the amino acids can interact
with all the other nearby amino acids, and even some of the faraway
ones, pushing the folding in seemingly random directions. It still ends up making a useful structure
that is the same every time, but it is very hard to predict how it gets
there. So even though we know the amino acid sequences
of billions of proteins, we’re still stuck stumbling in the dark
when it comes to working out their shapes.

One way we currently study protein structure
is using x-ray crystallography. A solution that contains proteins is slowly evaporated so that the proteins left behind
form crystals. X-rays are fired at the crystals to get images,
and those images are assembled into a 3D model. Another method uses nuclear magnetic resonance
imaging, the same tech used in hospitals to image the
human body. These processes are time-consuming: they can take days or even years depending
on the protein. But in 2020, a team from the London-based, Google-owned company DeepMind made a stunning
announcement. They claimed that their new AI algorithm,
AlphaFold2, could predict the folded shape of proteins from
amino acid sequences about as well as experimental methods. DeepMind had previously made a name for itself
by making AIs that can beat world champions in chess,
go, shogi, and even StarCraft 2. But the game-playing AIs were just a warm-up
for the real challenge of doing science. The proof of their claim came from the results
of a competition held in 2020 called CASP14.

CASP is a competition held once every two
years for people trying to solve the protein folding
problem with computers. In the competition, teams are given the amino
acid sequences of about 100 proteins with known
structures. They’ve been worked out with experiments, like x-ray crystallography, but the structures
haven’t been revealed publicly. The teams then predict what the structure
of the protein will look like after folding, and independent judges compare the predictions
to the experiments.

Before 2020, no team came close to experimental
accuracy. Not even DeepMind themselves when they entered in 2018 with their original iteration of AlphaFold. But 2020 was different, and not just because
the competition involved proteins from a new virus
called SARS-CoV-2! In CASP14, AlphaFold2 cracked the problem
wide open: two-thirds of their predictions were about
as good as experiments. And it’s not like the experiments were perfect,
either. For some predictions where AlphaFold2 disagreed
with the experiments, it wasn’t actually clear which of the two
was closer to the real structure, since both predictions and experiments always
come with a certain margin of error. Many of AlphaFold2’s predictions were remarkably
precise. For some, the margin of error was the size
of a typical atom. In other words, they guessed the exact spot
where each of the thousands of individual atoms
were placed in the overall structure. It does not get much better than that! When the results were announced, scientists
were stunned by the breakthrough. Some had thought they would never see the
problem solved in their lifetimes, and called it a ‘holy grail’ of sorts.

Basically, AlphaFold2 and the DeepMind team
had kind of ‘solved’ the protein folding problem. But only kind of. The best was yet to come. DeepMind gave a short presentation about their
algorithm, which was seen by a group from the University
of Washington, Seattle, who then applied what they saw to their own
work on the problem. Working with an international team of collaborators, they developed their own algorithm, which
they called RoseTTAFold. They combined the DeepMind team’s ideas
with some of their own, and made a program that was actually better
than AlphaFold in a few ways. Finally, on the same day in July 2021, both
DeepMind and Seattle teams published their full methods, and made
the code free for academics to use.

In a day, we went from having no good way
to predict protein structure on a computer to having two different, highly-accurate options. And there’s already plenty of ideas for
how to use them. Firstly, DeepMind have published a paper with
predictions for the structure of almost all human proteins,
but there’s more, too. Scientists are always trying to design drugs that stop proteins from binding to other proteins. Which is useful if those proteins are made
by a cancerous cell, or, y’know, the virus that causes COVID-19. That process often relies on being able to
modify and design new proteins. Being able to see the protein you’re designing
on a computer without needing to go out and make it is hugely
useful.

It could potentially streamline the process
from taking years to taking days. And there’s more exotic ideas for new proteins
out there, too. Proteins in our bodies break down harmful
chemicals and make useful byproducts all the time, so imagine making
an artificial protein that can break down toxic waste, or produce
biofuels. So, that’s what they did. Now the question is how did the DeepMind and Seattle teams bring home this
holy grail of biochemistry? If you can’t tell, I, personally, am very
excited about this because it’s what I worked on in research and it was so hard and
I’m so excited to see it get easier.

Both methods use what’s called deep learning
and neural networks. A neural network contains layers of interconnected
nodes, or artificial ‘neurons’ made of data, which use algorithms to mimic
the way biological neurons in our brains send signals back and forth
to one another. Deep learning is a subset of machine learning,
and it’s basically a neural network with at least three layers of nodes: an input
layer, one or more "hidden" intermediary layers,
and an output layer.

The idea is that with all these different
layers working together, the program can learn to recognize and identify
objects in a dataset. A classic example of how this works is training
a neural network to look at an image and work out whether it
includes a cat. You could think of each node in a neural network
as its own tiny mathematical model, using an equation
to predict an outcome. This prediction will be the output of the
node, and this data will be sent along as the input data to another node in the next
layer of the neural network. This process repeats over and over again. In the cat image example, you could use a
set of correctly labeled training data to help “train” the algorithm. Like, say, thousands of images, some with
cats, some without, that are labeled with and without cats. The algorithm makes its predictions and checks for errors against the known dataset. Then, it can use the data from the errors
to adjust its predictions. And over time, it can gradually become more
accurate.

Eventually, you could give it testing data,
asking it to identify cats in images it hasn’t been trained on, and
checking if it’s right. So, applying this to protein-folding, the training and testing data consisted of
correctly folded proteins. The algorithms train themselves by looking
at these proteins and trying to fold the proteins themselves
to match what they see. Now, for the DeepMind and Seattle teams, that
training and testing data came from the Protein Data Bank, which has over 180,000
protein structures and their amino acid sequences. Now at the basic level, most of the CASP14
entries also used a deep learning approach, but the DeepMind and
Seattle teams had some clever tricks up their sleeves.

See, this is a more complex task than looking
at cat pictures. The teams had to do more than just put in
previously-folded proteins and hope the algorithm learned how
to fold more. So when the researchers built their algorithms, they needed to make adjustments specific to
the world of protein folding, which they needed deep biochemistry expertise
to do. For example, biochemists can predict on a
more general level if an amino acid sequence is likely to form
a structure like a coil or a flat sheet. And they can use that knowledge to guide the
neural network to its predictions. In fact, AlphaFold uses multiple neural networks that feed into each other in two stages. AlphaFold starts with a network that reads
and folds the amino acid sequence and adjusts how far apart pairs of amino acids
are in the overall structure. And that’s followed by a structure model
network that reads the whole thing, creates a 3D structure, and makes adjustments
at the end.

One of RoseTTAFold’s innovations was to
add a simultaneous third neural network: one that tracks where the amino acids are
in 3D space as the structure folds, alongside the 1D and 2D information. On top of that, protein-folding researchers
have started to look at evolutionary history of closely-related proteins
to work out their structures. If you know how closely related two proteins
are, and you know the structure of one of them, you can make good guesses about
the structure of the other. The two teams both took advantage of this
idea. RoseTTAFold isn’t quite as accurate as AlphaFold,
but it works using way less computing power and time, working out in minutes what AlphaFold needs
hours to do.

Because the Seattle team didn’t have the
immense computing power of Google at their disposal to do the calculations. But RoseTTAFold can actually do things AlphaFold
can’t do yet. For instance, AlphaFold struggles with the crucial but difficult subject of protein complexes, where multiple proteins interact and their
structures depend on which specific proteins are present. Many proteins require partners to function,
so understanding protein complexes is part of understanding
proteins. But RoseTTAFold can handle proteins where
the amino acid chain is broken into more than one piece, and that
same logic can be used to study interactions of different proteins with
each other, which is needed to analyze complexes. So both methods have interesting things to
bring to the table. However, no one here is envisioning replacing
the biologists who study protein structures. I mean, the AIs wouldn’t have gotten far
without the massive databases of protein structures people
have built up over decades. And the AI is only useful because it will
let humans do more stuff. Like quickly designing and testing artificial
proteins. With that power, biochemists will create exotic
proteins we can only dream of, and biologists will be able to explore the
history and nature of life like never before.

With all of this potential, the future of
biology is starting to take shape. Thank you for watching this episode of SciShow, which was brought to you as always with the
help of our amazing patrons. We would not be able to make ten minute long
videos about the overlap between chemical biology and computer science
without you, and we appreciate the heck out of your support. If you’d like to get involved making things
like this possible, it’s so easy to do that.

You just have to go to patreon.com/scishow. [♪ OUTRO].

You May Also Like