It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.
--- James D. Watson and Francis H. C. Crick, ``Molecular structure of Nucleic Acids,'' Nature 171, 737-738 (1953)
New Shimmer. For the greatest shine you ever tasted!
--- Chevy Chase, Saturday Night Live
Today's puzzler is a cryptogram:
YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
Last column, we raised linguistics questions about Unix.
This time, we take a cryptological approach to looking at DNA
sequences.
Before we do, let's go back to our puzzler, which is
classic cryptology in every sense of the word ``classic.''
You've probably used rot13, which makes jokes illegible (or legible) by rotating each letter halfway around the alphabet, with the command
tr 'a-zA-Z' 'n-za-mN-ZA-M'
Encrypting messages by substituting letters in a rotated
alphabet for the original letters is called a Caesar
cipher, because Julius Caesar used it.
In fact, the message above is from Julius Caesar, who used a shift of three. Using the command
to decrypt the message above, we get, uh,tr 'd-za-cD-ZA-D' 'a-zA-Z'
Well, actually, the Romans didn't have punctuation, lower case, spaces, or the letters, `J', `U', or `W'. Okay, tryingVENJVJDJVJCJHAJVTJVLJVTCAETAS
instead, we gettr 'D-IK-TV-ZA-C' 'A-IK-TV-Z'
VENIVIDIVICIGAIVSIVLIVSCAESAR
Perhaps after Julius Caesar found out he'd have to spell
his name ``GAIVSIVLIVSCAESAR,'' spelling it ``MYOMYWFDHWDV''
didn't seem like a big leap.
Now fast-forward a couple of millennia, and take a look
at your daily paper. Chances are, the page of your paper that
has a daily crossword puzzle also has a daily cryptogram: a quote
that you can decrypt by cracking a substitution cipher.
The substitution ciphers in your paper may be more
sophisticated than Caesar's -- instead of a simple rotation, the
alphabets are often rearranged in an arbitrary ways -- but you
can still crack them easily. Face it: if they were hard, they
wouldn't be a game in your daily paper.
Typical attacks use letter and word frequencies.
The quote has a one-letter word, `W'? Then `W' is the
encrypted version of an `i' or an `a'.
The most frequent letters are `S' and `Z'? Those are
probably `e' and `t,' (The nonsense words "etaoin shrdlu" are a
mnemonic for the most frequent letters in English, in approximate
frequency order.)
Several words end in the trigram ``XRF''? That's
probably ``ing.''
Using these guesses makes one of the three-letter words
``tBe''? `B' is probably an `h'.
This approach works because we know where words start
and end, and because we know the letters in our alphabet,
together with their frequencies and patterns.
If you haven't tried one of these before, and don't have a paper handy, here's a sample:
nyy tnhy vf qvivqrq vagb guerr cnegf, bar bs juvpu gur orytnr vaunovg, gur ndhvgnav nabgure, gur guveq, gubfr jub va gurve bja ynathntr ner pnyyrq prygf, naq va bhef, tnhyf. nyy gurfr qvssre sebz rnpu bgure va ynathntr, phfgbzf naq ynjf. gur evire tnebaar frcnengrf gur tnhyf sebz gur ndhvgnav; gur znear naq gur frvar frcnengr gurz sebz gur orytnr. bs nyy gurfr crbcyr gur oenirfg ner gur orytnr. gurl ner gur shegurfg njnl sebz gur phygher naq pvivyvmrq jnlf bs gur ebzna cebivapr, naq ner yrnfg bsgra ivfvgrq ol gur zrepunagf jub oevat yhkhevrf juvpu graq gb znxr crbcyr fbsg; nyfb gurl ner arnerfg gb gur treznaf npebff gur euvar naq ner pbagvahnyyl ng jne jvgu gurz. -- whyvhf pnrfne
Cryptography Made
Hard
You can probably imagine writing code to help do things
like this. So, for that matter, could Alan Turing, who jumped
into computing to help crack the German Enigma codes, at
Bletchley Park during World War II. So can the sleuths at the
NSA; this week they're probably compiling tables of letter and
word frequencies from Pashto.
Okay, now try this one.
What you're looking at, above, is a message that tells us why John Dalton was colorblind. All you have to do is translate it.atggcccagc agtggagcct ccaaaggctc gcaggccgcc atccgcagga cagctatgag gacagcaccc agtccagcat cttcacctac accaacagca actccaccag aggccccttc gaaggcccga attaccacat cgctcccaga tgggtgtacc acctcaccag tgtctggatg atctttgtgg tcattgcatc cgttttcaca aatgggcttg tgctggcggc caccatgaag ttcaagaagc tgcgccaccc gctgaactgg atcctggtga acctggcggt cgctgacctg gcagagaccg tcatcgccag cactatcagc gttgtgaacc aggtctatgg ctacttcgtg ctgggccacc ctatgtgtgt cctggagggc tacaccgtct ccctgtgtgg gatcacaggt ctctggtctc tggccatcat ttcctgggag agatggatgg tggtctgcaa gccctttggc aatgtgagat ttgatgccaa gctggccatc gtgggcattg ccttctcctg gatctgggct gctgtgtgga cagccccgcc catctttggt tggagcaggt actggcccca cggcctgaag acttcatgcg gcccagacgt gttcagcggc agctcgtacc ccggggtgca gtcttacatg attgtcctca tggtcacctg ctgcatcacc ccactcagca tcatcgtgct ctgctacctc caagtgtggc tggccatccg agcggtggca aagcagcaga aagagtctga atccacccag aaggcagaga aggaagtgac gcgcatggtg gtggtgatgg tcctggcatt ctgcttctgc tggggaccat acgccttctt cgcatgcttt gctgctgcca accctggcta ccccttccac cctttgatgg ctgccctgcc ggccttcttt gccaaaagtg ccactatcta caaccccgtt atctatgtct ttatgaaccg gcagtttcga aactgcatct tgcagctttt cgggaagaag gttgacgatg gctctgaact ctccagcgcc tccaaaacgg aggtctcatc tgtgtcctcg gtatcgcctg catga
This is the central problem of molecular genetics. We
have lots of encoded text. All we have to do is decode it.
How much encoded text? Well, it changes every day --
especially now, with the Human Genome Project and other
sequencing efforts going full-bore -- but a good place to start
looking is the National Institute of Health's biological sequence
database search engine, Entrez.
You want the complete DNA sequence of a human
being? No problem. It's actually only about 3 gigabytes. You
can't even buy a 3 gigabyte hard disk any more, but a 20 gigabyte
one will only set you back $100 -- half a penny a megabyte. You
can store DNA sequences for peanuts. (You want to store DNA
sequences from peanuts, Arachis hypogaea? Entrez
lists dozens.)
Each sequence is a simple string of the four letters
`a', `c', `g', and `t' (for the four molecules adenine, cytosine,
guanine, and thymine), strung together in a series of
instructions.
But you want to know to read those instructions? What
they mean? Ah. Well.
Decoding the
Genome
We have two options, here. We could either use the
remainder of the column for a full, in-depth course on molecular
genetics, ending with a lot of, ``we don't know anything after
this,'' or we could do something weird.
It really doesn't seem like a difficult choice, does it?
Up through World War I, a twist on simple substitution
ciphers, called the Vigniere cipher, was widely used because it
seemed immune to simple attacks like the one outlined above. The
basic idea behind the Vigniere cipher is that each letter is
encrypted by a Caesar cipher, but different parts of the message
use different shifts. For example, you could encrypt even
letters rot4 and odd letters, rot23.
More typically, after choosing a keyword like
``gillian'' the first letter of the message is encrypted with a
shift of 6 (`g'), the second with a shift of 8 (`i'), and so on
until you get to the eighth letter, which starts over with a
shift of 6.
Without knowing which parts of the message use which
shifts, attacks based on simple letter and cluster frequencies
fail.
These periodic substitution ciphers were quite
secure until, in the mid-1920s, William Friedman realized that
periodicity was the scheme's Achilles' heel and figured out how
to crack them.
Here's how.
First, imagine aligning two copies of a long, encrypted message, one under the other like this:
Each letter has the same letter beneath it.YHQMYMGMYMFMKDMYWMYOMYWFDHWDV YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
Next, shift the lower sequence some distance.
Notice that sometimes letters still, by chance, have the same letters under them. (For example, we've cheated outrageously so the first letter of the lower sequence, a `Y', is again directly under a `Y'.)YHQMYMGMYMFMKDMYWMYOMYWFDHWDV ....YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
Without going through the math, we'll claim that the
frequency of matches should be highest when the shift is a
multiple of the length of the key.
Let's try to make that sound more intuitively obvious.
Imagine that the original text has words ending in
``ing'' scattered throughout the text. Suppose, also, that the
encryption key is 10 characters long. Each `n' will be encrypted
with a different shift from the `i' on its left and the `g' on
its right. However, there are only 10 possible encryptions for
``ing''; and two ``ing'' strings separated by 10 (or 120 or 440)
characters from one another will be encrypted the same way: Their
encrypted versions will line up.
By looking at which shifts have the most frequent
matches, you can figure out how long a key is. Once you've done
that, it's easy. For example, if the key-length is 10, start at
one end of the sequence, collect every tenth letter -- all of
which are encoded with the same shift -- and then use techniques
for mono-alphabetic ciphers on them: the most frequent letter is
probably an 'e', and so on.
Okay, now let's try that with DNA. (To try this at home
or work, you'll want to grab GenBank sequences from Entrez. It's
trivial, honest.)
We begin by choosing three sequences: bacteriophage
ox174, bacteriophage lambda, and the human mitochondrial genome.
ox174 is around 5,000 characters long and lambda is about 50,000.
They're reasonable tiny and not-as-tiny test cases. Human
mitochondrial DNA is about 16,000; we throw it in both because
it's not closely related to the other two and to make this a
human interest story.
Plus, each of these has a useful feature: its DNA
sequence is circular. If you line a linear sequence up with
another copy of itself shifted 500 letters to the right, then 500
letters on the right and 500 on the left are unpaired. With a
circular sequence, every letter stays paired no matter how far
you rotate either copy. You can make full calculations for every
possible alignment.
Finding the frequency of matches with different shifts
is pretty easy. Here's our code to do it:
#!/usr/bin/perl -w # $Id: doshifts,v 1.6 2001/11/12 05:27:13 jsh Exp $ require "gs"; use subs qw(getseq); require "c"; sub c_idx; $ARGV[0] ||= "-"; my $s = getseq shift; my @s = @$s; my @t = @s; # shift no more than half way # -- it's a circle my $maxshift = int ((@s+1)/2); for (my $i = 0; $i< $maxshift; $i++) { my $c = c_idx(\@s, \@t); printf "$i\t%2.3f\n", ($c-.25)*100; push @t, shift @t; }
(To make the results easier to look at, we arbitrarily
subtract 0.25 from each value -- the frequency expected from a
random sequence of the four letters `a', `c', `g', and `t' -- and
multiply by 100.)
We realize we've only shown you the main program here.
To see the required modules, grab them with the program bundle
for this column from our web site.
What shifts have the highest frequency of matches? The
top 10 for each sequence:
phix174: 66, 78, 9, 12, 57, 21, 96, 48, 33, 30 lambda: 3, 21, 45, 30, 33, 72, 42, 117, 144, 87 human mitochondria: 3, 12, 6, 21, 27, 153, 15, 132, 9, 186
If these were secret messages, encrypted with the same
Vigniere cipher, we'd say the keylength was three.
How far down each list of shifts do we have to search
before finding one that isn't a multiple of three? This shell
one-liner
tac shiftlist | awk '$1%3 != 0 {print NR; exit}'
gives us these answers
phix174: 66 lambda: 124 human mitochondria: 44
This means that for ox174, the best-matching shift that
isn't a multiple of three -- as it turns out, a shift of 1828.
-- is worse than 65 other shifts that are divisible by 3.
For lambda, the best indivisible-by-three shift is the 124th
best, overall.
tac? It's just a backwards cat. It
reverses the lines in files.
Huh?
Wait just a minute. Are we seriously suggesting God is
Vigniere-encrypting DNA?
Nope. But the periodicity of length three is for a no-less-interesting reason: the genetic code is laid out as
triplets.
In case you barely scraped through high-school biology
(like Haemer), and you only know computers, think of it this way:
each base -- each `a', `c', `g', or `t' -- carries two bits of
information. DNA is laid out in 6-bit bytes, called ``codons.''
The codons themselves are not equal in frequency, so matching is
highest when the sequences are shifted in multiples of threes.
The high-agreement shifts are also relatively small
numbers. Is this because nearby codons tend to be more alike
than far-away codons?
Again, no. There are stretches of DNA that aren't laid
out as codons. The bigger the shift, the more likely it is that
some chunk of DNA that doesn't follow the rules will show up, and
knock you out of phase. Also, DNA is double-stranded, and only
one strand is read. Half the time, the real information is on
the other strand from the one we're trying to match against.
Is this triplet stuff new information?
A third time, no. Marshall Nirenberg, in the
mid-1960's, used wet (bio-)chemistry to break the genetic code
and assign meanings to all 64, three-base codons. It got him the
1968 Nobel prize in Medicine and Physiology.
Are we still amused that we can use a 1925 cryptological
attack on recently obtained DNA sequences to see this regularity?
Yes.
Ideas, anyone?
Bioinformatics is a young field. Like natural history
and astronomy, it has first-rate professionals, but it's also a
place where amateurs can still play.
The ocean of free DNA and protein data out there are an
instant amusement mix. You only need to add a bit of curiosity
and a few odd ideas.
If you want to explore, we can recommend a few places we like:
The NIH's biological sequence database search engine, Entrez. If you like using the CPAN (Comprehensive Perl Archive Network), check this one out.
The home page for Online Mendelian Genetics in
Man, the database of everything we know about human genetic
anomalies. If you want to start with a problem and work toward
the DNA, this is where you should go.
For example, Copeland is color-blind. Are you? Search for ``color-blind,'' or just go directly to http://www.ncbi.nlm.nih.gov/htbin-post/Omim/dispmim?303800#Reference18.
Home page for the bioperl project. If you want to find modules that let you write scripts to grab sequences from the databases, this is a good place to find them. If you want to create a Perl patch that will let you do regular expression matching on circular strings, here's a good place to contribute it.
Tom Schneider's home page at the National Cancer
Institute's Laboratory of Computational and Experimental Biology.
We like this page both because of the interesting stuff Tom has
been able to do, and because it hints at how much no one has even
tried yet.
There are only so many hours in a day.
On the other hand, for cryptography references, we
prefer the dead-tree form. We recommend Helen Fouche Gaines'
classic Cryptanalysis (Dover, 1939, ISBN 0-486-20097-3),
and David Kahn's recently updated The Codebreakers
(Scribners, 1996, ISBN 0-684-83130-9). Alternately, for a less
in-depth treatment, you can try to The Code Book by BBC
producer Simon Singh (Doubleday, 1999, 0-385-49531-5).
We'll leave you with a simple observation we stumbled on
last night.
Yeast mitochondrial DNA is about 80,000 bases long. It
has the added advantage that some of its four bases (letters) are
much rarer than others.
(Why is this good? Imagine an encrypted version of the
King James Bible: English doesn't have very many Z's or Q's, so
if you find a letter in the encrypted version that doesn't occur
very often, it's a good candidate for a Z or a Q. In general,
any sort of asymmetry or non-uniformity like this gives you an
edge.)
What shifts show the highest alignment for this
sequence? Here are the first 20:
3, 6, 4, 2, 12, 7, 9, 15, 18, 21, 8, 10, 14, 5, 11, 20, 24, 27, 13, 16
Fewer than half are divisible by 3. Ideas?
Last, Least.
We end by facing, head-on, the frequently asked question
``Why do you guys write about bizarre combinations, like
linguistics and Unix or cryptology and molecular genetics?''
This time we were inspired some by kind notes from a
pair of former co-workers, Bill Torcaso and Tom Schneider, both
of whom do bioinformatics for a living.
Most of the time, however, we do it because we're deeply
warped individuals.
So, until next month, when we try for another column that's both a dessert topping and a floor wax, happy trails.