Work: The Cryptogram of Life

Jeffreys Copeland & Haemer

(Server/Workstation Expert, January 2002)



It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.
                        --- James D. Watson and Francis H. C. Crick, ``Molecular structure of Nucleic Acids,'' Nature 171, 737-738 (1953)


New Shimmer. For the greatest shine you ever tasted!
                        --- Chevy Chase, Saturday Night Live


Today's puzzler is a cryptogram:

YHQMYMGMYMFMKDMYWMYOMYWFDHWDV

Last column, we raised linguistics questions about Unix. This time, we take a cryptological approach to looking at DNA sequences.

Before we do, let's go back to our puzzler, which is classic cryptology in every sense of the word ``classic.''

You've probably used rot13, which makes jokes illegible (or legible) by rotating each letter halfway around the alphabet, with the command

tr 'a-zA-Z' 'n-za-mN-ZA-M'

Encrypting messages by substituting letters in a rotated alphabet for the original letters is called a Caesar cipher, because Julius Caesar used it.

In fact, the message above is from Julius Caesar, who used a shift of three. Using the command

tr 'd-za-cD-ZA-D' 'a-zA-Z'
to decrypt the message above, we get, uh,
VENJVJDJVJCJHAJVTJVLJVTCAETAS
Well, actually, the Romans didn't have punctuation, lower case, spaces, or the letters, `J', `U', or `W'. Okay, trying
tr 'D-IK-TV-ZA-C' 'A-IK-TV-Z'
instead, we get
VENIVIDIVICIGAIVSIVLIVSCAESAR

Perhaps after Julius Caesar found out he'd have to spell his name ``GAIVSIVLIVSCAESAR,'' spelling it ``MYOMYWFDHWDV'' didn't seem like a big leap.

Now fast-forward a couple of millennia, and take a look at your daily paper. Chances are, the page of your paper that has a daily crossword puzzle also has a daily cryptogram: a quote that you can decrypt by cracking a substitution cipher.

The substitution ciphers in your paper may be more sophisticated than Caesar's -- instead of a simple rotation, the alphabets are often rearranged in an arbitrary ways -- but you can still crack them easily. Face it: if they were hard, they wouldn't be a game in your daily paper.

Typical attacks use letter and word frequencies.

The quote has a one-letter word, `W'? Then `W' is the encrypted version of an `i' or an `a'.

The most frequent letters are `S' and `Z'? Those are probably `e' and `t,' (The nonsense words "etaoin shrdlu" are a mnemonic for the most frequent letters in English, in approximate frequency order.)

Several words end in the trigram ``XRF''? That's probably ``ing.''

Using these guesses makes one of the three-letter words ``tBe''? `B' is probably an `h'.

This approach works because we know where words start and end, and because we know the letters in our alphabet, together with their frequencies and patterns.

If you haven't tried one of these before, and don't have a paper handy, here's a sample:

nyy tnhy vf qvivqrq vagb guerr cnegf, bar bs
juvpu gur orytnr vaunovg, gur ndhvgnav nabgure,
gur guveq, gubfr jub va gurve bja ynathntr ner
pnyyrq prygf, naq va bhef, tnhyf.  nyy gurfr
qvssre sebz rnpu bgure va ynathntr, phfgbzf
naq ynjf. gur evire tnebaar frcnengrf gur tnhyf
sebz gur ndhvgnav; gur znear naq gur frvar
frcnengr gurz sebz gur orytnr. bs nyy gurfr
crbcyr gur oenirfg ner gur orytnr. gurl ner
gur shegurfg njnl sebz gur phygher naq pvivyvmrq
jnlf bs gur ebzna cebivapr, naq ner yrnfg bsgra
ivfvgrq ol gur zrepunagf jub oevat yhkhevrf
juvpu graq gb znxr crbcyr fbsg; nyfb gurl ner
arnerfg gb gur treznaf npebff gur euvar naq
ner pbagvahnyyl ng jne jvgu gurz. -- whyvhf
pnrfne


Cryptography Made Hard

You can probably imagine writing code to help do things like this. So, for that matter, could Alan Turing, who jumped into computing to help crack the German Enigma codes, at Bletchley Park during World War II. So can the sleuths at the NSA; this week they're probably compiling tables of letter and word frequencies from Pashto.

Okay, now try this one.

atggcccagc agtggagcct ccaaaggctc gcaggccgcc
atccgcagga cagctatgag gacagcaccc agtccagcat
cttcacctac accaacagca actccaccag aggccccttc
gaaggcccga attaccacat cgctcccaga tgggtgtacc
acctcaccag tgtctggatg atctttgtgg tcattgcatc
cgttttcaca aatgggcttg tgctggcggc caccatgaag
ttcaagaagc tgcgccaccc gctgaactgg atcctggtga
acctggcggt cgctgacctg gcagagaccg tcatcgccag
cactatcagc gttgtgaacc aggtctatgg ctacttcgtg
ctgggccacc ctatgtgtgt cctggagggc tacaccgtct
ccctgtgtgg gatcacaggt ctctggtctc tggccatcat
ttcctgggag agatggatgg tggtctgcaa gccctttggc
aatgtgagat ttgatgccaa gctggccatc gtgggcattg
ccttctcctg gatctgggct gctgtgtgga cagccccgcc
catctttggt tggagcaggt actggcccca cggcctgaag
acttcatgcg gcccagacgt gttcagcggc agctcgtacc
ccggggtgca gtcttacatg attgtcctca tggtcacctg
ctgcatcacc ccactcagca tcatcgtgct ctgctacctc
caagtgtggc tggccatccg agcggtggca aagcagcaga
aagagtctga atccacccag aaggcagaga aggaagtgac
gcgcatggtg gtggtgatgg tcctggcatt ctgcttctgc
tggggaccat acgccttctt cgcatgcttt gctgctgcca
accctggcta ccccttccac cctttgatgg ctgccctgcc
ggccttcttt gccaaaagtg ccactatcta caaccccgtt
atctatgtct ttatgaaccg gcagtttcga aactgcatct
tgcagctttt cgggaagaag gttgacgatg gctctgaact
ctccagcgcc tccaaaacgg aggtctcatc tgtgtcctcg
gtatcgcctg catga
What you're looking at, above, is a message that tells us why John Dalton was colorblind. All you have to do is translate it.

This is the central problem of molecular genetics. We have lots of encoded text. All we have to do is decode it.

How much encoded text? Well, it changes every day -- especially now, with the Human Genome Project and other sequencing efforts going full-bore -- but a good place to start looking is the National Institute of Health's biological sequence database search engine, Entrez.

You want the complete DNA sequence of a human being? No problem. It's actually only about 3 gigabytes. You can't even buy a 3 gigabyte hard disk any more, but a 20 gigabyte one will only set you back $100 -- half a penny a megabyte. You can store DNA sequences for peanuts. (You want to store DNA sequences from peanuts, Arachis hypogaea? Entrez lists dozens.)

Each sequence is a simple string of the four letters `a', `c', `g', and `t' (for the four molecules adenine, cytosine, guanine, and thymine), strung together in a series of instructions.

But you want to know to read those instructions? What they mean? Ah. Well.


Decoding the Genome

We have two options, here. We could either use the remainder of the column for a full, in-depth course on molecular genetics, ending with a lot of, ``we don't know anything after this,'' or we could do something weird.

It really doesn't seem like a difficult choice, does it?

Up through World War I, a twist on simple substitution ciphers, called the Vigniere cipher, was widely used because it seemed immune to simple attacks like the one outlined above. The basic idea behind the Vigniere cipher is that each letter is encrypted by a Caesar cipher, but different parts of the message use different shifts. For example, you could encrypt even letters rot4 and odd letters, rot23.

More typically, after choosing a keyword like ``gillian'' the first letter of the message is encrypted with a shift of 6 (`g'), the second with a shift of 8 (`i'), and so on until you get to the eighth letter, which starts over with a shift of 6.

Without knowing which parts of the message use which shifts, attacks based on simple letter and cluster frequencies fail.

These periodic substitution ciphers were quite secure until, in the mid-1920s, William Friedman realized that periodicity was the scheme's Achilles' heel and figured out how to crack them.

Here's how.

First, imagine aligning two copies of a long, encrypted message, one under the other like this:

YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
Each letter has the same letter beneath it.

Next, shift the lower sequence some distance.

YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
....YHQMYMGMYMFMKDMYWMYOMYWFDHWDV
Notice that sometimes letters still, by chance, have the same letters under them. (For example, we've cheated outrageously so the first letter of the lower sequence, a `Y', is again directly under a `Y'.)

Without going through the math, we'll claim that the frequency of matches should be highest when the shift is a multiple of the length of the key.

Let's try to make that sound more intuitively obvious.

Imagine that the original text has words ending in ``ing'' scattered throughout the text. Suppose, also, that the encryption key is 10 characters long. Each `n' will be encrypted with a different shift from the `i' on its left and the `g' on its right. However, there are only 10 possible encryptions for ``ing''; and two ``ing'' strings separated by 10 (or 120 or 440) characters from one another will be encrypted the same way: Their encrypted versions will line up.

By looking at which shifts have the most frequent matches, you can figure out how long a key is. Once you've done that, it's easy. For example, if the key-length is 10, start at one end of the sequence, collect every tenth letter -- all of which are encoded with the same shift -- and then use techniques for mono-alphabetic ciphers on them: the most frequent letter is probably an 'e', and so on.

Okay, now let's try that with DNA. (To try this at home or work, you'll want to grab GenBank sequences from Entrez. It's trivial, honest.)

We begin by choosing three sequences: bacteriophage ox174, bacteriophage lambda, and the human mitochondrial genome. ox174 is around 5,000 characters long and lambda is about 50,000. They're reasonable tiny and not-as-tiny test cases. Human mitochondrial DNA is about 16,000; we throw it in both because it's not closely related to the other two and to make this a human interest story.

Plus, each of these has a useful feature: its DNA sequence is circular. If you line a linear sequence up with another copy of itself shifted 500 letters to the right, then 500 letters on the right and 500 on the left are unpaired. With a circular sequence, every letter stays paired no matter how far you rotate either copy. You can make full calculations for every possible alignment.

Finding the frequency of matches with different shifts is pretty easy. Here's our code to do it:

#!/usr/bin/perl -w
# $Id: doshifts,v 1.6 2001/11/12 05:27:13 jsh Exp $

require "gs";
use subs qw(getseq);

require "c";
sub c_idx;

$ARGV[0] ||= "-";
my $s = getseq shift;

my @s = @$s;
my @t = @s;

# shift no more than half way
# -- it's a circle
my $maxshift = int ((@s+1)/2);

for (my $i = 0; $i< $maxshift; $i++) {
  my $c =  c_idx(\@s, \@t);
  printf "$i\t%2.3f\n", ($c-.25)*100;
  push @t, shift @t;
}

(To make the results easier to look at, we arbitrarily subtract 0.25 from each value -- the frequency expected from a random sequence of the four letters `a', `c', `g', and `t' -- and multiply by 100.)

We realize we've only shown you the main program here. To see the required modules, grab them with the program bundle for this column from our web site.

What shifts have the highest frequency of matches? The top 10 for each sequence:

phix174: 66, 78, 9, 12, 57, 21, 96, 48, 33, 30
lambda: 3, 21, 45, 30, 33, 72, 42, 117, 144, 87
human mitochondria: 3, 12, 6, 21, 27, 153, 15,
  132, 9, 186

If these were secret messages, encrypted with the same Vigniere cipher, we'd say the keylength was three.

How far down each list of shifts do we have to search before finding one that isn't a multiple of three? This shell one-liner

tac shiftlist | awk '$1%3 != 0 {print NR; exit}'

gives us these answers

phix174:       66
lambda:             124
human mitochondria: 44

This means that for ox174, the best-matching shift that isn't a multiple of three -- as it turns out, a shift of 1828. -- is worse than 65 other shifts that are divisible by 3. For lambda, the best indivisible-by-three shift is the 124th best, overall.

tac? It's just a backwards cat. It reverses the lines in files.


Huh?

Wait just a minute. Are we seriously suggesting God is Vigniere-encrypting DNA?

Nope. But the periodicity of length three is for a no-less-interesting reason: the genetic code is laid out as triplets.

In case you barely scraped through high-school biology (like Haemer), and you only know computers, think of it this way: each base -- each `a', `c', `g', or `t' -- carries two bits of information. DNA is laid out in 6-bit bytes, called ``codons.'' The codons themselves are not equal in frequency, so matching is highest when the sequences are shifted in multiples of threes.

The high-agreement shifts are also relatively small numbers. Is this because nearby codons tend to be more alike than far-away codons?

Again, no. There are stretches of DNA that aren't laid out as codons. The bigger the shift, the more likely it is that some chunk of DNA that doesn't follow the rules will show up, and knock you out of phase. Also, DNA is double-stranded, and only one strand is read. Half the time, the real information is on the other strand from the one we're trying to match against.

Is this triplet stuff new information?

A third time, no. Marshall Nirenberg, in the mid-1960's, used wet (bio-)chemistry to break the genetic code and assign meanings to all 64, three-base codons. It got him the 1968 Nobel prize in Medicine and Physiology.

Are we still amused that we can use a 1925 cryptological attack on recently obtained DNA sequences to see this regularity?

Yes.


Ideas, anyone?

Bioinformatics is a young field. Like natural history and astronomy, it has first-rate professionals, but it's also a place where amateurs can still play.

The ocean of free DNA and protein data out there are an instant amusement mix. You only need to add a bit of curiosity and a few odd ideas.

If you want to explore, we can recommend a few places we like:


  • The NIH's biological sequence database search engine, Entrez. If you like using the CPAN (Comprehensive Perl Archive Network), check this one out.


  • The home page for Online Mendelian Genetics in Man, the database of everything we know about human genetic anomalies. If you want to start with a problem and work toward the DNA, this is where you should go.

    For example, Copeland is color-blind. Are you? Search for ``color-blind,'' or just go directly to http://www.ncbi.nlm.nih.gov/htbin-post/Omim/dispmim?303800#Reference18.


  • Home page for the bioperl project. If you want to find modules that let you write scripts to grab sequences from the databases, this is a good place to find them. If you want to create a Perl patch that will let you do regular expression matching on circular strings, here's a good place to contribute it.


  • Tom Schneider's home page at the National Cancer Institute's Laboratory of Computational and Experimental Biology. We like this page both because of the interesting stuff Tom has been able to do, and because it hints at how much no one has even tried yet.

    There are only so many hours in a day.


  • On the other hand, for cryptography references, we prefer the dead-tree form. We recommend Helen Fouche Gaines' classic Cryptanalysis (Dover, 1939, ISBN 0-486-20097-3), and David Kahn's recently updated The Codebreakers (Scribners, 1996, ISBN 0-684-83130-9). Alternately, for a less in-depth treatment, you can try to The Code Book by BBC producer Simon Singh (Doubleday, 1999, 0-385-49531-5).

    We'll leave you with a simple observation we stumbled on last night.

    Yeast mitochondrial DNA is about 80,000 bases long. It has the added advantage that some of its four bases (letters) are much rarer than others.

    (Why is this good? Imagine an encrypted version of the King James Bible: English doesn't have very many Z's or Q's, so if you find a letter in the encrypted version that doesn't occur very often, it's a good candidate for a Z or a Q. In general, any sort of asymmetry or non-uniformity like this gives you an edge.)

    What shifts show the highest alignment for this sequence? Here are the first 20:

    3, 6, 4, 2, 12, 7, 9, 15, 18, 21,
    8, 10, 14, 5, 11, 20, 24, 27, 13, 16
    

    Fewer than half are divisible by 3. Ideas?



    Last, Least.

    We end by facing, head-on, the frequently asked question ``Why do you guys write about bizarre combinations, like linguistics and Unix or cryptology and molecular genetics?''

    This time we were inspired some by kind notes from a pair of former co-workers, Bill Torcaso and Tom Schneider, both of whom do bioinformatics for a living.

    Most of the time, however, we do it because we're deeply warped individuals.

    So, until next month, when we try for another column that's both a dessert topping and a floor wax, happy trails.