The evidence in favor of t h e thesis t h a t proteins a r e random sequences of amino acids is exiguous; and random words may well be grouped into nonrandom sequences. This suggests t h a t t h e close study of t h e statistical properties of cer- tain proteins may involve a kind of dense conceptual myopia, something that reflects a passionate absorption in minutiae. The p r o c e s s by which evolution in strings takes place, on the other hand, is macroscopic and global, an energetic probabilistic swarming over sample spaces that a r e never specified by means of mechanisms that a r e never clarified.
Biological paths
Life loiters over two metric spaces. The first is alphabetic; t h e second, zoo- logical. Evolution comprises a drama in t h e large, a t the zoological level; but t h e Central Dogma requires t h a t any change in t h e large be mirrored by an alphabetic change, and so the process is doubled as i t is divided. To talk blithely of evolution in strings is to assume t h e completion of t h e two first s t e p s in biological evolution:
t h e emergence of life-like systems from inorganic matter; and t h e adventitious creation of the modern biological system of replication and genetic information. An explanation of these s t e p s I cede to t h e forces of the Night: my more limited con- c e r n is with evolution as a process that takes place once t h e genetic machinery is throbbing moistly. In evolution a t the molecular level, one amino acid is dropped from a protein string, another is inserted: make w a y ! , move o v e r ! , get o u t ! , get Lost!, t o cast t h e operations in easily understood t e r m s ; even if t h e process is more complicated, i t may mathematically be resolved into discrete and finite steps. Whatever t h e details, proteins change over time; and t h e changes leading to their creation may be regarded as a p a t h P
=
p l , p z ,...,
pn or protein s e q u e n c e . Suppose that A comprises t h e full stock of 20 amino acids; A / , t h e set of all w o r d s of amino acids precisely 250 points in length; and A * , t h e s e t of all finite sequences drawn over A / . I assume-
an a s s u m p t i o n note!-
t h a t A* has t h e structure of a language-like system under t h e binary and associative operation of protein c o n c a t e n a t i o n . where concatenation has precisely its usual linguistic meaning.28 D. B e r l t n s k t
Stochastic processes
Let S b e a system and X t h e set of its s t a t e s o r configurations. State transi- tions are represented by a transformation T : X + X , an artifice expressing t h e action of t h e system's laws of evolution. If T s + c
=
T s T t , [Tt E R ] is a flow, o r g r o u p action of R on X.On t h e Darwinian theory, evolution is a t its s e c r e t h e a r t stochastic; it is natural, therefore, t o specialize t h e concept of a process t o t h e case in which X is a measure space, T a measure preserving transformation. This is the domain chiefly of ergodic theory. Its underlying, indeed, fundamental, object is a proba- bility space ( X , B, u ) , where X is a set of states.
B
a a-algebra of measurable sub- sets of X , and u a countably additive nonnegative s e t function on B. u ( X ) is. of course, 1. Let T b e an invertible injection from X onto X ; if u ( T I E )=
u ( E ) f o r all E in B , T is a m e a s u r e - p r e s e r v i n g t r a n s f o r m a t i o n ; t h e system ( X , B , u , T ) , a b a s i c p r o b a b i l i t y s p a c e . [la]By t h e o r b i t of a measure-preserving transformation T , I mean t h e extended history of a single point z under T from t h e infinite past t o t h e infinite future: a trajectory from void t o void. Artificially truncated a t z , t h e system is in an initial s t a t e or condition. A real valued function f
:X
+ R , whose values correspond t o f ( z ) , f (%), f ( ! I % ) ,...
a c t s t o measure a system along its orbit; t h e class of such measurements is defined only t o t h e extent t h a t f is itself measurable:is thus t h e t i m e m e a n of t h e system;
its space m e a n : systems in which t h e two coincide for every measurable function are e r g o d i c .
Example 9.2 Let A b e an alphabet of n symbols a l , a 2 ,
...,
a,, with probabilities p l , pz....
, p,. such t h a t pi > 0 , and C p i=
1. The product space n z consists of t h e s e t of all two-sided sequences in n ; t h e various probabilities assigned to each sequence induce a measure u on n z . The shin t r a n s f i r m a t i o nV z ) , =
z, +l is measure preserving; t h e system t h a t results is a finite-valued stationary stochas- tic process with identically distributed terms.Example 9.3 Let A4
=
( a t , ) b e an n x n stochastic matrix. Let p=
( p l . . . . , p n ) b e a row probability vector fixed by M :Keep the product space and shift transformation from Example 9.2:
Uu
may b e extended t o a countably additive measure on t h e algebra generated by cylinder sets; by t h e Caratheodory-Hopf theorem,Uu
thus forms a measure on the Bore1 Fields of n ' .Example 9.2 models, say, a doubly infinite series of coin flips. each with probabil- ity of one-half; Example 9.3, a regular Markov chain, where p measures t h e a
The Language of Ltfe 29
p r i o r i probability of each symbol, M , t h e transition probabilities from one symbol t o another.
Consider a s o u r c e consisting of a finite alphabet A and an associated string of symbols,
...
z o z i z z...
, where each z, is an element of A . Symbols a p p e a r in sequence with a fixed probability p t ; if t h e probabilities are independent, t h e a v e r a g e e n t r o p y p e r s y m b o l isn
H z -
C
P , log2 P, a t =1H is a t i t s maximum if e a c h pt
=
I/ n . In general, t h e probability t h a t a particu- l a r symbol a p p e a r s in sequence may depend on symbols t h a t have gone before.This is t r u e if t h e source is a finite-state Markov device. Let A' b e t h e ensemble of all doubly infinite sequences drawn on A ; t h e cross section on A' of sequences t h a t coincide a t a finite number of points at
=
z t i , wherett
r e p r e s e n t s any s e t of integers, is a c y l i n d e r s e t . Now if A contains k l e t t e r s , t h e number of n - t e r m sequences over A is k ; and each sequence is a cylinder in t h e larger space A'. I t is t h e cylinder t h a t has a fixed probability Pr(C): t h e set of all n -term sequences r e p r e s e n t s a finite probability space, k n points in size. The a v e r a g e a m o u n t of i n f o r m a t i o n p e r symbol s e n t out b y a source of this s o r t ist h e e n t r o p y of t h e source itself
H
=
lim -I/ kC
P r (C) logzPr(C).
k +- C €Cb
The concept of a source may b e specialized t o t h e c a s e of a measure- preserving system under ergodic constraints. [l9]
The Shannon-Macmillan theorem
A source puts out sequences; a t any given time, t h e r e will b e only finitely many
-
An in f a c t , if A is a finite alphabet, and n is t h e length of each sequence.Finite length sequences a r e cylinders in t h e infinite probability space determined b y t h e source; t h e y inherit a probability s t r u c t u r e . If n is sufficiently large, t h e r e exists an arbitrarily small
t
and6 >
0 such t h a t t h e n - t e r m sequences may b e s e p a r a t e d into two groups. For t h e f i r s tfor t h e second,
This is Shannon's theorem, a result in mathematics t h a t a p p e a r s t o add an author in regular periods. In any case, sequences of t h e f i r s t group a r e characterized by
t h e f a c t that ( l / n ) logZPr(C) is arbitrarily close to
-
H . The probability of any such sequence Ct is thusz " ~ :
t h e number of such sequences is ZnH, and-
2n logs acomprises a very s m a l l s h a r e of t h e total number a n