• Keine Ergebnisse gefunden

Der Prozess

Im Dokument The Language of Life (Seite 35-38)

The evidence in favor of t h e thesis t h a t proteins a r e random sequences of amino acids is exiguous; and random words may well be grouped into nonrandom sequences. This suggests t h a t t h e close study of t h e statistical properties of cer- tain proteins may involve a kind of dense conceptual myopia, something that reflects a passionate absorption in minutiae. The p r o c e s s by which evolution in strings takes place, on the other hand, is macroscopic and global, an energetic probabilistic swarming over sample spaces that a r e never specified by means of mechanisms that a r e never clarified.

Biological paths

Life loiters over two metric spaces. The first is alphabetic; t h e second, zoo- logical. Evolution comprises a drama in t h e large, a t the zoological level; but t h e Central Dogma requires t h a t any change in t h e large be mirrored by an alphabetic change, and so the process is doubled as i t is divided. To talk blithely of evolution in strings is to assume t h e completion of t h e two first s t e p s in biological evolution:

t h e emergence of life-like systems from inorganic matter; and t h e adventitious creation of the modern biological system of replication and genetic information. An explanation of these s t e p s I cede to t h e forces of the Night: my more limited con- c e r n is with evolution as a process that takes place once t h e genetic machinery is throbbing moistly. In evolution a t the molecular level, one amino acid is dropped from a protein string, another is inserted: make w a y ! , move o v e r ! , get o u t ! , get Lost!, t o cast t h e operations in easily understood t e r m s ; even if t h e process is more complicated, i t may mathematically be resolved into discrete and finite steps. Whatever t h e details, proteins change over time; and t h e changes leading to their creation may be regarded as a p a t h P

=

p l , p z ,

...,

pn or protein s e q u e n c e . Suppose that A comprises t h e full stock of 20 amino acids; A / , t h e set of all w o r d s of amino acids precisely 250 points in length; and A * , t h e s e t of all finite sequences drawn over A / . I assume

-

an a s s u m p t i o n note!

-

t h a t A* has t h e structure of a language-like system under t h e binary and associative operation of protein c o n c a t e n a t i o n . where concatenation has precisely its usual linguistic meaning.

28 D. B e r l t n s k t

Stochastic processes

Let S b e a system and X t h e set of its s t a t e s o r configurations. State transi- tions are represented by a transformation T : X + X , an artifice expressing t h e action of t h e system's laws of evolution. If T s + c

=

T s T t , [Tt E R ] is a flow, o r g r o u p action of R on X.

On t h e Darwinian theory, evolution is a t its s e c r e t h e a r t stochastic; it is natural, therefore, t o specialize t h e concept of a process t o t h e case in which X is a measure space, T a measure preserving transformation. This is the domain chiefly of ergodic theory. Its underlying, indeed, fundamental, object is a proba- bility space ( X , B, u ) , where X is a set of states.

B

a a-algebra of measurable sub- sets of X , and u a countably additive nonnegative s e t function on B. u ( X ) is. of course, 1. Let T b e an invertible injection from X onto X ; if u ( T I E )

=

u ( E ) f o r all E in B , T is a m e a s u r e - p r e s e r v i n g t r a n s f o r m a t i o n ; t h e system ( X , B , u , T ) , a b a s i c p r o b a b i l i t y s p a c e . [la]

By t h e o r b i t of a measure-preserving transformation T , I mean t h e extended history of a single point z under T from t h e infinite past t o t h e infinite future: a trajectory from void t o void. Artificially truncated a t z , t h e system is in an initial s t a t e or condition. A real valued function f

:X

+ R , whose values correspond t o f ( z ) , f (%), f ( ! I % ) ,

...

a c t s t o measure a system along its orbit; t h e class of such measurements is defined only t o t h e extent t h a t f is itself measurable:

is thus t h e t i m e m e a n of t h e system;

its space m e a n : systems in which t h e two coincide for every measurable function are e r g o d i c .

Example 9.2 Let A b e an alphabet of n symbols a l , a 2 ,

...,

a,, with probabilities p l , pz.

...

, p,. such t h a t pi > 0 , and C p i

=

1. The product space n z consists of t h e s e t of all two-sided sequences in n ; t h e various probabilities assigned to each sequence induce a measure u on n z . The shin t r a n s f i r m a t i o n

V z ) , =

z, +l is measure preserving; t h e system t h a t results is a finite-valued stationary stochas- tic process with identically distributed terms.

Example 9.3 Let A4

=

( a t , ) b e an n x n stochastic matrix. Let p

=

( p l . . . . , p n ) b e a row probability vector fixed by M :

Keep the product space and shift transformation from Example 9.2:

Uu

may b e extended t o a countably additive measure on t h e algebra generated by cylinder sets; by t h e Caratheodory-Hopf theorem,

Uu

thus forms a measure on the Bore1 Fields of n ' .

Example 9.2 models, say, a doubly infinite series of coin flips. each with probabil- ity of one-half; Example 9.3, a regular Markov chain, where p measures t h e a

The Language of Ltfe 29

p r i o r i probability of each symbol, M , t h e transition probabilities from one symbol t o another.

Consider a s o u r c e consisting of a finite alphabet A and an associated string of symbols,

...

z o z i z z

...

, where each z, is an element of A . Symbols a p p e a r in sequence with a fixed probability p t ; if t h e probabilities are independent, t h e a v e r a g e e n t r o p y p e r s y m b o l is

n

H z -

C

P , log2 P, a t =1

H is a t i t s maximum if e a c h pt

=

I/ n . In general, t h e probability t h a t a particu- l a r symbol a p p e a r s in sequence may depend on symbols t h a t have gone before.

This is t r u e if t h e source is a finite-state Markov device. Let A' b e t h e ensemble of all doubly infinite sequences drawn on A ; t h e cross section on A' of sequences t h a t coincide a t a finite number of points at

=

z t i , where

tt

r e p r e s e n t s any s e t of integers, is a c y l i n d e r s e t . Now if A contains k l e t t e r s , t h e number of n - t e r m sequences over A is k ; and each sequence is a cylinder in t h e larger space A'. I t is t h e cylinder t h a t has a fixed probability Pr(C): t h e set of all n -term sequences r e p r e s e n t s a finite probability space, k n points in size. The a v e r a g e a m o u n t of i n f o r m a t i o n p e r symbol s e n t out b y a source of this s o r t is

t h e e n t r o p y of t h e source itself

H

=

lim -I/ k

C

P r (C) logzPr(C)

.

k +- C €Cb

The concept of a source may b e specialized t o t h e c a s e of a measure- preserving system under ergodic constraints. [l9]

The Shannon-Macmillan theorem

A source puts out sequences; a t any given time, t h e r e will b e only finitely many

-

An in f a c t , if A is a finite alphabet, and n is t h e length of each sequence.

Finite length sequences a r e cylinders in t h e infinite probability space determined b y t h e source; t h e y inherit a probability s t r u c t u r e . If n is sufficiently large, t h e r e exists an arbitrarily small

t

and

6 >

0 such t h a t t h e n - t e r m sequences may b e s e p a r a t e d into two groups. For t h e f i r s t

for t h e second,

This is Shannon's theorem, a result in mathematics t h a t a p p e a r s t o add an author in regular periods. In any case, sequences of t h e f i r s t group a r e characterized by

t h e f a c t that ( l / n ) logZPr(C) is arbitrarily close to

-

H . The probability of any such sequence Ct is thus

z " ~ :

t h e number of such sequences is ZnH, and

-

2n logs a

comprises a very s m a l l s h a r e of t h e total number a n

-

of available sequences: a happy result. In coding a channel of communication. attention need be directed only t o a tiny sample of t h e output.

Im Dokument The Language of Life (Seite 35-38)