New Mathematics for a New Problem

ORRIN

E.

TAULBEE Manager, Information Sciences Goodyear Aerospace Corporation

INTRODUCTION

Perhaps you have wondered what is the new problem with which we shall be concerned in these pages, and secondly, after the problem is ex-pressed, what new mathematics has been developed that is applicable to the problem. Let me say at the outset that the principal concern of my dis-cussion is with classification. What is new about this problem? It has been around since the dawn of civilization in one context or another. My primary reason for referring to it as new is that there is new emphasis on this problem as our information-handling systems increase in complex-ity. Throughout our discussion we shall explore some of the ramifications of this significant unsolved problem but we will demonstrate certain re-sults take a positive step toward finding satisfactory classification schemes.

schemes.

CLASSES AND CLASSIFICATION

Before beginning a discussion of classification, one must concern one-self at least to some extent with the notion of classes. It is not our purpose here to delve into the philosophical considerations of what classes are, but in case one is interested he should consult Ref. 7. Nor is it an easy ques-tion to decide generally what the concept of a class should be and in par-ticular what a class should be in the context in which we shall use it. Let us just say here that our use of classes can be thought of as a decomposi-tion of a set of objects into a collecdecomposi-tion of subordinate groups which will be called classes. According to Encyclopoedia Britannica, classification is

"the arrangement of things in classes according to the characteristics that they have in common." It is not sufficient to think of classification as placing those objects in a class adjacent to one another, as is done in most library classification schemes, for we must admit the possibility that the objects are considered to belong in the same class even though they may be quite widely separated.

We may consider two types of classification-hierarchical and nonhier-archical-the former admitting the possibility that a class may be

sub-151

ordinate to a class other than the entire collection of objects, while the latter does not admit this possibility. It is unfortunate that some indi-viduals interpret classification to always mean hierarchical classification.

INFORMATION HANDLING

For a better understanding of the following discussion it is convenient to give a diagrammatic description of information handling. To my knowledge, it represents all information-handling systems-including those which are purely manual, those with a man-machine intermix and those which are completely automatic. Since the diagram is representa-tive of all systems, it is clear that the functions represented by the blocks take on different meanings depending on the particular system under con-sideration. In fact, for some systems one or more of the functional blocks may not be present. However, the end product of any information-han-dling system is the same-the presentation of information for decision-making.

I I I

1_-Obtaining Representation-of.

item File

(Display)

~---UPDATIN}

GAINING INFORMATION

NEW MATHEMATICS FOR A NEW PROBLEM 153 The objects with which our information-handling system is concerned shall be referred to as items. A common information-handling system is one where the items are textual in nature. Our discussion will not be lim-ited to this, however. We shall assume that an item may be a document in the usual sense, a book, a section, paragraph or sentence of a document or book; or the item may refer to an aerial photograph, a structural dia-gram of a chemical compound, a radar return, a sonar signal, and so forth. A decision must be made to determine those items which are to be included in the system. This decision may be an a priori one, or the deci-sion may be made for each item individually at the time of accesdeci-sion.

Many different representations of the items are possible for a given collection. For example, if the items are textual in nature the represen-tation adopted for the items might be: full text, full text with common words omitted, abstract, extract, keywords in context, title, index terms, first and last paragraph, etc. If the items are chemical compounds the representation might be: structural diagram, chemical name, one of sev-eral linear notations, a connection matrix, etc. If the items are signals the representation might consist of an explicit function of time, power spectral density, amplitude and phase spectrum, sampled data representa-tion, etc. Of course, in every case an item may be used to represent itself.

Again, the representation criteria may be established a priori so that the representation may be obtained either routinely or for each individual item on a judgmental basis, subject to general criteria established before-hand. Part of the function of obtaining the representation-of-item file is that of recording the results on a searchable medium.

If the item collection reaches any substantial magnitude (the collection is assumed to be dynamic), then consideration must be given to how the file should be organized. This is intended to include the establishment of format, search strategy, and classification of the recorded representa-tion. At this point, updating of the file is complete and ready for search-ing.

Upon the formulation of a query an analysis must be performed in order to (1) make the representation of the query compatible with the item representation, and (2) establish appropriate permissible search strategy.

Following this the file is searched and results of the search are delivered.

Within this framework, we may now describe a series of information-handling systems in which each system is more complex than the previous one with the ultimate being a system requiring no human intervention which operates in real time.

(a) Natural System

First of all let us describe what may be called a "natural system."

An example of this is the individual researcher's rersonal file. This

generally consists of a collection of items relevant to his particular field of endeavor. He is the user to the extent that he decides what items will be added to the collection; he formulates his own query, searches the collection, obtains those items which are responsive to the query, and makes a judgmental evaluation as to their relevance to the query. Dissatisfaction with the items retrieved may lead him to refine or modify his query and iterate the process. Note here that the researcher is using the items to represent themselves and in retrieving he actually retrieves the physical items from the files.

Growth of the collection may require the researcher to develop and organize an auxiliary file-the representation-of-item file.

Libraries, either public, university or specialized have developed a system duplicating in large measure the information system of our individual researcher.

(b) M achine-A ided System

Because of one or more of the following reasons, one may bring in a machine to assist in the information-handling system. These rea-sons are: (1) increased speed; (2) magnitude of representation-of-item file; (3) reduced costs in processing; or (4) avoidance of errors in processing. Machines to assist in processing consist generally of three types: (1) tabulating equipment such as sorters, collators, and printers; (2) peek-a-boo devices; and (3) computers. The most common utilization of machines in information-handling systems is in performing the function of searching the file.

Because of increased complexity of information-handling systems, it is frequently desirable to have a machine system called an "auto-matic system," behaviorally equivalent to the functions included in the solid rectangle (Fig. 1); this includes all those functions which can be mechanized.

(d) Real- Time System

For a real-time system four "times" appear to be of significance:

(1) J.L, the average time for updating; (2) v, the average time for

gain-ing information; (3) 0, the average rate of accession of new items;

and (4) ~, the average rate of accession of queries. Obvious condi-tions on these variables are J.L :::; 0 and v :::; ~.

Most information-handling systems in existence today fall into either those of types (a) or (b). For many systems it would be desirable that they either be of types (c) or (d).

The use of machines to perform each of the functions in the solid rectangle are in various degrees of development. As was indicated

pre-NEW MATHEMATICS FOR A pre-NEW PROBLEM 155 viously, the most highly mechanized function is that of searching the file.

Displays, so far as printed or microform output are concerned, are fairly well mechanized. Much remains to be done for other types of displays.

The other three functions represented are perhaps less well developed, but experiments are going on in each of these areas. For example, in obtain-ing the representation-of-item file, experiments in auto-abstractobtain-ing and auto-indexing have been performed. The function of inquiry analysis may be avoided almost completely. An example of such a system is that in which the representation-of-item is full text. Once the file-organization characteristics have been established, a machine may assist in performing this function. However, little has been done in the way of machine clas-sification.

CLASSIFICATION IN INFORMATION HANDLING

Restricting the concept of classification to information handling it is clear that the fundamental problem is that of deciding in what sense the items should be considered associated or similar. It is also clear that a classification scheme cannot be universal but will be specialized to the particular collection of items under consideration. For example, the cri-teria for association of two items will be quite different if the items are, on the one hand, documents, and on the other hand, signals. In fact, we can go further: the classification scheme of the same collection of items will be quite different depending upon the viewpoint of the classifier.

This can be handled theoretically, however, by means of the criteria adopted for association. There are three principal reasons for classifica-tion in an informaclassifica-tion-handling system: (l) size of file; (2) increased speed; and (3) recognition of the appearance of new classes. These rea-sons are not mutually exclusive. For the first, unless the file is classified, it is necessary to search the entire file, but this may be impractical de-pending upon the size and mechanism, if any, used in searching. For the second, urgency of gaining access to the information may dictate that the items be decomposed into classes. The third purpose of classification is in identifying new concepts or knowledge that finds its way into the item collection.

TRADITIONAL APPROACH TO CLASSIFICA TION

The following is a common approach to classification: From personal knowledge of the item collection some classes are established a priori

which are felt to be representative of the characteristics of the entire item collection. After this, each item, whether in the original collection or a new accession, is considered individually and evaluated to determine the classes to which the item belongs. This is a judgmental evaluation which must be made yet cannot be made precisely since initially the definition of the class is vague. New classes are added reluctantly. When the new classes are formed, almost without exception there is little or no review of items already in the file to determine whether or not they fit into the new class.

In the usual library situation, classification consists of two functions, that of establishing cross-references and that of classifying, each of these being accomplished within the guidelines of a set of rules. It seems to me the purpose of classifying in this context is to narrow the search resulting from an inquiry to a limited portion of the representation-of-item file, and the cross-referencing or association established increases the possi-bility of retrieving all pertinent information from the file that is either directly or peripherally relevant to the query. Cross-references are in-cluded to the best of the individual's ability to remember and recall.

In order to automate both processes, it is necessary to establish an ana-lytic procedure for making associations and classifications, since classi-fication is made on an intuitive and experience basis. Thus, it would be desirable to have a classification scheme which is objective; that is, it removes the judgmental element, and gives complete updating when a new class is formed.

MOTIVATION FOR MATHEMATICAL MODEL Perhaps the first step away from the traditional approach to classifica-tion was included in a paper by Vannevar Bush⁴in the year 1945. In this paper he defined a theoretical machine called the "memex." The memex has massive storage capability, the capability of retrieving any item from storage and displaying it, the capability of inserting written comments into storage during the viewing process, and most important, the capa-bility of tying two related items together. This last capacapa-bility Dr. Bush referred to as "associative indexing," by which he meant a mechanism whereby any item will select immediately and automatically another as-sociated item. Furthermore, the operator of this machine, in viewing items which he wishes to associate, links these together permanently by simply pressing a key and thereby successively builds a trail of associa-tion. What this amounts to, in effect, is to put items into a class, as if they were bound together in one volume, from widely separated locations.

Notice that here emphasis is placed upon the association between concepts or ideas-each concept forming a class in the individual's mind.

NEW MATHEMATICS FOR A NEW PROBLEM 157 The need for the associative concept is evident when one considers the selection processes that are available in searching a file. At present there are two types: (1) search the entire file; (2) use a tree structure for search-ing the file. These methods have been implemented on card equipment and conventional computers. The association concept would be particu-Jarlyeffective for avoiding backtracking if a search is being made in one branch of the tree and it is required to search in another branch of the tree.

Tying together the ideas presented by Dr. Bush and the traditional classification approach to the library, it seems reasonable to think, in-stead, of reversing the process, that of first establishing the association between the items and then through some logical process form the appro-priate classes. There are two cases to consider: (1) the items are either associated or not, and (2) more generally, the items are associated to a degree. This paper will be concerned only with the first.

THE MA THEMATICAL MODEL

A review of the mathematical literature reveals that little has been done in the way of a mathematical approach to classification. Apparently one of the few concepts in mathematics relating to classification as such is the well-known idea of an equivalence relation.

The fundamental features of an equivalence relation are that a binary relation P is defined on a set S. The relation satisfies the reflexive, sym-metric, and transitive properties. By the phrase "a binary relation p is defined on a set S" is meant that for any pair of elements a, b, of S a definite rule is prescribed by which it can be determined whether or not a and b are in the relation p. This may be denoted by p(a,b) = 1 or O. The significant property of an equivalence relation, defined on a set S, is that the relation separates the set into mutually exclusive, exhaustive classes.

That is, each element of S belongs to one and only one class. Because of this the partitioning of the set S may be thought of as a classification of the elements of S. To be precise, let S be a finite set with elements

SJ, S2, •.. , Sm. Since it will be assumed that, in general, the number of elements varies with time it will be necessary to require that the set be well-defined-i.e., given a new object s* there exists a definite rule by which it can be determined whether or not s*f. S. It will be further as-sumed that p is a binary relation defined on S by an explicit rule which determines whether Sj and Sj' Sj and Sj not necessarily distinct, are in the relation or not. This will be denoted by p(Sj,Sj) (or p(Sj,Sj) if the order is important) and agree that p(Sj,Sj) has the value 1 if Sj and Sj stand in the relation p; otherwise, p(Sj,Sj) has the value zero. An alternative way of thinking of this is that p is a mapping of the cartesian product

space S2 onto the set

to,!}.

Moreover, it will be assumed that when the set S is augmented by the addition of s*, to form the set S*, the same rule is applicable for evaluating p(s* ,Sj) for j = 1, 2, ... , m and p(s* ,s*).

The relation P is an equivalence relation if p is reflexive, symmetric and transitive, i.e.,

(1) p(Sj,Sj) 1, for i = 1,2, ... , m.

(2) if p(Sj,Sj) = 1, then p(Sj,Sj) = 1 for all i andj.

(3) if p(Sj,Sj) = 1, and P(Sj,Sk) = 1, then P(Sj,Sk) = 1.

In the application of this to information handling such a classification is clearly unsatisfactory since in general an element may belong to more than one class. This motivates the search for a generalization of the classifica-tion induced by an equivalence relaclassifica-tion.

A study of the equivalence relation postulates shows that it is the transi-tive property which decomposes S into mutually exclusive classes. How-ever, the classes induced by an equivalence relation do have the charac-teristic that the classes-are maximal^lwith respect to the property that any pair Sj and Sj belonging to a class implies that p(Sj,Sj) = 1. This suggests that the transitive property be dropped and the maximality condition, just referred to, be imposed on the classes. The collection of classes deter-mined by such a relation p are called "coherence classes." This terminol-ogy is consistent with Ref. 6.

Suppose now that a.new element s* is adjoined to the set S to form the set S*. Let Ck , k = 1, 2, ... , n be the coherence classes of Sand R (s*) be the set of elements in S* related by p to s*. A precise inductive algorithm was given in Ref. 2 for obtaining C*, a coherence class in S*.

The algorithm is based upon C: ⁼ {s*} U {R(s*) () Cd. As k ranges over the values 1, 2, ... , n, C: forms a new coherence class if it is maxi-mal. This yields all new classes; none of the classes Ck of S can disappear.

(If an element is removed from S, then, of course, a class may disappear.) It should be pointed out that the decomposition of the set S into classes, either in the case of equivalence classes or coherence classes, is unique.

If a different classification is desired then the association criterion may be changed, i.e., the binary relation p defined on S is modified. Because of the well-known correspondence between graphs, relations, and matrices, it is clear that these ideas may be expressed in either of the other forms.

Some interesting matric relationships were given in Ref. 8.

Since initiation of this investigation two papers have come to the author's attention. Hillman⁵has explored, from a philosophical-logical point of view, Carnap's idea of a "concept-class." Bonner³develops some computer algorithms for what he calls "clusters." The "concept-class"

NEW MATHEMATICS FOR A NEW PROBLEM 159 and Bonner's "tight cluster" appear to be identical to the notion of a coherence class. The present paper establishes a firm mathematical basis for classification in case (1) referred to above and simultaneously affords

Im Dokument ee ion (Seite 161-171)