Conjectures on Information Handling in Large-Scale Systems

GEORGE

W. N.

SCHMIDT

North American Air Defense Command

Conjecture implies formation of an opinion or judgment upon insuffi-cient evidence. After twelve years of experience in the military application of computer systems in the areas of command and control, simulation and intelligence, the best I can do is conjecture. Certain specific information handling problems have been solved. Others await solution and will re-quire either or both software and hardware techniques development.

Basically those information handling problems which are considered to have reached a reasonably successful level of solution are exemplified by:

1. The financial problem, represented by the payroll processing by the various military finance offices. The data is well defined. Individ-ual's serial number, grade, length of service, marital status, depend-ents, etc. The only field that can cause a storage or retrieval problem is the individual's name, as it is alphabetic and variable in length.

2. The personnel problem which is now partially automated at the rec-ords center in which the service recrec-ords of Air Force personnel are now maintained, and personnel assignments processed.

3. The supply functions which are being mechanized at base level in order to speed up the resupply and inventory control functions.

4. The aircraft control and warning function as exemplified by SAGE (Semi-Automatic Ground Environment). This system processes the returns from surveillance radars to arrive at a position of the aircraft by latitude, longitude, altitude and time. This data is correlated by the computer program with the flight plan as filed with the FAA.

The data which correlates with the FAA flight plan is reported as known friendly that which does not correlate is declared either hos-tile or unknown and identification procedures are initiated.

5. The Ballistic Missile Early Warning System in which radar returns are processed by both wired and stored program logic. The wired logic establishes the validity of the returning signals as coming from a real object in space and also converts the return to azimuth, eleva-tion, range and range rate data form. The stored program logic is used to generate azimuth rate and elevation rate data and to perform

123

the discrimination tests which eliminate nonthreatening objects from the reporting system. The data relating to those objects which are classified as threatening objects is formatted into 63 bit messages by the program and passed over communications to the Display Infor-mation Processor at Colorado Springs. The Display InforInfor-mation Processor program decodes the message and computes the alarm levels, time to go to soonest impact, and the parameters to be passed to the ICONORAMA equipment to drive the display of impact and launch ellipses.

Those information handling problems awaItmg solution are those which require the processing of narrative text, photographic indexing and in terpreta ti on.

The problems which have yielded to solution are those that have a common characteristic, well-defined organization and structure that can be readily formatted. Those problems which are presenting the most difficulty also have a common characteristic, a complex organization and structure which is permeated with exceptions and is not amenable to for-matting.

I feel there are these two basic classes of data available for exploitation, formatted and unformatted. Examples of the formatted data are BMEWS data which because of its origin, radar data, can be formatted at the source. It is no problem to handle the more than 6.3 million messages a year and present the data to the user in summary displays. Other sensors can collect data and furnish it in formatted form for processing. Several of these record their data in a typical magnetic tape format, i.e., 556 bits per inch density, 112.5 inches per second speed with a 10-second record length. Using 100 word per minute teletype lines to transfer this data, if error-free communications were possible, would require only 17 hours, 37 minutes, 30 seconds per record. More of this later.

Examples of unformatted data to be processed are incident reports, i.e., descriptive narratives of objects seen or nonstandard activities; scientific treatises; proceedings of symposia and other technical meetings; other in-formation of this kind and photographs which must be indexed for re-trieval and also interpreted.

In both the formatted and unformatted classes there appear to be two categories of information-processing requirements. One could be called

"real-time," the other "deferred." To permit intelligent argument, in the Greek sense of argument, I should define my terms. "Real-time" infor-mation handling requires update of the data base, response to queries, and summarization of the data so that the user may react to the changing con-ditions and affect the environment from which the data is collected-i.e.,

CONJECTURES: LARGE-SCALE SYSTEMS 125 the data is being processed concurrent with the operation. "Deferred"

information handling requires update of the data base, response to que-ries, and summarization of the data ex post facto so that the user may per-form detailed analytical studies to establish criterion measures, patterns and new techniques.

Capability to do "real-time" processing implies that there is available a history of data in depth relating to the problem. Based upon this file of data, the necessary criteria and patterns for quick-look analysis can be established and narrative statements relating to the "real-time" problem can be retrieved. This leads to the problem of the structure of the file.

Several techniques have been used experimentally. In almost all the approach has been to establish a dictionary of terms, their synonyms and some code to represent them. Documents are scanned by people who select the meaningful words and encode these words for inclusion in some formatted field, record, or file so that a search can be made of the for-matted portion which will then constitute the retrieval control.

Because word-by-word encoding has proved to be not entirely satis-factory, this technique has been expanded to include phrases or as some-times stated, "keywords in context." Again the process is one of human interpretation of what is significant in the document. As encoders change and as individuals' moods change, the index capability changes introduc-ing inconsistencies which will degrade the retrieval capability.

The English language being what it is, things such as prefixes, suffixes, tenses, etc., present the indexer and the file definer problems of the type related to unformatted data. With the field length varying from one letter to more than 25 letters and irregular verbs requiring cross-referencing to their root, a voluminous dictionary of terms would be required.

Perhaps another approach to the problem could be investigated. Elimi-nate the human cataloguer or indexer from the system. Rather than look for the significant words or phrases, establish a machine search technique which would identify the "nonsignificant" words, i.e., the, and, but, that, etc. There are probably fewer of these in the English language than the other type of words; and, therefore, a much more limited dictionary could be used for an initial screening of a document to form the basis of both indexing, storage and retrieval. "Nonsignificant" words appear to' con-\

stitute approximately 50 and up to 65 percent of most documents. The remaining words could then be catalogued by their location within the document and some formatted file of these words be generated as the re-trieval control.

Any index of this type information will be large. One of the applica-tions with which I am working will require the capacity to store between 200 and 300 narratives a day with an historical depth of not less than one

year and preferably two years to improve both "deferred" and "real-time"

analytical capability of our analysts. The indexing problem is tremendous and the structure of this index in order to permit ready access to the de-sired data without serial search of the entire file to locate the data is desired. Tape files with chronological addition of the data to the file generates a tremendous amount of tape spinning with the associated in-efficient use of the central processor.

This has led to the consideration of disk files, tape files, and bulk core memory. During the investigation there has been much emotion and little fact upon which to base our decision. We have sifted through much of the emotion and as much fact as we could find. Our "guestimates," con-jectures, if you please, indicate that there are some areas of data retrieval where tape will outperform disk for the retrieval of information for proc-essing purposes. The controlling factors seem to be the record length and its relation to the track length for recording on the disk. Our initial feel-ing with the announcement of large-volume disks was one of elation. We now have tempered that elation and realize we need more data relative to the payoff crossover point definition between disk and tape. One of the applications in which we see the greatest payoff for disks is that of sorting formatted data for purging, merging and updating of the file.

The announcement of large-size core memories-in excess of 200,000 words-by several manufacturers is interesting and many applications in information handling can be seen. Large speedups are possible because bigger batches of data can be processed without repeated input-output interrupts. Large core memories should allow larger, more sophisticated, greater depth of cross-referencing in the index for retrieval.

In the application in which I am most interested, several individuals are required to have access to the data base. Under the standard techniques of executive and monitor control the first one in with the highest priority would be the first one to have his job processed, with the resultant queuing problem.

The area in which preliminary investigation shows the greatest payoff for large-scale information handling systems will accrue is multiprocess-ing capability both in hardware and software because several analysts may then concurrently be serviced. Several organizations are now operating such systems either experimentally or in a limited operational situation.

Some sort of hybrid configuration of the computer with multiprocessor capability and an associative memory device appears to be desirable-the associative memory to be the index or library catalogue which would be computer generated by a technique similar to that previously discussed.

The request for data would be processed by the associative memory de-vice which would furnish to the central processor the acquisition control

CONJECTURES: LARGE-SCALE SYSTEMS 127 data whereby the data could be extracted or the desired documents re-trieved. The associative memory device would be a job set-up preproc-essor and, effectively, a peripheral unit.

Earlier a data-collection system was mentioned which required a large amount of time for data transmission. Before any large information-handling system can be automated to the degree required to handle the

"real-time" and "deferred" requirements, some way must be found to summarize the data at the collection point. One technique is to place a data processor at the collection source. This was done at BMEWS.

Secondly, sbme form of error detection or correction system must be de-signed into the communications system and terminals. Until this is done, human intervention between the collection source and the input to the data file will be required with the resultant slowing of the system response time in satisfying the "real-time" requirement.

Most systems today require pro forma sheets from which the keypunch operator punches cards which in turn are verified on another keypunch.

We are looking toward elimination of the card punch requirement by substituting a keyboard with a monitor readout so that the catalogue key-punch operator can correct as he key-punches and get the data more directly to magnetic tape for insertion in the data base. Eventually, as program-ming techniques are developed, the cataloguing can be automated to a large extent. These same type consoles will be available to our analysts for the insertion of their queries.

The organization with which I work is out at the far end of the line-that is, we use the techniques and hardware you people design in an opera-tional environment. We are not aware of all the techniques under study and do not always know where to go to get the information. Perhaps some organization such as the Knowledge Availability Systems Center might act as the central facility for information relative to information-handling techniques. This, in itself, would present an interesting informa-tion-handling problem in the area of unformatted data handling.

In this rambling presentation, however, are the basic elements upon which I framed the conjectures which follow:

1. Except for the volume of data involved, formatted files constitute no serious problem to any programming group.

2. Insufficient specific problems related to the handling of unformatted data-i.e., narrative text-have been solved in detail to permit the techniques to be expanded to the general case.

3. Where multiple sensors feed a central file, some summarizing or screening technique at the collection site is required to reduce the communications requirements and prevent cluttering of the central file.

4. Error-detection and correction codes in communications systems will be an absolute necessity before any automated indexing and file generation system will work.

5. Some system for the interchange of information on the status of techniques and hardware development in the information handling is required.

Im Dokument ee ion (Seite 133-139)