• Keine Ergebnisse gefunden

C) Generative computer pictures

5 Case Studies: Using the Data Type »Image«

5.1 Semantic Requests to Image Databases in IRIS

A major problem arising in the present “age of the images” is the pure amount of pictures to be dealt with. The most elementary task of finding a certain picture or even a reasonably small group of relevant images in the enormous corpus of pictures available (for example, in a press archive) becomes increasingly hard. It is sometimes easier to produce a completely new picture instead – which then again contributes to clogging up the archive.

The classical tool for pictorial archiving is to index pictures by means of more or less arbitrary annotations associated conventionally with them. The keepers of the archives have to manually associate the annotations, essentially following their understanding of the pictures’ essential features or contents and the principles of cataloguing of their pro-fession. It is quite uncomfortable for any human being to describe the content of thou-sands of images following rather fixed criteria, and to construct a corresponding index.

However, as soon as a new criterion becomes relevant, all images already categorized would have to be revised again: those processes are rather being done automatically.

In principle, we have to distinguish between several cases of image retrieval:

1. We want images that contain certain syntactic features, e.g., a red circle or a large patch of grass texture: Although such a request can be quite helpful if no other means of searching is provided, it is relatively uninteresting in most situations. If the archive is managed computationally sample pictures can be used to mark the fea-tures instead of giving them symbolically: IBM’s system QUBIC provides exactly such queries by examples.

2. We want images that have a certain picture content, e.g., two persons in front of a forest: this is the most interesting case and we deal with it below. Specifying (par-tially) a picture content can reach from a single sortal object type (‘a chair’) to a fairly precise set of relative locations of several objects of certain types with associ-ated visual features (‘a red sports car with a blond guy sitting inside and a black-haired woman standing at the left side door’).

3. We want images that have certain individual referents in them, e.g., a picture of the Taj Mahal. As was explained in section 4.3.2.4, image reference is problematic if not an unspecific individual is meant: ‘unspecific’ means that the individual in ques-tion is not known so far from some other context – it is just a spontaneously gener-ated intentional object. In contrast to that, the specific individual pictured must be known as the same individual in other contexts, as well. The picture per se cannot establish such identification – another object with high visual similarity could be the referent just as well.

The task of retrieving pictures from a database in a semantic fashion, i.e., by means of giving content descriptions can be stated in relatively simple terms – nevertheless it is quite demanding to solve. Some aspects of such a task have been sketched already in section 4.4.1.4; but there, PINEDA assumed that descriptions of the images’ content had already been derived in advance, and essentially “by hand”.

5.1.1 Image Retrieval for Information Systems

The project IRIS (Image Retrieval for Information Systems85, 1994–1996, Univ.

Bremen) has approached the retrieving of a group of images currently of interest from a huge image archive, in which the pictures are automatically indexed according to their content (up to a certain level of detail). The system developed describes autonomously images by their content in a textual form. Only a specification of the general picture type is necessary, e.g., landscape picture, technical drawing or sports photograph, since the algorithms performing the image analysis depend on domain-specific parameters that cannot yet be extracted by the computer on its own.

The resulting annotations are fed into a standard textual database together with the reference to the corresponding picture files. A user of the system is able to retrieve the references to images by keywords from the annotations employing the well-known methods of text retrieval. The keywords can be derived by means of an analysis of a sample picture, as well (Fig. 104).

85 The system has been implemented in C on IBM RS/6000 with AIX. It later became part of IBM’s sys-tem “ImageMiner”.

Figure 104: General Architecture of Content-Based Image Retrieval

The most interesting task is to construct the image content that is the basis of the an-notations describing the pictures. An overview of this image analysis component is given in Figure 105. As described in section 4.3.2, the first step of picture analysis is to determine elementary pixemes: several types of marker values like colors, texture at-tributes, and contour elements are extracted from the image. Algorithms based on those described in [HARALICK ET AL. 1973] and [KORN 1988] have been used for categorizing those features.

Elementary pixemes depending on color and texture are based on a grid with an ad-justable size subdividing the image into grid elements. For every grid element, a color histogram is computed and reduced to a color category: the color category appearing most frequently defines the color of the grid element. Neighboring grid elements with the same color are grouped, and the circumscribing rectangles are determined. The re-sults of color-based segmentation are described qualitatively by means of attributes such as relative size, position respective to the underlying grid size, and the color category.

Similarly, for every grid element, the system performs some matrix calculations get-ting some local statistic parameters like entropy, variance, correlation, and angular sec-ond momentum that base texture analysis. The mapping between the statistical values and the texture category to be used is performed by means of a neural net and depends on the type of scenes considered: certain statistic parameters may indicate one texture category in landscapes, for example, and another one in indoor scenes. Therefore, the neural net has to be trained in advance by backtracking with textures typical for the do-main chosen (e.g., sky, clouds, sand, forest, grass, stone, snow, ice for landscapes).

Figure 105: Architecture of Image Analysis in the System IRIS

Again, neighboring grid elements with the same texture type are grouped together so that the circumscribing rectangle can be used as the basis of the qualitative description.

Shape attributes are represented through contour-based region descriptions. Detection of edge elements based on the intensity gradient is a standard tool of image processing.

To avoid the inherent scale-space problem of the gradient-threshold calculation, a

“pyramid-structured” approach with several levels of resolution is used in IRIS. Rele-vant edge points (i.e., no noise) are collected into contours if they continue a contour hypothesis starting with the most prominent edge points. Closed contours are finally used to determine regions.

Color rectangles, texture rectangles and shape descriptions are encoded in a qualita-tive manner. Take as an example the BNF specification developed by the author for color rectangles given in Table 3.86

Spatial objects as the basic elements of the goal descriptions are associated to sets of segments that are visually perceptible in the picture, but they also involve relations to their parts, or to wholes of which they are parts. The definition of those part-whole rela-tions for a particular type of object in fact organizes which sets of pictorial segments show an instance of that type, and which deviations are not to be rated as such instances.

Spatial objects in the intended sense are constituted by the coordination of correspond-ing segments by means of object schemata relevant for the domain in question (Fig.

106). Context-sensitive techniques are used to guide this process. The goal is to elimi-nate ambiguity as early as possible by means of expectations.

An association between segments with the same marker values is usually only possi-ble to elementary parts of spatial objects. To that purpose, topological relations between the pixemes found in the previous step are employed in a graph grammar parser to iden-tify candidates for elementary parts of the scene in question [KLAUCK 1994]. Thus, if a certain contour-based region, a white color rectangle, and a snow texture rectangle over-lap widely, a region of snow is likely to have been recognized. A color rectangle of ei-ther blue or white in the upper part of the picture togeei-ther with an overlapping cloud texture rectangle gives a good reason for having recognized clouds (Fig.s 106 and 107).

Note that it is not necessary to call for a precise overlap of color, texture, and shape: as is well-known for example from aquarelles, contours and colors need not fit exactly and still allow us to determine clearly what is shown.

86 In the later versions of the system, the relative positioning was abandoned; the more precise grid posi-tions of the rectangles are used instead. Furthermore a “density parameter” was introduced mirroring the proportion between the grid elements covered by the rectangle that have the corresponding feature and those that have not.

Table 3: Original BNF for color rectangles in IRIS

<color description> := "HOR=<hor>,VER=<ver>,SIZ=<siz>,DIR=<dir>, COM=<com>,COL=<col>"

<hor> := ll | left | middle | right | rr ;; horizontal position

<ver> := uu | up | middle | down | dd ;; vertical position

<siz> := XS | S | M | L | XL ;; qualitative size

<dir> := Ver| Hor | Dec | Inc | none ;; qualitative direction

<com> := Quad | Rect | Path ;; compactness

<col> := white | black | gray | red | yellow | blue | green | orange | violet | brown ;; the actual marker values

5.1.2 Results and Queries

The overall result of the parsing is a topological graph of primitive objects. Spatial relations over and above simple topological relations are not yet used in this version of IRIS. The resulting graph is parsed by a second graph grammar dealing expectation-driven with part-whole relations for more complex object concepts, the definition of which is encoded in the thesaurus management system TM/2. It even allows the system to classify objects that are only partially pictured. The thesaurus managing system forms the knowledge base providing the part-of relations inherent to object schemata of a certain domain. The most complex “object” types are the scene categories, like landscape, architecture photography or technical drawing that are explicitly given when a picture is to be integrated. Stating explicitly the category of a picture when adding it to the database helps significantly to determine the annotations proper, since the expectation-driven parsing can be performed in a more focused manner.87

The overall description of the image is finally given by one or several resulting struc-tures reflecting the topic (e.g., mountain landscape), its particular complex constituents (e.g., snowy mountain, meadow, lake), their elements (snow, water), and the corre-sponding marker values, which is finally fed into the database. That is, a structured document containing not only the final interpretation but all the intermediate descrip-tions of the image, as well, is indexed in a text retrieval system; a user, thus, may use both syntactic and semantic descriptions for searching images with the system IRIS.

Queries to the database can be formulated on any level or combination of levels con-tained in the image annotation (Fig 108). Specific interfaces have been provided to ease the user specifying parameters on the lower levels. Color, for example, can be specified either by using text (a partially instantiated color rectangle description), an example

87 Specifying the category in advance is already necessary for using an appropriate set of parameters for feature extraction.

Figure 106: A Simple Object Schema (“Clouds”) and a Complex Object Schema (“Mountainlake”)

(“picked” from a given picture) or a color editor. Similarly, texture can be specified verbally by the texture category (in a partially instantiated texture rectangle description) or by an example area from a given picture. Weighed correlation measures are used for computing similarity between two feature vectors. Of course, the most complex type of request can only be stated verbally by naming the object concepts included in the scene, for example, asking for a picture with “mountain”, “snow”, and “lake”; or, on the most general level, by simply asking for the type of scene – “mountain scene”.

Pragmatic restrictions, in particular concerning user modeling, have not been consid-ered in IRIS. However, the image analysis has been explicitly designed in a relatively simple way so that users can more easily understand the categories used for indexing and are not mislead to ask too much “understanding” from the system. Of course, IRIS does not really understand the pictures it analyses; it is able to deliver a coarse approxi-mation to »picture content« based on a simplified picture syntax. This approxiapproxi-mation is proposed as a simple-to-use semi-semantic specification for a kind of image retrieval close to common-sense picture understanding.

Beside its use in picture archiving in the strict sense, the automatically derived de-scriptions can be integrated as special digital watermarks in pictures that are to be pub-lished in the WorldWideWeb: search engines could use that information to find more easily the pictures a user wants to find, and to block others irrelevant (or prohibited for a certain user group).