Future work - Evaluation – Case Studies - Visualization of Large Document Corpora

5.3 Evaluation – Case Studies

5.4.2 Future work

There are a number of issues we plan to address in the future. The projection view changes abruptly when modified by a number of external events, making it difficult to preserve the mental map of the projected items. We plan to develop methods to reduce the changes from one view to another and to im-plement smooth animations that help relate the new projection to the old one.

As the analysis gets more complex and the user goes through multiple steps, it becomes difficult to remember previous steps and return to interesting states previously visited. We plan to implement a history and save mechanism that support this specific need. An assistance system to allow the user to choose a valuable expansion size is planned. An in-depth investigation of how differ-ent fingerprint generation methods and parameter sets influence the resulting distances is of high interest.

Chapter 6 Conclusion

The thesis presented approaches for visualization of large document collections.

Document Cards were introduced as a method to represent single documents in small scale. Evolution of whole corpora over time has been shown as context aware tag clouds. Overlap of data representatives occurs in many visualization applications. We gave an extensive overview of methods and depicted an new method to resolve this overlap. Finally, with HiTSEE we allowed investigation of molecular data as abstract objects using an interactive metaphor which fitted the need of domain experts.

Taking the described methods as fundament, one future work idea immedi-ately arises – a collection browser system. This browser will be user-centric and user-administrated view on document collections similar to the kiosk system of Figure 2.4. Beside the manual arrangement of Document Cards different automatic positioning methods will allow to throw light on different document constellations in the collection. Input for deriving positions are variants of doc-ument content and docdoc-ument meta information. Bridging from molecule ex-ploration to document browsing, the design for such a system utilizing insights of HiTSEE is given in Figure 6.1. Creating the system for different devices and evaluating it’s effectiveness for document exploration and serendipity findings is future work. For further developing Document Cards we will have a look on cross-domain applications considering books or plain-text documents as input.

Another important question is how the combination of images and texts can support users. The psychological experiments already hint towards good per-ceptual performance for explanation. But methods for automatic generation

of integrated image-text compilations that perform better for gaining insights than single content type visualizations need more attention. The question of how images perform as discriminators or for memorizing has been addressed recently by Isola et al. [IXTO11] but still needs further investigation. Further future ideas are described in detail at the end of each chapter.

Towards the end, we should ask how far we have come from filling desks with documents? The fundamentals are given at hand and the described Document Browser will have the capability to manage large collections. Additionally, the current trend of high resolution and high density displays can enable us in near future to read more text on computer/tablet screens and reduce the amount of printed documents. But Niels Bohr stated: "predictions are hard to make, especially about the future". So my final and minimal hope is, that I was able raise interest for the vibrating research area of document collection visualization.

Document A

Visual Enlightenment

Projection View

Document B Document C Document D Document E Document F Document G Document H

. . .

Project & Expand

Figure 6.1: Design for a "Project & Expand" system for document collections.

The user can define a degree-of-interest function by typing a search term.

A relevance sorted list of documents is returned. By selecting documents (blue) and hitting the project and expand button, the selected documents are projected in the 2D canvas, enriched with documents semantical closest to the selection (orange).

Appendix A

Content Extraction from PDF files

Nowadays electronic documents are often encoded in Portable Document For-mat (PDF), a forFor-mat developed by Adobe Systems and since 2008 standard-ized as ISO 32000-1:2008. Unfortunately, the standard does not cover "specific processes for converting paper or electronic documents to the PDF format" or

"specific technical design, user interface or implementation or operational de-tails of rendering" as stated by ISO¹. That means that the creation of PDF files allows freedom in document structuring as long as the resulting documents appear the same in every PDF viewer application. These weak obligations for how to embed content impede its access. For example, think of a scenario where two images should be positioned on one page. The single images can either be positioned by geometric instructions (translate,scale,. . . ) or they can be pre-rendered on a transparent image of page size which itself is used as one big background image. Designing a method that can extract single images from both scenarios is difficult and exemplar for PDF accessing tasks. We give a short overview of related work and describe our method afterwards.

A.1 Related Work

Many tools to access texts from PDF files are available as web based online services. But they lack the ability to extract position of text and images.

Additionally when uploading files they do not guarantee security or privacy.

1http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?

csnumber=51502

Therefore we focus on methods which operate on offline on standalone ma-chines. Chao and Fan [CF04] split PDF Documents into the components:

images, vector images, and text. The different components are extracted from component-only renderings of each document. Maderlechner et al. [MPS06]

focus on finding figure captions and mapping them to images. Cohen et al. [KCM03] mention to use a modified version of open source tools to extract images. An approach based on learning methods and involving visual analytics to fasten the training phase is described by Stoffel et.al. [SSKK10]. Recently, Ramakrishnan et.al. [RPHB12] proposed a tool (LA-PDFText) which is based on open-source java library JPedal [JPe12]. It uses rules to define heuristics for text extraction from scientific publications, forming text blocks and later classifying these blocks into functional groups. The tool uses an equivalent idea to ours, although it cannot extract images in its current version.

Im Dokument Visualization of Large Document Corpora (Seite 124-129)