• Keine Ergebnisse gefunden

4.4 Visualization for Document Structure Analysis

4.4.2 Visualization of Features

Improving or adapting the structure analysis process for a new document col-lection or to a new analysis tasks, requires the analysis of the existing structure analysis features. Instead of using the structure type for coloring the document elements, the color can be derived from a feature value. Figure 4.9 shows ex-ample visualizations of the rational Font Size Change and the nominal Paper Keywords features. This visualization helps to verify the implementation of the feature and to understand the problems of the structure analysis models.

Text

pfTable 1:Performance of different algorithmsonINTEGERS articles.

pfTitleAuthorAbstractHeadline*TextReferenceMath. Comp.*CaptionEnum. Footnote * pfNakagawaet al.

pfPrecision 1.00 0.22 1.00 0.13 0.78 1.00 0.14 1.00 0.00 0.70

pfRecall 0.46 1.00 0.12 0.24 0.93 0.00 0.02 0.00 0.00 0.20

pfF-Measure0.63 0.36 0.21 0.17 0.85 0.00 0.04 0.00 0.00 0.31

pfRatte´et al.

pfPrecision 0.83 0.24 1.0

pfRecall 0.14 0.24 0.0

pfF-Measure0.24 0.24 0.0

pfProposedSystem

pfPrecision 1.00 0.67 0.76 0.60 0.93 0.93 0.83 1.00 0.00 1.00 0.97

pfRecall 1.00 0.22 0.79 0.51 0.97 0.91 0.81 0.00 1.00 0.00 0.93

pfF-Measure1.00 0.33 0.78 0.55 0.95 0.92 0.82 0.00 0.00 0.00 0.95

pfTable 2:Performance of different algorithmsoncomputer science publications. pfTitleAuthorAbstractHeadline*TextReferenceMath. Comp.*CaptionEnum. Footnote * pfNakagawaet al.

pfPrecision 1.00 0.46 0.89 0.20 0.72 1.00 1.00 1.00 0.08 0.24

pfRecall 0.38 0.21 0.73 0.02 0.99 0.00 1.00 0.00 0.17 0.12

pfF-Measure0.56 0.29 0.80 0.03 0.83 0.00 1.00 0.00 0.11 0.16

pfRatte´et al.

pfPrecision 1.00 0.92 0.21

pfRecall 0.54 0.81 0.47

pfF-Measure0.70 0.86 0.29

pfProposedSystem

pfPrecision 0.88 0.63 0.47 0.77 0.94 0.97 0.00 0.95 0.25 1.00 0.50

pfRecall 1.00 0.92 0.43 0.82 0.96 0.96 1.00 0.87 0.15 0.00 0.46

pfF-Measure0.93 0.75 0.45 0.80 0.95 0.97 0.00 0.91 0.19 0.00 0.48

pfTable 3:Accuracy values basedonthe performances

pfshownin Table 1 and Table 2.

auINTEGERSComputer Science

tNakagawaet al. 0.73 0.71

tRatt ´eet al. 0.14 0.69

tProposedSystem 0.91 0.91

tThe needfor manualinteraction isreduced byusingthe tstructure analysis during the reference datacreation. Within tthefirsttwoorthreeiterations, theautomaticmethod rec-tognizesalreadythemajorityoftext lines correctly. Only the tlabelsof miss-classified linesmust be corrected manually be ttheuser.

h24.2 Use Case: Publications tWith the labeledtrainingcollection a newstructure analy -tsis istrained and comparedto the methods of Nakagawa tet al.[10] and Ratt ´eet al.[13]. Nakagawa et al. described tan algorithmforextractingstructure information and math

-tematical componentsfrom publications.The methodof

tRatt ´eet al.isa graph based methodthatuseslinguistic tinformationto identify titles, headlines andenumerationsin tdocuments.For all methods, theprecision, recall and F -tmeasurefor each label available in the reference dataare tcalculatedontext lines. The resultsontheINTEGERS ar -tticlesareshown in Table 1 and Table 2 shows the resultson tthe computersciencepublications.

tSummingup, in Table 3 the accuracy of the different

tmethodsfrom theINTEGERS articles and the computer

tsciencepublicationsareshown. From the results in Table 1, tTable 2 and Table 3 itisevidentthat the performance of tthe algorithms dependsonthe typeof the document col -tlection.The algorithmof Nakagawaet al.performs al -tmost equallyontheINTEGERS articlesaswellas onthe tcomputersciencepublications.In particular, the system of tRatt ´eet al. achieves much higher accuracyonthe computer tsciencepublications thanontheINTEGERS articles. The tproposedsystem outperforms thetwo others,onboth, the tINTEGERS articles and the computersciencepublications.

tComparing the results in Table 1 with Table 2, itisevi -tdent that predefinedstructure analysis algorithms have the tdrawback to work only for a specific document collection.

tAdaptations of these algorithmsto different document types tresults in designing and implementing additional rulesor tgrammars.In contrast, the machine learning approach of tthe proposed systemcaneasilybe adaptedto different docu -tment collections and achieves very highrecognitionrates.

tBasically, only the feature set used for the structure analy -tsishasto be adapted to the specificpropertiesof the new tdocument collection.

t4.3 Use Case: Product Manuals tAs already mentioned, in addition to the computerscience tpublicationsandINTEGERS articles, the proposed system tiseasily adaptedtoprocessa collectionof product manuals.

tIn this third type of documents, the following structural tlabels should be recognized:“Title”, “Headline”, “Table of tContent” (TOC), “Hint” and “Text”. Here, a new feature set twith geometry and formatting featuresisimplemented.A

Logical demo.pdf.xml.gz

(a) Low quality model trained with four example documents.

Text

tiEnhancingDocument StructureAnalysis

tiusing Visual Analytics

tiAndreasStoffel

auSiemensAG,Germany

auandreas.stoffel.ext@siemens.com

auDavid Spretke

auUniversity ofKonstanz,Germany

audavid.spretke@uni-konstanz.de auHenrik Kinnemann

auSiemensAG,Germany auhenrik.kinnemann@siemens.com

auDaniel A. Keim auUniversity ofKonstanz,Germany audaniel.keim@uni-konstanz.de

h1ABSTRACT

tDuringthe last decade nationalarchives, libraries, muse-tumsandcompaniesstartedto make their records, books tandfiles electronically available.In order to allow efficient taccessof this information, the content of the documents tmust be stored in database and information retrieval sys -ttems. State-of-the-art indexing techniques mostly relyon ttheinformation explicitly available in thetextportionsof tdocuments.Documents usuallycontaina significantamount tof implicit information suchastheir logicalstructure which tisnot directly accessible(unless the documentsareavail -tableaswell-structured XML-files) andisthereforenot used tinthe searchprocess.In thispaper, a new approach for an -talyzingthe logicalstructure of text documentsispresented.

tThe problemof state-of-the-art methodsisthat they have tbeen developedfor a particular type of documents andcan tonlyhandle documentsof that type.In mostcases, adap -ttationandre-trainingfor a different document typeisnot tpossible.Our proposed method allows an efficientandeffec -ttive adaptationof thestructure analysisprocessbycombin -tingstate-of-the-art machine learning with novel interactive tvisualization techniques, allowing a quick adaptation of the tstructure analysisprocessto unknown document classes and tnew taskswithoutrequiringa predefinedtrainingset.

tCategories and Subject Descriptors

tI.7.5[Computing Methodologies]:Document Capture

tDocument analysis tiGeneral Terms

auAutomatic DocumentStructure Analysis, Visual Analytics i1.INTRODUCTION

tLibraries, national archives andcompanies arefaced with thugeamount of documents thatareshelved in archives. The

tPermission to makedigital or hard copies of all or part of this work for tpersonal or classroom use is granted without fee provided that copies are tnotmadeor distributed for profit or commercial advantage and that copies tbear this notice and the full citationon the firstpage. To copy otherwise, to trepublish, to post on servers or to redistribute to lists, requires prior specific tpermission and/or a fee.

tCopyright200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.

tarchivesarefull ofimages, books, file cards and other doc -tuments. Making these cultural assets and documents avail -tableto a broader public and allowing an efficientsearch and tretrieval of information raised the desireto make these docu -tments available in electronic form, which resulted in several tmassdigitizationprojectsworldwide.

tSearching andinformation retrieval fortext documentsis tawell known task.Traditionally, the bag-of-word model tisusedfor indexingpurposes, which does not consider the tpositionof the words in the documents[8]. Augmenting tthe bag-of-word models with additional information about tthe logicalstructure of the documents would allowmore ex-tpressive queriesfor retrievalpurposes.Forinstance, a query tcould benarrowed to a specific logical part of the document, tlike“introduction:documentengineering” to search for doc -tuments thatcontaintheterms “document” and “engineering tinthe introduction.

tChallengesforstructure analysis tasksareheterogeneous tdocument collections with many different document types tthat may also change overtime.Manually adaptingthe tstructure analysisprocessto each document typeisa labo -trioustask and maybe uneconomical. The proposedstruc -tture analysis system addresses this problem by reducing the tmanual adaptioncostsusinga combinationof machine learn -tingalgorithms with manual verification andcorrectionof tthestructure information. The machine learning algorithm tlearns directlyfrom theuser’sinputand adaptsitself step -twiseto new document types.

h12.RELATEDWORK

tThe analysisof the documentstructureismainlyusedfor tdocumentimageanalysisandinformationextraction. Rule -tbased approachesarebasic techniqueswhich evaluate pre-tdefined rulestoassignlabelsto the textregions[6, 7, 10].

tAlternatively, various kinds of grammars have been proposed tforstructure analysis[1, 3, 13]. These systems model doc -tuments with different kinds of grammars andassignslabels tto textregionsby applyingthe grammar rulesto the docu -tments. Other structure analysis techniques include, for ex-tample, emergentcomputing[5] and n-grams [2]. Overviews tofstructure analysis approaches for documentimages canbe tfound in[9, 11]. All mentioned approaches have incommon tthat theyaredevelopedfor a specific task and document ttype. Using any of the presented method for a different task, twouldmeanto modify the specific set of rulesorgrammars, twhichisa laborious manual task. The problemofcreating

Logical demo.pdf.xml.gz

(b) Medium quality model trained with 16 example documents.

Figure 4.7: Logical structure analysis results with two models of different quality.

Text

tiEnhancingDocument StructureAnalysis

tiusing Visual Analytics

tiAndreasStoffel auSiemensAG,Germany auandreas.stoffel.ext@siemens.com

auDavid Spretke auUniversity ofKonstanz,Germany audavid.spretke@uni-konstanz.de

tDuringthe last decade nationalarchives, libraries, muse-tumsandcompaniesstartedto make their records, books tandfiles electronically available.In order to allow efficient taccessof this information, the content of the documents tmust be stored in database and information retrieval sys -ttems. State-of-the-art indexing techniques mostly relyon ttheinformation explicitly available in thetextportionsof tdocuments.Documents usuallycontaina significantamount tof implicit information suchastheir logicalstructure which tisnot directly accessible(unless the documentsareavail -tableaswell-structured XML-files) andisthereforenot used tinthe searchprocess.In thispaper, a new approach for an -talyzingthe logicalstructure of text documentsispresented.

tThe problemof state-of-the-art methodsisthat they have tbeen developedfor a particular type of documents andcan tonlyhandle documentsof that type.In mostcases, adap -ttationandre-trainingfor a different document typeisnot tpossible.Our proposed method allows an efficientandeffec -ttive adaptationof thestructure analysisprocessbycombin -tingstate-of-the-art machine learning with novel interactive tvisualization techniques, allowing a quick adaptation of the tstructure analysisprocessto unknown document classes and tnew taskswithoutrequiringa predefinedtrainingset.

tCategories and Subject Descriptors

tI.7.5[Computing Methodologies]:Document Capture

tDocument analysis tiGeneral Terms

auAutomatic DocumentStructure Analysis, Visual Analytics i1. INTRODUCTION

tLibraries, national archives andcompanies arefaced with thugeamount of documents thatareshelved in archives. The

tPermission to makedigital or hard copies of all or part of this work for tpersonal or classroom use is granted without fee provided that copies are tnotmadeor distributed for profit or commercial advantage and that copies tbear this notice and the full citationon the firstpage. To copy otherwise, to trepublish, to post on servers or to redistribute to lists, requires prior specific tpermission and/or a fee.

tCopyright200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.

tarchivesarefull ofimages, books, file cards and other doc -tuments. Making these cultural assets and documents avail -tableto a broader public and allowing an efficientsearch and tretrieval of information raised the desireto make these docu -tments available in electronic form, which resulted in several tmassdigitizationprojectsworldwide.

tSearching andinformation retrieval fortext documentsis tawell known task.Traditionally, the bag-of-word model tisusedfor indexingpurposes, which does not consider the tpositionof the words in the documents[8]. Augmenting tthe bag-of-word models with additional information about tthe logicalstructure of the documents would allowmore ex-tpressive queriesfor retrievalpurposes.Forinstance, a query tcould benarrowed to a specific logical part of the document, tlike“introduction:documentengineering” to search for doc -tuments thatcontaintheterms “document” and “engineering tinthe introduction.

tChallengesforstructure analysis tasksareheterogeneous tdocument collections with many different document types tthat may also change overtime.Manually adaptingthe tstructure analysisprocessto each document typeisa labo -trioustask and maybe uneconomical. The proposedstruc -tture analysis system addresses this problem by reducing the tmanual adaptioncostsusinga combinationof machine learn -tingalgorithms with manual verification andcorrectionof tthestructure information. The machine learning algorithm tlearns directlyfrom theuser’sinputand adaptsitself step -twiseto new document types.

h12.RELATEDWORK

tThe analysisof the documentstructureismainlyusedfor tdocumentimageanalysisandinformationextraction. Rule -tbased approachesarebasic techniqueswhich evaluate pre-tdefined rulestoassignlabelsto the textregions[6, 7, 10].

tAlternatively, various kinds of grammars have been proposed tforstructure analysis[1, 3, 13]. These systems model doc -tuments with different kinds of grammars andassignslabels tto textregionsby applyingthe grammar rulesto the docu -tments. Other structure analysis techniques include, for ex-tample, emergentcomputing[5] and n-grams [2]. Overviews tofstructure analysis approaches for documentimages canbe tfound in[9, 11]. All mentioned approaches have incommon tthat theyaredevelopedfor a specific task and document ttype. Using any of the presented method for a different task, twouldmeanto modify the specific set of rulesorgrammars, twhichisa laborious manual task. The problemofcreating

Logical demo.pdf.xml.gz

(a) Highlighting of uncertain results only in the thumbnail view.

Text

tiEnhancingDocument StructureAnalysis

ti using Visual Analytics

ti AndreasStoffel auSiemensAG,Germany auandreas.stoffel.ext@siemens.com

au David Spretke auUniversity ofKonstanz,Germany audavid.spretke@uni-konstanz.de auHenrik Kinnemann

auSiemensAG , Germany

auhenrik.kinnemann@siemens.com

au Daniel A. Keim auUniversity ofKonstanz,Germany audaniel.keim@uni-konstanz.de

h1ABSTRACT

tDuringthe last decadenationalarchives,libraries,muse -tumsand companiesstartedto make their records,books tandfiles electronicallyavailable .In order to allow effi cient taccessof thisinformation,thecontent of the documents tmust be stored in database and informationretrieval sys-ttems . State-of-the-artindexingtechniquesmostly rely on ttheinformationexplicitlyavailable in thetext portions of tdocuments .Documents usually contain a significantamount tof implicit information such as their logicalstructure which tisnot directly accessible(unless the documents are avail-table aswell-structuredXML-files)and isthereforenot used tinthe search process .In this paper,anew approach for an-talyzing the logicalstructure of text documents is presented . tThe problemof state-of-the-artmethods is that they have tbeen developedfor a particular type of documents and can tonly handle documentsof that type .In most cases,adap -ttationand re-trainingfor a different document type isnot tpossible.Our proposed method allows an effi cient and effec-ttive adaptationof thestructure analysis process by combin -tingstate-of-the-artmachine learning with novel interactive tvisualizationtechniques,allowing a quick adaptationof the tstructure analysis process to unknown document classes and tnew taskswithout requiring a predefined training set . tCategories and Subject Descriptors tI. 7.5[ComputingMethodologies]:Document Capture tDocument analysis

ti General Terms au

Automatic DocumentStructure Analysis, Visual Analytics i

1. INTRODUCTION

t Libraries, national archives andcompanies arefaced with thugeamount of documents that are shelved in archives . The

tPermission to make digital or hardcopies of all or part of this work for tpersonal or classroom use is granted without fee provided that copies are tnotmadeor distributed for profit or commercial advantage and that copies tbear this notice and the full citationon the firstpage . To copy otherwise , to trepublish , to post on servers or to redistribute to lists , requires prior specific tpermission and/or a fee.

tCopyright200X ACM X-XXXXX-XX-X / XX/XX ...$ 10.00 .

tarchivesarefull ofimages, books, file cards and other doc -tuments.Making these culturalassets and documents avail -tabletoabroaderpublicand allowinganefficientsearch and tretrieval of information raised the desireto make these docu -tments availableinelectronicform, which resultedinseveral tmassdigitizationprojectsworldwide. t Searching andinformation retrieval fortext documentsis tawell known task.Traditionally, the bag-of-word model tisusedfor indexingpurposes, which does not consider the tpositionof the wordsinthe documents[8].Augmenting tthe bag-of-word modelswith additional information about tthe logicalstructure of the documents would allowmore ex-tpressive queriesfor retrievalpurposes.Forinstance,a query tcould benarrowed toaspecific logical part of the document, tlike“introduction:documentengineering” to search for doc -tuments thatcontaintheterms “document ” and “engineering tinthe introduction.

t Challengesforstructure analysis tasksareheterogeneous tdocument collectionswithmanydifferent document types tthatmayalso changeover time.Manually adapting the tstructure analysisprocessto each document typeisalabo -trioustask and maybe uneconomical.Theproposedstruc -tture analysis system addresses thisproblemby reducing the tmanualadaptioncostsusingacombinationof machine learn -tingalgorithms with manual verification andcorrectionof tthestructure information.The machine learning algorithm tlearns directlyfrom theuser’sinputandadaptsitself step -twisetonewdocument types.

h1

2.RELATEDWORK

t The analysisof the documentstructureismainly usedfor tdocumentimageanalysis andinformationextraction .Rule -tbasedapproaches arebasictechniqueswhich evaluate pre-tdefined rulestoassignlabelsto the textregions[6, 7, 10]. tAlternatively,variouskindsofgrammarshave beenproposed tforstructure analysis[1, 3, 13].These systems model doc -tuments with different kinds ofgrammarsandassignslabels tto textregionsby applying thegrammarrulesto the docu -tments.Other structure analysistechniquesinclude, for ex-tample, emergentcomputing[5] andn-grams[2].Overviews tofstructure analysisapproachesfor documentimages canbe tfoundin[9,11].All mentionedapproacheshavein common tthat theyare developedforaspecific task and document ttype. Using anyof thepresentedmethodforadifferent task, twouldmeanto modify the specific set of rulesorgrammars, twhichisalaborious manual task.Theproblemofcreating

Logical demo.pdf.xml.gz

(b) Highlighting of uncertain results in the thumbnail and detail views.

Figure 4.8: Logical structure visualization based on model confidence.

The Font Size Change feature shown in Figure 4.9a is suited for recognizing headlines in general. A headline usually starts with a large increase in font size follows by a large decrease. This feature allows to detect headlines but not to distinguish the different levels of headlines from each other. Accidentally, this pattern might be observed in vector graphics as well. This feature leads to high recall for headlines, but to get high precision additional features must be con-sidered. The case of the Paper Keywords feature, shown in Figure 4.9b, is sim-ilar. Detecting keywords, such as “Figure” or “Table”, at the beginning of lines, results in a high recall for captions, but also in false positives in the running text. A reliable recognition of captions with a high precision needs additional features, for instance measuring the line width and justification.

4.5 Summary

Document structures and structure analysis features can be visualized through coloring the background of a document. Thumbnails help in navigation

Document structures and structure analysis features can be visualized through coloring the background of a document. Thumbnails help in navigation