A Formal, Declarative Approach to Data Format Description

(1)

VomFachbereichInformatik

der Technischen UniversitätDarmstadt

genehmigte

Dissertation

zur Erlangung des akademischen GradesDr.-Ing.

vorgelegt von

Dipl.-Inform. (FH) Michael Hartle

geboren inFrankfurt am Main

Tagder Einreichung: 7. Juni 2010

Tagder Disputation: 16. Juli 2010

Referenten: Prof. Dr. Max Mühlhäuser, Darmstadt

Prof. Dr. AndreasRauber, Wien

(2)

(3)

Ehrenwörtliche Erklärung 1

Hiermit erkläre ich, die vorgelegte Arbeit zur Erlangung des akademischen Grades

Dr.-Ing. mitdemTitelAFormal,DeclarativeApproachtoDataFormat

Descrip-tion selbständig und ausschlieÿlich unter Verwendung der angegebenen Hilfsmittel

erstellt zu haben. Ich habe bishernochkeinen Promotionsversuch unternommen.

Darmstadt, den 7. Juni 2010

Dipl.-Inform. (FH) Michael

Har-tle

(4)

(5)

Wissenschaftlicher Werdegang des Verfassers 2

09/199803/2003 Studiumder Informatikander Fachhochschule Darmstadt

Abschluss mitNote1,3

09/200203/2003 Diplomarbeitander Fachhochschule Darmstadt

Titel Real-TimeGeneration of Planetary Landscapes

07/200407/2010 WissenschaftlicherMitarbeiteramFachgebietTelekooperation(Prof.

Dr. Mühlhäuser) an der Technischen UniversitätDarmstadt

07/200407/2010 TechnischerMitarbeiterame-learningcenterim

Hochschulrechen-zentrum der Technischen Universität Darmstadt

09/200602/2010 AssoziiertesMitgliedimGraduiertenkolleg1223Qualitätsverbesserung

im E-Learning durchrückgekoppelte Prozesse

(6)

(7)

Abstract

The concept of data formats is central to informationstorage and exchange, as

it coins the process of how information is written to and read back from

format-compliant data by senders and receivers. In contrast to the widespread use of

natural-languagedescriptionsintended forhuman engineers,and ofprocedural

de-nitionsofformat-compliantcomponents,describingdata formatknowledgeina

for-mal,declarativemannerisnecessary formakingthisknowledgemachine-processible,

enabling its exible, automated application to format-compliant data. To that

ef-fect, data format knowledge is considered both on the level of format-compliant

data as adata format instance and on the level of a data format consisting of such

instances.

In a survey of current State of the Art in Data Format Description, examined

relatedwork fromthe data-centric research domainsof DigitalPreservation,

Multi-media and Telecommunication showa lack suitableformalised models foruniversal

applicability. Aswell,examinedrelatedworkprovidesonlyasubsetofthe four

nec-essary descriptivecapabilities to describedata which may be primitive,structured,

transcoded or fragmented.

In the analysis, a formalisationis presented which is based on the research

hy-pothesis that a data format denes a normative set of lossless information

repre-sentations, where there exists a bijective mapping between interal representations

of senders / receivers, and external representations that are exchanged as

format-compliant data. The formalisationis universally applicable for arbitrary data

for-mats, issuitable forboth so-calledlosslessand lossy data formats,and leads tothe

notionoffourelementarydescriptivecapabilities,whichexactlymatchthoseusedin

theStateoftheArtsurvey. A validPortableNetworkGraphics(PNG)rasterimage

is given as litmustest for data format description, as its description exercises all

fourelementarydescriptivecapabilities. Basedontheformalisation,itisshownthat

a universal approach to data format description is too powerful in computational

termsastobeabletoguaranteetermination, thatthe tractabilityof bijective

map-ping functions and their inverses is neither given nor necessarily related, and that

one-to-one correspondence of abijective mappingfunctioncan be guaranteed using

information-preserving, Turing-complete Reversible Turing Machines.

Building on the formalisation given in the analysis, the thesis denes the

Bit-streamSegmentGraph(BSG)modelfordescribingarbitrarydataformatinstances.

For BSG instances, representations are dened both for visualisation as well as

for storageand exchange through machine-processible, RDF-basedrepresentations.

Incremental construction and modication of BSG instances is enabled through a

closedsetofoperations,andthecoverage ofaBSGinstanceisdenedasameasure

of its completeness. Actual tool support for the construction,modication and

ex-plorationof BSGinstances onarbitrary datais provided through the Apeiron BSG

Editor. Applications of the BSG model are demonstrated through the description

of the PNG image litmustest from the previous analysis, and for the description

of anexploit in the contextof IT Security.

Buildingonthe BSG model, the thesis denes the BSG Reasoningapproach for

describingarbitrarydataformatsaspotentiallyinnitesetsofdataformatinstances.

(8)

representation ofrulesetsforstorageand exchange. Applications ofBSGReasoning

are demonstrated through the description of a PNG image le format subset and

throughanoutlinedapproach forformat-aware fuzzingofbitstreams inIT Security.

The PNG image le format subset described through BSG Reasoning exercises all

elementary descriptive capabilities previously identied in the analysis, and it is

shown that the resulting set of logic rules, despite a lownumber of format-specic

rules,already yieldsahighcoverage ofinferred BSGinstances onanumberof valid

PNGimages.

The thesis closes with a retrospection, conclusions and an outlook on potential

future research on the BSG model and the BSG Reasoning approach, focusing on

aspects suchas the computer-aided reverse-engineering of data format rules, orthe

use of reversible programming languages for the denition of lossless coding and

(9)

Zusammenfassung

Die Speicherung und der Austausch von Informationen ist eng mit dem

Be-gri des Datenformats verknüpft. Ein Datenformat legt fest, wie Informationen

format-konform von einem Sender als Daten geschrieben und aus diesen von einem

Empfänger wieder gelesen werden können. Obwohl natürlich-sprachliche

Beschrei-bungenfürmenschlicheIngenieureheutehäuggenutztwerden,undformat-konforme

Abläufe teilweise prozedural beschrieben werden, hätte eine formale, deklarative

Beschreibung von Datenformat-Wissen den Vorteil, daÿ dieses ohne Bindung an

einenkonkretenAblaufundohnedenUmwegüberMenschenmaschinen-verarbeitbar

ist,unddamitexibelundautomatisiertaufformat-konformeDatenangewandt

wer-den kann. ImRahmendieser DissertationwirdDatenformat-Wissen sowohlaufder

Ebenevonformat-konformenDaten alsDatenformat-InstanzalsauchaufderEbene

eines Datenformats betrachtet, welches aus Datenformat-Instanzen besteht.

ImRahmeneinerBegutachtungverwandterArbeitenimBereichder

Datenformat-BeschreibungwerdenAnsätzeindendaten-orientiertenForschungsgebieten der

Dig-italenErhaltung, MultimediaundTelekommunikation untersucht,und eswird

fest-gestellt,dassgeeignete,formalisierteModellefehlen,welcheuniversellfürdie

Beschrei-bung von Datenformaten anwendbar sind. Darüber hinaus hat sich gezeigt, dass

diebetrachteten Ansätzenur teilweise dienotwendigenbeschreibendenFähigkeiten

haben, welche erforderlich sind, um den Aufbau von Daten zu beschreiben, welche

primitiveWerteenthalten,eine Strukturdarstellen,einerBlock-Transformation

un-terzogen wurden oder aber infragmentierterForm vorliegen.

In einer Analyse wird daher eine Formalisierung des Datenformat-Begris

en-twickelt,welche vonderAnnahmeausgeht,dasseinDatenformateinnormativesSet

von verlustfreien Informations-Repräsentationen darstellt. Für ein solches Set

ex-istiert eine bijektiveAbbildung zwischen der internenRepräsentationeines Senders

/Empfängersundder korrespondierendenexternenRepräsentation,welche inForm

format-konformer Daten ausgetauscht wird. Diese Formalisierungist universell für

beliebige Datenformate anwendbar, also auch für sogenannte verlustbehaftete und

verlustfreieDatenformate,undführtzum Konzeptvonelementarenbeschreibenden

Fähigkeiten, welche sich genau mitdenen decken, welche in der Begutachtung

ver-wendetwurden. AufBasisdieser Fähigkeiten wirdeingültigesBildimDateiformat

PortableNetworkGraphics(PNG)alsLackmus-Test fürAnsätzeder

Datenformat-Beschreibung vorgestellt, dadessenBeschreibungallevierelementaren

beschreiben-den Fähigkeiten voraussetzt. Auf Basis der Formalisierung wird dann gezeigt,dass

einuniversellanwendbarerAnsatzzurDatenformat-Beschreibungzumächtigist,als

dass dessen Terminierung noch garantiert werden kann. Ferner wird gezeigt, dass

bijektiveAbbildungsfunktionen und ihre Inversen weder ezient sein müssen, noch

dass die Ezienz einer bijetiven Abbildungsfunktionund ihrer Inversen im

Zusam-menhang stehen müssen. Zuguterletzt wird gezeigt, dass diefür eine bijektive

Ab-bildung erforderliche Korrespondenz von internen und externen Repräsentationen

dadurchgarantiert werdenkann,dass man diese übereine informations-erhaltende,

Turing-vollständige Reversible Turing-Maschine deniert.

Aufbauend auf der Formalisierung der Analyse wird in dieser Dissertation das

Bitstream Segment Graph (BSG)-Modell deniert, welches der Beschreibung

(10)

tationfür dieSpeicherungund den Austauschdeniert. Die schrittweise

Konstruk-tion und Modikation von BSG-Instanzen wird durch ein geschlossenes Set von

Operationen ermöglicht,und mittels dem Maÿder Abdeckung einer BSG-Instanz

kann deren Vollständigkeitbestimmtwerden. Mithilfedes Apeiron BSG Editor ist

die Konstruktion, Modikation und Betrachtung von BSG-Instanzen auf eigenen

Daten in der Praxis möglich. Die Anwendung des BSG-Modells wird

demonstri-ert, indem eine Beschreibung des PNG-Bilds aus dem Lackmus-Test der

Anal-yse vorgenommen wird, und indem der Aufbau eines Exploit im Kontext der

IT-Sicherheitmittels einer Beschreibung näher erklärt wird.

AufbauendaufdemBSG-ModellbeschreibtdieseDissertationdenBSG

Reasoning-Ansatz,um beliebigeDatenformatealspotentiellunendliche Setsvon

Datenformat-Instanzen zu beschreiben. Mithilfe von Logik-Regeln kann eine BSG-Instanz auf

einergegebenen Bitfolgeerschlossenwerden,vonderinitialangenommenwird,dass

sieformat-konform ist. DieserAnsatz deniert auch dieRepräsentation von

Regel-Sets zur Speicherung und zum Austausch. Die Anwendung des BSG

Reasoning-AnsatzeswirddurchdieBeschreibungeinesSubsets desPNG-Datenformats

demon-striert,sowiedurchdieBeschreibungeinesAnsatzeszumformat-spezischenFuzzing

von BinärdatenimKontextvonIT-Sicherheitergänzt. Die Beschreibung des

PNG-Datenformat-SubsetsmittelsdesBSGReasoning-Ansatzesnutztallevierelementaren

beschreibenden Fähigkeiten, welche zuvor in der Analyse identiziert wurden, und

eswurdegezigt,dass dashierfür verwendete SetanLogik-Regelntrotzseines

gerin-gen Umfangs bereits in der Lage ist, BSG-Instanzen mit einem hohen Grad an

Abdeckung für eine Reihe vongültigen PNG-Bildern zu erschliessen.

Die Dissertation schliesst mit einem Rückblick über die gesamte Arbeit, zieht

Schlussfolgerungenund bieteteinenAusblickaufkünge ForschungimHinblick auf

das BSG-Modell und den BSG Reasoning-Ansatz, speziell imHinblick aufAspekte

wiedermaschinellunterstütztenAnalysevonDatenunddenjeweils

zugrundeliegen-den Datenformat-Regeln, oder aber der Verwendung von reversiblen

(11)

(12)

(13)

1 Introduction 1

1.1 Motivation. . . 1

1.1.1 Formal descriptionsover natural-languagedescriptions . . . 1

1.1.2 Declarativeapproaches overproceduralapproaches . . . 2

1.1.3 Research domains . . . 3

1.1.4 Dataformat instances and data formats . . . 3

1.2 Research Problems . . . 4

1.3 Contributions and Outline . . . 4

1.4 Acknowledgements . . . 5

2 State of the Art 7 2.1 Introduction . . . 7

2.2 DigitalPreservation . . . 9

2.2.1 Data FormatRegistries. . . 10

2.2.2 Open Archival InformationSystem Reference Model . . . 11

2.2.3 Typed Object Model . . . 14

2.2.4 eXtensibleCharacterisation Language . . . 16

2.2.5 Data FormatDescription Language . . . 19

2.3 Multimedia . . . 22

2.3.1 MPEG 1/2 Methodology . . . 22

2.3.2 Formal Language for Audio-VideoObject Representation . . . 25

2.3.3 Bitstream Syntax DescriptionLanguage. . . 29

2.3.4 BFlavorand gBFlavor . . . 34

2.4 Telecommunication. . . 36

2.4.1 Abstract Syntax NotationOne . . . 36

2.4.2 Concrete Syntax Notation 1 . . . 40

2.5 Discussion . . . 43

2.5.1 Considerationof examined RelatedWork . . . 43

2.5.2 Generalobservations . . . 45 2.5.3 Other approaches . . . 46 2.6 Summary . . . 47 3 Analysis 49 3.1 Introduction . . . 49 3.2 Research Hypothesis . . . 50

3.3 FormalisingDataFormat Description. . . 51

(14)

3.3.2 Representing complex information . . . 52

3.3.3 Validatingrepresentations . . . 53

3.3.4 Mappingbetween representations. . . 54

3.4 Properties of the Formalisation . . . 57

3.4.1 Suitabilityof bijectivemappingfunctions for data formats . . 57

3.4.2 Suciency for lossless and lossydata formats . . . 59

3.4.3 Suciency and necessity of descriptive capabilities . . . 59

3.4.4 Usinga PNGraster image aslitmustest . . . 60

3.5 Limitsto DataFormat Description . . . 61

3.5.1 Overview . . . 61

3.5.2 Computability and decidabilityof functions . . . 62

3.5.3 Tractability of functions . . . 65

3.5.4 One-to-onecorrespondence of sets . . . 65

3.5.5 Summary . . . 67

3.7 Summary . . . 68

4 Describing Data Format Instances 71 4.1 Introduction. . . 71

4.2 Denition of the Bitstream Segment Graphmodel . . . 72

4.2.1 Dening codings and transformations . . . 72

4.2.2 Dening bitstream segments. . . 73

4.2.3 Dening aBitstream Segment Graph . . . 73

4.3 Representation of BSG instances . . . 74

4.3.1 Visualrepresentations . . . 74

4.3.2 Digitalrepresentation . . . 76

4.4 Construction and modication of BSG instances . . . 77

4.4.1 ModifyingBSG instances through operations . . . 79

4.4.2 Measuringcompletenessof a description . . . 84

4.4.3 Providingtoolsupportwith the Apeiron BSG Editor. . . 85

4.5 Applications of the BSGmodel . . . 85

4.5.1 Descriptionof the PNG raster image litmustest . . . 86

4.5.2 Describingexploits in IT Security . . . 87

4.7 Summary . . . 92

5 Describing Data Formats 95 5.1 Introduction. . . 95

5.2 Denition of the BSG Reasoning aproach . . . 96

5.2.1 Makingpropositions . . . 96

5.2.2 Usingpredicates . . . 97

5.2.3 Dening rules . . . 100

5.2.4 Matching rule conditions . . . 102

5.2.5 Inference process. . . 102

5.3 Representation of BSG Reasoning rulesets . . . 103

(15)

5.4.2 Format-aware fuzzing of bitstreams . . . 112 5.5 Discussion . . . 113 5.6 Summary . . . 115 6 Finale 117 6.1 Introduction . . . 117 6.2 Retrospection . . . 117 6.3 Conclusions . . . 118 6.4 Outlook . . . 118

6.4.1 Computer-aided reverse-engineering of data format rules. . . . 118

6.4.2 Use of reversible programminglanguages . . . 119

6.4.3 Analysisof space-eciency regarding existing data formats . . 119

A BSG Reasoning ruleset for PNG subset 121

B BSG Reasoning results of PNG ruleset 143

(16)

(17)

2.1 Overview of OAIS Functional Entities . . . 12

2.2 Overview of OAIS concept relations. . . 13

3.1 Abstraction of the informationtransport froma sender toa receiver . 50 3.2 Mappingsteps ordered by input and output cardinality. . . 56

3.3 Bijectivemappingbetween

m

α

and

m

d

. . . 56

3.4 Bijectivemappingbetween internal and external representations . . . 57

3.5 Containment of grammarclasses based onChomsky . . . 62

4.1 Representation of bitstream segmentsin the simple variant . . . 74

4.2 Representation of bitstream segmentsin the extended variant . . . 75

4.3 Representation of bitstream segmentsin the interactive variant . . . . 75

4.4 Graphgrammar rule forinitial_split/nal_join . . . 79

4.5 Graphgrammar rule forsplit/join. . . 80

4.6 Graphgrammar rule fortie/untie . . . 81

4.7 Graphgrammar rule fordeclare_primitive/undeclare_primitive . . . 81

4.8 Graphgrammar rule fordeclare_fragment/undeclare_fragment . . . 82

4.9 Graphgrammar rule forcompose/decompose . . . 83

4.10 Graphgrammar rule forexpand/compress . . . 83

4.11 Apeiron BSG Editor showing aBSG instance . . . 86

4.12 Partial bitstream segment graph forle oi2n0g16.png . . . 87

4.13 Structure of acrafted PNG image exploit shown inApeiron (1/3) . . 88

(18)

(19)

2.1 Excerpt of anXCEL description for a PNGIDAT chunk . . . 18

2.2 Excerpt of asample data structure dened using DFDL . . . 20

2.3 Denitionof a data structure using the MPEG 1/2 methodology . . . 24

2.4 Denitionof an ISOFile Type Box using MSDL / Flavor . . . 26

2.5 Denitionof a MPEG-2 Systems Program Stream(PS) using Flavor. 27 2.6 XFlavor representation of a MPEG-2 Systems Program Stream . . . . 28

2.7 Denitionof a BS Schema for JPEG2000using BSDL . . . 30

2.8 Denitionof a BS Description forJPEG2000 using BSDL . . . 31

2.9 Denitionof a gBS Descriptionfor JPEG2000 using BSDL . . . 32

2.10 Excerpt of aX.509 certicate denition given in ASN.1. . . 38

2.11 AnASN.1 assignmentfor use with ECN . . . 38

2.12 AnECN encoding object assignment . . . 39

2.13 Excerpt of a3GPP messagedenition in CSN.1 . . . 42

2.14 Comparisonof supported elementary descriptivecapabilities . . . 43

4.1 Namespace declarations. . . 76

4.2 Bitstreamsegmenttypes. . . 77

4.3 RDFclasses for bitstream segments. . . 77

4.4 RDFproperties for bitstream segments. . . 78

5.1 Example BSGrule . . . 105

5.2 Listof model-specicBSG rules . . . 107

5.3 Excerpt of format-specic rules for aPNG subset (1/2). . . 108

5.4 Excerpt of format-specic rules for aPNG subset (2/2). . . 109

B.1 Results forPNG subset rules (1/5) . . . 143

(20)

(21)

Introduction

1.1 Motivation

Theconcept ofdata formatsiscentraltoinformationstorageand exchange. A data

format denes how informationis represented digitallyasbits, bytes orcharacters,

forming higher-level data structures. It therefore coins the process of how to

deter-mine syntax and semantics of data in order to access the information represented

by it, and toprocess itin ameaningfulmanner.

To pass digitally represented information between a sender and a receiver, an

agreement is needed on the semantics of data to be transmitted, and on the data

format to be used. The agreement is used by a sender todetermine how to encode

and serialise information to be sent into a sequence of bits, which is then passed

on as a message. The actual composition of format-compliant data forming the

messagedepends onthe dataformatagreedforthe exchange,andisthusonlygiven

implicitly. Therefore,the agreementisalsousedby areceiverof themessage, where

the bit sequence isparsed and decoded, enabling access torepresented information

for further processing. In this process, data format knowledge is applied to make

the actual compositionof format-compliantdata explicit again.

Itisthereforenecessary thatsomerepresentationof thedataformatisgivenand

shared between these parties as part of the agreement, leading to the problem of

describing data format knowledge.

1.1.1 Formal descriptions over natural-language descriptions

The way in which data formats are described and represented depends on the

in-tended audience. Both natural-language descriptions for human engineers and

for-mal descriptions formachine-processing exist:

Natural-language descriptions intended for human engineers are still

domi-nant at the time of writing. Translating data format knowledge from

natural-languagedescriptionstomachine-executableimplementations,madenecessaryby

the sheer volume of format-compliant data as well as by the complexity of its

representation, depends onqualied human labour.

Formal descriptions intended for machine-processing are currently present in

(22)

data formats in general depends on the availability of a formalised model that

guarantees this property.

Accessing represented information in format-compliant data strongly depends on

format-compliant implementations. Existing implementations are threatened by

rapid technological change, which necessitates constant adaptation or replacement

for them to remain usable. Yet, the problem with natural-language descriptions is

thattheapplicationofdataformatknowledgetoaproblemdependsonqualied

hu-man labour, which islimited by itscost and availability. Lacking format-compliant

implementations,the illegibilityof represented information eectively results in its

loss.

Employing formal descriptions of data formats in a machine-processible

man-ner is a desirablealternative for making data format knowledgemachine-accessible

and applicable without inherently depending on human labour. Yet, a suitable,

formalisedmodel stillremainsessentialfor the universal applicability of formal

de-scriptions.

1.1.2 Declarative approaches over procedural approaches

The problem of formally describing data format knowledge can be approached in

eithera procedural ordeclarative manner:

In a procedural manner: Using languages similar to existing programming

languages like Java orC/C++, proceduralapproaches dene xed, specic

algo-rithmsforprocessingformat-compliantdata,suchasforparsinganddecoding, or

for encoding and serialising. In contrast to declarative approaches, the

underly-ingrulesandconstraintsofformat-compliantdataaregiven onlyindirectly,albeit

being present inthe dened algorithms.

In a declarative manner: Approaches using a declarative manner dene the

underlying rules and constraints that govern format-compliantdata. In contrast

to procedural approaches, no specic algorithm for processing format-compliant

data is given.

For a specic purpose, procedural approaches make it straightforward to write

al-gorithms to process format-compliant data. Yet, the problem with procedural

ap-proaches is that dierent purposes lead to dierent algorithmic implementations,

even when the same underlyingrules and constraintsof the very same data format

stillapply.

Using adeclarativeapproach, these underlyingrules and constraintscan be

ex-posed for a wide range of applications, retaining the freedom of using the same

format-specicknowledgefordierentpurposes. Althoughmorecomplex,the

declar-ative approachhas the benet of increased exibility.

For example, it may be desirable toaccess only a certain subset of information

containedin large volumes of format-compliantdata, thereby using resources more

eciently and speeding up the parsing process. Likewise, when part of

(23)

strongly depends onthe availabilityof data format knowledge that can beadapted

and applied ina exible manner.

1.1.3 Research domains

The problem of describing data formats is of concern to data-centric domains of

ComputerScience, especiallyDigital Preservation, Multimedia and T

elecommunica-tion:

In Digital Preservation,the problem of preserving long-term access to digital

informationforfuturegenerationsthreatensdigitallyrepresented culturalheritage

[RH05]. A majoruse caseforrelatedwork inthis domainisthemigrationof data

between data formats to prevent technological obsolescence and the subsequent

lossof information.

InMultimedia,twousecasesrelatedtodataformatsarethenormativedenition

ofdata formats aswellasthe high-leveladaptationof digital objects forUniversal

Media Access [VCE03 ]. Regarding the former use case, normative denition of

data formats is required for the specication of new data formats, allowing the

interoperabilityof systems working with multimediadata. Concerning the latter

use case, Universal Media Access depends on the timely delivery of multimedia

resources over heterogenous networks to end-user devices with varying decoding

and playback capabilities. High-level, on-the-y adaptation of digital objects to

thecapabilities andlimitationsofnetworkand end-user devicesstrongly depends

ondata format knowledge toachieve meaningfuladaptations.

InTelecommunication,aprimaryusecaseisthenormativedenitionofprotocol

data units (PDUs). Similar to Multimedia, there is the problem of describing

communication protocols for interoperability of parties, which also covers the

data format of PDUs tobetransmitted.

Related work exists for formally describing data formats in a declarative and

pro-cedural manner, yet their underlying models are often based on domain-specic

assumptionsthatdonot holdingeneral. Examinedrelatedworkdoesnotprovidea

formalised model that is geared towards universal applicability, althoughuniversal

applicabilityis sometimes claimed.

1.1.4 Data format instances and data formats

When considering the problem of formally describing data format knowledge, it is

helpful to distinguishbetween a data format instance and adata format:

Data format instance: A data format instance has a bit sequence, where

in-formation is represented conforming to its data format. Describing a data

for-mat instance in a machine-processible manner provides a means to

authorita-tively express how specic informationis represented, correcting

misunderstand-ings present in applications and in the understanding of human engineers. For

example, a Portable Network Graphics (PNG) raster image is a data format

(24)

Data format: Adataformathas apotentiallyinnitenumberof itsdataformat

instances, which conform to a common set of underlying rules and constraints.

Again, describing adata formatina machine-processiblemanner providesmeans

to authoritatively express how a specic type of information is represented. For

example, the PNG le format is a data format which denes rules on where the

image header is located and from which elds it is composed, thus dening the

placement of elds containing the width of arbitraryPNG images.

Data format instances and data formats are closely linked, since underlying rules

andconstraintsofadataformatmanifestthemselvesinitsdataformatinstances. In

order toaddress and consider both levels, itis desirableto havesuitable, matching

models for both describing data format instances as well asdata formats.

1.2 Research Problems

This thesis addresses both the formal description of arbitrary data formats in a

declarative manner, based on the formal description of arbitrary data format

in-stances. Describingadataformatasaclassthrough itsdata formatinstancesraises

the followingtwo research problems for this thesis:

Describingdataformatinstances: Howtodescribethecompositionof

format-compliant data, considering the syntax and semantics of its bit sequence or

seg-ments thereof?

Describing data formats: How to describe a data format with a potentially

inniteset of dataformat instances through itsunderlyingrules and constraints?

Sinceadataformatservesthepurposeofrepresentinginformationforitsstorageand

transmission over time, this thesis assumes that the representation of information

is lossless - represented informationmust actually be present in its representation.

Thisthesisthereforeassumesaswellthatforeverydataformatinstance,thereexists

a bijective mapping between a format-compliant bit sequence and the information

itrepresents.

1.3 Contributions and Outline

The four contributions of this thesis are a state-of-the-art survey on data formats

andtheirdescription,ananalysisondataformatdescription,theBitstreamSegment

Graph (BSG) model for describing data format instances, and the BSG Reasoning

approach for describing data format classes:

Surveyoncurrentstate-of-the-artindataformatsandtheirdescription:

The survey in Chapter 2 covers denitions and provides a basic systematisation

of related work in terms of their descriptive capabilities, focusing on the

data-centric research domains of Digital Preservation, Multimedia and T

(25)

Analysis of data format description: The analysis provided in Chapter 3

presentsanabstractmodelfordescribingdata formatsand dataformatinstances

in order to address the inherent properties and limitations of data format

de-scription. Whilethe reversibilityof abijectivemappingdoesnot restrictits

com-putational complexity, the analysis states that describing arbitrary data formats

comes at the cost of losing guaranteed termination of parsing and decoding

pro-cessesincase oferroneousdataformatrules. Theanalysispresentstheconcept of

elementary descriptive capabilities, which align with the descriptive capabilities

previously used forcomparing related approaches and identifying shortcomings.

BitstreamSegment Graph (BSG) model: Drawing onthe setof elementary

descriptive capabilities established in the analysis, the graph-based model

pre-sented inChapter 4 isused todescribedata format instances through a bijective

mappingfor a format-compliant bitsequence. The chapter includes methods for

the construction of a BSG instance and its evaluation, e.g. for the extraction of

contained information. The chapter also presents the Apeiron BSG Editor tool

forthe manualannotation of data and the denition ofa RDF-based

representa-tionforBSGinstances. Thiscontributionhasbeenpublishedatthe International

Conferenceon SoftwareandData Technologies(ICSOFT)2008[HMT

+

08],atthe

International Multi-Conference on Computing in the Global Information T

ech-nology (ICCGI) 2008 [HSB

+

08] and extended in an article in the International

Journal on Advances in IT Security (IJAS) [HFS

+

09], with the lattertwo

focus-ingonapplications inthe context of IT Security.

BSG Reasoning approach: The approach presented in Chapter 5 builds on

the BSG model and describes a data format with a possibly innite set of data

format instances using rules for the computation of a least xed pointsimilar to

Datalog [CGT89], thereby inferring aBSG instance from a format-compliantbit

sequence. The chapter also presents a syntax for data format rules and a

BSG-based reasoning engine. The contribution has been evaluated on a subset of the

Portable Network Graphics (PNG) image format which exercises all elementary

descriptive capabilities identied in the previous analysis. This contributionhas

been published atthe International Conference on Digital Preservation (iPRES)

2008[HBSM08] and extended inthe IJAS article[HFS

+

09].

1.4 Acknowledgements

As Malcom Gladwell so aptly put it, behind every story of success, there is also a

storyof a fertilegroundprovided, ofchancesoered, of opportunities seized and of

hard work being done even in the faceof setbacks and constant frustrations.

Although writinga PhD thesis isa personal endeavour, itcannot be considered

andunderstoodasanisolatedtask. Lookingback,Ihadalotofhelpinghandswhen

I needed them -in this spirit, I want tothank:

my PhD advisor Prof. Dr. Max Mühlhäuser for taking the risk, accepting meas

hisPhDstudent, and givingmethe freedom topursuemylines ofresearchinhis

(26)

my parentsAstrid and Joachimand my sister Sveafor keepingme grounded, for

encouraging me all along the way, for supporting me in countless ways, and for

listening toquitesimilar stories overand overagain - nally,it isdone -,

all of my friends for being there, and for both distracting me and pushing me

forward as it was necessary (especially, but neither limited to nor in a special

order,SandraandHans, TobiasandNina,AndreasandVerena,Marekand Ste,

Ariane, Maren, Kerstin, Annetteand Claudia),

my TK research teamhead Dr. GuidoRöÿlingfor asound educationin scientic

procedures, and for hisendless patience to both read and comment on countless

drafts,

mycolleaguesattheTelecooperation(TK)researchgroup(especiallyGinaHäuÿge,

KaiHöver,HenningBär,andGundolfvonBachhausfromtheDLHgroup,aswell

as Sebastian Riesand others),

my previous e-learning center (elc) team head Dr. Susanne Oenbartl for her

support and her refreshingly pragmatic viewpoints on the better side of

morale-building,

my colleagues at the e-learning center (especially IrinaReuter, AnikaHartmann,

Evelyn Ehrich, Klaus Steitz, Anne Bieberstein, Iris Wegmann, Thorsten Maas,

LeifPullich,JensJapesandJuliaSonnberger)andattheHochschulrechenzentrum

(HRZ) (especiallyDr. Jürgen Ohrnberger),

mycolleaguesattheRechnerbetriebsgruppe(RBG)group(namelySvenSchenkel,

ClaudiaKleber, Qun Herz, Manfred Schnitt,Gerhard Fladererand Uwe

Langen-dor),

mycolleaguesattheresearchtraininggroupGKEL(especiallyDoreenBöhnstedt)

Prof. Dr.-Ing. Ralf Steinmetz and his KOM research group,

allofmystudenthelperswithwhomIworkedinclosecollaboration(especially

An-dreasDoms,Tim Klein,HristoLulev,AmirNeziri,MartinPinto-Bazurco

Mendi-eta, Viet Hung Tu and Thomas Wilker), as wellas

all of my students whose bachelor, diploma and master theses I supervised, and

whocontributedtomyPhDresearchandmyprojects(namelyinalphabeticorder

JanAdler,BenediktAntoni,PaulBaumann,ArseneBotchak,AlexBulach,Julian

Dean, Jens Hatlak, Clayton Hoss, Tim Klein (again), Benno Kröger, F

riedrich-DanielMöller,BerndPapachrissanthou,DanielSchumann,MarcusStänder,Björn

(27)

State of the Art

2.1 Introduction

The previousChapter 1presented the researchproblemofformallyexpressing both

the composition of format-compliant data for data format instances as well as the

underlyingrules for adata formatin amachine-processible,declarative manner. In

ordertoassessthecurrentstateoftheartinthisregard,thischapternowcontributes

asurveyonthecurrentstateoftheartregardingexistingmodelsforexpressingsuch

data format knowledge. The survey focuses onthe followingaspects:

Denitions and models for data format knowledge: Related work directly

concerned with data format knowledge regarding the composition of data gives

denitions and models for its expression, either implicitly or explicitly. Existing

denitionsandmodelsarebasedontheirspecicconceptsandconstraints,leading

toinherent properties and limitationson their expressiveness.

References to existing related work: Related workonlyindirectlyconcerned

withthecompositionofdata,butstillinneedtoexpress such dataformat

knowl-edge, provides references to existing denitions and models aswell as insightsto

their adoption.

When related work provides a model for describing data formats or data format

instances not on the level of meta-information, but focusing on data format rules

and the structure of data format instances, the survey provides aclassication:

Classication: It is considered whether such anapproach is declarative or

pro-cedural, whether it is machine-processible and whether it has a formalised model

for universal applicability:

Declarative or procedural approach: In order to classify existing

ap-proaches for this thesis, it is considered whether they describe data format

knowledgein adeclarativeor procedural manner.

Machine-processibleapproach: Approaches existinrelatedwork which

in-troducedescriptivemeansintended forhumanengineers,whichdonotfocuson

being machine-processible. Machine-processibleapproaches themselves require

aminimumdegreeofformalisationwhichmaybegiven onlyindirectlythrough

(28)

Formalised model for universal applicability: Approaches exist which

introduce a number of concepts to providespecic descriptivecapabilities, yet

which do not address the completeness or orthogonality of their concepts for

describing arbitrary data formats. The existence of such a formalised model

supports potentialclaims for universal applicability.

Last but not least, if a suitable model is presented for an approach, the survey

examinesits descriptivecapabilities:

Descriptive capabilities: In order to compare dierent models, their

expres-sivity is considered regarding the handling of primitive data, structured data,

transcoded data or fragmented data:

Primitive data isa singlepiece of information,suchasa oating-point

num-ber, acharacter stringorathree-bit unsignedintegerstoredinleast-signicant

bit rst order. It is represented in an encoded form, which has to be decoded

in orderto access it.

Structured data is a continuous sequence of bit sequences, each with a

sep-arate, distinct meaning in its context. It has to be segmented to access the

separate constituents.

Transcoded datais a bitsequence which isthe result of a transformationof

anoriginalbitsequence, suchascompression,encryption orsomesimilarblock

transformation. It has to be transformed in order to access the original bit

sequence.

Fragmented data is a bit sequence which is only a fragment of a larger,

originalbitsequence. Inordertoaccesstheoriginalbitsequence initsentirety,

fragmented data has to beconcatenated inthe right order.

Handling these kinds of data properly requires the matching descriptive

capa-bilities of decoding primitive data, of segmenting structured data, transforming

transcoded data aswell as concatenating fragmented data are considered. Based

onthesedescriptivecapabilities,thesurveycomparesrelatedwork,allowing

state-mentstobemaderegardingtheirsuitabilityfordescribingarbitrarydataformats

in general.

Sinceprocessingdataisacentral,recurringaspectofComputerScience,relatedwork

on data formats can be sought and found in a number of research domains. This

chapter focuses on contributions from a data-centric subset of research domains in

separatesections, namelyDigital Preservation inSection2.2,Multimedia inSection

(29)

2.2 Digital Preservation

Overview

Digital Preservation is concerned with the long-term preservation of digital

infor-mation. Dataformats play acrucial role,since digital informationis storedas data

in a specic data format, while format-specic hardware and software provides

ac-cess tothe contained information. Over time, both hardware and software tend to

become obsolete due to technological advances. As obsolete hardware may fail in

the future when there is no replacement available, and as obsolete software may

not be availableanymore or failto interoperate with newerhardware and software,

technological obsolescence is a constant threat to Digital Preservation eorts. The

current rapid pace of technologicalchange ampliesthis threat.

Inliterature,threepreservationstrategiesforprotectingdigitalinformationagainst

informationloss through technological obsolescence onthe logical level are the

mi-gration of data, the emulation of hardware / software, and digital archaeology:

Migration of data: Information stored as data in a specic data format is

migratedto asuitabletargetdata format. Duetoatypicalmismatchofdierent

dataformats,someinformationoftencannotberetainedduringamigrationandis

usuallylost. Foramigrationofdata,itusuallymustbedecidedwhichinformation

toretain, forwhich the migrationprocess must bemonitored [Arm00].

Emulation of hardware / software: Hardware and/or software representing

the originaltechnologicalenvironment orparts thereofare replaced byan

emula-tion. Throughusingtheemulation,originalhardwareand/orsoftwarecomponents

remain capable of providing access to contained information[Rot99, Arm00]. A

variation thereof is the virtualisation of software, where software for a specic

technological environment is replaced with software that targets a virtual

ma-chineasanintermediateplatform,whichisavailablefortheoriginaltechnological

environment. In case of technological obsolescence of the original environment,

it is only necessary to port the virtual machine implementation to a new

plat-formratherthan porting each and every specic software. The Universal Virtual

Computerserves as anexample of such anapproach [Lor01].

Digitalarchaeology: Software, dataandrelateddocumentationareanalysedin

ordertoreverse-engineer both syntax and semantics of data. Itsdescription then

serves as a basis for implementing new hardware or writing new software, which

againprovides access tocontained information[RG99, Arm00].

The former three strategies depend onpreparatory actionstaken prior tothe event

of technological obsolescence. Without suitablepreparation in advance, digital

ar-chaeologyistheonlyremainingoption,althoughusuallycostly [Wet98]. Incontrast

tothe emulationofhardware /software, both migrationofdata anddigital

archae-ologyare concernedwith data format knowledgeon the compositionof data.

Outline

(30)

Maintaining format-related meta-information is addressed by data format

reg-istries suchastheUSLibrary of Congress(LoC)DigitalPreservationproject,the

GlobalDigital FormatRegistry (GDFR),PRONOM,and therecentmergerofthe

latter two, the Unied Digital Formats Registry (UDFR).

Addressingissuesrelatedtotherepresentationofinformationaccordingtodened

data formats, the Open Archival Information System (OAIS) Reference Model

serves as a de-facto standard modelregarding the long-termarchivaland

preser-vation of data.

For the migrationof data, the selection and execution of suitablepaths for

con-version is anon-trivial problem. The Typed Object Model (TOM) isan approach

for mediating data between dierent data formats in a distributed system,

au-tomatingthe process of migratingdata between dierent formats.

Formeasuring thequality ofdata migrationintermsof retained information,the

eXtensible CharacterisationLanguage(XCL) projectcontributesboth the

eXten-sible Characterisation Extraction Language (XCEL) for extracting information

fromdataasaproperty,andtheeXtensible CharacterisationDenitionLanguage

(XCDL) for the description of such propertiesfor later comparison.

Related to aspects of Digital Preservation, but not explicitly framing itself into

this domain,the Data FormatDescription Language (DFDL) is an approach for

describing the composition of data according to a data format which focuses on

leveragingexisting XML technologies.

2.2.1 Data Format Registries

Thepreservationstrategiespreviouslypresented inChapter2.2dependon

maintain-ing detailed meta-information on data formats and format-compliant applications

inthe long term. Therefore, thereis aneed fordata format registries as custodians

of meta-informationondata formats.

Overview

ProminentdataformatregistriesaretheUSLibraryofCongressDigitalPreservation

Project(LoCDP)[AF05],theGlobalData FormatRegistry(GDFR)[AS03],andthe

PRONOM [Bro05] data format registry. At the time of writing, both GDFR and

PRONOMareintheprocessofmergingtoUniversalDataFormatRegistry (UDFR)

[UDF09].

Richmodelsforcategorisingdataformatsandformanagingrelatedmeta-information

inhighdetailarecommontoLoCDP,GDFR[Abr07b,Abr07a,AG08]andPRONOM

[Bro05]. The model of GDFR extends to complex relationships between dierent

data formats, such as extensions or versions, on a highly formal level. In varying

forms,these registriesprovide meansfor referencing aspecic data format, such as

(31)

Discussion

Data format registriesallowtheir users to identify and associate data formats with

metadata suchasreferences totheir specication,format-compliantapplicationsor

relationsbetween formatsrequiredforthemigrationofdata ordigitalarchaeologyin

case of technological obsolescence. Examined related work providesdenitions and

models concerning format-related meta-information as well as references to other

approaches.

Denitionsandmodels: Besidestheirrichmodelsonformat-relatedmeta-data,

allregistriesatleast considernatural-language descriptionssuchastextual

speci-cationssuitableforhuman consumptionasabaseline. In thecase ofPRONOM,

format-relatedknowledgesuitableformachine-processingisprovided onthe level

of le signatures, allowing for the identication of a les' data format through

automatedtools suchas Digital Record Object Identication (DROID).

References: In terms of approaches for describing the composition of data,

GDFRreferences toanumberofformatdescriptionlanguages. Itexplicitlyrefers

to the eXtensible Characterisation Extraction Language (XCEL), the Bitstream

SyntaxDescriptionLanguage(BSDL)andtheDataFormatDescriptionLanguage

(DFDL)amongotherapproaches, includingformalgrammarnotations and XML

schema languages.

2.2.2 Open Archival Information System Reference Model

Operating anarchive for long-termarchival and preservation of data poses a

num-berof problems,whichincludesthe managementof howinformationis represented.

Within short-term transactions, producers and consumers can negotiate the

repre-sentation of information to be exchanged. Yet, for long archival and preservation,

producers cannot foresee future representations of information, and thus need to

delegate the negotation with future consumers to suitable archival systems, which

managea potentiallyrequired migrationof data.

Overview

TheOpen ArchivalInformationSystem(OAIS)providesareferencemodelfor

long-term archival andpreservation systems, andserves asthe de-factostandard

regard-inglong-termarchivalandpreservationofdigitalinformation[CCS02 ]. Amongother

aspects, OAIS addresses the issue of managingthe representation of information.

The OAIS reference model denes a number of processes surrounding

informa-tionstoredinanOAISarchivebetweenaProducer anditsConsumers,aswellasthe

archive Management. Besides the processes of Preservation Planning and

Admin-istration that coordinateand manage archivaloperations between involved parties,

actual processes related to archival and preservation are Ingest, Archival Storage,

Data Management and Access, shown inFigure2.1:

(32)

Pro ducer Consumer Management Ingest Access Preservation Planning DataManagement Archival Storage Administration SIP AIP DIP

Figure2.1: Overview of OAIS Functional Entities, based on[CCS02].

Archival InformationPackage (AIP) conformingtoarchivepolicies,suchasusing

only publically disclosed data formats. It then generates descriptive infos and

coordinates potential updates withData Management.

ArchivalStorage: TheArchivalStorageprocessisresponsibleforreceivingAIPs

tobeputintostorage,managementofthestorage,errorcheckingandreplacement

of storage media, their backup for disaster recovery, and nally for providing an

AIP upon request.

Data Management: The Data Management process serves for administrating,

updating and querying of archival databases as well as for general reporting for

the archive.

Access: Upon request by a Consumer, the Access process generates a suitable

Dissemination Information Package (DIP) for the requested AIP and delivers it

to the Consumer.

Asboth ArchivalStorageand Data Management are not central tothe role ofdata

formatsinOAIS,theseprocessesarementionedforcompletenessandarenotfurther

explored.

As can be seen from the Ingest and Access process descriptions, the concepts

of SIP, AIP and DIP are of importance to the OAIS reference model. These are

specializations of the Information Package, being distinguished in its role in the

respective process. As shown in Figure 2.2, an Information Package contains the

ContentInformationastheactualcontenttobepreserved, aswellasrelated

Preser-vation DescriptionInformation. TheContentInformationconsistsof aData Object

and its Representation Information, which can be used to obtain an Information

Object:

Data Object: A DataObjectspecialisesintoeitheraPhysical Object asa

(33)

Information Package Content Information Preservation Description Information Data Object Representation Information Submission Information Package Archival Information Package Dissemination Information Package

Figure 2.2: Overview of OAIS concept relations,based on[CCS02].

Representation Information: RepresentationInformationmaps theData

Ob-ject to an Information Object as a more meaningful concept. Representation

Informationcan bespecialisedintoStructure Information whichdenes the

map-pingofbitsequencesintodatatypes,andintoSemanticInformationwhichdenes

the meaningof data. RepresentationInformation itselfmayagain berepresented

asaDataObject,andthusdependsonotherRepresentationInformation,forming

aRepresentation Network requiredfor fullydescribing themeaningofanoriginal

DataObject.

InformationObject: AnInformationObjectisobtainedbyinterpretingaData

Object according toRepresentation Information.

Representation Informationcan beconsidered as aform of data format knowledge,

which may be present in various forms such as textual descriptions in natural

lan-guage, formal grammars, or some derivative work thereof, such as software

imple-mentationsor even their source code.

Discussion

The OAIS reference model is intended for a long-term preservation archive, which

alsohas to support the migration of data between dierent formats. The reference

modelthus depends on the availability and applicabilityof data format knowledge.

ItthereforereferstosuchknowledgeinvariousformsasRepresentationInformation,

whichiscontainedinContentInformationinsubmission,archivalanddissemination

IPs.

(34)

ex-a denition of how information is represented. Furthermore, the OAIS reference

model itself doesnot mandate a specic formfor data format knowledge as

Rep-resentation Information. It allows for avariety of ways to represent data format

knowledge. The modelexplicitly mentions the option of formaldescription

lan-guagescontainingwell-denedconstructs withwhichtodescribedata structures,

referring to formalisedapproaches ingeneral. Althoughthe OAIS eectively is a

de-facto standard in its domain, it does not provide more specic references or

make a statement regarding the expressivity of existing approaches. Foractually

migrating data, an OAIS archive relies on software as a form of Representation

Information toperform the migration.

2.2.3 Typed Object Model

Processing format-compliantdata requires suitable,format-compliantapplications.

Althoughanapplicationmaybeconceptuallycapableofprocessingacertaintypeof

data such asvideo/audio recordings or textdocuments,it strongly depends onthe

specicdigitalrepresentationtobeparsed,decoded andprocessed. The diversityof

dierent data formats thus can force users to mediate data between multiple data

formats asnecessary, therefore making itdesirable toautomate this process.

Overview

The Typed Object Model (TOM) serves for automating the process of migrating

data between dierent formats in a distributed manner and has been published in

thePhDthesisofJohn Ockerbloom[Ock98]. TOMdenes bothadistributedsystem

and a datamodel:

Distributed system: For mediating data between dierent data formats in a

decentralised, scalablemanner,TOM describes adistributedsystem consistingof

agents that handle the processing and conversion of data in heterogenous data

formats, operating on a distributed type graph. TOM denes the Typed Object

Protocol (TOP) for communicating in such adistributed mediating system. The

actualmediationisexecutedbytypebrokersasspecialisedagents,whichoertheir

services and perform the actual processing onthe behalfof clients. Other agents

suchasclientscanqueryfortypeinformation,getattributes andcallmethodson

objectsor request conversions from atype broker.

Data model: In the TOM data model, informationis represented as an object,

whichis immutableand has botha value and atype. A value isnot restricted to

a digital representation such as nite byte sequences alone,but may alsoinclude

abstract formsof representation.

A type denes how the object and its value are to be interpreted. Here, an

importanttypeistheByteSeqType. Objectsofthistypehavenitebytesequences

as values and thus can be stored and transmitted in a digital form. A type may

dene one or more attributes as well as one or more methods for its objects.

An attribute extracts information from an object through a function without

(35)

informationor use additional parameters. TOM oers subtyping, so a type may

havemorespecialisedsubtypes. Italsoenablessubstitutability, wheregivenatype

T

, a subtype

S

of

T

and two objects

t

and

s

of types

T

and

S

,

s

can substitute for

t

, which allows objectsto beconsidered atdierent levelsof abstraction. The aggregatedtyping informationdenes atype graph onwhichTOM operates.

For obtaining dierent representations of the same information in TOM, an

en-coding describes a relationship between a pair of objects, the original encoded

object and the resultingencoding object in adierent representation. In order to

cope with multiple,semanticallyequivalentencoding objects for agiven encoded

object, encodings in TOM are consideredas one-to-many relations. Its inverse is

adecoding, whichisa many-to-one function. TOM considers encodingas the

re-nement ofabstract objects. Likewise,itconsidersdecodingasthe abstraction

of concrete objects.

Basedontheseconcepts, TOM denes aformat asasequence ofencodings tobe

appliedonobjectsofa given type,whichyieldsobjectsof thetypeByteSeqType.

Aformat allowstodenea shipped object,whichis anobjectincludingitsformat

andthusallowsareceivertodecodetheobjecttothetypeindicatedbytheformat.

For mediating data between dierent data formats in TOM, a conversion is a

migrationof data, which takes ashipped object asinput and producesa shipped

object as output. Often, a conversion between dierent formats cannot preserve

all present information but only a subset. Therefore, a conversion tries to

ap-proximate the input shipped object. Tomanage the lossof informationof such a

conversion, TOM denes the concept of intersubstitutability, which is given for a

conversion

c

andatype

T

if everyinputand outputoftheconversion

c

cannotbe distinguishedwith regards tothe attributes and methods of type

T

. The degree ofinformationpreservationina conversion increaseswith every level downinthe

type hierarchy of

T

due to the specialisation of subtypes. Both substitutability andintersubstitutabilityaidintheautomatedcompositionandconversionofdata

between dierent formats inTOM.

The TOM approach has been implemented through the Format REgistry

Demon-stration (FRED), which also served as a prototype for the Global Data Format

Registry (GDFR) [Ock06].

Discussion

TOMusesformat-relatedknowledgetoassistthemigrationofdatabetweendierent

dataformats. Contrarytootherapproaches,itsmaincharacteristicisthedistributed

setup of format-related knowledge among type brokers which provide migration

services toother agents.

Denitions and models: In TOM, information is considered in type-specic

representations. The data format of such a type-specic information is dened

asasequence of encodings that converts informationfromits type-specic

(36)

Rather thandescribing the actualcoding and structure offormat-compliantdata

itself, it addresses the migration of data by describing encoding / decoding

re-lationships between types. Through its use of interfaces, TOM considers digital

objects on varying levels of abstraction, and is therefore able to indirectly

man-age the loss of information during a migration of data. External to TOM and

its model, the actual process of accessing and conversion is performed by

soft-ware toolswhichfollowtheunderlyingrulesand constraintsofadataformatthat

shapes format-compliantdata.

Rather than describing the composition of data format instances orthe

underly-ing rules and constraints of a data format itself, TOMs model exposes the type

networkcomposed fromsoftware operatedinadistributedsystem. The

compos-ability asa property of TOM refers to the overall type network, and not to the

compositionof data.

2.2.4 eXtensible Characterisation Language

Technological obsolescence of le formats threatens long-term accessibility of

con-tained information. Although data can be migrated inadvance in order toprevent

technologicalobsolescence, ittypicallyleads toinformationlossdue tomismatching

representationalcapabilitiesof dierentdataformats. Oneapproachtohandlesuch

informationloss is to identify signicant information and to monitor its successful

retention after amigrationfor it tobedeemed successful.

Overview

Forestimating the success of data migration,the eXtensible Characterisation

Lan-guage (XCL) project denes the eXtensible Characterisation Extraction Language

(XCEL) as well as the eXtensible Characterisation Denition Language (XCDL)

for extracting and comparing signicant information represented in dierent data

formats:

eXtensible Characterisation Extraction Language (XCEL): XCEL

in-tends todescribecharacteristics of format-compliantdata through signicant

in-formation, which isrepresented as aproperty with a name and a value [SHC08 ].

Towards that goal, XCEL describes the composition of data through a

XML-based,schema-likedenition,wheredeclarativedenitionsofdatatypesaremixed

with proceduralprocessing instructions.

The basic buildingblocks of XCEL are XCEL elements which are used to build

an XCEL Tree as a representation which matches with actual data. A symbol

element denes both the encoding and semantics of a byte sequence, may dene

constraintsformatchingandisaleafof suchatree. Asymbolhas informationon

the placement of its data and employs an absolute addressing scheme, using the

numberof bytesconsumedsofar. An item elementdenes alogical,structuralor

semanticgroupforoneormorechildelements,describingeitherasequence of

ele-ments,theirpermutationorachoiceofalternativeelements. Last butnot least,a

(37)

copyinganotherXCEL elementtothecurrentposition(pullXCEL)orfor

recong-uring the parser during its operation (configureParser). Moreover, processing

elements alsoallow the addition of a lter to alter chain of anelement,

allow-ing for the translation of data into another representation. Elements and their

contents canbereferenced through identiers and may originatefromaseparate

le(externalSource)or fromanother XCEL element (internalSource).

These XCEL elements are used in the schema-like XCEL document, which

con-sistsofthefourpartspreProcessing, formatDescription,postProcessingand

templates. It is used by an XCEL processor, which processes both the XCEL

document and an input le, and produces an XCEL Tree as Result Tree. In a

followingstep, an extractor extracts signicantinformationfromthe Result Tree

which characterises the format-compliant data contained in the input le, and

stores itas anXCDLdocument.

Example 2.2.1: Anexampleof XCEL isgiven inTable2.1 forthe PNGIDAT

chunk data structure, which carries compressed, transformed, and in some cases

even fragmented, data representing the actual image. The data structure starts

with the chunkDataLength symbol,which is an32-bit unsigned integer, followed

by the pngIDATIdentifier, which carries the four-byte ASCII string IDAT

to distinguish it from other chunk data structures. The next processing

state-ment sets the length of the yet-to-come normDataSymbol identier to the value

of chunkDataLength, sothe normDataSymbol has adened length. Last but not

least, the remainingcrcsymbolcarries a four-byteCRC value.

eXtensible Characterisation Denition Language (XCDL): XCDL

de-scribes data as a collection of signicant information that have been extracted

previously froma Result Tree [BHST08 ].

During the migrationof data, some original input le is migratedto a new data

format,producingamigratedle. Aftersignicantinformationhasbeenextracted

from both the original and the migrated le as XCDL documents, the degree to

which the migration has been successful is measured through a comparison of

the retained signicant information. This is performed by a comparator, which

processesbothXCDLdocumentsandcomparesitssignicantinformationthrough

some domain-specic metric forjudging the success of a migration.

BesidesXCEL andXCDL,the XCLprojectalsotriestoaddressaspectsofsemantic

mismatchbetween dierentXCDLdocumentsthroughthedenition ofanontology.

Moreover, it tries to extend the comparison of signicant information beyond the

semantics of data to whether the actual rendering of data tohuman observers still

carries the same signicantinformation through the use of an informationmodel.

Discussion

The use case of XCEL and XCDL is the evaluation to which degree a migration of

(38)

1 <item identifier="pngIDAT" xsi:type="structuringItem" 2 multiple="true"> 3 <symbol identifier="chunkDataLength" interpretation="uint32" 4 length="4"/> 5 <symbol identifier="pngIDATIdentifier" interpretation="ASCII" 6 optional="false" value="IDAT"/> 7 <processing type="pushXCEL" xcelRef="normDataSymbol"> 8 <processingMethod name="setLength"> 9 <param valueRef="chunkDataLength"/> 10 </processingMethod> 11 </processing> 12 <symbol identifier="normDataSymbol" interpretation="uint8" 13 name="normData"/> 14 <symbol identifier="crc" length="4"/> 15 </item>

Table 2.1: Excerpt of a XCEL description for a PNG IDAT chunk data structure,

carrying transformed and compressed imagedata, taking from [SHC08 ].

Denitions and models: The XCL project makes no explicit formal

deni-tion towhat a data format actually is. Through the denition of XCEL, a quite

complex modelis given for describing a data format as a tree-based hierarchical

representationofinformation. XCELhas anumberof interestingproperties,such

as the support of lters for handling the translation between dierent

represen-tations of information, for partial descriptions which coveronly part of aninput

le, orforallowingdependenciessuchastheplacementofelementsintheoriginal

datatobeevaluateddynamicallyatruntime,whichisofinterestforaddress-based

references in data formats such as the ISO Base File Format. Regarding XCDL,

itis interesting tonotethat [BHST08] statesitnot tobeintended asalanguage

for multi-purpose representation of information.

Classication: TheXCELapproachisbothdeclarative andmachine-processible,

but the examinedpublicationsprovideno formalised modelfor universal

applica-bility.

Descriptive capabilities: XCEL clearlyprovidessupportforsegmenting

struc-tured data through items and for decoding primitive data through symbols,

al-though the length of primitivedata is limitedto multiplesof entire octets rather

thanhavingbitgranularity. Duetoitsconceptoflters usedinalterchain,and

through using the internalSource attribute, XCEL can transform transcoded

data and enable further processing of the result. Using the special normData

symbol name for processing fragmented PNG IDAT chunk [SHC08], and again

using the internalSource attribute, XCEL provides at least partial support for

(39)

2.2.5 Data Format Description Language

Forprocessing data in XML-based representations, anumberof standardised

tech-nologies exist, such as transforming documents using Extensible Stylesheet

Lan-guage Transformations (XSLT), or ltering elements using XPath. By translating

the representation of data fromthe binary domainto XML, these technologiescan

be leveraged for use on datafrom the binary domain aswell.

Overview

The Data Format DescriptionLanguage (DFDL)is anextensiontothe W3C XML

Schema Description Language (XSDL) and intends to describe arbitrary data

for-matstoenablethetranslationfromformat-compliantdatatoanXMLrepresentation

andviceversa. Atthe timeofwriting,the currentversionofDFDLis1.0asdened

in Draft 038, where several parts of the languagespecication are designated to be

in uxand to be changedin upcomingversions of the draft[PHB

+

10].

IntheDFDLapproach,aprocessor processesdatagiveninaformatasdescribed

by a schema:

Processor: Aprocessor typicallyiseitheraDFDLParser oraDFDL Unparser,

where the former parses the format-compliant representation and serialises it to

an XML representation. Vice versa, the latter parses the XML representation

andunparses itsformat-compliantbinaryrepresentation,performingthereverse

direction.

Schema: A DFDL schema describes the composition of data through XML

schema extended with DFDL annotations. While XML Schema provides the

meanstodescribebothprimitiveandcomplexdatatypes, DFDLannotations

de-scribeadditionalinformation,suchasthe lengthorthe binaryencodingof adata

type. Forhandlingdynamicdependencies, whereparsedanddecoded information

isusedforfurtherparsing,DFDLemploysasubsetofXPath2.0asexpression

lan-guage, includingfunctions for boolean, string and date operations. Since DFDL

has the explicit goal of round-trip support for data parsed and unparsed, every

schema is required to be unambigious during unparsing, that is, only one binary

representation may exist.

Example2.2.2: ADFDLexampleisshowninTable2.2,wherestructureddata

isdescribed asa sequence of fourprimitivedata, namely anintegerw, an integer

x,adouble-precisionoating-pointnumberyandasingle-precisionoating-point

numberz, all inbig-endian byte order.

Discussion

DFDL assists the processing of data for binary data formats through standardised

XML tools by translating information from its format-compliant representation to

anXML representation, and vice versa.

(40)

1 <xs:complexType name="example1"> 2 <xs:sequence> 3 <xs:element name="w" type="int"> 4 <xs:annotation> 5 <xs:appinfo source="http://www.ogf.org/dfdl/"> 6 <dfdl:element representation="binary" 7 byteOrder="bigEndian" 8 lengthKind="implicit"/> 9 </xs:appinfo> 10 </xs:annotation> 11 </xs:element> 12 <xs:element name="x" type="int "> 13 <xs:annotation> 14 <xs:appinfo source="http://www.ogf.org/dfdl/"> 15 <dfdl:element representation="binary" 16 byteOrder="bigEndian" 17 lengthKind="implicit"/> 18 </xs:appinfo> 19 </xs:annotation> 20 </xs:element> 21 <xs:element name="y" type="double"> 22 <xs:annotation> 23 <xs:appinfo source="http://www.ogf.org/dfdl/"> 24 <dfdl:element representation="binary" 25 byteOrder="bigEndian" 26 lengthKind="implicit"/> 27 </xs:appinfo> 28 </xs:annotation> 29 </xs:element> 30 <xs:element name="z" type="float" > 31 <xs:annotation> 32 <xs:appinfo source="http://www.ogf.org/dfdl/"> 33 <dfdl:element representation="binary" 34 byteOrder="bigEndian" 35 lengthKind="implicit" 36 binaryFloatRep="ieee" /> 37 </xs:appinfo> 38 </xs:annotation> 39 </xs:element> 40 </xs:sequence> 41 </xs:complexType>

Table 2.2: Excerpt of a sample data structure dened using DFDL, taken from

(41)

throughthe extensionofXMLSchema. Itassumesthatadataformat denesthe

composition of hierarchically-nested data, at the same time explicitlyclaiming

itsapplicability onthe description of any data format.

The DFDL specication includes two noteworthy concepts. It distinguishes

be-tween approachesfor makingdataformatknowledgeexplictaseitherprescriptive

ordescriptive. The specication documentcategorises approaches such asASN.1

intothe formercategory,anditselfintothelatter. Moreover, DFDLdistinguishes

betweendataaseithercontent orframing,dependingonitspurposeintheformat,

wherethe language allows tohide framingfrom later processing.

Classication: DFDLisadeclarative approachwhichismachine-processible. In

examinedpublications, no formalised modelfor universal applicability is given.

Descriptive capabilities: DFDL clearly supports the segmentation of

struc-tureddata and the decoding of primitivedata, alsosupporting datawith lengths

ofbit granularity through dfdl:lengthUnits. Although XML Schema isa

pow-erful basis for DFDL to extend, concerning data format description, there are

limitations to DFDL despite its explicit claim of universal applicability. Most

notably, DFDL itself acknowledges its lack of support for cases where one

ele-ment's value becomes the representation of another element, termed layering

by DFDL, which has been conrmed as a limitation and explicitly deferred to a

later revision.

Yet, layering isrequiredforhandlingtransformed orfragmented data, asinthese

cases, the value of one or more bit sequences represents another bit sequence

whenprocessedaccordingly. Forexample,tocompletelydescribevideoandaudio

streamstypicallystoredasinterleavedfragmentsinmultimediacontainerssuchas

theMPEG-4 FileFormat,fragmentsofa specicstreamhavetobeconcatenated

inorder toanalyse the stream according toits own format-specic rules,e.g. for

MPEG-4 Video orMPEG-4 Advanced Audio Coding (AAC).

Part of the lack can be attributed to DFDL's explicit assumption of data to

be hierarchically-nested, which ts well with the tree-based structural model

of XML, where a logical node may have multiple children, but has at most one

parent. Yet, for the concatenation of fragmented data, a logicalnode is required

tohavemultipleparentsaswell,pointingtowardsagraph-basedstructuralmodel.