VomFachbereichInformatik
der Technischen UniversitätDarmstadt
genehmigte
Dissertation
zur Erlangung des akademischen GradesDr.-Ing.
vorgelegt von
Dipl.-Inform. (FH) Michael Hartle
geboren inFrankfurt am Main
Tagder Einreichung: 7. Juni 2010
Tagder Disputation: 16. Juli 2010
Referenten: Prof. Dr. Max Mühlhäuser, Darmstadt
Prof. Dr. AndreasRauber, Wien
Ehrenwörtliche Erklärung 1
Hiermit erkläre ich, die vorgelegte Arbeit zur Erlangung des akademischen Grades
Dr.-Ing. mitdemTitelAFormal,DeclarativeApproachtoDataFormat
Descrip-tion selbständig und ausschlieÿlich unter Verwendung der angegebenen Hilfsmittel
erstellt zu haben. Ich habe bishernochkeinen Promotionsversuch unternommen.
Darmstadt, den 7. Juni 2010
Dipl.-Inform. (FH) Michael
Har-tle
Wissenschaftlicher Werdegang des Verfassers 2
09/199803/2003 Studiumder Informatikander Fachhochschule Darmstadt
Abschluss mitNote1,3
09/200203/2003 Diplomarbeitander Fachhochschule Darmstadt
Titel Real-TimeGeneration of Planetary Landscapes
07/200407/2010 WissenschaftlicherMitarbeiteramFachgebietTelekooperation(Prof.
Dr. Mühlhäuser) an der Technischen UniversitätDarmstadt
07/200407/2010 TechnischerMitarbeiterame-learningcenterim
Hochschulrechen-zentrum der Technischen Universität Darmstadt
09/200602/2010 AssoziiertesMitgliedimGraduiertenkolleg1223Qualitätsverbesserung
im E-Learning durchrückgekoppelte Prozesse
Abstract
The concept of data formats is central to informationstorage and exchange, as
it coins the process of how information is written to and read back from
format-compliant data by senders and receivers. In contrast to the widespread use of
natural-languagedescriptionsintended forhuman engineers,and ofprocedural
de-nitionsofformat-compliantcomponents,describingdata formatknowledgeina
for-mal,declarativemannerisnecessary formakingthisknowledgemachine-processible,
enabling its exible, automated application to format-compliant data. To that
ef-fect, data format knowledge is considered both on the level of format-compliant
data as adata format instance and on the level of a data format consisting of such
instances.
In a survey of current State of the Art in Data Format Description, examined
relatedwork fromthe data-centric research domainsof DigitalPreservation,
Multi-media and Telecommunication showa lack suitableformalised models foruniversal
applicability. Aswell,examinedrelatedworkprovidesonlyasubsetofthe four
nec-essary descriptivecapabilities to describedata which may be primitive,structured,
transcoded or fragmented.
In the analysis, a formalisationis presented which is based on the research
hy-pothesis that a data format denes a normative set of lossless information
repre-sentations, where there exists a bijective mapping between interal representations
of senders / receivers, and external representations that are exchanged as
format-compliant data. The formalisationis universally applicable for arbitrary data
for-mats, issuitable forboth so-calledlosslessand lossy data formats,and leads tothe
notionoffourelementarydescriptivecapabilities,whichexactlymatchthoseusedin
theStateoftheArtsurvey. A validPortableNetworkGraphics(PNG)rasterimage
is given as litmustest for data format description, as its description exercises all
fourelementarydescriptivecapabilities. Basedontheformalisation,itisshownthat
a universal approach to data format description is too powerful in computational
termsastobeabletoguaranteetermination, thatthe tractabilityof bijective
map-ping functions and their inverses is neither given nor necessarily related, and that
one-to-one correspondence of abijective mappingfunctioncan be guaranteed using
information-preserving, Turing-complete Reversible Turing Machines.
Building on the formalisation given in the analysis, the thesis denes the
Bit-streamSegmentGraph(BSG)modelfordescribingarbitrarydataformatinstances.
For BSG instances, representations are dened both for visualisation as well as
for storageand exchange through machine-processible, RDF-basedrepresentations.
Incremental construction and modication of BSG instances is enabled through a
closedsetofoperations,andthecoverage ofaBSGinstanceisdenedasameasure
of its completeness. Actual tool support for the construction,modication and
ex-plorationof BSGinstances onarbitrary datais provided through the Apeiron BSG
Editor. Applications of the BSG model are demonstrated through the description
of the PNG image litmustest from the previous analysis, and for the description
of anexploit in the contextof IT Security.
Buildingonthe BSG model, the thesis denes the BSG Reasoningapproach for
describingarbitrarydataformatsaspotentiallyinnitesetsofdataformatinstances.
representation ofrulesetsforstorageand exchange. Applications ofBSGReasoning
are demonstrated through the description of a PNG image le format subset and
throughanoutlinedapproach forformat-aware fuzzingofbitstreams inIT Security.
The PNG image le format subset described through BSG Reasoning exercises all
elementary descriptive capabilities previously identied in the analysis, and it is
shown that the resulting set of logic rules, despite a lownumber of format-specic
rules,already yieldsahighcoverage ofinferred BSGinstances onanumberof valid
PNGimages.
The thesis closes with a retrospection, conclusions and an outlook on potential
future research on the BSG model and the BSG Reasoning approach, focusing on
aspects suchas the computer-aided reverse-engineering of data format rules, orthe
use of reversible programming languages for the denition of lossless coding and
Zusammenfassung
Die Speicherung und der Austausch von Informationen ist eng mit dem
Be-gri des Datenformats verknüpft. Ein Datenformat legt fest, wie Informationen
format-konform von einem Sender als Daten geschrieben und aus diesen von einem
Empfänger wieder gelesen werden können. Obwohl natürlich-sprachliche
Beschrei-bungenfürmenschlicheIngenieureheutehäuggenutztwerden,undformat-konforme
Abläufe teilweise prozedural beschrieben werden, hätte eine formale, deklarative
Beschreibung von Datenformat-Wissen den Vorteil, daÿ dieses ohne Bindung an
einenkonkretenAblaufundohnedenUmwegüberMenschenmaschinen-verarbeitbar
ist,unddamitexibelundautomatisiertaufformat-konformeDatenangewandt
wer-den kann. ImRahmendieser DissertationwirdDatenformat-Wissen sowohlaufder
Ebenevonformat-konformenDaten alsDatenformat-InstanzalsauchaufderEbene
eines Datenformats betrachtet, welches aus Datenformat-Instanzen besteht.
ImRahmeneinerBegutachtungverwandterArbeitenimBereichder
Datenformat-BeschreibungwerdenAnsätzeindendaten-orientiertenForschungsgebieten der
Dig-italenErhaltung, MultimediaundTelekommunikation untersucht,und eswird
fest-gestellt,dassgeeignete,formalisierteModellefehlen,welcheuniversellfürdie
Beschrei-bung von Datenformaten anwendbar sind. Darüber hinaus hat sich gezeigt, dass
diebetrachteten Ansätzenur teilweise dienotwendigenbeschreibendenFähigkeiten
haben, welche erforderlich sind, um den Aufbau von Daten zu beschreiben, welche
primitiveWerteenthalten,eine Strukturdarstellen,einerBlock-Transformation
un-terzogen wurden oder aber infragmentierterForm vorliegen.
In einer Analyse wird daher eine Formalisierung des Datenformat-Begris
en-twickelt,welche vonderAnnahmeausgeht,dasseinDatenformateinnormativesSet
von verlustfreien Informations-Repräsentationen darstellt. Für ein solches Set
ex-istiert eine bijektiveAbbildung zwischen der internenRepräsentationeines Senders
/Empfängersundder korrespondierendenexternenRepräsentation,welche inForm
format-konformer Daten ausgetauscht wird. Diese Formalisierungist universell für
beliebige Datenformate anwendbar, also auch für sogenannte verlustbehaftete und
verlustfreieDatenformate,undführtzum Konzeptvonelementarenbeschreibenden
Fähigkeiten, welche sich genau mitdenen decken, welche in der Begutachtung
ver-wendetwurden. AufBasisdieser Fähigkeiten wirdeingültigesBildimDateiformat
PortableNetworkGraphics(PNG)alsLackmus-Test fürAnsätzeder
Datenformat-Beschreibung vorgestellt, dadessenBeschreibungallevierelementaren
beschreiben-den Fähigkeiten voraussetzt. Auf Basis der Formalisierung wird dann gezeigt,dass
einuniversellanwendbarerAnsatzzurDatenformat-Beschreibungzumächtigist,als
dass dessen Terminierung noch garantiert werden kann. Ferner wird gezeigt, dass
bijektiveAbbildungsfunktionen und ihre Inversen weder ezient sein müssen, noch
dass die Ezienz einer bijetiven Abbildungsfunktionund ihrer Inversen im
Zusam-menhang stehen müssen. Zuguterletzt wird gezeigt, dass diefür eine bijektive
Ab-bildung erforderliche Korrespondenz von internen und externen Repräsentationen
dadurchgarantiert werdenkann,dass man diese übereine informations-erhaltende,
Turing-vollständige Reversible Turing-Maschine deniert.
Aufbauend auf der Formalisierung der Analyse wird in dieser Dissertation das
Bitstream Segment Graph (BSG)-Modell deniert, welches der Beschreibung
tationfür dieSpeicherungund den Austauschdeniert. Die schrittweise
Konstruk-tion und Modikation von BSG-Instanzen wird durch ein geschlossenes Set von
Operationen ermöglicht,und mittels dem Maÿder Abdeckung einer BSG-Instanz
kann deren Vollständigkeitbestimmtwerden. Mithilfedes Apeiron BSG Editor ist
die Konstruktion, Modikation und Betrachtung von BSG-Instanzen auf eigenen
Daten in der Praxis möglich. Die Anwendung des BSG-Modells wird
demonstri-ert, indem eine Beschreibung des PNG-Bilds aus dem Lackmus-Test der
Anal-yse vorgenommen wird, und indem der Aufbau eines Exploit im Kontext der
IT-Sicherheitmittels einer Beschreibung näher erklärt wird.
AufbauendaufdemBSG-ModellbeschreibtdieseDissertationdenBSG
Reasoning-Ansatz,um beliebigeDatenformatealspotentiellunendliche Setsvon
Datenformat-Instanzen zu beschreiben. Mithilfe von Logik-Regeln kann eine BSG-Instanz auf
einergegebenen Bitfolgeerschlossenwerden,vonderinitialangenommenwird,dass
sieformat-konform ist. DieserAnsatz deniert auch dieRepräsentation von
Regel-Sets zur Speicherung und zum Austausch. Die Anwendung des BSG
Reasoning-AnsatzeswirddurchdieBeschreibungeinesSubsets desPNG-Datenformats
demon-striert,sowiedurchdieBeschreibungeinesAnsatzeszumformat-spezischenFuzzing
von BinärdatenimKontextvonIT-Sicherheitergänzt. Die Beschreibung des
PNG-Datenformat-SubsetsmittelsdesBSGReasoning-Ansatzesnutztallevierelementaren
beschreibenden Fähigkeiten, welche zuvor in der Analyse identiziert wurden, und
eswurdegezigt,dass dashierfür verwendete SetanLogik-Regelntrotzseines
gerin-gen Umfangs bereits in der Lage ist, BSG-Instanzen mit einem hohen Grad an
Abdeckung für eine Reihe vongültigen PNG-Bildern zu erschliessen.
Die Dissertation schliesst mit einem Rückblick über die gesamte Arbeit, zieht
Schlussfolgerungenund bieteteinenAusblickaufkünge ForschungimHinblick auf
das BSG-Modell und den BSG Reasoning-Ansatz, speziell imHinblick aufAspekte
wiedermaschinellunterstütztenAnalysevonDatenunddenjeweils
zugrundeliegen-den Datenformat-Regeln, oder aber der Verwendung von reversiblen
1 Introduction 1
1.1 Motivation. . . 1
1.1.1 Formal descriptionsover natural-languagedescriptions . . . 1
1.1.2 Declarativeapproaches overproceduralapproaches . . . 2
1.1.3 Research domains . . . 3
1.1.4 Dataformat instances and data formats . . . 3
1.2 Research Problems . . . 4
1.3 Contributions and Outline . . . 4
1.4 Acknowledgements . . . 5
2 State of the Art 7 2.1 Introduction . . . 7
2.2 DigitalPreservation . . . 9
2.2.1 Data FormatRegistries. . . 10
2.2.2 Open Archival InformationSystem Reference Model . . . 11
2.2.3 Typed Object Model . . . 14
2.2.4 eXtensibleCharacterisation Language . . . 16
2.2.5 Data FormatDescription Language . . . 19
2.3 Multimedia . . . 22
2.3.1 MPEG 1/2 Methodology . . . 22
2.3.2 Formal Language for Audio-VideoObject Representation . . . 25
2.3.3 Bitstream Syntax DescriptionLanguage. . . 29
2.3.4 BFlavorand gBFlavor . . . 34
2.4 Telecommunication. . . 36
2.4.1 Abstract Syntax NotationOne . . . 36
2.4.2 Concrete Syntax Notation 1 . . . 40
2.5 Discussion . . . 43
2.5.1 Considerationof examined RelatedWork . . . 43
2.5.2 Generalobservations . . . 45 2.5.3 Other approaches . . . 46 2.6 Summary . . . 47 3 Analysis 49 3.1 Introduction . . . 49 3.2 Research Hypothesis . . . 50
3.3 FormalisingDataFormat Description. . . 51
3.3.2 Representing complex information . . . 52
3.3.3 Validatingrepresentations . . . 53
3.3.4 Mappingbetween representations. . . 54
3.4 Properties of the Formalisation . . . 57
3.4.1 Suitabilityof bijectivemappingfunctions for data formats . . 57
3.4.2 Suciency for lossless and lossydata formats . . . 59
3.4.3 Suciency and necessity of descriptive capabilities . . . 59
3.4.4 Usinga PNGraster image aslitmustest . . . 60
3.5 Limitsto DataFormat Description . . . 61
3.5.1 Overview . . . 61
3.5.2 Computability and decidabilityof functions . . . 62
3.5.3 Tractability of functions . . . 65
3.5.4 One-to-onecorrespondence of sets . . . 65
3.5.5 Summary . . . 67
3.6 Discussion . . . 67
3.7 Summary . . . 68
4 Describing Data Format Instances 71 4.1 Introduction. . . 71
4.2 Denition of the Bitstream Segment Graphmodel . . . 72
4.2.1 Dening codings and transformations . . . 72
4.2.2 Dening bitstream segments. . . 73
4.2.3 Dening aBitstream Segment Graph . . . 73
4.3 Representation of BSG instances . . . 74
4.3.1 Visualrepresentations . . . 74
4.3.2 Digitalrepresentation . . . 76
4.4 Construction and modication of BSG instances . . . 77
4.4.1 ModifyingBSG instances through operations . . . 79
4.4.2 Measuringcompletenessof a description . . . 84
4.4.3 Providingtoolsupportwith the Apeiron BSG Editor. . . 85
4.5 Applications of the BSGmodel . . . 85
4.5.1 Descriptionof the PNG raster image litmustest . . . 86
4.5.2 Describingexploits in IT Security . . . 87
4.6 Discussion . . . 91
4.7 Summary . . . 92
5 Describing Data Formats 95 5.1 Introduction. . . 95
5.2 Denition of the BSG Reasoning aproach . . . 96
5.2.1 Makingpropositions . . . 96
5.2.2 Usingpredicates . . . 97
5.2.3 Dening rules . . . 100
5.2.4 Matching rule conditions . . . 102
5.2.5 Inference process. . . 102
5.3 Representation of BSG Reasoning rulesets . . . 103
5.4.2 Format-aware fuzzing of bitstreams . . . 112 5.5 Discussion . . . 113 5.6 Summary . . . 115 6 Finale 117 6.1 Introduction . . . 117 6.2 Retrospection . . . 117 6.3 Conclusions . . . 118 6.4 Outlook . . . 118
6.4.1 Computer-aided reverse-engineering of data format rules. . . . 118
6.4.2 Use of reversible programminglanguages . . . 119
6.4.3 Analysisof space-eciency regarding existing data formats . . 119
A BSG Reasoning ruleset for PNG subset 121
B BSG Reasoning results of PNG ruleset 143
2.1 Overview of OAIS Functional Entities . . . 12
2.2 Overview of OAIS concept relations. . . 13
3.1 Abstraction of the informationtransport froma sender toa receiver . 50 3.2 Mappingsteps ordered by input and output cardinality. . . 56
3.3 Bijectivemappingbetween
m
α
andm
d
. . . 563.4 Bijectivemappingbetween internal and external representations . . . 57
3.5 Containment of grammarclasses based onChomsky . . . 62
4.1 Representation of bitstream segmentsin the simple variant . . . 74
4.2 Representation of bitstream segmentsin the extended variant . . . 75
4.3 Representation of bitstream segmentsin the interactive variant . . . . 75
4.4 Graphgrammar rule forinitial_split/nal_join . . . 79
4.5 Graphgrammar rule forsplit/join. . . 80
4.6 Graphgrammar rule fortie/untie . . . 81
4.7 Graphgrammar rule fordeclare_primitive/undeclare_primitive . . . 81
4.8 Graphgrammar rule fordeclare_fragment/undeclare_fragment . . . 82
4.9 Graphgrammar rule forcompose/decompose . . . 83
4.10 Graphgrammar rule forexpand/compress . . . 83
4.11 Apeiron BSG Editor showing aBSG instance . . . 86
4.12 Partial bitstream segment graph forle oi2n0g16.png . . . 87
4.13 Structure of acrafted PNG image exploit shown inApeiron (1/3) . . 88
4.14 Structure of acrafted PNG image exploit shown inApeiron (2/3) . . 89
4.15 Structure of acrafted PNG image exploit shown inApeiron (3/3) . . 90
2.1 Excerpt of anXCEL description for a PNGIDAT chunk . . . 18
2.2 Excerpt of asample data structure dened using DFDL . . . 20
2.3 Denitionof a data structure using the MPEG 1/2 methodology . . . 24
2.4 Denitionof an ISOFile Type Box using MSDL / Flavor . . . 26
2.5 Denitionof a MPEG-2 Systems Program Stream(PS) using Flavor. 27 2.6 XFlavor representation of a MPEG-2 Systems Program Stream . . . . 28
2.7 Denitionof a BS Schema for JPEG2000using BSDL . . . 30
2.8 Denitionof a BS Description forJPEG2000 using BSDL . . . 31
2.9 Denitionof a gBS Descriptionfor JPEG2000 using BSDL . . . 32
2.10 Excerpt of aX.509 certicate denition given in ASN.1. . . 38
2.11 AnASN.1 assignmentfor use with ECN . . . 38
2.12 AnECN encoding object assignment . . . 39
2.13 Excerpt of a3GPP messagedenition in CSN.1 . . . 42
2.14 Comparisonof supported elementary descriptivecapabilities . . . 43
4.1 Namespace declarations. . . 76
4.2 Bitstreamsegmenttypes. . . 77
4.3 RDFclasses for bitstream segments. . . 77
4.4 RDFproperties for bitstream segments. . . 78
5.1 Example BSGrule . . . 105
5.2 Listof model-specicBSG rules . . . 107
5.3 Excerpt of format-specic rules for aPNG subset (1/2). . . 108
5.4 Excerpt of format-specic rules for aPNG subset (2/2). . . 109
B.1 Results forPNG subset rules (1/5) . . . 143
B.2 Results forPNG subset rules (2/5) . . . 144
B.3 Results forPNG subset rules (3/5) . . . 145
B.4 Results forPNG subset rules (4/5) . . . 146
Introduction
1.1 Motivation
Theconcept ofdata formatsiscentraltoinformationstorageand exchange. A data
format denes how informationis represented digitallyasbits, bytes orcharacters,
forming higher-level data structures. It therefore coins the process of how to
deter-mine syntax and semantics of data in order to access the information represented
by it, and toprocess itin ameaningfulmanner.
To pass digitally represented information between a sender and a receiver, an
agreement is needed on the semantics of data to be transmitted, and on the data
format to be used. The agreement is used by a sender todetermine how to encode
and serialise information to be sent into a sequence of bits, which is then passed
on as a message. The actual composition of format-compliant data forming the
messagedepends onthe dataformatagreedforthe exchange,andisthusonlygiven
implicitly. Therefore,the agreementisalsousedby areceiverof themessage, where
the bit sequence isparsed and decoded, enabling access torepresented information
for further processing. In this process, data format knowledge is applied to make
the actual compositionof format-compliantdata explicit again.
Itisthereforenecessary thatsomerepresentationof thedataformatisgivenand
shared between these parties as part of the agreement, leading to the problem of
describing data format knowledge.
1.1.1 Formal descriptions over natural-language descriptions
The way in which data formats are described and represented depends on the
in-tended audience. Both natural-language descriptions for human engineers and
for-mal descriptions formachine-processing exist:
Natural-language descriptions intended for human engineers are still
domi-nant at the time of writing. Translating data format knowledge from
natural-languagedescriptionstomachine-executableimplementations,madenecessaryby
the sheer volume of format-compliant data as well as by the complexity of its
representation, depends onqualied human labour.
Formal descriptions intended for machine-processing are currently present in
data formats in general depends on the availability of a formalised model that
guarantees this property.
Accessing represented information in format-compliant data strongly depends on
format-compliant implementations. Existing implementations are threatened by
rapid technological change, which necessitates constant adaptation or replacement
for them to remain usable. Yet, the problem with natural-language descriptions is
thattheapplicationofdataformatknowledgetoaproblemdependsonqualied
hu-man labour, which islimited by itscost and availability. Lacking format-compliant
implementations,the illegibilityof represented information eectively results in its
loss.
Employing formal descriptions of data formats in a machine-processible
man-ner is a desirablealternative for making data format knowledgemachine-accessible
and applicable without inherently depending on human labour. Yet, a suitable,
formalisedmodel stillremainsessentialfor the universal applicability of formal
de-scriptions.
1.1.2 Declarative approaches over procedural approaches
The problem of formally describing data format knowledge can be approached in
eithera procedural ordeclarative manner:
In a procedural manner: Using languages similar to existing programming
languages like Java orC/C++, proceduralapproaches dene xed, specic
algo-rithmsforprocessingformat-compliantdata,suchasforparsinganddecoding, or
for encoding and serialising. In contrast to declarative approaches, the
underly-ingrulesandconstraintsofformat-compliantdataaregiven onlyindirectly,albeit
being present inthe dened algorithms.
In a declarative manner: Approaches using a declarative manner dene the
underlying rules and constraints that govern format-compliantdata. In contrast
to procedural approaches, no specic algorithm for processing format-compliant
data is given.
For a specic purpose, procedural approaches make it straightforward to write
al-gorithms to process format-compliant data. Yet, the problem with procedural
ap-proaches is that dierent purposes lead to dierent algorithmic implementations,
even when the same underlyingrules and constraintsof the very same data format
stillapply.
Using adeclarativeapproach, these underlyingrules and constraintscan be
ex-posed for a wide range of applications, retaining the freedom of using the same
format-specicknowledgefordierentpurposes. Althoughmorecomplex,the
declar-ative approachhas the benet of increased exibility.
For example, it may be desirable toaccess only a certain subset of information
containedin large volumes of format-compliantdata, thereby using resources more
eciently and speeding up the parsing process. Likewise, when part of
strongly depends onthe availabilityof data format knowledge that can beadapted
and applied ina exible manner.
1.1.3 Research domains
The problem of describing data formats is of concern to data-centric domains of
ComputerScience, especiallyDigital Preservation, Multimedia and T
elecommunica-tion:
In Digital Preservation,the problem of preserving long-term access to digital
informationforfuturegenerationsthreatensdigitallyrepresented culturalheritage
[RH05]. A majoruse caseforrelatedwork inthis domainisthemigrationof data
between data formats to prevent technological obsolescence and the subsequent
lossof information.
InMultimedia,twousecasesrelatedtodataformatsarethenormativedenition
ofdata formats aswellasthe high-leveladaptationof digital objects forUniversal
Media Access [VCE03 ]. Regarding the former use case, normative denition of
data formats is required for the specication of new data formats, allowing the
interoperabilityof systems working with multimediadata. Concerning the latter
use case, Universal Media Access depends on the timely delivery of multimedia
resources over heterogenous networks to end-user devices with varying decoding
and playback capabilities. High-level, on-the-y adaptation of digital objects to
thecapabilities andlimitationsofnetworkand end-user devicesstrongly depends
ondata format knowledge toachieve meaningfuladaptations.
InTelecommunication,aprimaryusecaseisthenormativedenitionofprotocol
data units (PDUs). Similar to Multimedia, there is the problem of describing
communication protocols for interoperability of parties, which also covers the
data format of PDUs tobetransmitted.
Related work exists for formally describing data formats in a declarative and
pro-cedural manner, yet their underlying models are often based on domain-specic
assumptionsthatdonot holdingeneral. Examinedrelatedworkdoesnotprovidea
formalised model that is geared towards universal applicability, althoughuniversal
applicabilityis sometimes claimed.
1.1.4 Data format instances and data formats
When considering the problem of formally describing data format knowledge, it is
helpful to distinguishbetween a data format instance and adata format:
Data format instance: A data format instance has a bit sequence, where
in-formation is represented conforming to its data format. Describing a data
for-mat instance in a machine-processible manner provides a means to
authorita-tively express how specic informationis represented, correcting
misunderstand-ings present in applications and in the understanding of human engineers. For
example, a Portable Network Graphics (PNG) raster image is a data format
Data format: Adataformathas apotentiallyinnitenumberof itsdataformat
instances, which conform to a common set of underlying rules and constraints.
Again, describing adata formatina machine-processiblemanner providesmeans
to authoritatively express how a specic type of information is represented. For
example, the PNG le format is a data format which denes rules on where the
image header is located and from which elds it is composed, thus dening the
placement of elds containing the width of arbitraryPNG images.
Data format instances and data formats are closely linked, since underlying rules
andconstraintsofadataformatmanifestthemselvesinitsdataformatinstances. In
order toaddress and consider both levels, itis desirableto havesuitable, matching
models for both describing data format instances as well asdata formats.
1.2 Research Problems
This thesis addresses both the formal description of arbitrary data formats in a
declarative manner, based on the formal description of arbitrary data format
in-stances. Describingadataformatasaclassthrough itsdata formatinstancesraises
the followingtwo research problems for this thesis:
Describingdataformatinstances: Howtodescribethecompositionof
format-compliant data, considering the syntax and semantics of its bit sequence or
seg-ments thereof?
Describing data formats: How to describe a data format with a potentially
inniteset of dataformat instances through itsunderlyingrules and constraints?
Sinceadataformatservesthepurposeofrepresentinginformationforitsstorageand
transmission over time, this thesis assumes that the representation of information
is lossless - represented informationmust actually be present in its representation.
Thisthesisthereforeassumesaswellthatforeverydataformatinstance,thereexists
a bijective mapping between a format-compliant bit sequence and the information
itrepresents.
1.3 Contributions and Outline
The four contributions of this thesis are a state-of-the-art survey on data formats
andtheirdescription,ananalysisondataformatdescription,theBitstreamSegment
Graph (BSG) model for describing data format instances, and the BSG Reasoning
approach for describing data format classes:
Surveyoncurrentstate-of-the-artindataformatsandtheirdescription:
The survey in Chapter 2 covers denitions and provides a basic systematisation
of related work in terms of their descriptive capabilities, focusing on the
data-centric research domains of Digital Preservation, Multimedia and T
Analysis of data format description: The analysis provided in Chapter 3
presentsanabstractmodelfordescribingdata formatsand dataformatinstances
in order to address the inherent properties and limitations of data format
de-scription. Whilethe reversibilityof abijectivemappingdoesnot restrictits
com-putational complexity, the analysis states that describing arbitrary data formats
comes at the cost of losing guaranteed termination of parsing and decoding
pro-cessesincase oferroneousdataformatrules. Theanalysispresentstheconcept of
elementary descriptive capabilities, which align with the descriptive capabilities
previously used forcomparing related approaches and identifying shortcomings.
BitstreamSegment Graph (BSG) model: Drawing onthe setof elementary
descriptive capabilities established in the analysis, the graph-based model
pre-sented inChapter 4 isused todescribedata format instances through a bijective
mappingfor a format-compliant bitsequence. The chapter includes methods for
the construction of a BSG instance and its evaluation, e.g. for the extraction of
contained information. The chapter also presents the Apeiron BSG Editor tool
forthe manualannotation of data and the denition ofa RDF-based
representa-tionforBSGinstances. Thiscontributionhasbeenpublishedatthe International
Conferenceon SoftwareandData Technologies(ICSOFT)2008[HMT
+
08],atthe
International Multi-Conference on Computing in the Global Information T
ech-nology (ICCGI) 2008 [HSB
+
08] and extended in an article in the International
Journal on Advances in IT Security (IJAS) [HFS
+
09], with the lattertwo
focus-ingonapplications inthe context of IT Security.
BSG Reasoning approach: The approach presented in Chapter 5 builds on
the BSG model and describes a data format with a possibly innite set of data
format instances using rules for the computation of a least xed pointsimilar to
Datalog [CGT89], thereby inferring aBSG instance from a format-compliantbit
sequence. The chapter also presents a syntax for data format rules and a
BSG-based reasoning engine. The contribution has been evaluated on a subset of the
Portable Network Graphics (PNG) image format which exercises all elementary
descriptive capabilities identied in the previous analysis. This contributionhas
been published atthe International Conference on Digital Preservation (iPRES)
2008[HBSM08] and extended inthe IJAS article[HFS
+
09].
1.4 Acknowledgements
As Malcom Gladwell so aptly put it, behind every story of success, there is also a
storyof a fertilegroundprovided, ofchancesoered, of opportunities seized and of
hard work being done even in the faceof setbacks and constant frustrations.
Although writinga PhD thesis isa personal endeavour, itcannot be considered
andunderstoodasanisolatedtask. Lookingback,Ihadalotofhelpinghandswhen
I needed them -in this spirit, I want tothank:
my PhD advisor Prof. Dr. Max Mühlhäuser for taking the risk, accepting meas
hisPhDstudent, and givingmethe freedom topursuemylines ofresearchinhis
my parentsAstrid and Joachimand my sister Sveafor keepingme grounded, for
encouraging me all along the way, for supporting me in countless ways, and for
listening toquitesimilar stories overand overagain - nally,it isdone -,
all of my friends for being there, and for both distracting me and pushing me
forward as it was necessary (especially, but neither limited to nor in a special
order,SandraandHans, TobiasandNina,AndreasandVerena,Marekand Ste,
Ariane, Maren, Kerstin, Annetteand Claudia),
my TK research teamhead Dr. GuidoRöÿlingfor asound educationin scientic
procedures, and for hisendless patience to both read and comment on countless
drafts,
mycolleaguesattheTelecooperation(TK)researchgroup(especiallyGinaHäuÿge,
KaiHöver,HenningBär,andGundolfvonBachhausfromtheDLHgroup,aswell
as Sebastian Riesand others),
my previous e-learning center (elc) team head Dr. Susanne Oenbartl for her
support and her refreshingly pragmatic viewpoints on the better side of
morale-building,
my colleagues at the e-learning center (especially IrinaReuter, AnikaHartmann,
Evelyn Ehrich, Klaus Steitz, Anne Bieberstein, Iris Wegmann, Thorsten Maas,
LeifPullich,JensJapesandJuliaSonnberger)andattheHochschulrechenzentrum
(HRZ) (especiallyDr. Jürgen Ohrnberger),
mycolleaguesattheRechnerbetriebsgruppe(RBG)group(namelySvenSchenkel,
ClaudiaKleber, Qun Herz, Manfred Schnitt,Gerhard Fladererand Uwe
Langen-dor),
mycolleaguesattheresearchtraininggroupGKEL(especiallyDoreenBöhnstedt)
Prof. Dr.-Ing. Ralf Steinmetz and his KOM research group,
allofmystudenthelperswithwhomIworkedinclosecollaboration(especially
An-dreasDoms,Tim Klein,HristoLulev,AmirNeziri,MartinPinto-Bazurco
Mendi-eta, Viet Hung Tu and Thomas Wilker), as wellas
all of my students whose bachelor, diploma and master theses I supervised, and
whocontributedtomyPhDresearchandmyprojects(namelyinalphabeticorder
JanAdler,BenediktAntoni,PaulBaumann,ArseneBotchak,AlexBulach,Julian
Dean, Jens Hatlak, Clayton Hoss, Tim Klein (again), Benno Kröger, F
riedrich-DanielMöller,BerndPapachrissanthou,DanielSchumann,MarcusStänder,Björn
State of the Art
2.1 Introduction
The previousChapter 1presented the researchproblemofformallyexpressing both
the composition of format-compliant data for data format instances as well as the
underlyingrules for adata formatin amachine-processible,declarative manner. In
ordertoassessthecurrentstateoftheartinthisregard,thischapternowcontributes
asurveyonthecurrentstateoftheartregardingexistingmodelsforexpressingsuch
data format knowledge. The survey focuses onthe followingaspects:
Denitions and models for data format knowledge: Related work directly
concerned with data format knowledge regarding the composition of data gives
denitions and models for its expression, either implicitly or explicitly. Existing
denitionsandmodelsarebasedontheirspecicconceptsandconstraints,leading
toinherent properties and limitationson their expressiveness.
References to existing related work: Related workonlyindirectlyconcerned
withthecompositionofdata,butstillinneedtoexpress such dataformat
knowl-edge, provides references to existing denitions and models aswell as insightsto
their adoption.
When related work provides a model for describing data formats or data format
instances not on the level of meta-information, but focusing on data format rules
and the structure of data format instances, the survey provides aclassication:
Classication: It is considered whether such anapproach is declarative or
pro-cedural, whether it is machine-processible and whether it has a formalised model
for universal applicability:
Declarative or procedural approach: In order to classify existing
ap-proaches for this thesis, it is considered whether they describe data format
knowledgein adeclarativeor procedural manner.
Machine-processibleapproach: Approaches existinrelatedwork which
in-troducedescriptivemeansintended forhumanengineers,whichdonotfocuson
being machine-processible. Machine-processibleapproaches themselves require
aminimumdegreeofformalisationwhichmaybegiven onlyindirectlythrough
Formalised model for universal applicability: Approaches exist which
introduce a number of concepts to providespecic descriptivecapabilities, yet
which do not address the completeness or orthogonality of their concepts for
describing arbitrary data formats. The existence of such a formalised model
supports potentialclaims for universal applicability.
Last but not least, if a suitable model is presented for an approach, the survey
examinesits descriptivecapabilities:
Descriptive capabilities: In order to compare dierent models, their
expres-sivity is considered regarding the handling of primitive data, structured data,
transcoded data or fragmented data:
Primitive data isa singlepiece of information,suchasa oating-point
num-ber, acharacter stringorathree-bit unsignedintegerstoredinleast-signicant
bit rst order. It is represented in an encoded form, which has to be decoded
in orderto access it.
Structured data is a continuous sequence of bit sequences, each with a
sep-arate, distinct meaning in its context. It has to be segmented to access the
separate constituents.
Transcoded datais a bitsequence which isthe result of a transformationof
anoriginalbitsequence, suchascompression,encryption orsomesimilarblock
transformation. It has to be transformed in order to access the original bit
sequence.
Fragmented data is a bit sequence which is only a fragment of a larger,
originalbitsequence. Inordertoaccesstheoriginalbitsequence initsentirety,
fragmented data has to beconcatenated inthe right order.
Handling these kinds of data properly requires the matching descriptive
capa-bilities of decoding primitive data, of segmenting structured data, transforming
transcoded data aswell as concatenating fragmented data are considered. Based
onthesedescriptivecapabilities,thesurveycomparesrelatedwork,allowing
state-mentstobemaderegardingtheirsuitabilityfordescribingarbitrarydataformats
in general.
Sinceprocessingdataisacentral,recurringaspectofComputerScience,relatedwork
on data formats can be sought and found in a number of research domains. This
chapter focuses on contributions from a data-centric subset of research domains in
separatesections, namelyDigital Preservation inSection2.2,Multimedia inSection
2.2 Digital Preservation
Overview
Digital Preservation is concerned with the long-term preservation of digital
infor-mation. Dataformats play acrucial role,since digital informationis storedas data
in a specic data format, while format-specic hardware and software provides
ac-cess tothe contained information. Over time, both hardware and software tend to
become obsolete due to technological advances. As obsolete hardware may fail in
the future when there is no replacement available, and as obsolete software may
not be availableanymore or failto interoperate with newerhardware and software,
technological obsolescence is a constant threat to Digital Preservation eorts. The
current rapid pace of technologicalchange ampliesthis threat.
Inliterature,threepreservationstrategiesforprotectingdigitalinformationagainst
informationloss through technological obsolescence onthe logical level are the
mi-gration of data, the emulation of hardware / software, and digital archaeology:
Migration of data: Information stored as data in a specic data format is
migratedto asuitabletargetdata format. Duetoatypicalmismatchofdierent
dataformats,someinformationoftencannotberetainedduringamigrationandis
usuallylost. Foramigrationofdata,itusuallymustbedecidedwhichinformation
toretain, forwhich the migrationprocess must bemonitored [Arm00].
Emulation of hardware / software: Hardware and/or software representing
the originaltechnologicalenvironment orparts thereofare replaced byan
emula-tion. Throughusingtheemulation,originalhardwareand/orsoftwarecomponents
remain capable of providing access to contained information[Rot99, Arm00]. A
variation thereof is the virtualisation of software, where software for a specic
technological environment is replaced with software that targets a virtual
ma-chineasanintermediateplatform,whichisavailablefortheoriginaltechnological
environment. In case of technological obsolescence of the original environment,
it is only necessary to port the virtual machine implementation to a new
plat-formratherthan porting each and every specic software. The Universal Virtual
Computerserves as anexample of such anapproach [Lor01].
Digitalarchaeology: Software, dataandrelateddocumentationareanalysedin
ordertoreverse-engineer both syntax and semantics of data. Itsdescription then
serves as a basis for implementing new hardware or writing new software, which
againprovides access tocontained information[RG99, Arm00].
The former three strategies depend onpreparatory actionstaken prior tothe event
of technological obsolescence. Without suitablepreparation in advance, digital
ar-chaeologyistheonlyremainingoption,althoughusuallycostly [Wet98]. Incontrast
tothe emulationofhardware /software, both migrationofdata anddigital
archae-ologyare concernedwith data format knowledgeon the compositionof data.
Outline
Maintaining format-related meta-information is addressed by data format
reg-istries suchastheUSLibrary of Congress(LoC)DigitalPreservationproject,the
GlobalDigital FormatRegistry (GDFR),PRONOM,and therecentmergerofthe
latter two, the Unied Digital Formats Registry (UDFR).
Addressingissuesrelatedtotherepresentationofinformationaccordingtodened
data formats, the Open Archival Information System (OAIS) Reference Model
serves as a de-facto standard modelregarding the long-termarchivaland
preser-vation of data.
For the migrationof data, the selection and execution of suitablepaths for
con-version is anon-trivial problem. The Typed Object Model (TOM) isan approach
for mediating data between dierent data formats in a distributed system,
au-tomatingthe process of migratingdata between dierent formats.
Formeasuring thequality ofdata migrationintermsof retained information,the
eXtensible CharacterisationLanguage(XCL) projectcontributesboth the
eXten-sible Characterisation Extraction Language (XCEL) for extracting information
fromdataasaproperty,andtheeXtensible CharacterisationDenitionLanguage
(XCDL) for the description of such propertiesfor later comparison.
Related to aspects of Digital Preservation, but not explicitly framing itself into
this domain,the Data FormatDescription Language (DFDL) is an approach for
describing the composition of data according to a data format which focuses on
leveragingexisting XML technologies.
2.2.1 Data Format Registries
Thepreservationstrategiespreviouslypresented inChapter2.2dependon
maintain-ing detailed meta-information on data formats and format-compliant applications
inthe long term. Therefore, thereis aneed fordata format registries as custodians
of meta-informationondata formats.
Overview
ProminentdataformatregistriesaretheUSLibraryofCongressDigitalPreservation
Project(LoCDP)[AF05],theGlobalData FormatRegistry(GDFR)[AS03],andthe
PRONOM [Bro05] data format registry. At the time of writing, both GDFR and
PRONOMareintheprocessofmergingtoUniversalDataFormatRegistry (UDFR)
[UDF09].
Richmodelsforcategorisingdataformatsandformanagingrelatedmeta-information
inhighdetailarecommontoLoCDP,GDFR[Abr07b,Abr07a,AG08]andPRONOM
[Bro05]. The model of GDFR extends to complex relationships between dierent
data formats, such as extensions or versions, on a highly formal level. In varying
forms,these registriesprovide meansfor referencing aspecic data format, such as
Discussion
Data format registriesallowtheir users to identify and associate data formats with
metadata suchasreferences totheir specication,format-compliantapplicationsor
relationsbetween formatsrequiredforthemigrationofdata ordigitalarchaeologyin
case of technological obsolescence. Examined related work providesdenitions and
models concerning format-related meta-information as well as references to other
approaches.
Denitionsandmodels: Besidestheirrichmodelsonformat-relatedmeta-data,
allregistriesatleast considernatural-language descriptionssuchastextual
speci-cationssuitableforhuman consumptionasabaseline. In thecase ofPRONOM,
format-relatedknowledgesuitableformachine-processingisprovided onthe level
of le signatures, allowing for the identication of a les' data format through
automatedtools suchas Digital Record Object Identication (DROID).
References: In terms of approaches for describing the composition of data,
GDFRreferences toanumberofformatdescriptionlanguages. Itexplicitlyrefers
to the eXtensible Characterisation Extraction Language (XCEL), the Bitstream
SyntaxDescriptionLanguage(BSDL)andtheDataFormatDescriptionLanguage
(DFDL)amongotherapproaches, includingformalgrammarnotations and XML
schema languages.
2.2.2 Open Archival Information System Reference Model
Operating anarchive for long-termarchival and preservation of data poses a
num-berof problems,whichincludesthe managementof howinformationis represented.
Within short-term transactions, producers and consumers can negotiate the
repre-sentation of information to be exchanged. Yet, for long archival and preservation,
producers cannot foresee future representations of information, and thus need to
delegate the negotation with future consumers to suitable archival systems, which
managea potentiallyrequired migrationof data.
Overview
TheOpen ArchivalInformationSystem(OAIS)providesareferencemodelfor
long-term archival andpreservation systems, andserves asthe de-factostandard
regard-inglong-termarchivalandpreservationofdigitalinformation[CCS02 ]. Amongother
aspects, OAIS addresses the issue of managingthe representation of information.
The OAIS reference model denes a number of processes surrounding
informa-tionstoredinanOAISarchivebetweenaProducer anditsConsumers,aswellasthe
archive Management. Besides the processes of Preservation Planning and
Admin-istration that coordinateand manage archivaloperations between involved parties,
actual processes related to archival and preservation are Ingest, Archival Storage,
Data Management and Access, shown inFigure2.1:
Pro ducer Consumer Management Ingest Access Preservation Planning DataManagement Archival Storage Administration SIP AIP DIP
Figure2.1: Overview of OAIS Functional Entities, based on[CCS02].
Archival InformationPackage (AIP) conformingtoarchivepolicies,suchasusing
only publically disclosed data formats. It then generates descriptive infos and
coordinates potential updates withData Management.
ArchivalStorage: TheArchivalStorageprocessisresponsibleforreceivingAIPs
tobeputintostorage,managementofthestorage,errorcheckingandreplacement
of storage media, their backup for disaster recovery, and nally for providing an
AIP upon request.
Data Management: The Data Management process serves for administrating,
updating and querying of archival databases as well as for general reporting for
the archive.
Access: Upon request by a Consumer, the Access process generates a suitable
Dissemination Information Package (DIP) for the requested AIP and delivers it
to the Consumer.
Asboth ArchivalStorageand Data Management are not central tothe role ofdata
formatsinOAIS,theseprocessesarementionedforcompletenessandarenotfurther
explored.
As can be seen from the Ingest and Access process descriptions, the concepts
of SIP, AIP and DIP are of importance to the OAIS reference model. These are
specializations of the Information Package, being distinguished in its role in the
respective process. As shown in Figure 2.2, an Information Package contains the
ContentInformationastheactualcontenttobepreserved, aswellasrelated
Preser-vation DescriptionInformation. TheContentInformationconsistsof aData Object
and its Representation Information, which can be used to obtain an Information
Object:
Data Object: A DataObjectspecialisesintoeitheraPhysical Object asa
Information Package Content Information Preservation Description Information Data Object Representation Information Submission Information Package Archival Information Package Dissemination Information Package
Figure 2.2: Overview of OAIS concept relations,based on[CCS02].
Representation Information: RepresentationInformationmaps theData
Ob-ject to an Information Object as a more meaningful concept. Representation
Informationcan bespecialisedintoStructure Information whichdenes the
map-pingofbitsequencesintodatatypes,andintoSemanticInformationwhichdenes
the meaningof data. RepresentationInformation itselfmayagain berepresented
asaDataObject,andthusdependsonotherRepresentationInformation,forming
aRepresentation Network requiredfor fullydescribing themeaningofanoriginal
DataObject.
InformationObject: AnInformationObjectisobtainedbyinterpretingaData
Object according toRepresentation Information.
Representation Informationcan beconsidered as aform of data format knowledge,
which may be present in various forms such as textual descriptions in natural
lan-guage, formal grammars, or some derivative work thereof, such as software
imple-mentationsor even their source code.
Discussion
The OAIS reference model is intended for a long-term preservation archive, which
alsohas to support the migration of data between dierent formats. The reference
modelthus depends on the availability and applicabilityof data format knowledge.
ItthereforereferstosuchknowledgeinvariousformsasRepresentationInformation,
whichiscontainedinContentInformationinsubmission,archivalanddissemination
IPs.
ex-a denition of how information is represented. Furthermore, the OAIS reference
model itself doesnot mandate a specic formfor data format knowledge as
Rep-resentation Information. It allows for avariety of ways to represent data format
knowledge. The modelexplicitly mentions the option of formaldescription
lan-guagescontainingwell-denedconstructs withwhichtodescribedata structures,
referring to formalisedapproaches ingeneral. Althoughthe OAIS eectively is a
de-facto standard in its domain, it does not provide more specic references or
make a statement regarding the expressivity of existing approaches. Foractually
migrating data, an OAIS archive relies on software as a form of Representation
Information toperform the migration.
2.2.3 Typed Object Model
Processing format-compliantdata requires suitable,format-compliantapplications.
Althoughanapplicationmaybeconceptuallycapableofprocessingacertaintypeof
data such asvideo/audio recordings or textdocuments,it strongly depends onthe
specicdigitalrepresentationtobeparsed,decoded andprocessed. The diversityof
dierent data formats thus can force users to mediate data between multiple data
formats asnecessary, therefore making itdesirable toautomate this process.
Overview
The Typed Object Model (TOM) serves for automating the process of migrating
data between dierent formats in a distributed manner and has been published in
thePhDthesisofJohn Ockerbloom[Ock98]. TOMdenes bothadistributedsystem
and a datamodel:
Distributed system: For mediating data between dierent data formats in a
decentralised, scalablemanner,TOM describes adistributedsystem consistingof
agents that handle the processing and conversion of data in heterogenous data
formats, operating on a distributed type graph. TOM denes the Typed Object
Protocol (TOP) for communicating in such adistributed mediating system. The
actualmediationisexecutedbytypebrokersasspecialisedagents,whichoertheir
services and perform the actual processing onthe behalfof clients. Other agents
suchasclientscanqueryfortypeinformation,getattributes andcallmethodson
objectsor request conversions from atype broker.
Data model: In the TOM data model, informationis represented as an object,
whichis immutableand has botha value and atype. A value isnot restricted to
a digital representation such as nite byte sequences alone,but may alsoinclude
abstract formsof representation.
A type denes how the object and its value are to be interpreted. Here, an
importanttypeistheByteSeqType. Objectsofthistypehavenitebytesequences
as values and thus can be stored and transmitted in a digital form. A type may
dene one or more attributes as well as one or more methods for its objects.
An attribute extracts information from an object through a function without
informationor use additional parameters. TOM oers subtyping, so a type may
havemorespecialisedsubtypes. Italsoenablessubstitutability, wheregivenatype
T
, a subtypeS
ofT
and two objectst
ands
of typesT
andS
,s
can substitute fort
, which allows objectsto beconsidered atdierent levelsof abstraction. The aggregatedtyping informationdenes atype graph onwhichTOM operates.For obtaining dierent representations of the same information in TOM, an
en-coding describes a relationship between a pair of objects, the original encoded
object and the resultingencoding object in adierent representation. In order to
cope with multiple,semanticallyequivalentencoding objects for agiven encoded
object, encodings in TOM are consideredas one-to-many relations. Its inverse is
adecoding, whichisa many-to-one function. TOM considers encodingas the
re-nement ofabstract objects. Likewise,itconsidersdecodingasthe abstraction
of concrete objects.
Basedontheseconcepts, TOM denes aformat asasequence ofencodings tobe
appliedonobjectsofa given type,whichyieldsobjectsof thetypeByteSeqType.
Aformat allowstodenea shipped object,whichis anobjectincludingitsformat
andthusallowsareceivertodecodetheobjecttothetypeindicatedbytheformat.
For mediating data between dierent data formats in TOM, a conversion is a
migrationof data, which takes ashipped object asinput and producesa shipped
object as output. Often, a conversion between dierent formats cannot preserve
all present information but only a subset. Therefore, a conversion tries to
ap-proximate the input shipped object. Tomanage the lossof informationof such a
conversion, TOM denes the concept of intersubstitutability, which is given for a
conversion
c
andatypeT
if everyinputand outputoftheconversionc
cannotbe distinguishedwith regards tothe attributes and methods of typeT
. The degree ofinformationpreservationina conversion increaseswith every level downinthetype hierarchy of
T
due to the specialisation of subtypes. Both substitutability andintersubstitutabilityaidintheautomatedcompositionandconversionofdatabetween dierent formats inTOM.
The TOM approach has been implemented through the Format REgistry
Demon-stration (FRED), which also served as a prototype for the Global Data Format
Registry (GDFR) [Ock06].
Discussion
TOMusesformat-relatedknowledgetoassistthemigrationofdatabetweendierent
dataformats. Contrarytootherapproaches,itsmaincharacteristicisthedistributed
setup of format-related knowledge among type brokers which provide migration
services toother agents.
Denitions and models: In TOM, information is considered in type-specic
representations. The data format of such a type-specic information is dened
asasequence of encodings that converts informationfromits type-specic
Rather thandescribing the actualcoding and structure offormat-compliantdata
itself, it addresses the migration of data by describing encoding / decoding
re-lationships between types. Through its use of interfaces, TOM considers digital
objects on varying levels of abstraction, and is therefore able to indirectly
man-age the loss of information during a migration of data. External to TOM and
its model, the actual process of accessing and conversion is performed by
soft-ware toolswhichfollowtheunderlyingrulesand constraintsofadataformatthat
shapes format-compliantdata.
Rather than describing the composition of data format instances orthe
underly-ing rules and constraints of a data format itself, TOMs model exposes the type
networkcomposed fromsoftware operatedinadistributedsystem. The
compos-ability asa property of TOM refers to the overall type network, and not to the
compositionof data.
2.2.4 eXtensible Characterisation Language
Technological obsolescence of le formats threatens long-term accessibility of
con-tained information. Although data can be migrated inadvance in order toprevent
technologicalobsolescence, ittypicallyleads toinformationlossdue tomismatching
representationalcapabilitiesof dierentdataformats. Oneapproachtohandlesuch
informationloss is to identify signicant information and to monitor its successful
retention after amigrationfor it tobedeemed successful.
Overview
Forestimating the success of data migration,the eXtensible Characterisation
Lan-guage (XCL) project denes the eXtensible Characterisation Extraction Language
(XCEL) as well as the eXtensible Characterisation Denition Language (XCDL)
for extracting and comparing signicant information represented in dierent data
formats:
eXtensible Characterisation Extraction Language (XCEL): XCEL
in-tends todescribecharacteristics of format-compliantdata through signicant
in-formation, which isrepresented as aproperty with a name and a value [SHC08 ].
Towards that goal, XCEL describes the composition of data through a
XML-based,schema-likedenition,wheredeclarativedenitionsofdatatypesaremixed
with proceduralprocessing instructions.
The basic buildingblocks of XCEL are XCEL elements which are used to build
an XCEL Tree as a representation which matches with actual data. A symbol
element denes both the encoding and semantics of a byte sequence, may dene
constraintsformatchingandisaleafof suchatree. Asymbolhas informationon
the placement of its data and employs an absolute addressing scheme, using the
numberof bytesconsumedsofar. An item elementdenes alogical,structuralor
semanticgroupforoneormorechildelements,describingeitherasequence of
ele-ments,theirpermutationorachoiceofalternativeelements. Last butnot least,a
copyinganotherXCEL elementtothecurrentposition(pullXCEL)orfor
recong-uring the parser during its operation (configureParser). Moreover, processing
elements alsoallow the addition of a lter to alter chain of anelement,
allow-ing for the translation of data into another representation. Elements and their
contents canbereferenced through identiers and may originatefromaseparate
le(externalSource)or fromanother XCEL element (internalSource).
These XCEL elements are used in the schema-like XCEL document, which
con-sistsofthefourpartspreProcessing, formatDescription,postProcessingand
templates. It is used by an XCEL processor, which processes both the XCEL
document and an input le, and produces an XCEL Tree as Result Tree. In a
followingstep, an extractor extracts signicantinformationfromthe Result Tree
which characterises the format-compliant data contained in the input le, and
stores itas anXCDLdocument.
Example 2.2.1: Anexampleof XCEL isgiven inTable2.1 forthe PNGIDAT
chunk data structure, which carries compressed, transformed, and in some cases
even fragmented, data representing the actual image. The data structure starts
with the chunkDataLength symbol,which is an32-bit unsigned integer, followed
by the pngIDATIdentifier, which carries the four-byte ASCII string IDAT
to distinguish it from other chunk data structures. The next processing
state-ment sets the length of the yet-to-come normDataSymbol identier to the value
of chunkDataLength, sothe normDataSymbol has adened length. Last but not
least, the remainingcrcsymbolcarries a four-byteCRC value.
eXtensible Characterisation Denition Language (XCDL): XCDL
de-scribes data as a collection of signicant information that have been extracted
previously froma Result Tree [BHST08 ].
During the migrationof data, some original input le is migratedto a new data
format,producingamigratedle. Aftersignicantinformationhasbeenextracted
from both the original and the migrated le as XCDL documents, the degree to
which the migration has been successful is measured through a comparison of
the retained signicant information. This is performed by a comparator, which
processesbothXCDLdocumentsandcomparesitssignicantinformationthrough
some domain-specic metric forjudging the success of a migration.
BesidesXCEL andXCDL,the XCLprojectalsotriestoaddressaspectsofsemantic
mismatchbetween dierentXCDLdocumentsthroughthedenition ofanontology.
Moreover, it tries to extend the comparison of signicant information beyond the
semantics of data to whether the actual rendering of data tohuman observers still
carries the same signicantinformation through the use of an informationmodel.
Discussion
The use case of XCEL and XCDL is the evaluation to which degree a migration of
1 <item identifier="pngIDAT" xsi:type="structuringItem" 2 multiple="true"> 3 <symbol identifier="chunkDataLength" interpretation="uint32" 4 length="4"/> 5 <symbol identifier="pngIDATIdentifier" interpretation="ASCII" 6 optional="false" value="IDAT"/> 7 <processing type="pushXCEL" xcelRef="normDataSymbol"> 8 <processingMethod name="setLength"> 9 <param valueRef="chunkDataLength"/> 10 </processingMethod> 11 </processing> 12 <symbol identifier="normDataSymbol" interpretation="uint8" 13 name="normData"/> 14 <symbol identifier="crc" length="4"/> 15 </item>
Table 2.1: Excerpt of a XCEL description for a PNG IDAT chunk data structure,
carrying transformed and compressed imagedata, taking from [SHC08 ].
Denitions and models: The XCL project makes no explicit formal
deni-tion towhat a data format actually is. Through the denition of XCEL, a quite
complex modelis given for describing a data format as a tree-based hierarchical
representationofinformation. XCELhas anumberof interestingproperties,such
as the support of lters for handling the translation between dierent
represen-tations of information, for partial descriptions which coveronly part of aninput
le, orforallowingdependenciessuchastheplacementofelementsintheoriginal
datatobeevaluateddynamicallyatruntime,whichisofinterestforaddress-based
references in data formats such as the ISO Base File Format. Regarding XCDL,
itis interesting tonotethat [BHST08] statesitnot tobeintended asalanguage
for multi-purpose representation of information.
Classication: TheXCELapproachisbothdeclarative andmachine-processible,
but the examinedpublicationsprovideno formalised modelfor universal
applica-bility.
Descriptive capabilities: XCEL clearlyprovidessupportforsegmenting
struc-tured data through items and for decoding primitive data through symbols,
al-though the length of primitivedata is limitedto multiplesof entire octets rather
thanhavingbitgranularity. Duetoitsconceptoflters usedinalterchain,and
through using the internalSource attribute, XCEL can transform transcoded
data and enable further processing of the result. Using the special normData
symbol name for processing fragmented PNG IDAT chunk [SHC08], and again
using the internalSource attribute, XCEL provides at least partial support for
2.2.5 Data Format Description Language
Forprocessing data in XML-based representations, anumberof standardised
tech-nologies exist, such as transforming documents using Extensible Stylesheet
Lan-guage Transformations (XSLT), or ltering elements using XPath. By translating
the representation of data fromthe binary domainto XML, these technologiescan
be leveraged for use on datafrom the binary domain aswell.
Overview
The Data Format DescriptionLanguage (DFDL)is anextensiontothe W3C XML
Schema Description Language (XSDL) and intends to describe arbitrary data
for-matstoenablethetranslationfromformat-compliantdatatoanXMLrepresentation
andviceversa. Atthe timeofwriting,the currentversionofDFDLis1.0asdened
in Draft 038, where several parts of the languagespecication are designated to be
in uxand to be changedin upcomingversions of the draft[PHB
+
10].
IntheDFDLapproach,aprocessor processesdatagiveninaformatasdescribed
by a schema:
Processor: Aprocessor typicallyiseitheraDFDLParser oraDFDL Unparser,
where the former parses the format-compliant representation and serialises it to
an XML representation. Vice versa, the latter parses the XML representation
andunparses itsformat-compliantbinaryrepresentation,performingthereverse
direction.
Schema: A DFDL schema describes the composition of data through XML
schema extended with DFDL annotations. While XML Schema provides the
meanstodescribebothprimitiveandcomplexdatatypes, DFDLannotations
de-scribeadditionalinformation,suchasthe lengthorthe binaryencodingof adata
type. Forhandlingdynamicdependencies, whereparsedanddecoded information
isusedforfurtherparsing,DFDLemploysasubsetofXPath2.0asexpression
lan-guage, includingfunctions for boolean, string and date operations. Since DFDL
has the explicit goal of round-trip support for data parsed and unparsed, every
schema is required to be unambigious during unparsing, that is, only one binary
representation may exist.
Example2.2.2: ADFDLexampleisshowninTable2.2,wherestructureddata
isdescribed asa sequence of fourprimitivedata, namely anintegerw, an integer
x,adouble-precisionoating-pointnumberyandasingle-precisionoating-point
numberz, all inbig-endian byte order.
Discussion
DFDL assists the processing of data for binary data formats through standardised
XML tools by translating information from its format-compliant representation to
anXML representation, and vice versa.
1 <xs:complexType name="example1"> 2 <xs:sequence> 3 <xs:element name="w" type="int"> 4 <xs:annotation> 5 <xs:appinfo source="http://www.ogf.org/dfdl/"> 6 <dfdl:element representation="binary" 7 byteOrder="bigEndian" 8 lengthKind="implicit"/> 9 </xs:appinfo> 10 </xs:annotation> 11 </xs:element> 12 <xs:element name="x" type="int "> 13 <xs:annotation> 14 <xs:appinfo source="http://www.ogf.org/dfdl/"> 15 <dfdl:element representation="binary" 16 byteOrder="bigEndian" 17 lengthKind="implicit"/> 18 </xs:appinfo> 19 </xs:annotation> 20 </xs:element> 21 <xs:element name="y" type="double"> 22 <xs:annotation> 23 <xs:appinfo source="http://www.ogf.org/dfdl/"> 24 <dfdl:element representation="binary" 25 byteOrder="bigEndian" 26 lengthKind="implicit"/> 27 </xs:appinfo> 28 </xs:annotation> 29 </xs:element> 30 <xs:element name="z" type="float" > 31 <xs:annotation> 32 <xs:appinfo source="http://www.ogf.org/dfdl/"> 33 <dfdl:element representation="binary" 34 byteOrder="bigEndian" 35 lengthKind="implicit" 36 binaryFloatRep="ieee" /> 37 </xs:appinfo> 38 </xs:annotation> 39 </xs:element> 40 </xs:sequence> 41 </xs:complexType>
Table 2.2: Excerpt of a sample data structure dened using DFDL, taken from
throughthe extensionofXMLSchema. Itassumesthatadataformat denesthe
composition of hierarchically-nested data, at the same time explicitlyclaiming
itsapplicability onthe description of any data format.
The DFDL specication includes two noteworthy concepts. It distinguishes
be-tween approachesfor makingdataformatknowledgeexplictaseitherprescriptive
ordescriptive. The specication documentcategorises approaches such asASN.1
intothe formercategory,anditselfintothelatter. Moreover, DFDLdistinguishes
betweendataaseithercontent orframing,dependingonitspurposeintheformat,
wherethe language allows tohide framingfrom later processing.
Classication: DFDLisadeclarative approachwhichismachine-processible. In
examinedpublications, no formalised modelfor universal applicability is given.
Descriptive capabilities: DFDL clearly supports the segmentation of
struc-tureddata and the decoding of primitivedata, alsosupporting datawith lengths
ofbit granularity through dfdl:lengthUnits. Although XML Schema isa
pow-erful basis for DFDL to extend, concerning data format description, there are
limitations to DFDL despite its explicit claim of universal applicability. Most
notably, DFDL itself acknowledges its lack of support for cases where one
ele-ment's value becomes the representation of another element, termed layering
by DFDL, which has been conrmed as a limitation and explicitly deferred to a
later revision.
Yet, layering isrequiredforhandlingtransformed orfragmented data, asinthese
cases, the value of one or more bit sequences represents another bit sequence
whenprocessedaccordingly. Forexample,tocompletelydescribevideoandaudio
streamstypicallystoredasinterleavedfragmentsinmultimediacontainerssuchas
theMPEG-4 FileFormat,fragmentsofa specicstreamhavetobeconcatenated
inorder toanalyse the stream according toits own format-specic rules,e.g. for
MPEG-4 Video orMPEG-4 Advanced Audio Coding (AAC).
Part of the lack can be attributed to DFDL's explicit assumption of data to
be hierarchically-nested, which ts well with the tree-based structural model
of XML, where a logical node may have multiple children, but has at most one
parent. Yet, for the concatenation of fragmented data, a logicalnode is required
tohavemultipleparentsaswell,pointingtowardsagraph-basedstructuralmodel.