Using XML for Long-term Preservation
Experiences from the DiVA Project
Peter Hansson, peter.hansson@ub.uu.se
Uwe Klosa, uwe.klosa@ub.uu.se
2
Outline
• DiVA Project
• DiVA Document Format
• DiVA Archive
DiVA Project
• Goals
– Create searchable archives for digital documents
– Disseminate research information to researchers and the public – Develop an electronic publishing system
– Rationalize workflow
(author typing -> print, electronic publishing)
• Participants
– Uppsala University, Umeå University, Stockholm University, Södertörn University, Örebro University and more to come!
4
DiVA Project
• Links
– http://publications.uu.se
– http://publications.uu.se/portal
– http://publications.uu.se/etd2003
Document format?
• Decided early on to use XML
• Unicode support was crucial
• A simple and human readable format
• A format definition was developed (defined by an XML Schema)
– 99 elements for metadata
– DocBook is used for the content
6
DiVA Document Format
• Subsystem interface
• Source for other formats (TEI, Marc 21, DC etc.)
• Long-term preservation format
DiVA Document Format - root element
<?xml version=”1.0” encoding=”UTF-8”?>
<documents>
</documents>
<document> [..] </document>
<document> [..] </document>
8
DiVA Document Format - properties
[..]
<document>
</document>
[..]
<properties>
<property>book</property>
<property>thesis</property>
<property>postgraduateThesis</property>
</properties>
<property>doctoralThesis</property>
<property>comprehensiveDissertation</property>
[..]
DiVA Document Format - manifestations
[..]
<document>
</document>
<manifestations>
<manifestation>
[series, publication date, publishers, distributors, archivers, identifiers]
</manifestation>
</manifestations>
[..]
[..]
10
DiVA Document Format - specifics
[..]
<specifics type=”thesis”>
[..]
<disputation>
[place, disputation date, disputation time, language, opponents]
</disputation>
[..]
</specifics>
[..]
DiVA Document Format - person, organisation
<person>
[names, addresses, identifiers]
</person>
<organisation>
[organisation names, addresses, identifiers, parent organisation]
</organisation>
12
DiVA Document Format - creator
<creator>
<properties>
<property type=”role”>author</property>
</properties>
<person>
[names, addresses, identifiers]
</person>
</creator>
URN:NBN:se
(Uniform Resource Name:National Bibliographic Number:Sweden)
• Persistent identifiers for electronic resources
• Resolver at Swedish Royal Library
http://urn.kb.se/resolve?urn=
14
URN:NBN:se
(Uniform Resource Name:National Bibliographic Number:Sweden)
urn:nbn:se:uu:diva-3344
Uppsala University
archive name
serial number
Resolvable at
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-3344
Archive Structure
$archive_home urn
nbn se
uu
diva
1411
urn_nbn_se_uu_diva-1441__metadata.xml
1
urn_nbn_se_uu_diva-1441-1__fulltext.pdf urn_nbn_se_uu_diva-1441-1__fulltext.pdf.sha urn_nbn_se_uu_diva-1441-1__metadata.xml urn_nbn_se_uu_diva-1441-1__metadata.xml.sha
16
DiVA Archive Web portal
DiVA Manager
National long-term
storage Local
long-term storage
packages packages
xml
xml
Summary
• XML for long-term preservation - DiVA Document Format
• URN:NBN identifiers
• DiVA Archive
18