Declarative Access to Filesystem Data : New application domains for XML database management systems

Volltext

(1)Declarative Access to Filesystem Data New application domains for XML database management systems. Alexander Holupirek. Dissertation zur Erlangung des akademischen Grades Doktor der Naturwissenschaften (Dr. rer. nat.) Fachbereich Informatik und Informationswissenschaft Mathematisch-Naturwissenschaftliche Sektion Universität Konstanz. Referenten: Prof. Dr. Marc H. Scholl Prof. Dr. Marcel Waldvogel Tag der mündlichen Prüfung: 17. Juli 2012. Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-206486.

(2)

(3) Abstract XML and state-of-the-art XML database management systems (XML-DBMSs) can play a leading role in far more application domains as it is currently the case. Even in their basic configuration, they entail all components necessary to act as central systems for complex search and retrieval tasks. They provide language-specific indexing of full-text documents and can store structured, semi-structured and binary data. Besides, they offer a great variety of standardized languages (XQuery, XSLT, XQuery Full Text, etc.) to develop applications inside a pure XML technology stack. Benefits are obvious: Data, logic, and presentation tiers can operate on a single data model, and no conversions have to be applied when switching in between. This thesis deals with the design and development of XML/XQuery driven information architectures that process formerly heterogeneous data sources in a standardized and uniform manner. Filesystems and their vast amounts of different file types are a prime example for such a heterogeneous dataspace. A new XML dialect, the Filesystem Markup Language (FSML), is introduced to construct a database view of the filesystem and its contents. FSML provides a uniform view on the filesystem’s contents and allows developers to leverage the complete XML technology stack on filesystem data. BaseX, a high performance, native XML-DBMS developed at the University of Konstanz, is pushed to new application domains. We interface the database system with the operating system kernel and implement a database/filesystem hybrid (BaseX-FS), which is working on FSML database instances. A joint storage for both the filesystem and the database is established, which allows both developers and users to access data via the conventional and proven filesystem interface and, in addition, through a novel declarative, database-supported interface. As a direct consequence, XML languages such as XQuery can be used by applications and developers to analyze and process filesystem data. Smarter ways for accessing personal information stored in filesystems are achieved by retrieval strategies with no, partial, or full knowledge about the structure, format, and content of the data (“Query the filesystem like a database”). In combination with BaseX-Web, a database extension that facilitates the development of desktop-like web applications, we present a system architecture that makes it easier for application developers to build content-oriented (data-centric) retrieval and search applications dealing with files and their contents. The proposed architecture is ready to drive (expert) information systems that work with distinct data sources, using an XQuery-driven development approach. As a concluding proof of concept, a complete development cycle for an OPAC (Online Public Access Catalogue) system is presented in detail..

(4)

(5) Zusammenfassung (German Abstract) XML einerseits und moderne XML-Datenbank-Management-Systeme (XML-DBMS) andererseits können als Basistechnologie weit mehr leisten, als ihnen derzeit zugetraut wird. Bereits in ihrer Grundausstattung beinhalten sie alle notwendigen Komponenten, die für den Aufbau und den Betrieb komplexer Such- und Informationsdienste notwendig sind. Der Umgang mit Volltexten und deren sprachspezifische Indexierung gehört ebenso zu den Aufgaben eines modernen XML-DBMSs wie die Speicherung von strukturierten, semi-strukturierten oder binären Daten. Sie verfügen über ein reichhaltiges Arsenal an XML verarbeitenden Sprachen (XQuery, XSLT, XQuery Full Text, etc.) und bieten damit einen kompletten Technologiezweig an, der es erlaubt, innerhalb einer reinen, also nur auf XML Technologie basierenden Umgebung Applikationen zu entwickeln. Die Vorteile liegen auf der Hand: Von der Speicherung über die Verarbeitung bis hin zur Ergebnispräsentation kann das gleiche Datenmodell ohne Transformation zwischen den einzelnen Schichten einer Systemarchitektur erfolgen. Die vorliegende Arbeit erprobt die Verwendung von XML-DBMSs auf bisher unbekanntem Terrain und untersucht deren Einsatzmöglichkeiten innerhalb moderner Betriebssysteme. Wir zeigen, wie über den Einsatz von XML-DBMSen eine deklarative Schnittstelle zur Abfrage von Dateisystem-Inhalten mittels XQuery geschaffen werden kann und implementieren ein hybrides Datenbankdateisystem (BaseX-FS). Die Technologiestudie erlaubt es, auf den Daten des Dateisystems, sowohl konventionell, also über die vom Betriebsystem angebotenen system calls und den filesystem namespace, zu arbeiten, als auch mit Hilfe der vom Datenbanksystem angebotenen deklarativen Zugriffsmethoden. Das heisst insbesondere, dass die in BaseX-FS gespeicherten Dateien semantisch und inhaltsbezogen über XQuery abgerufen und verarbeitet werden können, als auch, dass über die Verzeichnishierarchie inhaltsbezogene Daten einer Datei exportiert und mit konventionellem File I/O bearbeitet werden können. Unter Verwendung von BaseX-FS als Basisarchitektur lässt sich zeigen, dass zahlreiche Dienste, wie zum Beispiel Desktopsuchmaschinen sehr viel leichtgewichtiger implementiert und funktional erweitert werden können, als dies bisher der Fall ist. Zusammen mit BaseX-Web, einer Datenbankerweiterung, die es erlaubt, desktop-ähnliche Web-Applikationen zu entwickeln, zeigen wir, dass sich die vorgestellte erweiterte Datenbankarchitektur sehr gut für den Aufbau von Expertensuchsystemen, wie zum Beispiel eines Online Public Access Catalogues (OPAC), eignet..

(6)

(7) Contents Abstract. 3. Zusammenfassung. 5. 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Intrinsic Motivation - Personal Data Mess . . 1.1.2 Professional Challenge - Retrieval Support for 1.2 Problem Description . . . . . . . . . . . . . . . . . . 1.3 Research Approach . . . . . . . . . . . . . . . . . . . 1.4 Contribution and Outline . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . Filesystems . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 9 14 14 14 15 18 20. 2 The BaseX Filesystem View 2.1 Joint Storage for Filesystem and Database . . . . . . . 2.1.1 The pre/distance/size Encoding . . . . . . . . . 2.1.2 The Encoded File Hierarchy . . . . . . . . . . . 2.2 Leverage Tacit Information Hidden in Files . . . . . . 2.2.1 Transducers – Filetype-specific Data Extractors 2.2.2 Implementation of a Transducer . . . . . . . . 2.3 A Deeper Filesystem – The Metadata Hierarchy . . . 2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . 2.5 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 25 26 26 28 29 30 32 33 36 37. 3 An XML Database as Filesystem 3.1 On Filesystem Prototyping . . . . . . . . . . . 3.1.1 Stackable Filesystems . . . . . . . . . . 3.1.2 Filesystem in Userspace . . . . . . . . . 3.2 Mounting the Database as a Filesystem . . . . 3.2.1 System Architecture . . . . . . . . . . . 3.2.2 Implementation Details . . . . . . . . . 3.2.3 Assessment . . . . . . . . . . . . . . . . 3.3 Database-aware Applications . . . . . . . . . . 3.3.1 XQuery your Filesystem . . . . . . . . . 3.3.2 Visual Access to Large Filesystem Data 3.4 Considerations . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 39 40 41 44 49 49 50 54 57 57 60 65. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 7.

(8) Contents 4 XQuery Application Framework 4.1 Maturity of Web Applications . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . 4.2.1 Sausalito – XQuery in the Cloud 4.2.2 eXist – the XQuery Servlet . . . 4.3 System Overview . . . . . . . . . . . . . 4.3.1 Model-View-Controller . . . . . . 4.3.2 Application Layout . . . . . . . . 4.3.3 Request-Response Cycle . . . . . 4.4 Summary . . . . . . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 67 68 69 71 72 74 76 79 80 82. 5 Kickstarting an Infrastructure 5.1 Online Public Access Catalog (OPAC) . . . . . . . . . 5.2 Konstanz Online Publication System (KOPS) . . . . . 5.3 XML OPAC . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Intention . . . . . . . . . . . . . . . . . . . . . 5.3.2 Foundation: General System Setup . . . . . . . 5.3.3 Configuration: Shaping a Retrieval Application 5.4 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . 5.5 Queries and Performance Results . . . . . . . . . . . . 5.5.1 Keyword Search . . . . . . . . . . . . . . . . . 5.5.2 Phrase Search . . . . . . . . . . . . . . . . . . . 5.5.3 Boolean Search . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. 85 85 86 88 88 89 91 94 95 95 97 100 101. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 6 Conclusion. 103. List of Figures. 106. List of Listings. 110. List of Tables. 113. Bibliography. 114. Appendix. 123. 8.

(9) 1 Introduction Today, almost exactly 14 years after the W3C released the first XML Recommendation on February 10, 1998 [8], XML has become an integral part of modern information systems. The markup language was originally envisioned as a language for defining new document formats and is suited especially well for that purpose. Besides it offers a rich set of accompanying standards1 , languages and processing techniques dealing with XML data: XQuery 1.0: An XML Query Language “that uses the structure of XML intelligently [to] express queries across all […] kinds of data, whether physically stored in XML or viewed as XML via middleware. XQuery is a full declarative programming language, and supports user-defined functions, external function libraries (modules) referenced by URI, and system-specific native functions.” [53] XQuery Update Facility (XQUF) “provides expressions that can be used to make persistent changes to instances of the XQuery 1.0 and XPath 2.0 Data Model.” [54] XQuery and XPath Full Text (XQFT) “a language that extends XQuery 1.0 and XPath 2.0 with full-text search capabilities. XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.” [12] XSLT (XSL Transformations) 2.0 “a language for transforming XML documents into other XML documents. The term stylesheet reflects the fact that one of the important roles of XSLT is to add styling information to an XML source document, by 1 http://www.w3.org/standards/xml/. 9.

(10) 1 Introduction transforming it into a document consisting of XSL formatting objects (see Extensible Stylesheet Language (XSL)), or into another presentation-oriented format such as HTML, XHTML, or SVG. However, XSLT is used for a wide range of transformation tasks, not exclusively for formatting and presentation applications.” [43] XSL: The Extensible Stylesheet Language. “Given a class of arbitrarily structured XML documents or data files, designers use an XSL stylesheet to express their intentions about how that structured content should be presented; that is, how the source content should be styled, laid out, and paginated onto some presentation medium, such as a window in a Web browser or a hand-held device, or a set of physical pages in a catalog, report, pamphlet, or book.” [4] XProc: An XML Pipeline Language “for describing operations to be performed on XML documents. An XML Pipeline specifies a sequence of operations to be performed on zero or more XML documents. Pipelines generally accept zero or more XML documents as input and produce zero or more XML documents as output.” [66] Operations can be of different nature but typically include validation, transformation, or querying of XML data. XQSE: XQuery Scripting Extensions. “XQuery is a functional language that is Turingcomplete and well suited to write code that ranges from simple queries to complete applications. However, some categories of applications are more easily implemented by combining XQuery capabilities with some imperative features, such as the ability to explicitly manage internal states. The same issue stands for XQuery enriched with the XQuery Update Facility [...]. The scripting extension is intended to overcome this problem, and allow programmers to write such applications without relying on embedding XQuery into an external language.” [15], [60] Fourteen years ago, a key feature, the easy definition of new document formats, paved the way for the huge success of XML as a data exchange format. Compared to ASN.1—a standard for the abstract definition of data types and an, at that time, established method to communicate between heterogeneous systems—the uniform description of data in XML and its subsequent processing is a straight-forward task. XML, as a textual format, is easy to edit, simple to parse and may represent structured, semistructured or unstructured data. Associating XML files with a schema allows to validate. 10.

(11) XML contents, but a less complicated ad-hoc approach—the so-called schema-agnostic processing—is possible as well and widely-used in practice. While XML, in its early years, has been mostly used for data exchange—for example as a replacement of older formats in Electronic Data Interchange (EDI)—it was soon accepted as a suitable data storage format by many applications. In the beginning, only small files like the famous .ini or other textual configuration files have been replaced. But other applications, such as Apache’s ant(1) software build tool2 , chose XML from the start. Integration of XML parsers in just about any common programming language made this a comprehensible choice. It allows the use of a standardized toolchain to parse files, check their validity against a Document Type Definition (DTD) or XML Schema Definition (XSD) and subsequently process the data accordingly. The files are readable by both humans and machines and can therefore easily be modified and adapted manually or automatically. As the story goes on, more applications joined the party and chose XML as a storage format. A prime example is the OASIS Open Document Format (ODF). It “is an open XML-based document file format for office applications to be used for documents containing text, spreadsheets, charts, and graphical elements. The file format makes transformations to other formats simple by leveraging and reusing existing standards wherever possible.” [67] What we can observe in general today is an ever increasing number of XML collections emerging in different areas of application. Best practice, storing XML files in the filesystem, is more and more becoming a bottleneck, and an increasing interest in supporting database technologies can be observed, especially in the industrial sector. During the first hype of XML, processing XML with dedicated database systems could not fulfill people’s expectations. Systems were unstable and not ready for production or did not meet the demands in terms of processing speed or scalability. After a poor start, the situation has changed. Now, a decade later, we face market-ready XML databases in just about every big players database portfolio. Besides well established database providers, such as Oracle, IBM and Microsoft, several smaller companies and open source projects, solely focussed on XML, emerged and matured: 2 Apache. Ant is very similar to the popular Unix make(1) tool. Its mission is to orchestrate processes (described in build files as targets and extension points) dependent upon each other. XML is used in the build files to define the rules to compile, assemble, test and run applications.. 11.

(12) 1 Introduction MarkLogic is the leading company in the niche market of XML database management systems. Their credo is to provide “21st century technology for 21st century challenges” [45]. For MarkLogic “traditional relational databases were built for another era and organizations are seeking alternatives to address today’s information management challenges” [45]. “Organizations are struggling to manage and leverage Big Data. Unstructured information and other complex, valuable data can be particularly difficult to capitalize on. Examples of unstructured information include: documents, rich media like images or videos, metadata, content, user-generated content, RSS feeds, e-mail, geospatial data, and XML among others. Typically, unstructured information has one or more of the following characteristics: • Heterogeneous (different formats, varying standards, irregular lengths, etc.) • Constantly evolving in ways that may be unanticipated • Growing exponentially These characteristics make it difficult to manage unstructured information using previous technologies, such as relational databases, which typically expect reasonably-sized data that is normalized and conforms to a pre-defined schema. MarkLogic 5 is the company’s flagship product: a next generation database for managing and leveraging Big Data and unstructured information. Such information may be textual, irregular, hierarchical, de-normalized, time-varying, or structured in an unexpected way.” [46] Documentum xDB is offered by EMC Corporation as a “high-performance and scalable native XML database designed for software developers who require advanced XML data processing and storage functionality within their applications. xDB enables high-speed storage and manipulation of very large numbers of XML documents. Using xDB, programmers can build custom XML content management solutions and store XML documents in an integrated, highly scalable, high-performance, object-oriented database.” [14] Quizx is “a fast XML database engine fully supporting XQuery. Qizx is designed from the ground up to perform fast queries, without requiring specific efforts from users. Queries run at full speed out of the box without the need to manually define. 12.

(13) indexes, tweak parameters, or add a new index.” [51] eXist-db “is an open source database management system. It stores XML data according to the XML data model and features efficient, index-based XQuery processing. It supports many Web 2.0 technology standards, making it an excellent platform for developing web-based applications.” [48] Sedna “is a free native XML database which provides a full range of core database services - persistent storage, ACID transactions, security, indices, hot backup. Flexible XML processing facilities include W3C XQuery implementation, tight integration of XQuery with full-text search facilities and a node-level update language.” [36] BaseX “is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery Processor, including full support for the W3C Update and Full Text extensions. Various interactive and user-friendly graphical user interfaces give great insight into stored XML documents. BaseX is developed at the Chair of Databases and Information Systems at the University of Konstanz as an open source system under the terms of the BSD license.” [13] In addition there are Saxon3 and Zorba4 , powerful XQuery processors of high renown in the community. Given these developments, we want to tap the full potential of current XML-DBMSs and put them to the test in somewhat unfamiliar territories. Our bold statement is that XML databases with their current characteristics can serve as core components for search and retrieval systems on heterogeneous data sources. Filesystems are prime examples. They store vast amount of heterogeneous data formats. Providing means to programmatically query, process and analyze personal data stored in filesystem would be a major improvement over current filesystem abilities.. 3 http://www.saxonica.com/ 4 http://zorba-xquery.com. 13.

(14) 1 Introduction. 1.1 Motivation 1.1.1 Intrinsic Motivation - Personal Data Mess Trying to find things—I definitely know I have—is a common task for me. This is true for my real life (but, lucky me, there are always nice people around helping me out) and even more for my digital self. As a matter of fact, it is getting worse all the time. Cloud storage and multiple mobile devices do not simplify matters in this respect. With every new machine my disk space to mislay things increases. Of course, it may be considered a bad habit to just copy data over from my old machine to the new laptop instead of curating, archiving and purging data from the working system. But I seem to be in good company and in-line with established practice: Already back in 2002, Jim Gray, while talking about data curation, pointed out that a “decade ago, 100 GB was considered a huge database. Today it is about 1/2 of a disk drive and is quite manageable. […] so it is both economical and desirable to bring the old data forward and store it on newer technology.” [20] In fact, mere storage of (personal) data in state-of-the-art filesystems is a markedly well done job in current operating systems. Convenient access to and information retrieval from such data, however, is crucial to leverage the stored information. Recent variants of operating systems therefore come with integrated search capabilities5 or can easily be equipped with a third-party desktop search application. These tools clearly offer a smarter way to access personal information stored in the filesystem (and tell me that I’m not alone needing help to recover once stored assets).. 1.1.2 Professional Challenge - Retrieval Support for Filesystems Working in a database group, however, I can not be satisfied. We, occupationally, want to store anything we consider useful in a database and have it ready to be queried. 5 (e.g.,. Instant Search on the Windows platform, the Spotlight architecture on Macintosh, or Tracker and Beagle on Linux systems). 14.

(15) 1.2 Problem Description The way we want to explore our data is via a standardized and established database query language (DQL). Finding things using a keyword-based search expression is just the beginning of what we would expect from an information system that keeps track of our (personal) data. Since the beginning of database management systems, there is a desire to store all data in a database and have it ready to be queried. Several industrial and research efforts such as WinFS or the Be Filesystem have been made to push the filesystem into a database. None made it to technical production quality. Offshoots, like Microsoft’s Instant Search or Apple’s Spotlight Architecture, however, can be found in all of the recent operating system variants, and a user demand for products helping to find relevant content can be derived from the increasing popularity of Desktop Search Engines, such as Google’s or Yahoo’s Desktop Search. While these tools offer a smarter way to access personal information stored in the filesystem, the keyword-driven search approach, as used by today’s search engines, is—while perfectly suitable for the everyday business—just the beginning of what can be expected. An additional support for database style query languages to “filter, select, search, join, sort, group, aggregate, transform, and restructure”, in short, analyze and programmatically process, stored data, would be a consequential further development.. 1.2 Problem Description We generally face the fact that the amount of data stored in filesystems on personal computers is growing steadily. This comes as no surprise since—against current opinion—data gets copied from old machines to new ones instead of being archived. This may be considered a bad habit, but it surely is a side effect of storage capabilities increasing at low cost, and thus cannot be condemned. Therefore filesystems contain a significant amount of text documents, images, and multimedia files. While the mere storage is an easy-to-manage task, convenient access to and information retrieval from huge amounts of data is crucial to leverage the stored information. Current. 15.

(16) 1 Introduction filesystems and their proven, but basic interface (VFS) support neither. Donald Norman coined the phrase “Attractive things work better” [50]. While Norman’s statement, in the first place, aimed at pushing aesthetics and attractiveness into user interfaces, it suits well for any human-centered design approach. Without usability, joy of use cannot evolve. Ease of use, on the other hand, is crucial, and for a data storage system is determined by the ability to search/find and access/use stored data. In fact, the challenge we now face (and will even more in the future) is to enhance storage systems in a way that users can make full use of their data. Finding and programmatically process relevant content in this ever growing amount of data is a major aspect. Filesystems still focus on mere storage and tend to be conservative regarding feature enhancements [70]. Consequently, they do not offer solutions to this demanding task. Current solutions to find files are developed outside the filesystem as separate, concurrent systems. Redundant storage of metadata is common practice in modern operating systems and applications. Integrated file indexing services, such as Windows Search or Apple’s Spotlight, crawl the filesystem in order to harvest metadata. Domain-specific applications, such as audio players or e-mail applications, harvest relevant information for their file types and store them in accompanying index structures. The extracted information is used to provide retrieval and search functionality to the user. In times of ever increasing personal data masses, this obviously is a frequently demanded and useful feature. Today’s solutions do not develop the exploration of the collected metadata to the maximum. Application-specific solutions fail to reoffer the extracted information via a public interface. As such, they hide relevant data, and peer applications have to perform the same work again. System-wide APIs to access the stored information use an imperative programming style only and, while suited to access single data items, do not allow for sophisticated declarative programming. While we consider keyword-driven search a suitable approach for end-users and ad-hoc queries, we postulate that it is not enough to cope with the explosive growth of personal information and the full variety of present and future search and retrieval tasks.. 16.

(17) 1.2 Problem Description A more general and ideally standardized storage facility for the harvested data would make it easier for applications and developers to profit from the tediously collected information. We opt for XML database technology to provide such an infrastructure and to establish a system-wide service to export harvested metadata in a standardized, well-defined format that is suitable for further processing. Choosing XML allows to leverage the complete and feature rich X-technology stack developed and standardized by the World Wide Web Consortium (W3C). Therefore, we propose the description of filesystem’s contents and metadata applying an XML dialect, to use a high-performant and scalable XML store, together with a full-fledged and highly compliant XQuery processor to system-wide expose the collected data. Applications Users Developers. Conventional File I/O. Metadata-aware File Access. DB-unaware applications (Filesystem Trail). Unix filesystem. XML STORAGE. Declarative (Query) Access. DB-aware applications (Database Road). XML database. Binary Backing Store. J DB/FS S. Figure 1.1: Dual access to filesystem data The approach has two main effects: It will provide an additional semantic, contentrelated view on the filesystem and its stored content. Using XML it becomes possible to express the logical structure of files (as also proposed by semantic desktop or filesystem approaches). That way, the system knows about the insides of a data source and can. 17.

(18) 1 Introduction retrieve parts of it (for instance just the subject of an e-mail). This stands in stark contrast to current filesystems, that treat files as a mere sequence of bytes. Databaseaware applications are able to directly query and process this information and have a more fine-grained view on the system. Interconnections between data assets of different kinds, for example, can be explored more easily using a declarative query language that is designed to this end. We will re-export the collected data back in the filesystem namespace, so that legacy applications can profit as well. Finally, conventional File I/O for database-unaware applications as well as database-enhanced access to the same data is provided. Figure 1.1 on the preceding page illustrates the concept.. 1.3 Research Approach Despite the fact that several database-driven filesystem attempts have already failed, the advent of XML brought some significant enhancements to DBMS that inspired us to dare another attempt. In a preliminary study [34], we evaluated the mapping of a file hierarchy and its content to XML and emulated filesystem operations using XPath/XQuery/XQUF operations. We found it possible to perform basic filesystem commands, as well as content-based retrieval, in interactive time on the constructed filesystem mappings with an off-the-shelf XML database. Motivated by these results we pushed the idea forward. The tree-based XML model has spawned efforts on relational storage and processing techniques for hierarchically structured data and meanwhile, DBMSs have learned to work with tree-shaped data (e.g., [5, 6, 22–24]). This is of direct benefit, as the hierarchic nature of filesystems can now consistently be mapped to the relational storage (see Figure 1.2 on the next page) and leverage the associated algorithms (an elaborate discussion of relational XML storage and algorithms can be found in [64, Chapter 2]). BaseX, the database we use within this project, is also built on a relational encoding scheme as will be discussed in Chapter 2 on page 25. A major problem of storing files in a DBMS (apart from using BLOBs) has been the basic necessity of providing a schema first. With an unmanageable amount of file formats this appears to be impossible. Schema-oblivious storage techniques made it possible for XML. 18.

(19) 1.3 Research Approach. 0 <a> 1 2 <c> 3 <d/> 0. 0. 4 <e/> 1. .. </c> 2 3 5 <f> 6 <g/> 4 7 <h> 8 5 9 <j/> 6 </h> 7 </f> 8 </a> 9. 3. d. 0. 1. b. 3. 2. c. 2. a. 9. 5 6. 4. e. 1. g. f. 8. 4. 7. 8. i. 5. h. 7. 9. j. 6. pre 0 1 2 3 4 5 6 7 8 9. post 9 3 2 0 1 8 4 7 5 6. n a b c d e f g h i j. Figure 1.2: Basic (simplified) idea of storing trees (such as file hierarchies, XML documents) in a RDBMS [21] data to be stored in the database without previous knowledge of its interior structure6 . As mentioned, more and more applications use XML as their native storage format anyway. Data of this kind is already prepared to be handled with database technology. From our point of view, these documents are nothing else but serialized database instances. In consequence, they are not only stored as plain text, but directly shredded7 into the DBMS. Legacy applications are still able to process them conventionally by requesting them in their serialized, i.e., textual representation. In direct communication with the database, however, XML processing languages such as XPath/XQuery can be used on the data. Going through with the concept, and as filesystems are structured hierarchically, it seems to be a natural thing to also map the file hierarchy into tree-aware DBMSs.. 6 An. additional XML Schema specification for the file type may be of advantage to formulate queries against the document, but is not mandatory. 7 A terminus technicus used to indicate the conversation of XML in its textual representation to an internal format used by the database. Read it as “import”.. 19.

(20) 1 Introduction. 1.4 Contribution and Outline We will design and develop an XML/XQuery driven information architecture that works on formerly heterogeneous data sources in a standardized and uniform manner, leveraging semi-structured database techniques. The system will provide both proven and stable access to the data using filesystem techniques and query support for all stored files. As a consequence, our architecture will provide the following novel features: • Database query capabilities on filesystem data as a general system service • Unified view on (formerly heterogeneous) filesystem contents • Declarative API to work with file objects • Metadata-aware file access through the filesystem namespace Furthermore, we will present work at different layers of a suitable DBMS architecture and show how application development inside a pure X-Technology Stack can be achieved. Foundation. In a first step, we provide an xmlified, database-centric view of the filesystem’s content. We gather file contents and express them in a new XML dialect designed for that purpose: FSML, the Filesystem Markup Language. The result is stored as a BaseX-FS database instance and ready to be queried via XQuery and related languages. That way we provide an unified view on filesystem data. It is the base for processing heterogeneous filesystem data with semi-structured database technology and will be described in the next Chapter. The Database as Filesystem in Chapter 3 will dig down and contribute an implementation that establishes a link between DBMS and OS and shows how a database with XML/XQuery support can be used as a user-level filesystem. As a result, the DBMS is mounted as a Unix filesystem by the operating system kernel. Consequently, access via the established filesystem interface as well as database-enhanced access to the same data is provided (joint storage for filesystem and database). The database filesystem hybrid will provide metadata-aware file access (“deep access”) over the conventional filesystem interface. Declarative Application Programming Interface. Having established the uniform view on heterogeneous filesystem data, we move up to the database frontend and show how the new database filesystem infrastructure can be used to facilitate, support and, in. 20.

(21) 1.4 Contribution and Outline the final analysis, change application development. The system now provides a declarative application programming interface, and database-aware applications can directly profit from the database infrastructure. A selection of user interfaces implemented as BaseX views will demonstrate this and show how databases can be turned into primary processors for users and application developers dealing with information stored in files. An XQuery Application Framework. In Chapter 4, we push the idea even further and develop an XQuery application framework to enable developers to implement database-aware applications inside a clean XML technology stack. Our expectation is that implementations developed on top of a pure W3C technology stack will show simpler, more flexible, and more efficient application code. Kickstarting an Infrastructure. To finally demonstrate the feasibility of our approach and to put our architecture to the test we provide an actual example application and describe the development of an expert retrieval system from scratch. It leverages declarative access on documents stored in the filesystem (BaseX-FS), makes use of our application framework (BaseX-Web), and is built solely on XML technology. Big picture. Figure 1.3 on page 23 illustrates the big picture. Users, developers, and applications gain two access paths to filesystem data. Database-enhanced (declarative, “queryable”) access to data is provided as illustrated in the upper half of the figure entitled as “Database Road”. It allows for a uniform view on formerly heterogeneous data stored in the filesystem (Chapter 2). An XQuery Application Framework (Chapter 4) permits application development using a declarative programming style and allows developers to stay inside a clean standardized technology stack approved by the W3C. Legacy access to filesystem data is still provided. A joint storage for filesystem and database is set up and the database is mounted as a filesystem by the operating system kernel (Chapter 3). That way both stable and proven access via the established filesystem interface as well as database-enhanced access can be achieved since the filesystem is the database and the database is the filesystem. This additionally allows for improved, metadata-aware (so-called deep) access—as it allows to navigate into the file—via the filesystem namespace. Generalization. We will apply XML technology to implement what is typically considered to be solved by a variety of different languages and concepts, including low-level system programming. As such, we target new application domains for XML database. 21.

(22) 1 Introduction management systems and propose enhancements for XML database architectures. The existing BaseX XML database management system is taken as a representative. Finally, the techniques discussed will serve as a general blueprint on how to design and develop XML/XQuery driven information architectures that work on formerly heterogeneous data sources in a standardized and uniform manner.. 22.

(23) BX-W. Access via filesystem namespace. home |-- Images |-- Documents | `-- news.pdf `-- Music.  <dir name="home" …> … <file name="song.mp3" …> … </file> </dir>. Metadata-aware file access. .news.pdf.deepfs |-- author |-- pages | `-- page.txt `-- subject. BX-FS. U XML V O H F D.  <album>…</album> <artist>Bob Dylan</artist> <title>…</title>. Declarative access to files using X-technology stack (Desktop Search, Personal Information Management, …). XQ A F. Binary Backing Store. XML STORAGE. J DB/FS S. Figure 1.3: Big picture and ultimate goal: Applications, users, and developers gain two access paths to file contents. Proven and stable access via the filesystem interface is retained. An enhanced, metadata-aware (deep) file access is provided as the data is stored in a joint storage for filesystem and database. The database is mounted as a filesystem by the operating system kernel (“Filesystem Trail”). Database-enhanced (declarative, “queryable”) access can leverage the complete range of XML technologies on filesystem data. Additionally an application framework to build software inside a unified W3C stack is proposed (“Database Road”). DB-unaware applications (Filesystem Trail). Applications Users Developers. DB-aware applications (Database Road). 1.4 Contribution and Outline. 23.

(24)

(25) 2 The BaseX Filesystem View Ever-growing data volumes demand for storage systems beyond current filesystems abilities, particularly a powerful querying capability. With the rise of XML, the database community has been challenged by semi-structured data processing, enhancing their field of activity. Since filesystems are structured hierarchically they can be mapped to XML and as such stored in and queried by an XML-aware database system. Filesystems typically store vast amounts of heterogeneous file and data formats. The lack of a unified representation, however, makes it difficult for query languages to work through the data. In the following, we present FSML, the Filesystem Markup Language, a novel XML dialect that maps filesystem entities to XML nodes. BaseX, a native XML database system, is used to store FSML instances in order to provide a standardized, uniform, and high-level representation of a filesystem. The proposed mapping will later on be used to: Work on filesystem data using a database query language. XQuery, for example, can be used to search through, program with or analyze the data of a filesystem. Mount the database as a filesystem by the operating system. We will establish a link between database and operating system. The database will be mounted as a conventional filesystem by the operating system kernel. Implement applications using a declarative/functional programming style whenever it comes to the processing of file data. Traditionally, files are roughly classified as either text or binary. We add XML as a third type and expose formerly locked away content of files in a well-defined format with both, its structure and content. The unified representation of a filesystem can be leveraged by applications, developers. 25.

(26) 2 The BaseX Filesystem View and users. It is, however, in first place targeted at application developers to finally offer a new declarative way of dealing with filesystem data. In Chapter 3 we show how XQuery can be used in the domain of system programming. We will hook the database into the operating system and re-export the database content via the filesystem namespace.. 2.1 Joint Storage for Filesystem and Database In Figure 1.3 on page 23 we gave a high-level overview of the system’s architecture. A key element is the joint storage system used by both the filesystem and the database. BaseX supports the storage of semi-structured XML and binary data. We will use its storage layer to assemble all data necessary to drive a filesystem (file hierarchy, filesystem metadata, user data). The XML Store supports updates and, beside the usual name, path, and value indexes, maintains two full-text index structures: A fuzzy index is centered on specialized approximate matches, and a trie index supports wildcard queries. Both versions yield fast results for exact queries [28]. The system is an early adopter of the XQuery Full Text Recommendation [12] and supports sequential scanning, index-based, and hybrid processing of full-text queries. The support for textual retrieval at the core of the database engine makes BaseX a good choice to power our content-aware filesystem representation.. 2.1.1 The pre/distance/size Encoding BaseX’ storage layer uses a pre/distance/size encoding for XML data with various compactification techniques, such as attribute and integer inlining [25]. It is derived from the XPath Accelerator encoding [21], which is used in the MonetDB/XQuery system1 . Those flat tree encodings have proven to show excellent query performance [6, 23, 25, 26]. Figure 2.1 shows a pre/distance/size encoded tree. The pre value is dense and ordered for the complete tree structure, and it is implicitly given by its position. dist defines the. 1 http://www.monetdb-xquery.org/. 26.

(27) 2.1 Joint Storage for Filesystem and Database. 0. $ tree ./a 0 a0. |-- 1 b 1 | `-- 2 c 1 | |-- 3 d 1 | `-- 4 e 2 . `-- 5 f 5 |-- 6 g 1 `-- 7 h 2 |-- 8 i 1 `-- 9 j 2. <a> <c> <d/> <e/> </c> <f> <g/> <h> <j/> </h> </f> </a>. 3. d. 1. 1. b. 1. 2. c. 1. a. 0. 5 6. 4. pre 0 1 2 3 4 5 6 7 8 9. e. g. 7. 8. size 9 3 2 0 0 4 0 2 0 0. 5. 1. 2. dist 0 1 1 1 2 5 1 2 1 2. f. i. h. 1. 2. 9. j. 2. n a b c d e f g h i j. Figure 2.1: Storing trees (such as file hierarchies, XML documents) in the pre/distance/ size encoding relative distance to the parent pre value, and size contains the number of descendants of a node. To facilitate updates, the table structure is organized in disk blocks. A block directory references the first pre value of each block. The dist and size values have to be modified if deletions/insertions are performed: The size values are updated for all ancestors of that node—which means that a maximum of log(n) nodes in the tree has to be accessed—and the dist values are updated for the following siblings and the following siblings of the ancestor nodes. In comparison, e.g., the storage of absolute parent references would ask for a complete renumbering of all nodes in the tree table that follow a deleted/inserted node, rendering it inapt for updates in filesystems.. 27.

(28) 2 The BaseX Filesystem View. 2.1.2 The Encoded File Hierarchy As the pre/distance/size encoding is essentially a storage for tree structures, it can be seamlessly used to store the file hierarchy of a filesystem. The hierarchical mapping of filesystems is straight-forward, as illustrated in Figure 2.1. A more detailed view of the joint storage [33] is shown in Figure 2.2. XML/XQ V   F  <dir name="home" …> … <file name="homepage.xhtml" …> <content db-name="…" …/> </file> <file name="song.mp3" atime="…> <metadata transducer="…/> </file> </dir>.  <xhtml>...</xhtml>  <ID3v2:title>...</ID3v2:title>. J DB/FS S F H T (FSML) pre. dist. size. .... .... .... .... .... .... .... .... .... pre. dist. size. .... .... .... .... .... .... .... .... .... XML S. B B S 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. Figure 2.2: Joint storage for filesystem and database. Uniform XML representation of filesystem content From the XML/XQuery perspective, the system stores an XML representation of the filesystem, which is valid against a W3C XML Schema Definition. A BaseX-FS database instance consists of three components: • The FSML database that contains the file hierarchy tree. • XML databases for well-formed native XML documents and extracted information from files. • The binary backing store that stores raw data for any file in the file hierarchy, except those XML files turned into a database. Any information relevant to operate a traditional filesystem is stored in the “File Hierarchy Table” and is accessible for the XQuery processor as well as for operating system. 28.

(29) 2.2 Leverage Tacit Information Hidden in Files requests, as we will see later. Following the Unix tradition there are block and character, directory, fifo, (symbolic) link, socket and regular file types. Each file type is expressed as XML element, e.g., <file/>, and augmented with its file attributes (file size, access time, protection mode, …). <file name="05_like_a_rolling_stone.mp3" suffix="mp3" st_mode="100644" st_uid="501" st_gid="20" st_size="8943004" st_nlink="1" st_mtime="1323101051" st_ctime="1323168268" st_atime="1324470592"/> Listing 1: FSML file element with file attributes. When the database is mounted as a filesystem at a later point, those attributes are used to provide file status information to the operating system kernel (as, for instance, necessary to process the stat(2) system call).. 2.2 Leverage Tacit Information Hidden in Files Another view at Figure 2.2 on the facing page reveals that content and inherent metadata of files is taken into account and explicitly represented in the BaseX-FS database instance. Our mapping breaks with the long tradition to consider files as just a sequence of bytes. A central point of the integration of contents into the XML representation of the filesystem is to allow the full range of XQuery retrieval features on the data. While XML files are ready to be included without additional effort (unless schema validation is demanded), commonly used binary files, such as images or audio files, contain metadata, which is quite relevant for querying. Our basic approach is that any information is considered an asset of interest that may, at a later point, be useful for information retrieval, personal information management, or related tasks. Assets of interest are to be defined for file types, e.g., audio files, e-mails, pictures, etc. and typically belong to one of the four categories: • Inherent file metadata (encapsulated within the file, e.g., ID3 information for music data, EXIF annotations for image files) • System metadata (filename, file size, modification times, …) • File content (full-text contained in office documents, e-mails etc.). 29.

(30) 2 The BaseX Filesystem View • User annotations, like tags, etc. Often, a number of different assets of interest exist for a given file. Those are bundled together and form what we call a metadata entry (MDE). A metadata entry is the XML encoded view of a file and is suited well for querying. Together with the original regular file in the backing store it forms the database view of a file.. 2.2.1 Transducers – Filetype-specific Data Extractors Metadata entries are constructed by so-called transducers (which first appeared in the context of the Semantic File System [17]). Transducers are file-specific metadata extractors. Transducers exist for various file types and can be plugged into the system to expose file-specific metadata. An extensible and configurable architecture has been chosen for the implementation of transducers to • facilitate the support for new filetypes • enable developers and organizations to produce metadata entries that contain exactly the (meta)data of files they want to query. BaseX-FS provides the framework in which transducers can feed metadata entries in order to build user-defined views of the filesystem in XML Transducers are triggered by the detected file MIME type. Transducer plugins can register for various MIME types and will be invoked once a file of that type is processed. The detected metadata is added as separate XML documents to the database and the file hierarchy mapping is augmented by a reference to the metadata entry. Listing 2 on the next page is an example of what is stored in the native XML database. A transducer for audio files has detected some ID3 information and the <file> element is augmented with file attributes taken from the operating system.. 30.

(31) 2.2 Leverage Tacit Information Hidden in Files  <fsml version="1.0">… <dir name="Music" st_size="…"> <file name="05_Like_A_Rolling_Stone.mp3" suffix="…"> <metadata transducer="exiftool" db-size="10144" db-nodes="73" db-name="fsml-522dd6df-169d-4edf-aaf3-e2396e18dfab" db-timestamp="16.01.2012 10:57:48" doc-size="2499 Bytes" doc-encoding="UTF-8" whitespace-chopping="true"/> </file> </dir>… </fsml>  <metadata transducer-toolkit="Image::ExifTool 8.68" xmlns:MPEG="http://ns.exiftool.ca/MPEG/MPEG/1.0/" xmlns:ID3v2_3="http://ns.exiftool.ca/ID3/ID3v2_3/1.0/" xmlns:Composite="http://ns.exiftool.ca/Composite/1.0/"> <MPEG:MPEGAudioVersion>1</MPEG:MPEGAudioVersion> <MPEG:AudioLayer>3</MPEG:AudioLayer> <MPEG:AudioBitrate>192 kbps</MPEG:AudioBitrate> <MPEG:SampleRate>44100</MPEG:SampleRate> <MPEG:ChannelMode>Stereo</MPEG:ChannelMode> <MPEG:MSStereo>Off</MPEG:MSStereo> <MPEG:IntensityStereo>Off</MPEG:IntensityStereo> <MPEG:CopyrightFlag>False</MPEG:CopyrightFlag> <MPEG:OriginalMedia>False</MPEG:OriginalMedia> <MPEG:Emphasis>None</MPEG:Emphasis> <ID3v2_3:Title>Like A Rolling Stone</ID3v2_3:Title> <ID3v2_3:Artist>Bob Dylan</ID3v2_3:Artist> <ID3v2_3:Composer>Bob Dylan</ID3v2_3:Composer> <ID3v2_3:Album>Greatest Hits</ID3v2_3:Album> <ID3v2_3:Track>5/10</ID3v2_3:Track> <ID3v2_3:PartOfSet>1/1</ID3v2_3:PartOfSet> <ID3v2_3:Year>1965</ID3v2_3:Year> <ID3v2_3:Genre>Folk</ID3v2_3:Genre> <ID3v2_3:Comment>(iTunPGAP) 0</ID3v2_3:Comment> <ID3v2_3:EncodedBy>iTunes 8.0.2</ID3v2_3:EncodedBy> <ID3v2_3:Comment>(iTunNORM) 00 … 00042A05</ID3v2_3:Comment> <ID3v2_3:Comment>(iTunSMPB) 00 … 000000</ID3v2_3:Comment> <ID3v2_3:Comment>(iTunes_CDDB_IDs) 10 … 750289</ID3v2_3:Comment> <Composite:DateTimeOriginal>1965</Composite:DateTimeOriginal> <Composite:Duration>0:06:12 (approx)</Composite:Duration> </metadata> Listing 2: Metadata extracted for .mp3 file using ExifTool transducer 31.

(32) 2 The BaseX Filesystem View Transducers externalize data formerly siloed in filesystems. The extraction of tacit information, encapsulated in various file formats, leads to a standardized and easily accessible representation. Content and structure of file data is exposed and can now be queried together, as the extracted data is presented in a homogeneous manner. The data is indexed and we can search on anything that has been loaded without knowing questions ahead of time. Think, for instance, about finding an e-mail with a known sender, a big attachment and some keywords: for $mail in //file/Mail let $attach := $mail/Attachment where $mail/From = 'jim.walker@mail.com' and $mail/Section contains text 'Hansson' ftand 'report' and $attach/@size > 3000000 return fsml:path($mail) Listing 3: XQuery pseudo-code to retrieve relevant e-mails. Queries may combine filesystem metadata (such as file size, directory names) with file content and use both filesystem commands and languages for semi-structured data, such as XQuery, to request and manipulate data. In the case of e-mails, comparable functionality is already offered by advanced e-mail applications. However, each application has to provide its own implementation, leading to highly redundant code for similar functionality. Our approach strives to provide such capabilities as a basic system service. Furthermore, the search is not restricted to application-defined communication paths (such as the often connected e-mail, calendar, address book applications), but can include any stored data.. 2.2.2 Implementation of a Transducer Several sophisticated tools in the open-source domain focus on metadata extraction. ExifTool [30] is a good example. It is in operation and under constant development since 2003 and supports an astonishing amount of more than 130 file types. Following established software engineering practice we want to put those tools to use for our goal. 32.

(33) 2.3 A Deeper Filesystem – The Metadata Hierarchy to externalize information in a homogeneous manner. A plugin architecture has been chosen for that purpose. While it can be quite difficult to write extraction code to get information from raw data, it is easy to deploy a new transducer and to integrate it into our architecture, as it boils down in supporting a simple interface: • register(list of mime types supported by transducer) • <metadata/> extract(fileref) • inject(fileref, <updates/>) The implementation has to be thread-safe, the functions are called back by the system when appropriate. Plugins are initially loaded into the BaseX-FS Database Server on startup. They may be provided as external dynamic libraries or included into the project code. Additional plugins can be loaded and removed from the server during runtime. If multiple transducers are registered for the same MIME type, they are executed in sequence.. 2.3 A Deeper Filesystem – The Metadata Hierarchy Back in 1998 Simon St. Laurent published a short essay [61] that contained the following Figure 2.3:. Figure 2.3: Simon St. Laurent’s vision of an enhanced, “deeper” filesystem. St. Laurent writes: “Implementing this requires a drastic rethinking of the file system. 33.

(34) 2 The BaseX Filesystem View and database structures as well. Supporting retrieval at the element level breaks down formerly monolithic binary files (or, in database terms, Binary Large Objects or BLOBs) into separate, often tiny chunks which may themselves continue other chunks, which contain other chunks, and so forth. At this point, the file system is no longer a file system in the traditional sense, but an object store which is capable of storing large chunks of information as well as hierarchies built of tiny data sets. The document still exists - but only as one layer of the object store, an object containing other objects much as directories contain files at present.” [61] We were excited about the idea of letting the filesystem immerse into files to have an enhanced, deeper, and more fine-grained access to data. And given a BaseX-FS instance, other—more specific or application-tailored—views on a filesystem can be created easily. We came up with another XML representation, called DeepFS, that integrates selected items of metadata entries into the file hierarchy. Beside the well-known file and directory hierarchy, DeepFS establishes a second metadata hierarchy. Assets of interest are structured in <fact/> and <folder/> elements. Facts are leaf nodes in the metadata hierarchy and contain values, such as ’Bob Dylan’ in an ’artist’ fact of an audio file or the full-text of a PDF in the ’page’ fact. Folders recursively contain, analogous to directories, zero or more facts or folders. They, for instance, group the individual page facts of a PDF document to a pages folder. An example is given in Listing 4 on the next page. Facts and folders form the metadata hierarchy that $ tree -a /var/tmp/mnt/ /var/tmp/mnt/ |-- a.mp3 `-- .a.mp3. deepfs |-- artist |-- sub |. `-- genre. `-- title 2 directories , 4 files. 34. is exposed in the Unix filesystem namespace. Per convention, a known file type is expected to expose its metadata in a folder with an annotation of type="metadata". It denotes the root of the metadata hierarchy along which deep access to the regular file is established. DeepFS prolongs the conventional file hierarchy with a metadata hierarchy. When mounting the database as filesystem this metadata hierarchy is reflected in the filesystem namespace again in order to navigate into the file..

(35) 2.3 A Deeper Filesystem – The Metadata Hierarchy <dir name="Documents" st_mode="040755" ...="..."> <file name="BBC_News-Mars_Nasa_images.pdf" suffix="pdf" ... > <folder name=".BBC_News-Mars_Nasa_images.pdf.deepfs" type="metadata"> <fact name="pagecount">2</fact> <fact name="title">Mars: Nasa images show signs of flowing water</fact> <fact name="author">Hamish Pritchard (Science Reporter)</fact> <fact name="subject">Science & Environment</fact> <fact name="keywords"/> <fact name="creator">Google Chrome</fact> <fact name="producer">Mac OS X 10.6.8 Quartz PDFContext</fact> <fact name="creationdate">2011-08-10T15:11:03.000Z</fact> <fact name="modificationdate">2011-08-10T15:11:03.000Z</fact> <folder name="pages"> <fact name="page" number="1">SCIENCE & ENVIRONMENT 4 August 2011 Last updated at 18:11 GMT Mars: Nasa images show signs of flowing water Striking new images from the mountains of Mars may be the best evidence yet of flowing, liquid water, an essential ingredient for life. The findings, reported today in the journal Science, come from a joint US-Swiss study. ... </fact> <fact name="page" number="2">Salty water ... Listing 4: DeepFS with facts and folder elements that establish a metadata hierarchy. Navigation into the file along the metadata hierarchy can be achieved once the database is mounted as a filesystem. Folders naturally appear as regular directories with the access rights of the original file inherited. The root folder of the metadata hierarchy is a hidden Unix directory (dot notation) and named after the corresponding file name suffixed with .deepfs. Facts show up as regular files and can be treated as such, i.e., a write to a “file” in the metadata hierarchy translates into an update of its <fact/> element. To go through with the concept of a deeper filesystem, updates of facts in the DeepFS view propagate back into the original files. For that purpose we introduced the concept of bi-directional transducers.. Pushing metadata updates back into files.. As indicated by the inject(fileref,. <update/>) function, we, in contrast to existing metadata harvesters, allow users and applications to actually work on the extracted data. This means, while Desktop Search. 35.

(36) 2 The BaseX Filesystem View Engines or application-specific indexes collect metadata in order to provide search functionality and lock away the metadata otherwise, we maintain a strong relationship between the XML view and the original data file. Whenever the file is updated, its redundant, externalized XML representation is updated as well. The same holds vice versa: if a metadata entry is updated by a database query those changes are propagated back into the original file. Since the original, raw file is kept in a backing store, the homogeneous representation comes with the cost of storing data redundantly: The original metadata in the file and its counterpart in the XML representation.. 2.4 Related Work Various ideas have been proposed for including file contents into information systems. One of the earliest attempts, the Semantic File System (SFS) [17], extracted attributevalue pairs for specific file types via so-called transducers. Content queries could be formulated by entering directory paths and extending them with AND combined query terms. The result was a virtual path, resembling a default directory path and including symbolic links to the result documents. While SFS offered only limited retrieval functionality and ways of representing the query results, it has influenced numerous future filesystem projects, including Shore [11], HAC [19]. An interesting approach to bring XML and filesystems together was presented by IBM’s XMLFS [3]. The underlying prototype implementation offered access to XML documents via an NFS server, and a simple path language allowed querying tags and text nodes across several documents. Nevertheless, the project was not extended to a full XPath/XQuery support, and document storage was apparently limited to XML instances and to the existence of DTDs. The visionary paper “From databases to dataspaces: a new abstraction for information management” [16] proposes dataspaces as a new data management abstraction. It led to various promising research efforts regarding the development of software platforms to facilitate a heterogeneous and distributed mix of personal information, such as Semex [10]. Approaches like this are far more prospective and target the development of so-called. 36.

(37) 2.5 In a Nutshell DataSpace Support Platforms (DSSPs). These are supposed to meet the criteria defined in “Principles of dataspace systems” [29]. IBM’s Virtual XML Garden [55] and the draft of File System XML (FSX) [68] share the common idea to have a unified view over heterogeneous data sources. Since filesystems are structured hierarchically, they can easily be mapped to an XML structure as sketched in [68]. Together with the idea to let the filesystem immerse into the file [61], these provide the basis for the construction of our representations. An extensive discussion focused on semantic technologies to the problem of personal information management is to be found in [56, Chapter 2].. 2.5 In a Nutshell mkfs.basexfs(1) takes an existing file hierarchy as input and creates a BaseX-FS database instance. This bulk loading operation serves well as a short summary of the points discussed so far. A depth-first preorder tree traversal, starting from the topmost directory, is performed in order to produces a unified representation of the file hierarchy. While traversing the file hierarchy, each file is visited and analyzed. The following operations take place: • Encountered files are represented as XML elements in the FSML database. They are augmented with operating system specific metadata attributes, such as file access time, file size, file protection mode and the like. This is done for all file types, incl. regular files, directories, links, etc. When, at a later date, mounting the database as filesystem, those attributes are used to obtain information about the file. • File-type specific metadata (such as EXIF information for images, ID3 data for audio files, …) is stored in separate databases using an XML representation constructed by transducers. • File-type specific content, such as the full-text of a PDF file, or an e-mail message, is included as well. The transducers are responsible for deciding what assets of interest should be represented.. 37.

(38) 2 The BaseX Filesystem View • The original data file is copied to the binary backing store of BaseX and a unique reference is added to the corresponding file element (<file bsid='uuid'>). • XML files are treated the same way. Metadata about the document is added to FSML, i.e., statistics about the document (how many nodes, how many elements of a specific tagname, etc.). The document itself is shredded into the database. This holds for any well-formed XML instance. In the case of an incorrect, corrupt XML document, the document is put into the backing store and the FSML metadata entry contains information about the problem. We created a suitable format to store and operate on filesystem data using a DBMS. While being straight-forward, it adds semantics and exposes formerly hidden contents of files with both, its structure and values. The approach allows to leverage all components (storage, indexes, query capabilities) of an XML-DBMS. XML processing languages, such as XPath and XQuery, allow for unprecedented search capabilities and flexibility on the data. Conventional approaches using full-text engines can only perform full-text queries, using proprietary syntax. With XQuery and its Full-Text extension we can easily combine full-text search criteria and queries based on values of any XML element or attribute. In the next chapter, we will explore the integration of BaseX-FS instances to Unix operating systems in order to build filesystems on top of the unified XML representation. Since the database will be mounted as a conventional filesystem by the operating system kernel, access via the established (virtual) filesystem interface as well as database-enhanced access to the same data will be provided.. 38.

(39) 3 An XML Database as Filesystem Given an instance of a BaseX-FS database, and given the database is connected to the operating system, metadata of files normally only accessible with dedicated tools, can now be represented as regular files and directories. Applications completely unaware of the database can utilize the Unix filesystem interface to gain access to the uniformly stored file data in the DBMS. The database is mounted as a filesystem, and its data appear in the filesystem namespace. The database becomes the filesystem and the filesystem is the database. Establishing a link between database management system and operating system kernel is crucial to achieve our ultimate goal: Provide both proven and stable access to the data leveraging filesystem techniques for database-unaware applications and enhanced, declarative access (including query support) to all stored files for database-aware applications. Via the Unix filesystem interface we can provide: • Conventional file I/O to all files in the FS/DB Server (legacy interface) • Access to the formerly locked-in metadata of files via a proven and well-known interface (metadata-aware filesystem) • Manipulation of database content in a BaseX-FS instance, using any tool capable of reading and writing files (file I/O to database). 39.

(40) 3 An XML Database as Filesystem Applications Users Developers. Conventional File I/O. Metadata-aware File Access. DB-unaware applications (Filesystem Trail). Unix filesystem. XML STORAGE. Declarative (Query) Access. DB-aware applications (Database Road). XML database. Binary Backing Store. J DB/FS S. Figure 3.1: Ultimate goal: Database-enhanced (“Database Road”) and conventional access (“Filesystem Trail”) to filesystem data. 3.1 On Filesystem Prototyping Developing a filesystem from scratch is reported to be difficult and error-prone [71], [52]. Rajgarhia et al. from Stanford University summarize it as follows: “Developing in-kernel file systems for Unix is a challenging task, due to a variety of reasons. This approach requires the programmer to understand and deal with complicated kernel code and data structures, making new code prone to bugs caused by programming errors. Moreover, there is a steep learning curve for doing kernel development due to the lack of facilities that are available to application programmers. For instance, the kernel code lacks memory protection, requires careful use of synchronization primitives, can be written only in C, and that too without being linked against the standard C library. Debugging kernel code is also tedious, and errors can require rebooting the system. Even a fully functional in-kernel file system has several disadvantages. Porting a file system written for a particular flavor of Unix to a different one can require significant changes in the design and implementation of the file system, even though the use of similar file system interfaces (such as the VFS layer) on several Unix-like systems makes the task. 40.

(41) 3.1 On Filesystem Prototyping somewhat easier. Besides, an in-kernel file system can be mounted only with superuser privileges. This can be a hindrance for file system development and usage on centrally administered machines, such as those in universities and corporations.” [52] At least two approaches strive to overcome this burden and provide frameworks suitable to rapidly prototype new filesystem concepts and ideas: • Stackable Filesystems (paired with the FiST framework) • Filesystem in USErspace (FUSE) approach. 3.1.1 Stackable Filesystems Stackable filesystems offer a way to add new functionality to existing filesystems without modifying kernel or existing filesystem code. The basic idea of stacking can be summarized as follows: Most operating systems separate their filesystem code in two components, a native filesystem and a general-purpose layer, the Virtual File System (VFS). The VFS provides a uniform access mechanism to filesystems at a higher abstraction level and is unaware of the underlying filesystems’ details. When filesystems are initialized in the kernel, a set of function pointers is installed in the VFS. The VFS, in turn, generically calls these pointer functions without knowing which specific filesystem the pointers represent. For example, an unlink system call gets translated into a service routine sys_unlink. It invokes the VFS function (vfs_unlink), which in turn invokes the filesystem specific method by using its installed function pointer: ext4_unlink for ext4, nfs_unlink for NFS or the appropriate function for other filesystems. This allows yet another filesystem to be inserted right between the existing VFS and base filesystem. Figure 3.2 shows such an inserted filesystem (CryptFS). It is called stackable, because it is stacked on top of another, the underlying filesystem. If the stackable filesystem approach is applied, new functionality is layered on top of existing filesystems. Before the lower-level filesystem is called, a stackable filesystem can modify an operation and/or its arguments, and perform arbitrary operations before, after or instead of the underlying filesystems actions. Thereby, the underlying filesystem could. 41.

(42) 3 An XML Database as Filesystem. user process write(). userspace. vfs_write(). kernel Virtual Filesystem (VFS). cryptfs_write() CryptFS ext4fs_write() ext4. Figure 3.2: Information and execution flow in a stackable filesystem be any other filesystem (Ext4, NFS, another stackable FS). The features implemented in the stackable filesystem are separate from the filesystem module, thus a stackable filesystem allows for portability to different environments. The File-System Translator (FiST) [69] is a high-level language developed by Erez Zadok from Stony Brook University to describe stackable filesystem. If a FiST description is taken as input, a dedicated compiler can generate kernel filesystem modules for different platforms. Erez Zadok and Jason Nieh explain in [70] why they consider using FiST a good choice to prototype new filesystems using the stackable filesystem approach: “To ease the problems of developing and porting stackable file systems that perform well, we propose a high-level language to describe such file systems. There are three benefits to using a language: 1. Simplicity: A file system language can provide familiar higher-level primitives that simplify file system development. The language can also define suitable defaults automatically. These reduce the amount of code that developers need to write, and lessen their need for extensive knowledge of kernel internals, allowing even non-experts to develop file systems. 2. Portability: A language can describe file systems using an interface abstraction that is common to operating systems. The language compiler can bridge the gaps among different systems’ interfaces. From a single description of a file system, we could generate file system code for different platforms. This improves portability. 42.