Überblick über Herausforderungen und Lösungen bei der Primo-Implementierung im OBV

(1)

DIE ÖSTERREICHISCHE BIBLIOTHEKENVERBUND UND SERVICE GMBH

Überblick über Herausforderungen und Lösungen bei der Primo-Implementierung im OBV

Victor Babitchev, Ulrike Krabo

Primo Entwicklertreffen, Dresden 08.11.2011

(2)

Inhalt

• Primo im OBV

• Datenaufbereitung von zentralen Ressourcen in heterogener Umgebung des OBV

• Datenkorrekturen und Datenprüfungen

• Überblick sonstiger Tools

• Frontend Erweiterungen

• Layoutveränderungen

• PNX Enrichment

• Dynamische Einbindung weiterer Services (Wikipedia, Google Books)

OBVSG: Primo Entwicklertreffen, Dresden 08.11.2011 2

(3)

Primo im OBV Teilnehmer - 8 (+4)

Einrichtung Start

Universität Innsbruck 2009 Herbst

Universität Wien 2010 März

Verbundsicht 2010 April

Veterinärmedizinische Universität Wien 2010 Juli

Universität Graz 2011 Feber

Österreichische Nationalbibliothek 2011 Mai Wirtschaftsuniversität Wien 2011 Juli

Universität für angewandte Kunst 2011 November

Technische Universität Wien 2012 Q. 1 ?

Universität Klagenfurt 2012 Q. 1-2 ?

Medizinische Universität Wien 2012 Q. 3 ?

Universität Salzburg 2012 Q. 4 ?

(4)

Zentrale Primo-Instanz Datenmodell / Datenfluss

HOL SE BIB01

Z30

ADM

BIB ACC01

HOL Z300

Publish zentral Aufberei

-tung (PPS)

eDOC

el.Objekte zur ACC01

Publish Lokal 1

SE

BIB02 Publish Lokal 2

UBW ACC UBI ...

Lokalsystem

Verbundsystem

PRIMO ML KB SFX Digit

Repos

Normalisie- rung

Enrich- ment

Laden Dedup FRBR

Plug-in-1 (Enrich.)

Harvesting

Lokale BIB- Felder nach HOL

Norm

Plug-in-2 (Index.)

Exporte für PRIMO

PNX

. - OBVSG

(5)

Datenaufbereitung von zentralen Ressourcen

in heterogener Umgebung des OBV

(6)

Challenges of heterogeneous data supply for Primo

• Primo is flexible in processing and indexing data in heterogeneous environments

• But it is Your job to prepare the data and what even more challenging, is maintaining the data consistent!

(7)

Challenges of heterogeneous data supply for Primo

A typical heterogeneous environment: Catalog„Catalog enrichment repository“

• Catalog is the master system for bibl. data (8,3 Mio. records), a part of it has linked objects that can be indexed (abstracts, TOCs, full texts)

• Repository is the master system for dig. objects (our “eDOC” maintains

~575.000 items)

• Each system has its own data management workflow

Catalog record

Repository object

0..1 record has dig. object 1..N

Repository object

Catalog record

1..N Objects linked to

1

(8)

Challenges of heterogeneous data supply for Primo

In order to implement in Primo full text indexing - data from the both sources should be supplied and maintained consistently!

Processing of bibliographic and linked „full text“ data in Primo (top to bottom flow)

Note. OBVSG uses the Primo BO „import pnx_extensions“ tool, another way could be using of Primo file splitters (run in pipes).

Catalog

record Primo Pipe

Repository object

Import pnx_extentions

pnx_record

pnx_extension TOC

pnx_extension

„full text“

Iindexing (full document: bibl.data + full text)

(9)

Challenges of heterogeneous data supply for Primo

The challenge

• Catalog and repository are master systems for the data types which they manage

• Primo is a „slave“ system consuming, transforming and linking data that it receives from the master systems

• The changes in each master system must be registered consistently and prepared for Primo considering its interrelationships. The latter is not trivial to achieve

(10)

Challenges of heterogeneous data supply for Primo

OBVSG approach – consistent data extraction for Primo

• Data changes in catalog prepares Ex Libris “Aleph Publishing Mechanism”

• Objects changes in eDOC are prepared by PPS, which also gets from eDOC plain texts from objects (*.pdf etc.)

 PPS prepares input for Primo „merging“ data considering its relations

Central Catalog

PPS

(Primo Proc.

System)

eDOC

pnx

time scale

Day 1 Day 2, t1

delta

delta MAB

XML

eDOC XML (plain texts)

Primo Pipe

Import pnx_extensions

pnx_ext pnx_ext

Primo Index-

ing

t2 t3

„delta“

t4

(11)

Datenkorrekturen und Datenprüfungen

(12)

Consistent data changes in heterogeneous environment

Sample of problem

Librarian received a request to remove access to a full text dissertation

• the text object is a link in the catalog record. Removal of the link (tag 655) is not a solution – but it happens …

The correct solution requires considering roles of each system component (“master” or “salve”) and its interconnections

… but it is hard to process such requests manually  we decided to automate it:

• protected eDOC 655 tags against manual changes (indicator „o‟)

• implemented a language “of requests” placed by librarian into the catalog‟s “memo record”

(13)

Implementation of consistent data changes

4. ACC01 module

ACC01 eDOC

3. eDOC module

5. Primo module

Primo

- “delete this object” … - “replace that

scan with a better one 1. acc01.Z104

REQUESTS are placed into standard “memo-records”

from Aleph GUI client

“Command requests” vs. direct changes in 655 links to eDOC

“Transaction ”

a Z104 request activates controlled changes in all

three systems Perl, Oracle, MySQL

Resources: 4 m/month Queue

2. Queue manager

Full texts eDOC 655/eDOC

Objects

request

(14)

Implementation of consistent data changes

Have all problems been solved so far?

Almost…. A good control over “the situation” is wanted:

 check and validate your data regularly

 prepare, or better „generate“ corrections where possible

To achieve this, we developed another tool – „ eDocXray4Primo”

(15)

Validation of data consistency

2. Aleph

eDocXray4Primo

1. Bib. Ids ^{3. eDOC}

4. Primo

Error codes :

er-1 : NO SysNr was found for given ACNrs er-3 : XREF exists but NO edoc objects er-4 : eDOC object(s) exist but NO XREF found er-21 : NO objects in eDOC but pnx-ext. and XREF exist er-22 : NO pnx-ext. found for eDOC object(s)and XREF exists er-23 : Nr. of eDOC objects greater than nr. of pnx-ext. (XREF exists) er-24 : Nr. of eDOC objects less than nr. of pnx-ext. (XREF exists) er-25 : ACC DS: NO PNX record exists in Primo (for ACC DS only) er-26 : non-ACC: NO PNX record exists in Primo (XREF for ACC DS

exists!)

er-27 : non-ACC: NO PNX record exists in Primo (XREF for ACC DS NOT exists!)

er-28 : NO PNX record exists due to failure at harvesting/nep wrn-2 : Nr. of eDOC objects not equal to nr. of V_enrichm. records

Reports

eDocXray4Primo - checks data related to full text indexing in all three systems:

• eDOC

• Aleph

• Primo

Sample of report

Bibl. records having links to eDOC

Perl, Oracle, MySQL / ~ 2 m/month

(16)

Runnig data supply and validation tools

The all three tools run daily

• Object data corrections („655 - requests“) runs before PPS

• PPS starts at 22:00 followed by Primo pipes (10.000 - 40.000 records daily)

• Data validation and corrections runs after Primo update

(17)

Primo data supply summary

• If a complex or intensive processing of data for Primo is needed – you may do it on [MAB] XML files produced by the Aleph Publishing

Mechanism, and after that pass it to Primo pipes

• Implement a stable data supply from heterogeneous sources and plan implementation of consistent data changes along with its regular

validations  this saves hours of analysis of complex data problems!

• Management of full texts in Primo does not support yet some

repository operations (e.g. the deletion of objects we do by our tools)

• We avoid pipes for bringing full texts into Primo – we import objects outside the pipes (scales well and brings other advantages for us)

(18)

Other Primo Tools

Samples

• Statistics tools for institutions

• Back office tools (“attempts to compensate” a lack of its multi tenancy) back-up of configuration data, reports on changes, finding

differences between prod. and staging systems

• Single Sign On (Shibboleth)

• Other consortium specific tools

Note. SQL access to Primo Oracle is used often in our tools

(19)

OBV Frontend Erweiterungen

(20)

Überblick

(21)

Beispiele Layoutänderungen + Goodies

Trennung von Dokumenttyp-Icons und Umschlagbildern + Links in der Kurzanzeige

Navigation in der Kurzanzeige: Permalink:

z.B. http://permalink.obvsg.at/AC07023679 Individuelle Hilfeseiten für jede View:

(HTML und CSS)

Keine Tabs bei FRBRized Treffer:

(22)

Beispiele Anreicherungen des PNX

(23)

Beispiele Dynamische Einbindung neuer Services

Neue Tabs (Wikipedia, Google Books, Ebooks On Demand):

Addthis.com

Tooltips

(24)

Technische Herausforderungen

(25)

Konzept: Dynamisches Laden zusätzlicher Daten

Primo UI (brief, full, eshelf)

OBV ‚Webservice

Primo

DB CACHE

RVK Online API

Google Books API DBPedia

1: HTTP Request (JS AJAX onReady)

2: Webservice (Fast CGI) -> PNX aus DB lesen -> Cache

-> HTTP-Request ext. APIs 3: JSON-Antwort auswerten und Services integrieren

-> JQuery -> Tab API

(26)

Details zur Implementierung

• showPnx=true zu langsam

• -> serverseitige Komponente notwendig (X-Service oder DB-Request)

• Externe Services

• Unterschiedliche APIs

• Google Books API, RVK Online API, …

• SPARQL (Linked Data) Ideallösung, aber technische Probleme (Verfügbarkeit und Performance) -> daher Workaround:

• Regelmäßiges Downloaden der Daten -> Arbeiten mit lokalen Daten!

• Caching der Resultate der Abfragen der ext. APIs (derzeit wöchentlich)

• FAST CGI

• Auf anderem Rechner -> Cross Domain Requests -> daher JSONP

(27)

Wikipedia-Tab

• Vorrausetzungen:

• PND im PNX-Datensatz (z.B. display/lds34)

• Short-Abstracts der DBPedia: http://dbpedia.org/Downloads

• Verwendete PNDs in der Wikipedia: http://toolserver.org/~apper/pnd.txt

• URL-Redirect: http://toolserver.org/~apper/pd/person/pnd-redirect/de/[PND]

• Workflow: Existiert im PNX eine PND und wird die PND in der Wikipedia verwendet, wird der Wikipedia-Tab erzeugt

• Optimierungen:

• DBPedia-SPARQL-Endpoint nutzen

• Mehr Informationen im Tab, beispielsweise Bilder

(28)

Buchvorschau-Tab

• Vorrausetzungen:

• ISBN im PNX-Datensatz (z.B. display/identifier)

• Google Books API

• Workflow: Existiert im PNX mind. eine ISBN wird die Google Books API anhand der ISBN abgefragt. Existiert in Google Books für diese ISBN eine Teil- oder Vollansicht, wird der Google Books Tab generiert.

• Optimierungen:

• Vorschau direkt in Lightbox in Primo

(29)

Übersicht verwendete Plugins

(+ Snippets) + addthis.com + Tab API

(30)

Zusammenfassung Frontend-Erweiterungen

• JSP-Änderungen schwierig

• Konsortial-Umgebung und Wartung

• JS Grenzen

• Performance und eventuelle Konflikte mit Exl-JS

• Alte JQuery-Version

• Wartung

• PNX-Enrichment vs Dynamische Erweiterungen mit JS

• Arbeiten mit Plugins

• Ideen (aber keine Ressourcen)

• RTA für zentrale View

• mehr PushTo Formate -> Zusammenarbeit in Primo-Community?

(31)