Visual-Interactive Querying for Multivariate Research Data Repositories Using Bag-of-Words

(1)

285

Visual-Interactive Querying for Multivariate Research Data Repositories Using Bag-of-Words

Maximilian Scherer

TU Darmstadt Interactive Graphics Systems

Group Fraunhoferstr. 5 64283 Darmstadt, Germany

maximilia n.scherer

@gris.tu-da rmstadt.de

T atiana von Landesberg er

TU Darmstadt

Interactive Graphics Systems Group

Fraunhoferstr. 5

Tobias Schreck

University of Konstanz

Data Analysis and Visualization Group Universilaetsstr. 10 64283 Darmstadt, Germany

tatiana.von_ landesberger

@gris.tu-darmstadt.de

78457 Konstanz, Germany

tobias. schreck

@u ni-konstanz.de

ABSTRACT

Large amounts

or

multivariate data arc collected in differellt areas of scientific fCS(!arcli ,\lId indnstrial production. The;c data are collected, archived and made I)uhlicly a\~lilablc by research data repositories. In addition to meta-datil based acccs>;, content-IJru;I..'(1 approaches are highly desirable to effectively retrieve, discover and allaly~e data sets of interest.

Severa! sllch methods, that allow users to search for particular curve progn.>ssiolls, have bccn proposed. However.

a major clwllellge when providing content-based access - interactive fccdh;u::k during (luery formulation - has not re- ceived much attentiou yelo This is important because it can substantially improve the user's search effectiveness.

In this paper, we prescnt a novel interactive feedback appro,l,ch for content-ba8Cd acccs>; to multivariate re<;earch data. Thereby, we enable query rnodalities that were not available for multivariate data before. \Ve provide instant search n.'Sults aud highlight query patterns in the rt.'Sult set. Real-time search suggestions give an overview of important patterns to look for in the data rcpo:;itory. For this ]>urposc, we dcvelop a bag-of-words iudex for multivariate data as the back-end of our approach.

\Ve apply our method to a large repository of nmltivari- ate data from the climate research domain. We describe a usc-case for the discovery of illteresting patterns in maritime climate research using our new vi~ual-inter;u::ti\"(l query

tool~.

Categories and Subject Desc riptors

H.3.7 [Information Storage and Retrieval]: Digital Libraries; 11.3.1 [Information Storage and Retrieval]:

Content Analysis and Indexing- Indexing methQds

Keywords

Research Data Repo:;itories; Content-Based Retrieval: Bag-

of-\Vonl~; Query Interfaces; l\luitimriate Data

I. INTRODUCTION

l\·lulti\'luiate data can be described as tabular data with

dimen~ionality /II. x ^II,where II is the number of variables (e.g., water density, water depth or pressure) and 1/1. i~ the uumber of ob~ervations (e.g., time of day or location). Such data arises in mally areas of research, industrial production ,\l1d other commercial applications. Due to increasing ef- forts in the digital library community over the last decade, such data, particularly that obtained for research purposes, is made publicly available in speciali~ed research data repa:;..

itories. For example, the PANGAEA rc]}()Sitory

[G]

is a digital !ibmry for data-intCllsh·c environmental scienccs. It

host~ very lurge amount~ of earth observation data of vari- ous kinds (e.g., timc series, multivariatc observations, image data, etc.), which nrc provided for public acccss. Similar to the search and acccs>; paradigms for multimedia databases, content-b,wed acce-ss to such repositories has started to re- ceive attention from the Digital Library communit.y. Such acccss supports user~ to search (\lld explore <I(\La patterns. in addition to annotated textual meta-data. Previous work ha .. 'i

considered similarity functions and feature extmction tech- niques for relevant aspects of rcscarch data, including time series, functional, and bivariate data [14, 10,7, 25], as well

'.1.5 thcir eWIluatioll [13, 26]. Little research, however, focused on in/emctive met/w(Ls which use these new similarity functions to help the user with the query formulation process b,lSed on data content. Such methods inelude highlighting of results to sholl' why a document was retrieved, as well as search suggestions to provide the llser with an overview of meaningful terms ~he can scarch for next. These runction~

are typically locatt.'<l ou the front-end of II visual-interactive retrie\-al system, but require indexing structures in the b. .... ck- end to be efficient.

In thi~ work, lI"e prCSCllt a no\·el approacll for providing the user with interactive search suggestiolis aud result highlight- illg II"hell (jueryillg lIIultivnriate data. Such visual-interactivc

lOol~ arc already successfully used in textllal search engines and yield similar lIdwlrltages to lIsers querying nOll-textual rCSCllfch data documelils. Search suggestiolls provide users with all overview of (often complex) dllta pattefll~ and vari- Ersch. in: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries : Indianapolis, IN, USA, July 22-26, 2013 / J.

Stephen Downie [u.a.] (eds.). - New York, NY : Association for Computing Machinery, 2013. - S. 285-294. - ISBN 978-1-4503-2077-1

(2)

".I:''''~.lOOJ1lI ^~"^o.• ^~_.,.

.-. ^- ^\ ^.,. ^..

^, ^~

t/h ... i -

~,'0.'-"c...u...

•

,*,

... =-

... ./

H f

\ ..

: .

-,,,,

~"o.'~EA.'OlO·"

, ^... ·

^.\~

^.

I

. I

1

^{.. '}^{' .}^t.^·

:r'~.

r!!iiI.!!!:!!!!

""..-j]

~:".,~,-

· .

.. · · • _• ^-.

^.

_,

Figure 1, Case Study, We queried for a specific pattern betwccn temperature and water depth to sec whether the documents containing this pattern were measured at locations with a similar maritime climate. The first document is situated in the Norwegian sea, while documents 2 to 5 were obtained in the antarctic southern ocean. Both regions have a maritime subarctic climatic zone, explaining why the same pattern was found there. Map Data is attributed to Google Maps.

able nallles to senreh for or refine their senreh; resnlt highlighting shows the user, which part of a document matched her query aud thu~ e"plai,,~ ",lIy it w~ retrieved. Figure 1 shows a serccnshot of the propOSl..,(\ system in action and illustrates tllesc benefits for the users (the usc-case is de- tailed in Section 4). By searching for ^IIspecific pattern of water depth versus temperature, we can find measurements that were obtained ill arefllj with a similar maritime climate.

Using only meta-data (the goo-location in this case) such a query would !Lot ha,·e been possible.

Akin to search suggestions on I>opular web- or e-colllllleree search engincs, we present the uscr with search suggestions and completions, based on her partial query as it is being entered. Figure 4 shows an exumple of this suggcstion- appro."\ch. Furthermore we can provide the user with instant search results and also highlight those parts of a retrieved multivariate data-set, th,lt corresponds to the uscr's <Iuery.

Similar to paragraph highlighting in text retrieval, we propose to show those scatter-plots of a retrieved data-set, that colltaiu pllrts of the <lllery (e.g., II textlllli hit 011 the ,Ixis I;\bel or a particular scatter-plot pattern) nlld to highlight these parts. That wily a llscr Clln sec why a particular document

was retrieved in the first place, and quickly skim through the results to find the data·sets she is most interested in.

Another example in Figure 5 ~how~ " rc~ult li.st and the highlighted sentler-plots.

To allow for this kind of visual-intefllcti\"e <Iuerying in mul- tivarinte data, we develop a novel indexing method based on a bag-of-words approllch. The bag-of-words approach has showil to yield state of the art retrieval performallce in multimedia databast.'S, e.g., for images, videos or mmsic [12].

We propose to adapt it for retrieval in multivariate research data repOliitorics. The basic idea is shown in Figure 2. By extracting bi''8riate features from each !}ldr of variables ill multivariate dllta, we obtain II sct of local features for each document. We then quantize each feature vector, byas- signing the id of the clOSI..'St clu~ter centroid (obtained, e.g., via k-means clustering) to each feature vectOT. Thus 11'0

Clln represent II document with multivari<lte data by a sct of conlent-bm;ed lokells obtailled from this qualltizntion. Such a representation allows IlS to le,·erage efficient indexing using inverted lists. Tile details of this indexing approach life described in Sectioll 3.

\\'e provide a case-study of our proposed npproach ill Scc-

(3)

Multivariate Documents Local Features

'--...;1-;;;·:.,~;;~,;~ ,· ~,:j!;;;*;;;Lj:iLh::~:~ , ^:;JL ^.",,,, ,,,,",,-->

10;.111 Extraction

p, ...

.,

db., T."'~ 35.031 1"(

-

^lSJI36

-

^35.034

-

^om

'-' =, _. _. _.

m m 3S.016 lS.014 l6.71] l6.714 370.6

'"

^lS.OII ^l6.715

172.i m JS.019 l6.718

J7'.5 37!1 35.016 l6.721

~, -

•• ~---Tokenizatlon,---•• t - --Quantization - - - '

Bag-of-Words

Inverted Inde)(

< ...

<po!alol., dpcymco! leh

• , _ I I

- - - I n de.ins _________

l.l.5.7.5(), ..

1,3,6,11,18" ..

1,1,13,14,20,._

U5,165,177.180""

~ ^'-' ..

• ,

^),4,10.10^7,1115,67,98,521,645,_

Figure 2: Overview of our b<lg-of-words <lppro<\ch for indexing multivariate research data.

tiou ,I. \Ve indcx all publicly IIvllilable rcseRrch data documents of the dala repository PANGAEA [6J. We show how our proposed indexing scheme and the newly elilloled visual- interactive search tools can be used for this kiud of research data documents.

2. RELATED WORK

This work i~ relnted to scveral aspect~ of di,l!;ital libraries.

multimedia information retrie\'al and data mining. In the following two subsections we outline recent work related to this pnper.

2.1 Content- based a nd Visua l Access Met hods

Conteul-l)l)scd analysis and iudexing is an important research domain withill digital libraries to provide additiollul access paradigms to documents besides access based on all- notated meta-data [20j.

Exumples of recent digital library systems that provide different means of contellt-based access include systems for 3D models and classical music [3j, images [23, 5J, time-series datil [11, climllle data [25J and ehemiClI1 dat;\ [IGj. On top of access viH annotated meta-data, these digital library systems extract domnin-spccific descl"ip/Ql'S from the underlying d.'lta as a basis to implement distllnce functions in ~upport of search and aCC(!ss functionality. Such HCCCSS includes query- by-example, e.g., supplying all example imuge alld retrieving similar imagt.'s [23, 5J; <Iuery-by-sketch, e.g., drawing a shape

and retrieving similar 3D models; or content-based layouts, e.g., clustering time-series by data similarity lind prcscnting the uscr with iln overview [I].

Visual access methods have shown to be highly success- ful for providing overview and search fUTlctionll!ity for lIsers in the Digital Librnry dOllluin [9J. Effccti\'e interfaces can help to more effectively browse, search or analyze huge data repO.'iitories [311. The idea behind many IIpproaches is based on S],Ileidermau's Visual ["formation Seeking !l.'!a1ltr;\ to provide overview first lind dehlils on denmnd [27]. A recent example of such a system in the digital library context was presented in [2]. There, by analyzing metn-data and time- series b(l.';ed content.'lt the Slime time, this system generntcs an interactive layout of research data to enable the discovery of interesting CO-OCC\lrreTlCC!l of meta-data based and time- series bascd patterns. Such approaches can combine tradi- tional meta-data based and content-rnL'>C(1 methods and call extend the standard SC<lrch support with elements of explo-- rative scarch systems useful for hypothesis generation [30J.

2.2 Bag-or-Words

The focu~ of this work is to provide the lIser with II set of interactive retrie\-altools which call respond in real-lime to uscr interactiol1. Our imer<lctive apprO<lehes include instant displny of search results, highlighting nnd search suggestion;;

for querying multh-ariHte research data. All of these fUllc- tions rC(llIire an efficient computation of similllrities. A suitable, ellicient content-based indexing method to this end is

(4)

,

• ~

(a) 111])lIt data (b) Gaussian kernel density (e) Detected edges

...

~r ... u

.lLJJ,+ . ~ • ^.-<.

(d) Edge histogram

Figure 3: Bival'iate feature extraction: Given bivariate input data (a), estimate Gaussian kernel density (b), apply canny edge detector (e) and compute edge histogram descriptor (d). This algorithm by Scherer cl al.

[26J has shown to yield state of the art performance for bivariate data retrieval.

the bag-or-wordS (BOW) approach that has become highly popular ill multimedia information retrieval. It has bccn shown to yield state-or-the-art retriC\'ll] performance in different domains, including image and music retrieval jl7, 22, 12J. In this paper we transfer it for the first time to the domain of multivariate research data. BOW apl>foaches originate from text retrieval and natural language process--

ing, where the inherent tokenization of textual doculllents was used for proposing efficient indexing and term weighting methods (24]_

It was first applied to lllultimedia documents by Sivic aud Zisscrman for contCllt-bascd image retriewII (28]_ The 1).1,-

sic approach is to extract local features, e.g., SIFT ]I9] or SURF features for images, quantizing these features via k- means or other suitable clustering methods (32], and finally indexing / weighting these tokells using tedllli(lues like tf-idf ]24], (probabilistiC) latent semantic indexing ]I 1] or latent Dirichlet alloclltion [4]. This allows for similarity measurements between multimedia objects via their associated hag- of-words (usually the terms arc encoded as a histogram), as well as querying or clustering the documents via spccific terms (e.g., a predominant color in an image). ]I .. loot rc<:cntly.

such a bag-of-words approach was aloo applied successfully to the retrieval of 3D models and 3D scencs (8) as well as to the retrieval of time-series data [18).

3. APPROACH

In our approach, we provide search suggestions and highlighting for querying multivariate data documents. To perform the re<:luired computations at interactive rates, we need efficient similarity functions for multivariate data. There-

fore, we base our apprOilch on constructing and utilizing a bag-of-words index.

[n the following subsections, we first describe the construction of the index itself and then describe the interactive feedback functions for retricval of multivariate data. \\lc present several cxamples for retrieval. suggestions and highlighting using our proposed approach using data described in Section 4.

3.1 Bag-or·Words Indexing

As a basis of our approach we provide data indexing, whereby we adapt the bng-of-words npprO<"lch to multivariate (tabular) data. The lIow·chart in Figure 2 gives an o,·crview of the re<:]uire<:1 algorithmic steps. \\'e describe how

we adapted each of the these stelJS for indexing multivariate data.

I. Feature Extraction: extract a set of n local feature vectors Vi for ellch data object (scatter plot in our case)

2. Quantization: quantize each of the feature vcctorsv;

for all documents.(Of1!inc Stel>: Training a q1Hllltizer model q(V)

3. Tokenization: combine the quantized features with ad- ditional categorical information

,I. Term \Veighting: assign a suitable weight to each ob-

taiue<:l term

5. Indexing: build inverted lists COlltilining the rc1evlIllce of a given token (qmlutized feature vcctor) for cvery document

Slep I: Fealllre EXlraelion.

\\'e consider multivariate data docunlents that contain tabular data with dimensionality m x II. lu practice that means n different variables (like water density, water depth or pressure) where lllellSure<1 m times_ To extract a set of feature vcctors from such a documellt COlltaining Illultivari- ate data, we propo>;c to compute all bivariate wHiable combinations, and compute a feature vcctor from each of th(.'SC two-dimellsional point-clouds (scatter-plots). Jl,4uch like the previously mentioned SIFT or SURF features for images, these feature '· ... 'Clors Me aloo local in the sense that they represent a local !:k"ltlern (biwwillte) in the whole (1l11l1tivari- ate) document.

Based on previous results on feature extraction and bench·

marking for bivariate data (26), we will usc the algorithm that yielded the bc:;t o,·erall results in the benchmurk: EDH.

It is b.·lsed on the MPEG-7 descriptor "e<:lge histogram detector-' used in shapc rctrievu!' The basic idea of EDH is to render the actual scatter-plot of the bil'ariate data using Gaussl<u) kernel-density estimation. During this proccs.'l the sclltter-plots is mill' /max-llormalize<:1, resulting in trans- lation amI linear-trend inv,uiance. Then an e<lge filter is applied to this reu<iere<] image and the orieutation of the resulting edges arc extracle<! as a histogram. Figure 3 shows an illustration of this extraction process .

(5)

search OuefyTerm 1

Vanable )t

Oeplh_WlIte<.Jml

VanableY Press.ldbarj Results:

1 I

' I 1

• !

1

dol: 1 O.15941PANGAEA.143952

Depth ... [m) dol:10.15941PANGAEA.408922

Depth ... [m) doI:10.15941PANGAEA.144061

Depth ... [m) dol: 1 O.15941PANGAEA.207220

Deoth ... lm)

Figure 4: An example query using the PANGAEA data repository. We sea"ched for a linear relationship between water depth and water pressure. This tri-gram (variable x, variable y, curve form) we searched for is highlighted in each retrieved data- set.

The result of this feature extraction step is a set of 8o..

dimcnsional featurc vectors for each multivariate documcnt.

Thc number of feature \'ectors C(luals the number of possible scatter-plots, n· (n - I) for multiwlriatc data with n dimensions.

Please note that the benchmark used for evaluation of

bi\'ari(\te descriptors

[ 261

cannot be directly applied to Ollr case, /IS we consider multivariate data retrieval.

Step

2:

QUlIllliZillioll.

The result of the feature extractiou is a set of feature vectors for each dOCllment. Since we Ilccd to obtain a set of tokens for cach document, we train a quantization model that is suitable to project each input feature \'ector to a categorical integer value (the id of the codcbook elltry). There is a wealth of clustering algorithms suitable for this tru;k [32).

We chose k-mcans clustering as this IHlS shown good pcrfor- mance for image-retrieval tasks at reasonable computational costs. We choose a TalldolTl subset of all feature vectors Vi

<Ind compute II k-mealls clusteriug 011 this subset. The number of clusters k WfIlj set to 5000 b/lSed 011 the literature for a compromise betwecn discriminativencss and computational cost [33].

\\Te can then represent an ulllabelt'd feature veClor by COlll- puting the nearest of tile k centroids and <lssiglling the ID of this centroid as the tokcn for this feature ,·ector.

Slep

3:

Tokell/zalioll.

Once we quantizC(1 the feature vectors of each document and obtained the categorical cluster ids, we tokenize the doc- IIIl1ent. Since we arc not only il\lerested in bivariate duta p<ltterns (which arc now encoded ill the quantized features), but also in the variable combination that exhibits this I)at- tern, we index the data tokens ru; all possible IIni-, hi- and tri-gram terms. For exam pic, if the featllfe vector of the scatter plot of variable a vcrsus variable b wru; qlluntized to cluster id c, we would obtain the tcrms a, b, c, u-b, b_c, U-C,

u-b-c.

Slep

4:

Teml Weightillg.

After obtaining a set of terms for each document, we have to choose a weighting scheme for these terms to ,)1- low for ranked retrievul (instead of just Boolean retrieval).

A straight forward scheme to measure the relevance of a term to a given doeulIlent is ternl frequency - the number of occurrellccs of a term in a docunlent. This, however, is not suitable for our terms /IS the tri-grams - by cOllstruc- tion - occur ,It IIIOst once in a givcn document. Hcnce, \\'e propofS() to use the distance to the closest cluster centroid in fe'lture sp(lce as the relevancc of those terms rcspeclh-ely.

As an altcrnative, we also experimented with introducing an ill\"ersc-document fre<luency (idf) weight, which had little to no effect 011 retricval perfOrTmU1Ce and was thus disc<lfde<l.

Slep

5:

Indexing.

The final step is to index the sct of weighted terms of each document. For each given, distinct term we build an lJl\"erte<1 list. This Illeans, that we save a hashed look-up from each term to e,)ch dOCllment that contains this term along with the as;;ociate<l weight. This allows for ranked re- triev(11 by intersecting the inverted lists of each SCHrch term, aggregating the term weights alld sorting them in descending order. This approach scales very well. The rC<)uired main memory for this indexing structures illcreases linearly with the number of indexed documents. Retrieval time is con- stant with respect to the index look-up and is dominated by the time required to read the dOCllmellt data from the hard disks.

(6)

~T"""l

doi:10.16WPANGAEA. 194873 doI:l0.1594JPANGAEA.1941113 doI:10.15WPANGAEA. 194873

J I

"" .----

~.iiriI ~I!II

doi:l0.16!WPA.NGAEA.l!W616 doI:1G.l6S4IPANGAEA.1!U!i16

J ~ _J _I ...---

-ttwtIo (l!im

doI:l0.l~GAEA.193131 doI:l0.1&Wf>AHGAEA.193131

..

~1'!lI doi:l0.1&Wf>ANGAEA. lSJ131

E •

, _...----

-

DoIpCh ...

•

~

Figure 5: Search Result Highlighting on our front·end: For each of the retrieved multivariate documents, the scatter·plots that contain the search terms are highlighted by coloring the axis labels and/or the data points ill yellow. We searched for multivariate documents that contain a sigmoid like relationship between water density (sigma· theta) and water depth, as well as an arbitrary relationship of water depth versus temperature and salinity versus pressure. Note that the retrieved documents contain all of these search terms; we highlight the I'esults accordhlgly to show the user, why these documents were retrieved.

3.2 Retri eval

\Ve u~ the bag.of.words index described in the previous subsection for content-based retrievlll. A user Clln search for arbitrary meta-data, parameters, parameter combinations, data pattern id or a specific relationship of (I. versus b with pattern c. Figure 4 ~hows an example search query. [n thi~

example, we searched for a complete tri·gram by SI}ccifying both axis labels (water depth versus pressure) and a data pattern (a linear relationship uetwccu those two variables).

As with any full·text index that is based on inverted lists, we Can effiCiently <."Ombine sevcral search terms by inten;<.'Cl- ing the associated lists. Thus, the default behavior of our appro.-,ch is to look for all search terms, aud only return those docurtlems, that contain every search term ,md rank them according to their aggr('gated term weight as dt'SCrib('d above.

3.3 Insta nt Results

Due to the full-text-like indexing of our bag-of-words fill- proach, we arc able to perform search queries in 1('88 than 300 milli5CConds. which is genewlly accepted as "instanta-

ncous~ in retrieval applications (Sl't! '1.1 for our t('st-setul}).

Thus, while the uscr is still formulating her <jueI"Y, w(' provide her with immediate results as this has bccu shown to sp('C(lup the retrieval process.

As long as the full-l('xt-ind('x amI the primary key index of the databa;;c fully reside in the system's main memory, the look-up pnrt of the query time is independcnt of the number of documents and is dominated by the time r<''<juin.><1 to r('ad the result <hlta from the hard disk.

3.4 Res ult Highli ghtin g

Highlighting of search results is very imlJOrtant to explain to the user, why fI particular document is being return<xL

(7)

(b)

I

^sealch

I

Query Teon 1

Pattern SuggeslJOns

(a)

- -

I

^search

I

Query T erm 1

Variable X: Variable Y:

II

Sigma-theta_[kg/m-3] ]

Sal

Suggestions ,.,.-

Depth wat~

[m]

Press [dbar]

'Tpct

[ 0C] -

Variable X. VanableY

Di!plh_WlIIi!r.JmJ Sigma·lhelil..ikglm··3)

Figure 6: Scarch Suggestions 011 our (ront-end: (8) After specifying one axis label for OUI" search term, the system presents search terms for the other axis. In this example we specified water density as the y axis. The system suggests to search for salillity, water depth, pressure or temperature - precisely those four variables that water density is functionally dependent of [291. (b) After selecting one of the suggested x-axis labels, the system suggests search terms for the curve progression by visualizing slllail scatter-plots to the user.

For text retrieval, highlighting the search terms and show- ing a few surrounding sentences is a very suitable way to do so. We ada I)\. this to our retrieval scenario. For each retrie\"ed nmltivflriate document, we show up to five plots of its scatter-plot-matrix. ThCl;C scatter-plots visualize thOl;e bivariate pauerns in each document that matched the user query. \Ve further highlight those scalter-plots by coloring the axis labels and

I

or the data-points, depending on their match with the search term. Figure 5 shows an example query, where scatter-plots of each returned document highlight the user's query matches.

3.5 Sear ch Suggesti on s

\Ve provide users with sear'ch suggestions as they provide a major increase in uscability for moot retrieval systems.

Search suggest.ions and auto-completions gaim.'(1 particular popularity due to thcir introduction illto Coogle's and Ama- zon's search front-ends. Since then, search suggestions ha\'e also become central to the user's expectations (or mental model) how a search engine works [\5J. As such, failure to provide the user with this functiouality often leads to queries with no results due to the search for nOll-existent patt,ers.

In other cases, uscrs do not have a l)recise pattern in mind to search for (or to continue / refine their search with). In these cases, scarch suggestions provide the user with a mnch needed overview of patterns she can search for next.

Our approach for search suggestions works as follows:

• retrieve If documents that contflin ,111 query terms the uscr searched for fro far

• sum up tIle ranking scores for all tri-gram terms in the result set

• rcstrin possible tri-grams according to a partial search term (e.g., a partial axis label) the user ~UI)I)1ied

• return the h tri-grallls with the high ... 'St score as search suggcstions and visualiy-e them accordingly

The visualiY-ation shows a pop-up of potential x- and y- axis labels, as wen as up to nine scatter-plots as small pre- view images the user can select from (sec Figure 6).

The parameter d, the number of document~ retrim'oo to select search suggestions from, inf1.uenccs the q\HlliLy of the

~uggcstiolls (higher d is better) but also the COlllputatiomll cost of the suggestions (lower Ii is better). \Ve found d = 50 to be a good compromise betwecn fIln-time and search suggestion acr:uracy.

4. APPL ICATIO N

4.1 Data Source and Setu p

We show the applicability and scalability our propQS(.'{]

approach to real-world data frolll PANGAEA Data Library [6, 21J. PANGAEA is a digital librflry for environmental science; and it archives, publishes, and dbtributC!; gco- referenced primary research data from scienti~ts an over the world. It is openltcd by the Alfred-Wegener-Institute for Polar and i\[arine Research ill Brelllerhaven, and the Center for Marine Enviroumenwl Sciences in Bremen, Cermany.

For this usc-case, we considered every document that is currently available under the Creati"e Commons Attribution

(8)

Liccnse 3.0 and download,lble from http://vvv . pangaea.

de. In total, we were able to obtain and index 98,416 sueh doculllcnts. The TiIW IIncomprcsscd diltH of these documcnts requires approximately 35 GB of di"k "pace. Using our all- proach, we computed and indexed approximately 2.5 million terms. This Te<luires about 2 GB of RAM to keep the index fully within the mllin memory of OUT test setup.

Each document is uuiquely identified with a DO( (digital object idelltifier) alld consists of a table of lIIultivllriatc measurements, that include radiation levels, temperat.ure pro- gressions lind ozone values, iunollg many more. Each document IIvllilablc lit PANGAEA i" carefully allllotllted by the scientist who conducted the measurements. A data cura- tor controls the quality of this lin notation process. These meta-data annotations include standardized names along with base units for each measurement variable in the dilta table, which we usc for tokenizatioll as described in Section 3.

4.2 Case Study

\Ve use this welllth of ellviromllelltlll datil indexed with our approocll for a case study. A""uming no prior knowledge of the contents of this dllta repository, we look for data sets that show "imilar measurement p,lL\.erns lIS part of all explorative search process. It can he the b!l.')is to hypothe- size about the reason for the observed similarities. First, we ellter two intuitive 'mrinbles. namely water temperature and water depth. Since we do not know what kind of pattern a measuremellt bctwecll these two vHriables should look like, we let the system I}TOvide us with an overview of important patterns (sec upper left part of Figure I). We initially as- sultle<l temperalllre to either drop or rise with water depth (depending on whether the environment is warm or cold).

However, we were surprised the system suggested a pattefll that indicates a temperature drop up until a certain water depth, and then an increase (see scatter-plot "'Suggestions'"

in upp-er left of Figure I). Thus, we searched for t.his Pilt- tern. fllorCO\'er, we wanted to highlight water dem;ity versus water depth in the r<-'Sult set, as we assumed this to be similar to each other ^!I.')well. The result of thi" query can be seen in Figure I.

The !"Csult set that was retrieved did exhibit the pattern we (]uerie<i for and was highlighted accordingly. On top of that, the relationship of density \'eTl:lus depth WlIS also similar. Looking at the locations where those measurements were taken, we were surprised to 5(.'C such different locations lIS the Norwegian sea and the antarctic southern ocean.

However, looking into sollie details about the Norwegian sea did reveal that it is a maritime subarctic climatic zone, explaining the similarity of the temperature patterns.

Further refinement of our search by selecting a specific pattern for the relationship hetween water density and water depth ^lISwell, did reduce the result set to a more ho- mogencous region as we expected at that point (sec Figure 7). Using that query, all retrieved documents were measured in the antarctic wuthern ocean, approximately at the longitude of New Zealand (169°).

5. CONCLUSION AND FUTURE WORK

In this paper, we presente<l a novel approach for content- based indexing of Ulultiwlriate data by using a bag-of-words appro."lch. On this basis, we developed visual-interactive query modalities that were not a\'ailable for this kind of doc-

ument before. In particular, our approach provides the user with result highlighting !Ilul search suggestions. We showed the applicability and scalability of our appro."lCh by indexing the complete collection of multivuriate research data that is publicly available from II data library for the environmental sciences. \Ve provided an exemplary usc-case on this repository by retrieving data documents measured in similar maritime climate zones, which would not have been possible using meta-dal.a alone.

For future work, we plan to improve the similarity functions. One extension is to consider hierarchical relationships, which mIL)" exist betwccn the observlLtioll variables. These could be included to provide approximate matches for variable queries in case; where queried variables do not match, bllt other related variables exist: An example, that has also been di.'iCusscd in the applicatiou, is the \'1\riable of water pressure, which may be closely related to wuter depth. An- other sl>ecific tllSk for futllre work is to explore the parameter space for the algorithms u5(.'(1 ill this appronch. such as the number of clusteTl:l for the k-means algorithm.

More generally, future work will include defining a suitable

~imilarity meru;nre between multivariate documents by using the bag-or-words index. Such a similarity measure will allow for (luery-by-exlllllple of complete lllultivnriate data documents, as well as new interactive tools for exploratory search by layouting lllultivariate documents based Oil similar content. Finally. evnluation of the effectiveness of the expansion- and highlighting-bru;ed search proposed from a l15Cr perspective would be interesting. \Ve note tl1<lt grOllnd- truth benclnll(trking is not directly npplieable, as our al>- proach supports exploT<\tive search. Ultimately, we need to me1\.Sure the degree of insight or information increru;e, that is brought llbout by tIle apprOlLch, in ,\ given domain problem.

Acknowledgments

We thunk the Alfred-Wegener-Institute (AWl) in Bremer- haven, pllrticularly Rainer Sieger, Hannes Grobe ,wd Gert Konig-Lunglo, and e\'eryone involved with PANGAEA for supporting this research effort. \Ve arc especially grateful to the many scientists that contributed the dnta availlLble

t hrollgh BSRN and other research project".

6. REFERENCES

Ii]

J. Bernard, J. Brase, D. W. Fellner, O. I(oepler, J. Kohlhnmmer, T. Ruppert, T. Schreck, and J. Sens.

A visual digital library approach for time-oriented scientific primary data. lilt. l. all Digital Libmries, 11(2):111-123,2010.

[2] J. Bernard, T. RUI>I}ert, I'd. Scherer, J. Kohlhammer, and T. Schreck. Content-based layouts for exploratory metadata search in scientific resenrch data. In K. B.

Boughida, B. Howard, I'd. L. Nelwn, H. V. de Sompel, and I. Solvberg, e<litors, lODL, pages 139~148. ACM, 2012.

[3] R. Berndt, I. BWmel, 11"1. Clnuscn, D. Damm, J. Diet, D. W. Fellner, C. F'remerey, R. Kleiu, F. Krahl, M. Scherer, T. Schreck, I. Sens, V. Thomas, and R. \Vcsscl. The probado projeet - approach i1nd lessons 1cnrned in building II digitlLI library system for heterogencous non-textual documents. In EGO!', pages 376~383, Sept. 2010.

(9)

! - ^? : ^... --, ~ -- ^.-;../ -

lIoI:lg.l-""'C'O'EA.1JC1~10 doI:lo.l""PAliGAEA.1JO&I0 _...

^:

^' _c;:IOoI

G2J

^~.~.,

^-' ^... ^1,. ^.

l.. \. . .. '-.. r _I _. ₎ ^[Il ₁ ^! ^e- ^.. ^... ^. ^.... ^. ^/ ^..

^:

- , -.

^~~"J

I>"or<il

'-~

--~

lIOI:lg.l~I:M1_

. / / ,-- _'I.) ' l

^I

~]kWIn""3l1

. / , ..I

lIoI:l'.I~.I:M1'n

_ v'

--~

ill ... )

--

^....^.^"!

.:. c::-

:

.

^,

. .

'\'

Figure 7: Case Study: We refined our original search query (see Figure 1) by also selecting a specific relationship between water density and water depth that was suggested by system. We see that this combination of patterns only occurs in documents that originate from the antarctic southern ocean at longitude 169⁰^• Map Data is attributed to Coogle Maps.

[<I[ D. Blei, A. Ng, and M. Jordan. Latcnt dirichlet allocation. the Jo!!l"1tal oj machille Lcal"1tillg research, 3:gg3-1022, 2003.

[5] 11. Dalla. J. Li. and J. Z. Wang. Conteut-based image retrieval: approaches and trends of the uew age. In III Proceedings ACM Inl,eniati/ial Workshop ⁰¹¹

Multimedia luJonnatioll Retrieva~ pages 253-262.

ACr.! Press, 2005.

[6] til. Diepenbrock, H. Crobe, M. Reinke, U. Schindler,

n..

^Schli(,^zer,^R.^Sieger,^and^C^.^Wefer.^Pangaea-^an

information system for environmental sciences.

Computers fj Geoseiwccs, 28(10):1201-1210, 2002.

[7] H Ding, C. Trajccvski, P. SchcuenOlLnn, X. Wang, and E. Keogh. Querying lind mining of time series data: experimental COml)arison of representations and distance mC/lSures. P.-oceedings oj Ihe VLDB

Efldowmeflt, 1(2):15<12-1552,2008.

[8J til. Eitz, R. Richter, T. Boubckeur, 1<' Hildebrand, and til. Alexa. Sketch-bru;cd shape retrie\'lli. AClLf 1'runs.

CrupI!. (PlY)(;. SICCRAPH), 31(,1):31:1-31.10, 2012.

[9J til. A. Hean;t. Search User InlerJaces. Cambridge University Press, I edition, 2009.

[IO[ C. Hebrail, B. Hugueney, Y. Lcchevallier, and

F. Rossi. Exploratory analysis of functional data via clustering and optimal segmentation. Nemocolllpll/.,

73(7-9):1125-1141,2010.

[II] T. Hofmann. Probabilistic latent semantic indexing. In Proce€liings oj tile 22nd al!fw«1 illternal,i07w/ ACM SICIR conference on Research and development in in/ormation rell'ievai, pages 50-57. ACM, 1999.

[l2[ H. Jcgou, r.1. Douze, aud C. Schmid. Improving bag-or-features for large :scale image search.

"!tenw/jonal JOllnwl oj Computer Vision, 87(3):316-336,2010.

[l3[ E. Keogh and S. Kasctt.y. On the net.'<l for time series dala mining bcnchnwrks: /\ survey and empirical demonstration. Data Mining and Knowledge

Discovcry, 7(4):34g-371, 2003.

[1<1] E. Keogh, J. Lin, and A. Fu. Hot s.'lX: EfIlcicl1lly

finding the most unusual time series subsequence. In IEEE Inte171ationai ConJerence all Data Millin.Q, pages 226-233, 2005.

[l5J til. !<hoo and C. Hall. What would 'google' do'! usen;' mental models of a digital library search engine. In P. Zaplliris, G. BllciHlml1l, E. Rl\Smusscn, and

(10)

F. Loizidcs, editors, The{)1Y (<lid Pmetice oj Digital Ubmries, \'olume H89 of Lecture Nates in Computer Science, pages 1-12. Springer, 2012.

[16J B. Kiihncke, S. Tonnics, and W.-T. Balke. Catching

the drift - indexing implicit knowledge in chemical digital librarics. In P. Zaphiris. G. Buchanan, E. Rasmussen, and F. Loizidcs, editors, Theoryalld Prrlctice oj Digitall.,ibrories, volume 748D of I.,ecture Notes cin C01l!pl,ler Science, pagcs 383-3D5. Springer, 20[2.

[l7J ~l. Lew, N. Sebc, C. Djeraba, and R. Jain.

Content-b'l.sed multimedia information retrieval: State of the art and challenges. ACM TruIl.sac/io11S 011

lIIultimedirl Computing, Commuui~.lious, (Iud AT!JIiicatio)!.s (TOlllCCAP), 2(1):1-19, 2006.

[lSJ J. Lin, R. Khade, and Y. Li. Rotation-invaril.U\L similarity in time serics using bag-of-patterns representation. Journal oj Intelligcnt. IIiJormation Systems, pages 1-29, 201!.

[l9J D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. "isiml, 60(2):91-110, Nov. 2004.

[20J C. A. Lynch. Jim gray's fourth paradigm and the construction of the scientific record. In T. l'ley, S. Tansley, and 1<' M. Tolle, editors, The Fourth Pamdigm. pages 177-183. Microsoft Research, 2009.

[21] PANGAEA Publishing Network for Geoscientific &

Environmental Data. http://wW\\'.I)angaea.de/. [22J ~1. Riley, E. Heinen, and J. Ghosh. A text retrieval

approach to content-based andio retrie\'al. In lilt.

Symp. 011 MWiic hrJ0I711ation llctl"ieva/ (ISAflR), pages 295-300, 200S.

[23J R. Rowley-Brooke, F. Pitie, and A. Kobram. A ground truth bleed-through doculllent image database.

In P. Zaphiris, G. Buchanan, E. Ra.<:;mussen, and F. Loizidcs, editors, Thenry alld Pmctice oj Digita/

Libml"ies, volume 7<]S9 of Lectu,.e Notes ill COI~!Jluter Science, page~ 1S5-1%. Springer, 2012.

[2<IJ G. Salton and C. Buckley. Term-weighting approaches in automatic text retrie\'l).1. hlJOl7l1uiioll processillg fj mallagement, 24(5):513-523, 19S5.

[25J ~1. Scherer, J. Bernard, and T. Schreck. Retrieval and eXl)loratory search in multivariate research data repositories using regression(ll features. In Proceedi'lg of the 11 til ulI/w(ll illtel7lutiOlla/ A CM/IEEE joint amfert'.lIce 011 Digitallibml'ics, JCDL 'II, ])ages 3u3-372, New York, NY, USA, 2011. AC1\'1.

[26J M. Scherer, T. von Landesbcrger, lind T. Schreck. A

bCllchmark for content-based retrie"al in bivariate d,)ta collections. In p,.oceediugs of Ihe Second intel7wtiollal conJcrt'.llce 011 Theory alld Pmctice oj DigitallAbmrics, TPDL'12, pages 286-297, Berlin, Heidelberg, 2012. Springer-Verlag.

[27J B. Slmeiderman. The eyes IHwe it: A task by data

type ti)Xonomy for information visualizations. In IEEE Visu(ll /,allguagcs, number U1\ICP-CSD CS-TR-36(i5, pages 336-343, College Park, Maryland 207<12. U.S.A., 1996.

[2SJ J. Slvic and A. Zisscrman. Video google: A text retrieWl1 approach to object matching in videos. In

Computer Vision, 2003. Proceedings. Ninth IEEE Illtenl(ltional CmlJel"Cnce 011, pil.ges 1,170-1477. IEEE, 2003.

[29J R. Stewart. IntroductiOIl to physico.l oceallagmphy.

Texas A & 1\1 University, 200'!.

[30J R. W. White and R. A Roth. ExplomtOIY SC(l1'Ch:

Beyondllie Quely-lles]}ollse Pa1"(uiigt1l. Synthesis Lectures on Information Concepts, Retricvul, und Services. ~lorgan & Claypool Publishers, 2009.

[31] B. Wong, S. Choudhury, C. Rooney, R. Chen, and

K. Xu. luvisquc; technology lIud methodologics for interactive information visualization and analytics iu large library collcctions. Resem'CIl and Advanced

Technology for Digital Libml"ies, pages 227-235, 2011.

[32J R. Xu, D. Wum;ch, et 1.11. Survey of clustering algorithms. Nellm/ Networks, IEEE TmnS(lcliolUl 011,

iG(3):645-678, 2005.

[33J J. Yang, Y. Jiang, A. Hauptmann, and C. Ngo.

Evaluating bag-of-visual-words representations in scene cla&lification. [n Proceedillgs oj the iulenwtiollo.l workshop 011 Workshop 011 mllitimcdia illJonn(l/ioll retl'ievai, pages 197-206. !\C~I, 2007.

Visual-Interactive Querying for Multivariate Research Data Repositories Using Bag-of-Words