Informationsintegration:
Indexing
!"#$
%&'
Indexerstellung
)*
+
,-./012-3456-078
9/03:;
Manuelles Indexieren
<
Indexierer ordnen Dokumente per Hand in Kategorien ein oder bestimmen Indexterme
<
Yahoo: In „beste“ Kategorie einordnen
<
National Library of Medicine: Einordnung in so viele Kategorien des Medical Subject Headings (MeSH) Katalogs wie möglich
<
Skalierungsproblem
<
Spezialisierung als Ausweg
=> ? ?@A B BDC C C E FG HI J
> G KL M
KG
C H ? J
> E JN O B
I G @ N I ? F BP MI G J ?N I MG F E> ? O QR
S TU U
VXW YZ [ VXW S \ ] ^ ^ ] [_
_ Ta`
b VU U
c` d [ [ U S TU Ve
W Y \ ] ^ ^ ] [_
eU U U U U
eU U
f [ [ gh
\` i
Z U j TU Ve
W k \ ] ^ ^ ] [_
l k VU U U
l kU U U
mDn op qrDs ot uwv s x qzy u|{ }
~r
p a
y u o v s r
op
r uwv s op
o
s
r t o
Automatisches Indexieren
Notwendig wegen Web-Grösse
Crawling
Indexing
Anfragemanagement
¡¢
£
¤¥¦§¨©ª¥«¬®¥¨¯°
±§¨«²³
Normalisierung
´
Dokumentenvorbereitung
µ
Parsieren/Entfernen von HTML
¶
Ermittlung indexierungsrelevanter Informationen
·
alt Attribut bei <img>
·
<meta>-Tags
·
lang Attribut
·
…
¸
Umgang mit Zeichenkodierungen
¸
Entitäten expandieren
¸
Interpunktion entfernen
¹
Aufteilen in Token
º
Stop words entfernen
º
Stemming
¼½
¾
¿ÀÁÂÃÄÅÀÆÇÈÉÀÃÊË
ÌÂÃÆÍÎ
Limerik Beispiel
Ï ÐÑ Ò ÓÔ
Ï Õ|Ö ×zØ Ô
ÏÙ Ô Ú Ð|Û ÜÛ Ö ÝÞ Û ß àá à á Û à ÜÞ Ð|Û Ü
Ý à ÒÛ
× â à Ý Ý à Ï Õ Ü
Ô
ã ÐÖ ÝÛ Û ×
Û × á Ö ÒÛ ä Ý å|Ö Ö Ý
Ò à Ý Ý àçæ Ï Õ Ü
Ô
è ÐÛ Ù éÑ ê ÜØ Û ê à Ý × ê ß Ð|Û àÑ ê ä Ý ÐÛ Ü
ë éÛ ÜØ Ï Õ Ü
Ô
ì ÓÖ Ýí ß
äÑ Ð êÙ Ö Ñ àÑ Ö ê Ö Ü
êÞ Ü à Ý Õ|Û Û ÜØ
î ê Ï Õ Ü
Ô
ï éÑ ÝÖ ÒÛ ÝÑ äÖ Ý Ö å êá Ö é Ü ×
Ö éí Ð ê Ö Ü
ê Õ à Ý à Ý à êæ Ï ðÙ Ô
ÏÙ ÔñÝá Ñ Û à × Ö å
ÜØ Û î ò Ý Õá Ù óÞ Ü à Ý Õ
Û Ü ÜØ
î
Ö Ü ß Ð|Û àÑ î Ï Õ Ü
Ô
Ú ÐÛ ÜÛ á é
ÓÑ
á Ð à ×
ÒÖ ÜÛ á Ù ä Ü
äÑ
é à Ó
ÒÛ à
Ñ æ Ï Õ Ü
Ô
èÖ â à Ý Ý à ß àá ÝÖ
Ñ Ù ÓÛ àá Û × î Ï Õ Ü
Ô
ô
Ö Ü ß àá ÐÛ Ü Ð é Ýí Û Ü Û àá Û × î Ï Õ Ü
Ô
õö à éá Û á ÐÛ ß àá
ÓÖ Ö ÷ ä
Ýí
åÖ Ü
á Ö ÒÛ
Ñ Ð ä
Ýí
Ñ Ö Û àÑ æ Ï ðÙ Ô
Ï ð ÕÖ ×Ø Ô Ï ð ÐÑ Ò ÓÔ ø ùaú ûú üý þú ÿ ú û þ ùú û ý ú
ý ý
ù ü ý ú ú ú ü ú ý ü üý ý ý
ùú û ú ý ÿ ùaú ý ùú û ú û
üý ÿ ù ü ü ü û þ û ý
ú ú û
ý ü ú ý
üý ü ü û ü ù ü û
ý ý
ý
ú ü û ú þ û ý ú û û ü û ÿ ùú
ø ùaú
ûú
ù ü û
ú û ú
ü ý ý ÿ ý ü ú ú
ü û ÿ ùú û ù ý ú û ú ú
ú ùú ÿ ü ü ý ü û ü ú ù ý
ü ú
!"#$%&'
(")*
Stop words
+
Nicht alle Worte sind für den Inhalt eines Dokuments entscheidend
(idf so gering, dass sich ihre Weiterverarbeitung nicht lohnt)
+
Diese stop words werden aus dem Dokument entfernt
+
Übrig bleiben content words
, -.
/
01234561789:14;<
=347>?
Englischsprachige Stop-Worte
a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning
consequently consider considering contain containing contains
corresponding could couldn't course currently d definitely described despite did didn't different do does doesn't doing don't done down downwards during e each edu eg eight either else elsewhere
enough entirely especially et etc even ever every everybody
everyone everything everywhere ex exactly example except f far
few fifth first five followed following follows for former formerly
forth four from further furthermore g get gets getting given gives
go goes going gone got gotten greetings h had hadn't happens
@ AB
C
DEFGHIJEKLMNEHOP
QGHKRS
Ca. 570 Worte
hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated
indicates inner insofar instead into inward is isn't it it'd it'll it's its itself j just k keep keeps kept know knows known l last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself n name namely nd near nearly necessary need needs neither never
nevertheless new next nine no nobody non none noone nor
normally not nothing novel now nowhere o obviously of off often oh
ok okay old on once one ones only onto or other others otherwise
ought our ours ourselves out outside over overall own p particular
particularly per perhaps placed please plus possible presumably
probably provides q que quite qv r rather rd re really reasonably
regarding regardless regards relatively respectively right s said
same saw say saying says second secondly see seeing seem
seemed seeming seems seen self selves sensible sent serious
UVW
X
YZ[\]^_Z`abcZ]de
f\]`gh
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes
somewhat somewhere soon sorry specified specify specifying still sub such sup sure t t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two u un under unfortunately unless unlikely until unto up upon us use used useful uses using usually uucp v value various very via viz vs w want wants was wasn't way we we'd we'll we're we've
welcome well went were weren't what what's whatever when
whence whenever where where's whereafter whereas whereby
wherein whereupon wherever whether which while whither who
who's whoever whole whom whose why will willing wish with within
without won't wonder would would wouldn't x y yes yet you you'd
you'll you're you've your yours yourself yourselves z zero
jjk
l
mnopqrsntuvwnqxy
zpqt{|
Stemming
}
Varianten des gleichen Worts können auf einen Term abgebildet werden:
}
Beauty, beatiful, beautify -> beaut
}
Notwendig Stemming
}
Sprachabhängig
~
Porter Stemmer:
Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14 (3) :130-137
Definition plus Implementierungen bei
http://www.tartarus.org/~martin/PorterStemmer/
Porter Stemmer
¡¢£¤¥¦
§¡¨©
Definitionen
ª
Liste C = ccc... aus Konsonanten
ª
Liste V = vvv... aus Vokalen
ª
Jedes Wort: CVCV...C, CVCV...V, VCVC...C, VCVC...V
ª
[C]VCVC ... [V] , [ x ]: Optional
ª
[C](VC){m}[V], ( x ){m}: x in m Wiederholungen
ª
m: „measure“ des Worts
«
Regeln: (condition) S1 -> S2 (m > 1) EMENT ->
TROUBLES, PRIVATE, OATEN, ORRERY m=2
TROUBLE, OATS, TREES, IVY m=1
TR, EE, TREE, Y, BY
m=0
®¯
°
±²³´µ¶·²¸¹º»²µ¼½
¾´µ¸¿À
Regeln für ersten Schritt
Á
1a:
 Âà ÂÅÄ Æ Â Â ÇÈ ÉÊË Ë ÊË Ä Æ
ÇÈ ÉÊË Ë
Ì Ã Â Ä Æ Ì ÍÎÏ ÐÊË Ä Æ ÍÎÏ Ð
Ñ ÐÊË Ä Æ Ñ Ð
Â Â Ä Æ Â Â ÇÈ ÉÊË Ë Ä Æ ÇÈ ÉÊË Ë
Â Ä Æ ÇÈ Ñ
Ë Ä Æ
ÇÈ
Ñ
Á
1b:
ÒÔÓ Æ Õ Ö Ã Ã× Ä Æ Ã Ã Ø
Ê Ê Ù Ä Æ Ø
Ê Ê Ù
ÈÚ ÉÊ Ê Ù Ä Æ ÈÚ ÉÊ Ê
ÒÛÝÜ Û Ö Ã× Ä Æ Í ÞÈË ÑÊ ÉÊ Ù Ä Æ Í ÞÈË ÑÊ É
ß ÞÊ Ù Ä Æ ß ÞÊ Ù
ÒÛÝÜ Û Ö Ìà á Ä Æ Ó Î ÑÎ É Ð
Ï Ú Ä Æ Ó Î ÑÎ É
Ë Ð
Ï Ú Ä Æ Ë Ð
Ï Ú
ãäå
æ
çèéêëìíèîïðñèëòó
ôêëîõö
Regeln für ersten Schritt
÷
Zusätzlich nach Erfolg mit Regeln :
øù ú û øùü ýþÿ ú û
ýþÿ
ú û ü þ
ú û þ
ú û ü ú û
ÿ ÿ þ þ þ ú û
ÿ
þ
ÿ ú û þ
ÿ ÿ ú û ÿ
ÿ ú û
ÿ ú û
ú û
ÿ þ ú ûü
ÿ ú û
ÿ ú û
1c:
! " # $% & &' ! " $% & & (
) *
' ! " ) *
'
,-.
/
01234561789:14;<
=347>?
Regeln für zweiten Schritt
@
2:
A BC D E FG H IJ FK LC FG M NO PRQ S TRU VQ P LC NO
PQ S
O
A BC D E G H IJ FK LC G H IJ WU V X T S TU VQ P LC WU V X T S TU V
N
Q S TU VQ P LC NQ S TRU VQ P
A BC D E MJ Y H LC
MJ Y M QZ PO V W T LC QZ PO V WO
A BC D E FJ Y H LC
FJ Y M [O \ T S
Q V W T LC [RO \ T S
Q V WO
A BC D E H] M^ LC H] M X TR_ T S TR` O N LC X TR_ T S T` O
A BC D E Fa K H LC Fa K M WU V b
U N BQ c P T LC WU V b
U N BQ c PO
A BC D E
FK K H LC FK QN X T QW P P T LC QN X T QW P
A BC D E MJ G K H LC
MJ G X T b b
O NO V S P T LC X T b b
O NO V S
A BC D E
MK H LC M Z T PO P T L C Z T PO
A BC D E Id eK H LC Id e Q VQ PU _ U f \ P T LC Q VQ PRU _ U f \
A BC D E H] FG H IJ LC H] M Z TO S VQ B T` Q S TRU V LC Z TO S VQ B TR` O
A BC D E FG H IJ LC FG M g NO X T WQ S TU V LC g NO X T WQ S
O
A BC D E FG I^
LC FG M U gO NQ S
U N LC U gO NQ SO
hRi j k l mn o pq r j mn sut v wRx yz
{i
r j s
t v wx y
hRi j k l o| }~ } p p r j o| } wt
z {
z t t { { r j wt
z {
z t
hRi j k l n ~ } p p r j n R t s v y
t { { r j t
s v y
hRi j k l p~ } p p r j p x y y v { t { { r j
x
y y v {
hRi j k l mn o o r j mn s
i x yz
z r j s
i x y
hRi j k l o| o o r j o| } {t {
z
z
z
z r j
{t {
z
z t
hRi j k l on o o r j n } {t {
z z yz
z r j
{t {
z yt
¡
Regeln für dritten Schritt
¢
3:
£ ¤¥ ¦ § ¨ ©ª «¬ ¥ ¨ © ®¯ °± ² °³ ´ ®µ ¥ ®¯ °± ² °³
£ ¤¥ ¦ § ª « ¨¶ ¬ ¥ ·¹¸ ¯ ¤ ´ ® °¹º µ ¥ ·¹¸ ¯ ¤
£ ¤¥ ¦ § ª » ¨¼ ¬ ¥ ª » ·¹¸ ¯ ¤ ´ ² °¹½ µ ¥ ·¹¸ ¯ ¤ ´ ²
£ ¤¥ ¦ § ¨ © ¨ « ¨ ¥ ¨ © µ ²µ ³ ®¯ °³ ° ® ° ¥ µ ²µ ³ ®¯ °³
£ ¤¥ ¦ § ¨ ©ª » ¥ ¨ © µ ²µ ³ ®¯ °³ ´ ² ¥ µ ²µ ³ ®¯ °³
£ ¤¥ ¦ § ¾¿ » ¥ À¸ ± µ ·ÂÁ ² ¥ À¸ ± µ
ÃÄ Å Æ Ç ÈÉ Ê Ê Ë Å ÌÍ Í ÎÏ ÐÑ Ñ Ë Å ÌÍ Í Î
ÓÔÕ
Ö
×ØÙÚÛÜÝØÞßàáØÛâã
äÚÛÞåæ
Regeln für vierten Schritt
ç
4:
è éê ë ì íî ïê ðñ ò ó òô õ ïê
ðñ ò ó ò
è éê ë ì íö ÷ø ïê ô õ õRù
úô ûü ñ ïê ô õ õù ú
è éê ë ì ø ö ÷ø ïê ó û ý
ñ ðñ ûü ñ ïê ó û ý
ñ ð
è éê ë ì ø þ ïê ô ó ð õ ó ûñ ð ïê ô ó ð õ ó û
è éê ë ì ÿ ÷ ïê ðù ü ù óü ïê ðù ü ù
è éê ë ì í î ø ïê ô ô õñ ïê ô
è éê ë ì ÿ î ø ïê ñ ý
ñ û ó õñ ïê Rñ ý
ñ û
è éê ë ì íö ïê ó ð ð ó ô û ïê ó ð ð ó
è éê ë ì ø ø ö ïê ðñ õô ü ñ éñ û ïê ðñ õô ü
è éê ë ì ø ö ïê ô
éñ û ïê ô
è éê ë ì ø ö ïê ñ ñ û Rñ û ïê Rñ ñ û
! " # $ % " # $
& ' % ( ' %
) * ' + , ' +
$ % #
$ % #
! " # $ % ( " # $
- * , . .
, ' +, , . .
, '
/ * 0 1 #, 2 , 0 1 #,
456
7
89:;<=>9?@AB9<CD
E;<?FG
Regeln für fünften Schritt
H
5a:
IKJ L M N O P L QR S TKU VW P L QR S TKU V
R U V
W P L R U VW
IKJ X M U Y Z Y S V [ S N O P L
\W U ]W P L \W U ]
H
5b:
IKJ L M U Y Z [ Z U Y Z [^ N P L ] _ Y` aW aW V V
W R
\ S Y VR S a a P L \ S Y VR S a
R S a a P L R S a a
cde
f
ghijklmhnopqhkrs
tjknuv
Effekt
w
Test mit 10000 Worten
w
Rest von 6370 Worten -> Vokabular um 1/3 reduziert
3650 Worte nicht reduziert
1373 Worte reduziert in Schritt 5
2424 Worte reduziert in Schritt 4
327 Worte reduziert in Schritt 3
766 Worte reduziert in Schritt 2
3597
Worte reduziert in Schritt 1
yz{
|
}~~ ~
Beispiel in 10 Dokumenten aufgeteilt
There once was a searcher named Hanna
Who needed some info on manna
She put rye and wheat in her query
Along with potato or cranbeery
But no mention of sourdough or banana
Instead of rye cranberry or wheat
The results had more spiritual meat
So Hanna was not pleased
Nor was her hunger eased
Cause she was looking for something to eat
¡¢£¤¥¦§¡¨©
ª ¡¤«¬
Extrahierte Terme / Document file
Nach Stemming und Stopword-Entfernung:
- 10
hunger 9
Hanna 8
spiritual, meat 7
rye, cranberry, wheat 6
sourdough, banana 5
potato, cranbeery 4
rye, wheat, query 3
manna 2
searcher, Hanna 1
Terme
Dokument
®
¯°±
²
³´µ¶·¸¹´º»¼½´·¾¿
À¶·ºÁÂ
Termliste für Dokumentenmenge / Dictionary
2 wheat
1 spiritual
1 sourdough
2 rye
1 query
1 potato
1 meat
1 manna
1 hunger
2 Hanna
2 cranb
1 banana
Global/Document frequency
Term
à ÄÅÆ
Ç
ÈÉÊËÌÍÎÉÏÐÑÒÉÌÓÔ
ÕËÌÏÖ×
Inversion list
Ø
Invertierte Liste der Dokumente
Ø
Gibt an, wo in
welchen Dokumenten ein Term vorkommt
Ø
Format hier:
(Dokument,Position)
Ø
Verschiedene
Optionen für die Repräsentation
(3,5); (6,6) wheat
(7,5) spiritual
(5,5) sourdough
(3,3); (6,3) rye
(3,8) query
(4,3) potato
(7,6) meat
(2,6) manna
(9,4) hunger
(1,7); 8,2) Hanna
(4,5); (6,4) cranb
(5,7) banana
Fundstellen
Term
ÚÛÜ
Ý
Þßàáâãäßåæçèßâéê
ëáâåìí
Collaborative Indexing
ïðñ
ò
óôõö÷øùôúûüýô÷þÿ
ö÷ú
Harvest System
Indexe lesen jeweils komplette Information aus Servern aus
Harvest-System:
Gatherer extrahieren Quellen
Broker stellen Index her
Broker stellen beantworten Anfragen
Kaskadierte
Gatherer
Reduzierte Netzlast / erhöhte Flexibilität
Gatherer auf Server-Seite:
Reduzierte Kosten für Auslesen von Informationen
Methoden:
Erstellung von Zusammenfassungen von Informationen
Indexierte Informationen komprimiert übertragen
Inkrementelles Update geänderter Seiten
Cache geholter Informationen
Kaskadierte Broker:
Flexible Zusammenstellung zu themenspezifischen Indexen
!
Mögliche Filter für andere Broker
#$%
&
'()*+,-(./01(+23
4*+.56
Systemarchitektur
7
SOIF (Summary Object Interchange Format)
8 9 :;=< >? @BA < C
< ?ED @=F GIH J KD ? @H F < F
L MBN ? JD O @
< J?< J P
< N ?
RST
U
VWXYZ[\W]^_`WZab
cYZ]de
Gatherer Architektur
f
Untersucht Objekte auf Server
g
aufgrund von Aufzählung in Konfiguration
h
durch Traversierung
i
Essence System erstellt Zusammenfassungen
j
Konfigurierbar:
k
Wie Typ erkannt wird (Endungskonvention, Inhalt etc.)
l
Welche Objekte indexiert werden (binary/source)
l
Wie Informationen extrahiert werden (<title> etc).
m
Verschachtelte Objektrepräsentation (foo.tar.gz)
opq
r
stuvwxytz{|}tw~
vwz
Essence
SOIF Beispiel
¡¢£¤¥¦§¡¨©
ª ¡¤«¬
Broker Architektur
Collector
®
Erfragt Updates bei Gatherer oder Broker
¯
Registry
°
Liste der gespeicherten Dokumente
±
Storage Manager
²
Gespeicherte Dokumente (Filesystem) als SOIF
³
Index/Search Engine
´
IR System für Dokumente
µ
Unterschiedlice Systeme:
¶
Glimpse: Volletextindex
¶
Nebula: Standing queries
·
Query Manager
¸
Anfragen annehmen und an Index weiterleiten
¸
Anfragen annehmen und an anderen Broker weiterleiten
ºº»
¼
½¾¿ÀÁÂþÄÅÆǾÁÈÉ
ÊÀÁÄËÌ
Replikation
Í
Broker können repliziert werden
Í
Replikation: Mehrere gleiche Kopien an unterschiedlichen Stellen bereitstellen
Í
Effekt: Bessere Erreichbarkeit, besser Skalierbarkeit
Í
Problem: Konsistenzerhaltung bei Änderungen
Í
Harvest:
Î
Schwache Konstenz: Nach Änderungen werden sich die Replikate angleichen
Î
Replikat gehört zu Gruppe
Î
Gruppenmaster ermittelt notwendige Angleichungen
Î
Replikate synchronisieren sich paarweise
ÐÑÒ
Ó
ÔÕÖ×ØÙÚÕÛÜÝÞÕØßà
á×ØÛâã
Caching
ä
Objekte (von Servern)
werden in Caches
zwischenge- speichert
ä
Cache ist
hierarchisch
organisiert
æçè
é
êëìíîïðëñòóôëîõö
÷íîñøù
Caching-Strategie
ú
Anfrage nach Objekt geschieht über Cache
ú
Hit : Cache hat Objekt, Miss : Cache hat Objekt nicht
ú
Bei lokalen Miss: Anfrage an
û
Eltern und Nachbarcaches
û
an echo-Port des Herkunftsservers (mit Treffer-Antwort)
ü
Hit-Antwort von allen
ý
vom schnellsten holen
þ
Hit-Antwort vom Herkunftsserver und Miss-Antworten von Caches
ÿ
Wenn Herkunft langsamer als schnellster Eltern-Cache:
Von Eltern-Cache holen lassen
ÿ
Sonst: Von Herkunftsserver holen
Literatur
Michael W. Berry and Murray Browne: Understanding Search Engines: Mathematical Modeling and Text
Retrieval. 1999. Siam.