• Keine Ergebnisse gefunden

THE FUTURE OF INTERNET SEARCH

N/A
N/A
Protected

Academic year: 2022

Aktie "THE FUTURE OF INTERNET SEARCH"

Copied!
28
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

THE FUTURE OF INTERNET SEARCH

Herwig Unger

(2)

The perfect world of google…

2

???

(3)

How search engines work?

4

• They “copy” the web.

 completeness

 actuality, resources, energy

• They ”rank” results.

 independency, ‘bubbles’, the GoggleEffect

• They “present” results.

 ASCII list in the 21st century

 session management

 user support

 interaction

 trustworthiness

• They “earn” big money.

 advertisement

 company interests

 privacy vs. NSA

(4)

MI-6, KGB and Stasi were yesterday.

 Today’s secret service is Google

A whole copy of the WWW

Search histories, chats …

Private Traffic and

movement data, streetviews

Public transport schedules

Health data

5

(5)

Librarians …

6

are active intermediaries

between users and resources

Obtain, organise and maintain information

Managing access

 pathfinder, bibliographies

Specialised knowledge

Masters in information literacy i.e. "... the ability to know when there is a need for

information, to be able to identify, locate,

evaluate, and effectively use that information for the issue or problem at hand.“

(6)

A more direct comparison

7

Harry !?

Just Harry ???

Crazy ---

(7)

But search is ….

8

• … an iterative process

• … need consider many alternatives

• … having its own (very personal) context and history

• … coming along with learning effects derived from positive and negative feedbacks

• … also an influence to the objects being searched and retrieved

• … possibly a cooperative activity

• … in the nature seldomly centralised 

(think about foraging, ants, partner search)

(8)

Local improvements suggested:

9

 From D. Weiss, S. Osinski. Carot Search

http://project.carrot2.org/release-3.5.0-notes.html

Graphical clustering of results

User Feedback Additional Keywords and

Background Search Picture Search

(9)

Today is Google. And tomorrow?

 Search is understood as process with several participants in a

given context and a history

 Learning and adaptation, which are caused by multiple feedback sources

 Decentralisation and emergence of structures

 No central control or knowledge

 Brain like structures and

processes where connections between different instances are the most important things

11

(10)

Jeff Hawkins: “On Intelligence”

11

(11)

Going decentralised….

(12)

13

Alternatives

GNUTELLA

• Broadcast-based, i.e. many messages

• Keep anonymity of user until download

• Relatively fast

• Less overhead since simple protocol FREENET

• No broadcasts, single search messages

• Keep anonymity fully

• Get faster over time by copies and new links

• Still a simple protocol

Dynamic Hash Tables (DHT)

• Scalable with logarithmic expenses

• Just a content directory

• Fast

• Many overhead due to complex protocol

(13)

YaCy

Source: www.yaci.net

(14)

Decentralised search engines (see also YaCy and Faroo)

….

WWW Pages

+

Webserver Im Internet

+ P2P-Level

===

NEW SYSTEM

15

(15)

“Our” Preliminaries…

(16)

The basics: co-occurrence analysis

 Significant co-occurrences appear with a probability above a specific threshold in sentences (sentence level), in paragraphs (paragraph level) or in the whole text (document level).

 The set of all significant co-occurrences can be presented as a

co-occurrence graph (usually undirected): nodes-terms, edges-relations

Source: corpora.uni-leipzig.de

Herwig Unger, Seminar Groningen, 15.05.2015 17

(17)

Document centroids

18

The physical analogon:

 the centre of mass

words = mass point

distance vector = distance in co-occ. graph

The centroid of a document is the term with the minimal average distance to all words of the respective document in the co-occ. graph.

e.g. school is the centroid of a document containing classroom, students, teacher but also computer

(18)

Properties of centroids

19

The centroid can be a word, which is not contained in any of the

documents.

Often, generalising terms will be found.

Theoretically, a document may have more than one centroid.

The distance of two document centroids in the co-occurrence graph can be used to define the similarity of the documents.

Even to short queries may

a centroid term may be assigned.

(19)

Towards a

Librarian

of the

WWW

(20)

The librarian of the web

Empty bookshelf

…growth process…

…full shelf 

Catalogue or Order algorithm

Classify & Sort  !

(21)

Top Down: Building a self-specialising hierarchy

22

Co-occurrence graph Level 0

Co-occurrence graph Level 1

Rules of the game

If a level is full, the local co-occ. graph is partitioned.

Document links are given to one node of the lower level depending on the location of its centroids.

(some words of a document may be in the other partition, however)

The upper levels remain as a chunky classification of new arriving documents or queries which are later refined

The co-occ. graph in the lower level will be refined by documents assigned to the respective node

In case the next node is full, the game is repeated in a successive manner.

Refined

Co-occ. graph

(22)

Button-up: Agent play

23

1 3

1 3

Any Peer

1,3

Merging

1,3

(23)

Button-up: Agent play

1 3

1,3 2,4

4 2

Any Peer Compute

1,2 3,4

1,2 3,4

(24)

Button-up: Agent play

Any Peer

3 4 2

1,2

3,4

+ z random

+ z random

1

c2 …and go on playing

c1

The same procedure

C1,2

(25)

Properties of the agent play

 New peers will be automatically included. If needed, new agents and peer will be added.

 Peers leaving the community will be tolerated.

 Agent faults are no problem. A lost agent maybe replaced and included without any bigger problem to the remaining

community.

 Fully connected cluster makes the system more fault tolerant.

Also, several peers may fulfil the task as surrogate of the whole (local) sub-cluster, what increase fault tolerance once more.

 The size of the structure automatically adapts to changing needs.

 Search request may be routed --even if not coming in to the root node– in a calculatable time.

26

(26)

Peer architecture

27

(27)

Summary

 Today‘s search engines are far away to replace a librarians functionality.

 Small, decentralised systems are more flexible and competitive.

 Business models exist also for P2P.

 Copying the WWW is not a good approach, except for the NSA and secret services.

 An new, fully decentralised concept of search is investigated, offering new

services, interfaces and ahigh degree of privacy. This approach is scalable and may adapt to changing needs in the WWW.

 It shows similarities to the work of the human brain. This must be considered more detailed in the future.

(28)

Prof. Dr.-Ing. habil. Herwig Unger Herwig.Unger@gmail.com WeChat: pdu1966 LINE: hu2106 +49 176 8183 2106 / +66 979 722 070

Referenzen

ÄHNLICHE DOKUMENTE

a certain graph, is shown, and he wants to understand what it means — this corre- sponds to reception, though it involves the understanding of a non-linguistic sign;

Accelerate growth through strategic M&A moves Continue to build and grow new commerce verticals.. English-speaking) and genres Extend US scripted, reality and factual

 Growing scripted and non-scripted catalogue through content investments.. EBITDA [in EUR m].. existing notes now EUR 2.7bn - sufficient funding to implement our mid-term growth

Supports content production (CMS), distribution & reach generation, reports & data, Ad Tech, offers continuous campaign optimization Scale through Marketing budgets

and in Lead Gen assets, to build new verticals in the most attractive markets, and grow these verticals in Germany and internationally with our platforms and M&A

Global scale - Expansion of English-language scripted catalogue. • Investment in

The idea behind the algorithm FastCut(G) is to include the repetitions into the algo- rithm instead of simply repeating Karger’s Contract(G) algorithm as a whole.. This makes

In a study on the topographi- cal distribution of SLNs and non-SLN metastases in 220 patients with early-stage OSCC and lymph node metastases, 53 patients had positive SLNB