• Keine Ergebnisse gefunden

Challenges and Approaches for Web Archive Creation and Usage

N/A
N/A
Protected

Academic year: 2022

Aktie "Challenges and Approaches for Web Archive Creation and Usage"

Copied!
52
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Congress of Catalan Archivists 2015 Thomas Risse 25/11/15

Challenges and Approaches for Web Archive Creation and Usage

Thomas Risse

L3S Research Center

Congress of Catalan Archivists 2015 Lleida, 29. May 2015

1

(2)

Congress of Catalan Archivists 2015

World Wide Web = 50 Bill. Pages + 1 Bill. Users The Web and the Social Web

play a crucial role

– Information and services for all domains

– Allows contributions by every citizen – Giving room for the articulation for a

multitude of stakeholders

– Reflects all types of events, opinions, developments within society, science, politics, environment, business, …

The Web is a core part of our daily life

25/11/15

Thomas Risse 2

(3)

Congress of Catalan Archivists 2015

Spam Attack on Copts

Gun running from Sudan

Are we loosing

the past of the web?

25/11/15

Thomas Risse 3

(4)

Congress of Catalan Archivists 2015

The Web is a quickly changing, ever growing information space [1]

– It’s growing by >8% per week

– After 1 year only 40% of the pages are still accessible while 60%

of the pages are new

A Web Archive as a Collective Memory is a cultural necessity for the future

But „Archive and Store Everything“ is not a practical approach

[1] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective.

In Proceedings of the 13th international conference on World Wide Web (WWW '04)

The Web is Changing and Forgetting

25/11/15

Thomas Risse 4

(5)

Congress of Catalan Archivists 2015

[2] D. Gomes, J. Miranda and M. Costa. A survey on web archiving initiatives.

In Proceedings of the 1st International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011)

Where are we now?

• A number of tools are available (e.g. Heritrix)

• Crawl descriptions are currently lists of URLs

• >42 world wide Web Archives initiatives with different scopes [2]

• Still a lot manual effort is necessary but only some 100 people are involved world wide [2]

• Increasing interest from Digital Humanities, Journalism, etc.

 More support is necessary

– Crawl by Events, Topics and Entities

– Using the “Wisdom of the Crowds” for selection and appraisal

25/11/15

Thomas Risse 5

(6)

Congress of Catalan Archivists 2015

Agenda

Motivation

The Crawl Process and its Challenges

• Process Overview

• Dynamic Pages

• Basic Crawl Strategies

• Termination of Crawls

Next Generation Web Archiving

• Requirements

• Topical Crawling

• The ARCOMEM Approach

• iCrawl - Integrated Crawling

Web Archive Access and Usage Examples

• Current methods

• The ARCOMEM Approach

• Time and Topic aware diversification

• Language Evolution

Conclusions

25/11/15

Thomas Risse 6

(7)

Congress of Catalan Archivists 2015

Agenda

Motivation

The Crawl Process and its Challenges

• Process Overview

• Dynamic Pages

• Basic Crawl Strategies

• Termination of Crawls

Next Generation Web Archiving

• Requirements

• Topical Crawling

• The ARCOMEM Approach

• iCrawl - Integrated Crawling

Web Archive Access and Usage Examples

• Current methods

• The ARCOMEM Approach

• Time and Topic aware diversification

• Language Evolution

Conclusions

25/11/15

Thomas Risse 7

(8)

Congress of Catalan Archivists 2015

Web Archiving and some challenges

Selection

Prepar ation Di scovery Fi ltering

Capture

Li nk Ex tr action Fetchi ng

Archiving

Storage

Index

Access Quality Review

Archivist

User

Noise Filtering

Dynamic Pages JavaScript, Flash

Multimedia Content Temporal Coherence

of Crawls

Data Volume Temporal Aspects

Deep Web

Content Selection Content

Appraisal

Social Media API Crawling

25/11/15

Thomas Risse 8

Contributions the European Projects

(9)

Congress of Catalan Archivists 2015

Web Archiving and some challenges

Selection

Prepar ation Di scovery Fi ltering

Capture

Li nk Ex tr action Fetchi ng

Archiving

Storage

Index

Access Quality Review

Archivist

User

Noise Filtering

Dynamic Pages JavaScript, Flash

Multimedia Content Temporal Coherence

of Crawls

Data Volume Temporal Aspects

Deep Web

Content Selection Content

Appraisal

Social Media API Crawling

25/11/15

Thomas Risse 9

(10)

Congress of Catalan Archivists 2015

Dynamic Pages

25/11/15

Thomas Risse 10

(11)

Congress of Catalan Archivists 2015

Follow the links....

<a href="http://www.gnu.org/philosophy/free- sw.html">free software</a>

25/11/15

Thomas Risse 11

(12)

Congress of Catalan Archivists 2015

But what about these pages?

25/11/15 Thomas Risse

function() {

var fcb_referrers = [

"fcbarcelona.com", "fcbarcelona.cat", "fcbarcelona.es",

"fcbarab.com", "fcbarcelona.fr", "fcbarcelona.jp",

"fcbarcelona.cn", "fcbarcelona.qq.com"

];

var current_language = "en";

var routes = {"ca":"http://www.fcbarcelona.cat"};

var navigator_language = navigator.language || navigator.userLanguage;

var language = (navigator_language || "").split("-")[0];

var s = "^https?:\/\/[^\/]*(" + fcb_referrers.join("|").replace(/\./g, '\\.') + ")";

var fcb_referrer = document.referrer.match(new RegExp(s));

var redirect_url = routes[language];

if (language && language != current_language && !fcb_referrer && redirect_url) { window.location = redirect_url;

} })();

12

(13)

Congress of Catalan Archivists 2015

“Endless” Pages

25/11/15 13

Thomas Risse

Automatically reload content while browsing the page

• Requires execution of JavaScript

• When can the fetching be stopped?

• Examples: Facebook, Twitter, Tumblr,

etc.

(14)

Congress of Catalan Archivists 2015

Embedded Social Media

25/11/15 14

Thomas Risse

Embedded Social Media Content

• JavaScript embeds

• Dynamically integrated content on the browser side

• Often standard scripts

Possibility to extract parameters

• Content fetching requires API

Crawler

(15)

Congress of Catalan Archivists 2015

Handling of Dynamic Pages

The Problem

• Embedded “Programs”

• Adobe Flash impossible to handle

• JavaScript is readable

• Dynamic creation of links and content Approaches

• Guessing of links

 “guessing” by assembling any fragments that look like links into URLs

 Can be very noisy - lots of wrong URL’s

• Extraction of parameters from program code

 Applicable for known code libraries e.g. Facebook, Twitter

• Execution of JavaScript

 Simulate user activities- “pressing” the links and see what comes out

 Execute code in a Javascript engine (e.g. WebKit, Firefox Browser)

 Extract links from resulting DOM tree

Status

• Some implementations exist (e.g. LiWA Service, Browser Monkey)

• API Crawler exist

• No integration into standard Web Crawler

25/11/15

Thomas Risse 15

(16)

Congress of Catalan Archivists 2015

Basic Crawl Strategies

25/11/15

Thomas Risse 16

(17)

Congress of Catalan Archivists 2015

Crawl Strategy

25/11/15

Thomas Risse 17

www.news.de

journalists.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

ukraine_sports0105.htm

spain_sports0205.htm about.htm

ukraine_crisis.htm

sports.htm

(18)

Congress of Catalan Archivists 2015

Crawl Strategy: Depth-first

25/11/15 Thomas Risse

www.news.de

journalists.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

18

ukraine_sports0105.htm

spain_sports0205.htm about.htm

ukraine_crisis.htm

sports.htm

(19)

Congress of Catalan Archivists 2015

Crawl Strategy: Breadth-first

25/11/15 Thomas Risse

www.news.de

about.htm

journalists.htm

ukraine_crisis.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

19

ukraine_sports0105.htm

spain_sports0205.htm

sports.htm

(20)

Congress of Catalan Archivists 2015

Termination of a Crawl

When is the harvesting of Web Pages complete?

Problem: The total amount of pages is unknown Typical Approach

• Crawler Queue is empty (only for very focused crawls)

• Termination after number of pages/amount of data

• Termination after time

 Most often incomplete crawls depending on the strategy

25/11/15

Thomas Risse 20

(21)

Congress of Catalan Archivists 2015

Possible Consequences after stopping the Crawl

25/11/15 21

Thomas Risse

www.news.de

journalists.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

ukraine_sports0105.htm

spain_sports0205.htm about.htm

ukraine_crisis.htm

sports.htm

www.news.de

about.htm

journalists.htm

ukraine_crisis.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

ukraine_sports0105.htm

spain_sports0205.htm sports.htm

Depth-first Breadth-first

(22)

Congress of Catalan Archivists 2015

Alternative Crawl Strategies

25/11/15 Thomas Risse

- Selection by Popularity

- Ranking of the waiting queue

according to the popularity of the content - Requires knowledge about the Web Graph

- For example: PageRank

- Mainly useful to optimize regular crawls - Content based selection

- Topics, Events, Entities

- Requires semantic crawl specification

- The selection of the right strategy depends on the crawl intention

22

Will be discussed afterwards

www.news.de

journalists.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

ukraine_sports0105.htm

spain_sports0205.htm about.htm

ukraine_crisis.htm

sports.htm

(23)

Congress of Catalan Archivists 2015

Agenda

Motivation

The Crawl Process and its Challenges

• Process Overview

• Dynamic Pages

• Basic Crawl Strategies

• Termination of Crawls

Next Generation Web Archiving

• Requirements

• Topical Crawling

• The ARCOMEM Approach

• iCrawl - Integrated Crawling

Web Archive Access and Usage Examples

• Current methods

• The ARCOMEM Approach

• Time and Topic aware diversification

• Language Evolution

Conclusions

25/11/15

Thomas Risse 23

(24)

Congress of Catalan Archivists 2015

Growing Scientific Interest in Web Archive Content (1/2)

Historians

- Official Publications (e.g. Government) - Journalistic Resources

- Important topics and events with a high media coverage

- Multi-cultural or controversial topics

- Optimal: continues observation of topics in the Web

Social Sciences

- Observations of topics and events on major sites are good starting points - Identified Topic

- Official publications, journalistic and social media sources - Changes on the topic should be identified

- Metadata / Context (e.g. Author, Organizations and their interests, gender, location)

- Demographic information about social sites

- Provenance: Transparent and detailed documentation of content selection

25/11/15

Thomas Risse 24

(25)

Congress of Catalan Archivists 2015

Growing Scientific Interest in Web Archive Content (2/2)

Law

- Research is based on official publications and protocols of parliaments or comments

released by publishers

- Social media (especially blogs) are increasingly used

- Only used as background information

- Reason: missing citability and authenticity of resources

- Genesis of laws

- Used to understand original intention of laws

- A democratic system requires a complete documentation of the law genesis.

- Currently different degrees of documentation

- Official publications: parliament and committee meetings - Public discourse

25/11/15

Thomas Risse 25

(26)

Congress of Catalan Archivists 2015

Derived Requirements

Topical Dimension

- Crawl intention are mainly focused around events and rarely around entities - What is the intention of the researcher?

- Easy monitoring by the researcher and possibility to correct

Flexible Crawling Strategies - Shallow observation crawls

- Focused crawls with prioritization (e.g. PageRank and/or semantics)

Social Web Crawling

- General interest with different media focus

- Integrated with Web crawler to capture the full context

Authenticity

- See a web page as the user saw the page (e.g. including ads and tweets at that time point)

Context and Provenance - Demographics of sites

- Documentation of crawl specification and history

25/11/15

Thomas Risse 26

(27)

Congress of Catalan Archivists 2015

Topical / Event Crawling

25/11/15 Thomas Risse

www.news.de

about.htm

journalists.htm

ukraine_crisis.htm

thomas.htm

joe.htm

ukraine0105.htm

ukraine0205.htm

27

ukraine_sports0105.htm

spain_sports0205.htm sports.htm

Crawl Specification - Terms

- Ukraine - Crisis - …

- Seed List

- www.news.de

- …

(28)

Congress of Catalan Archivists 2015

ARCOMEM Crawling Phases

25/11/15

Thomas Risse 28

Crawling Online Processing

Offline Processing SARA

for

Broadcaster, Parliaments

ARCOMEM

Storage Archive

Crawling Appraisal

Selection Cross Crawl Processing

Entities

Obama, Romney, Biden, Ryan, Republicans, Democrats, Keywords

US Election, CommitToMitt, Teaparty, Budget deficit, Social Media Seedlist

https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlist

http://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Internet

(29)

Congress of Catalan Archivists 2015

Architecture Overview

Online Processing

Crawler Cross Crawl Analysis

Offline Processing

Queue

Management Application-Aware

Helper

Resource Selection

& Prioritization Resource

Fetching

Intelligent Crawl Definition Consolidation

Enrichment GATE Offline Analysis

Social Web Analysis

GATE Online Analysis Social Web Analysis Named Entity

Evol. Recog.

Extracted SocialWeb Information

Crawler Cockpit

ARCOMEM Storage (HBase, H2RDF)

URLs

Relevance Analysis

&

Priorization Image/Video Analysis

Twitter Dynamics

WARC Export

Application

WARC Files

SARA

SOLR Index +

Broadcaster Parliament

25/11/15

Thomas Risse 29

(30)

Congress of Catalan Archivists 2015

Crawler Cockpit

25/11/15

Thomas Risse 30

(31)

Congress of Catalan Archivists 2015

Relevance Distribution Time

25/11/15 31

Thomas Risse

(32)

Congress of Catalan Archivists 2015

New Strategies lead to …

Example: German Elections Crawl 2013

• Completely user defined crawl (German Broadcasters)

• Many focused terms and entities in German

• Many full names e.g. “Angela Merkel”

• Few less focused keywords in English

• 1 st Phase

• Many English pages with no relation to German Elections

• 2 nd Phase

 Refinement of Crawl Specification

 Focused English terms

 Last names instead of full names

 Finding a good cut off point of archive threshold

 Smaller result set with higher focus

25/11/15

Thomas Risse 32

(33)

Congress of Catalan Archivists 2015

… new Experiences …

A good specification of the crawl intention becomes critical

 Crawl specifications are rather complex compared to seed lists

 More experiences, observations and analysis are necessary from Users and Developers

 Room for more sophisticated guidance in the 1 st phase

Finding the right cut-off point

 Depends on the user requirements

 Highly focused  Smaller Archives but higher risk of loosing interesting information

 Less focused  Lower risk of missing information but larger archives

 Translation into scoring of individual crawls

25/11/15

Thomas Risse 33

(34)

Congress of Catalan Archivists 2015

… and new Limitiations

Social Media and Web Crawling are separate systems

• Process

1. Crawling of Social Media content 2. Extraction of Links

3. Crawling of Web Pages

• Static integration of Social Media

• Uni-directional Path: Social Media  Web Content

• Missing Path: Web Content  Social Media In Addition

• Complex system

• Required many changes in the Heritrix queue handling

25/11/15 34

Thomas Risse

(35)

Congress of Catalan Archivists 2015

Integrated crawling with the L3S iCrawl System

25/11/15 Thomas Risse

Apache Nutch based Web Archive crawler (under development)

• Learning the intention of the crawl

• Integration of Web and Social Media Crawling

• Content based monitoring of the crawl process

Web Archive

Crawl Specification

Learning the Crawl Specification Semantic

Crawl Description

Initial Seedlist

Provenance

Crawl Monitor

Crawler Crawl Analysis

&

Enrichment Specification

Refinement

Archive Creation &

Cataloguing Web Crawler

API Crawler Scheduler

Web ArchiveWeb Archive

Crawl Preparation Crawl Execution Crawl Finalization

35

(36)

Congress of Catalan Archivists 2015

iCrawl Wizard

25/11/15 36

Thomas Risse

(37)

Congress of Catalan Archivists 2015

Twitter #Ukraine Feed

Example for Integrated Crawling

25/11/15 Thomas Risse

ID Batch URL Priority

(High Page Relevance)

(Medium Page Relevance) (Low Page Relevance)

Web Link Extracted URL

ID Batch URL Priority

UK1 1 http://www.foxnews.com/world/2014/11/07/ukraine-accuses-russia-

sending-in-dozens-tanks-other-heavy-weapons-into-rebel/ 1.00 UK2 1 http://missilethreat.com/media-ukraine-may-buy-french-exocet-anti-ship-

missiles/ 1.00

ID Batch URL Priority

UK1 1 http://www.foxnews.com/world/2014/11/07/ukraine-accuses-russia-

sending-in-dozens-tanks-other-heavy-weapons-into-rebel/ 1.00 UK2 1 http://missilethreat.com/media-ukraine-may-buy-french-exocet-anti-ship-

missiles/ 1.00

UK3 x http://missilethreat.com/us-led-strikes-hit-group-oil-sites-2nd-day/ 0.40

ID Batch URL Priority

UK1 1 http://www.foxnews.com/world/2014/11/07/ukraine-accuses-russia-

sending-in-dozens-tanks-other-heavy-weapons-into-rebel/ 1.00 UK2 1 http://missilethreat.com/media-ukraine-may-buy-french-exocet-anti-ship-

missiles/ 1.00

UK3 x http://missilethreat.com/us-led-strikes-hit-group-oil-sites-2nd-day/ 0.40 UK4 y http://missilethreat.com/turkey-missile-talks-france-china-disagreements-

erdogan/ 0.05

… …

37

Crawler Queue

(38)

Congress of Catalan Archivists 2015

Agenda

Motivation

The Crawl Process and its Challenges

• Process Overview

• Dynamic Pages

• Basic Crawl Strategies

• Termination of Crawls

Next Generation Web Archiving

• Requirements

• Topical Crawling

• The ARCOMEM Approach

• iCrawl - Integrated Crawling

Web Archive Access and Usage Examples

• Current methods

• The ARCOMEM Approach

• Time and Topic aware diversification

• Language Evolution

Conclusions

25/11/15

Thomas Risse 38

(39)

Congress of Catalan Archivists 2015

WayBackMachine

• Basic tool to access Web Archives

• Also used in

combination with other tools

• Only URL based access

• Provides

• Capture overviews

• Details about crawl dates

25/11/15

Thomas Risse 40

(40)

Congress of Catalan Archivists 2015

Searching the Web Archive

25/11/15 41

Thomas Risse

(41)

Congress of Catalan Archivists 2015

Other Ways to Explore Web Archives

25/11/15

Thomas Risse 42

(42)

Congress of Catalan Archivists 2015

Web Archive Access

Different user groups with different needs

• Not the typical Web Search Engine user

Explorative Search

• Web Archive Search ≠ Web Search

• Time Dimension

• Many versions of the same page

• Comparing versions

Large scale analysis across time

• Information extraction (Entities, Events, Topics, Opinions)

• Interlinking

• Visualizations

25/11/15

Thomas Risse 43

(43)

Congress of Catalan Archivists 2015

The ARCOMEM Approach to Web Archive Access

25/11/15

Thomas Risse 44

(44)

Congress of Catalan Archivists 2015

Historical Search on News Archives*

Consider a journalist interested in the history of ... Rudolph Giuliani

• News articles encode history as it happens.

• Aspects are diverse across time

• Time windows can be diverse in aspects.

25/11/15 Mayoral

Campaign

Mayoral Campaign

Mayoral

Campaign 9/11

Post politics endeavours Senate,

Cancer, Allegations

Number of Documents

Mayor

*Jaspreet Singh, Avishek Anand; Historical Search on News Archives; L3S Report 2015

Thomas Risse 45

(45)

Congress of Catalan Archivists 2015

Large Scale Data Analytics Example Named Entity Evolution Recognition

25/11/15

Thomas Risse 46

(46)

Congress of Catalan Archivists 2015

Language changes over time!

Our language is DYNAMIC and changes with our culture, politics, technology, social media, etc.

 Different spellings over time

 New words are introduced

 Words change their meanings

In long-term digital archives  Problems!

25/11/15

Thomas Risse 47

1914 1924 1991 today t

Leningrad Petrograd

St. Petersburg St. Petersburg St. Petersburg

(47)

Congress of Catalan Archivists 2015

Problem: Finding Documents

A scholar writing a thesis about ”Pope Benedikt”:

”I want to know more about Pope Benedikt”

?

?

25/11/15

Thomas Risse 48

(48)

Congress of Catalan Archivists 2015

Change Period

Named Entity Evolution

 Named Entities (NE): people, places, companies...

 Characteristics of Named Entity Evolution (NEE)

 Same thing but different terms over time

 Change occurs over short periods of time

 Small or no concept shift

 Announced to the public repeatedly

Goal: Find method for named entity evolution recognition independent from external

knowledge sources

Joseph Ratzinger Pope Benedict

Pope Benedict XVI Benedict XVI

Pope emeritus Benedict XVI Joseph Aloisius Ratzinger

Cardinal Ratzinger

Cardinal Joseph Ratzinger

25/11/15

Thomas Risse 49

(49)

Congress of Catalan Archivists 2015

Named Entity Evolution Recognizer (NEER)

Filtering Finding

Temporal Co-references

Co-References

Benedict XVI

 Joseph Ratzinger

 Cardinal Ratzinger 1. Pope Benedict XVI

2. Pope Benedict 3. Benedict XVI 4. Cardinal Ratzinger 5. Pope

6. Benedict

Identifying Change Periods

(Burst Detection)

Extract Text NLP Processing Context Creation

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses.

No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses.

No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses.

No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr- esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr- esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr- esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

Evaluation Results

• Burst detection found total 73% of all change periods

• High recall for

unsupervised method

• Machine Learning boosts precision

• Data Set:

http://www.l3s.de/neer- dataset/

Barack Obama Senator

State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama

Illinois Democrat

Vladimir Putin

President-elect Vladimir V Putin Minister Vladimir Putin

Acting President Vladimir V Putin President Vladimir V Putin

25/11/15

Thomas Risse 50

(50)

Congress of Catalan Archivists 2015

Motivation

• Optimal access to Web archives taking into account

• the temporal dimension of Web archives

• structured semantic information available on the Web

• social media and network information

Objectives

• Evolution-Aware Entity-Based Enrichment and Indexing

• Aggregating Social Networks and Streams

• Temporal Retrieval and Ranking

• Collaborative Exploration and Analytics

Testbeds

• Temporal Wikipedia

• Academic Web Archive

• Politics on the Web

More information: http://alexandria-project.eu/

ALEXANDRIA

Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives

25/11/15

Thomas Risse 51

(51)

Congress of Catalan Archivists 2015

Conclusions

Web Crawling

• Mature standard Web crawlers but

still many challenges (e.g. Social Media, Dynamics)

• Manual interventions will always be necessary

• New user communities with different interests and requirements

• Additional crawling strategies are necessary

Web Archive Access

• Current approaches far behind the State of the Art

• Increasing Web Archive interest will lead to increasing user expectations

• Different usage categories: Explorative search vs. Big Data analysis

• Legal Aspects are still a big issue

25/11/15 52

Thomas Risse

(52)

Congress of Catalan Archivists 2015 Thomas Risse 25/11/15

Thank You!

Dr. Thomas Risse

Forschungszentrum L3S

Leibniz Universität Hannover Appelstrasse 9a

30167 Hannover, Germany

E-Mail: risse@L3S.de Telefon: +49-511-762 17764 Telefax: +49-511-762 17779

53

Referenzen

ÄHNLICHE DOKUMENTE

evidence from foreign countries and addressing the challenges of learning curve and centre effect; (2) economic value assessment, covering cost calculation of complex medical

In particular, the contributions (i) appear to exemplify that in simple climate models uncertainties in radiative forcing outweigh uncertainties associated with ocean models,

• COP23 (2017), which was marked by its Fijian presidency, again recognised in its outcomes the ‘increasing impacts associated with slow-onset events, and the urgent need to

Focusing on the broad regional security complex encompassing Sahel, North Africa and Middle East, it is evident that it is increasingly unstable, as it represents the physical

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder

We will demonstrate an easy to use interface for crawl specification that allows users to find seed URLs as well as descriptive keywords using Web and Social Media

As regards the third block of problems mentioned above (i.e. the internationalisation of RTD), the Austrian Ministry for Education, Science and Culture has started a new initiative

To examine extent and individual response of the adapta- tion to physical activity in dependence of the moderating fac- tors, pain experience, training status, psychophysical distress,