• Keine Ergebnisse gefunden

Evolution in Digital Archives

N/A
N/A
Protected

Academic year: 2022

Aktie "Evolution in Digital Archives"

Copied!
37
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Presentation at KAIST, 20. April 2015

The Role of Language

Evolution in Digital Archives

Thomas Risse, Helge Holzmann (L3S) Nina Tahmasebi, Uni Gothenburg

Presentation at KAIST, 20. April 2015

(2)

Presentation at KAIST, 20. April 2015

Computer Science and interdisciplinary research on all aspects of the Web

• Internet: Communication and Networks

• Information: Accessing information

and knowledge on and through the Web

• Community: Supporting communities and groups on the Web, for research, education, production and

entertainment

• Society: Requirements (technological, social, legal) for the Web

• Selected Projects

Web Science @ L3S

„ Preserving, understanding and shaping the Web”

Open Data and Education

K3: Je mehr Informationen, desto besser ?

Retrieval, Eploration and Analytics for Web

Archives (ERC) Towards a Unified Information & Queuing

Theory (ERC)

25.11.2015

MAPPING

Privacy, Copyright and Internet Governance Managed Forgetting

and Contextualized Remembering

L3S Research Center

(3)

Presentation at KAIST, 20. April 2015

Motivation

Optimal access to Web archives taking into account

the temporal dimension of Web archives

structured semantic information available on the Web

social media and network information

Objectives

Evolution-Aware Entity-Based Enrichment and Indexing

Aggregating Social Networks and Streams

Temporal Retrieval and Ranking

Collaborative Exploration and Analytics

Testbeds

Temporal Wikipedia

Academic Web Archive

Politics on the Web

ALEXANDRIA

Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives

Slide 3

(4)

Presentation at KAIST, 20. April 2015

Language changes over time!

Our language is DYNAMIC and changes with our culture, politics, technology,

social media, etc.

– Different spellings over time – New words are introduced

– Words change their meanings

In long-term digital archives  Problems!

Slide 4

(5)

Presentation at KAIST, 20. April 2015

Problem: Finding Documents

A scholar writing a thesis ”Economy of Stalingrad”:

”I want to know more about Stalingrad”

?

?

Slide 5

(6)

Presentation at KAIST, 20. April 2015

What is Google saying?

? ?

?

Slide 6

(7)

Presentation at KAIST, 20. April 2015

… and Wikipedia?

http://en.wikipedia.org/wiki/Stalingrad

Slide 7

(8)

Presentation at KAIST, 20. April 2015

Problem: Finding Documents

• Manual Query Formulation

–Sometimes the solution is in Wikipedia or a dictionary!

–Other times:

•Not important enough?

•User generated content

•Slang/jargon/domain specific

• Better: Search Engine Support

–Needs knowledge about Evolution –How can we extract this knowledge?

Tsaritsyn 1589-1925 Stalingrad 1925-1961 Volgograd

Slide 8

(9)

Presentation at KAIST, 20. April 2015

Available Information: The Time Dimension

  Backward Lack of Dictionaries Few experts

OCR errors Data sparsity

Forward  

Incremental changes Resources exist

(Wikipedia, WordNet) People can help

Noisy data (Web) Data flood

9

Today

(10)

Presentation at KAIST, 20. April 2015

  Backward Lack of Dictionaries Few experts

OCR errors Data sparsity

Forward  

Incremental changes Resources exist

(Wikipedia, WordNet) People can help

Noisy data (Web) Data flood

10

Available Information: The Time Dimension

Today

(11)

Presentation at KAIST, 20. April 2015

Named Entity Evolution Recognition (for historical collections)

Slide 11

(12)

Presentation at KAIST, 20. April 2015

Named Entity Evolution

Definitions:

– Context C

wi

: all terms related to word w at time i

– Temporal co-references: names used for the same entity at same or different points in time.

– Direct co-reference: co-references with lexical overlap

– Indirect co-reference: co-references without lexical overlap – Change period (CP): period of time where change occurs.

Cardinal Joseph Ratzinger  Pope Benedict XVI

CP = 2005

Slide 12

(13)

Presentation at KAIST, 20. April 2015

Our Method - Overview

1. Identify change periods

2. Create one context per CP.

3. Capture at least two co-references

 No need to compare vastly different contexts!

time

C

walkman-discman

C

discman-minidisc

C

mp3 player -ipod

t1 t2 t3

Change period discman minidisc

walkman discman mp3 player ipod

C

minidisc-mp3 player t2

minidisc mp3 player

Slide 13

(14)

Presentation at KAIST, 20. April 2015

Finding Change Periods

• Kleinberg’s burst detection

• Out of the box Java implementation from CIShell

For Evaluation:

Compare to manually found change periods (Known CPs)

Slide 14

(15)

Presentation at KAIST, 20. April 2015

Finding Direct Co-references

1. Extract text for each change period

2. Term & NE extraction

3. Build co-occurrence graph 4. Rules to merge terms from

dictionary and graph

Sub-Term Rule: Cardinal Joseph Ratzinger ↔ Joseph Ratzinger

Prefix/suffix Rule: Cardinal Joseph Ratzinger ↔ Cardinal Ratzinger

Prolongation Rule: Pope John Paul + John Paul II = Pope John Paul II

Cardinal Joseph Ratzinger, Cardinal Ratzinger,

Joseph Ratzinger

Cardinal Joseph Ratzinger, Cardinal Ratzinger,

Joseph Ratzinger

Slide 15

(16)

Presentation at KAIST, 20. April 2015

Detailed Merging

• Merge one token terms (Co-ref classes):

– Pope Benedict and Benedict = coref Benedict {Pope Benedict, Benedict}

– Benedict XVI and Benedict = coref Benedict {Benedict XVI, Benedict}

 choose Benedict as representative – highest frequency

• Merge co-reference classes

 coref

Benedict

{Pope Benedict, Benedict, Benedict XVI}

• Apply remaining rules:

– Merge coref Benedict with Pope Benedict XVI (subterm rule)

 coref Benedict {Pope Benedict, Benedict, Benedict XVI, Pope Benedict XVI}

Slide 16

(17)

Presentation at KAIST, 20. April 2015

Filtering and Evaluation

• NEER - unfiltered results

• Correlation (NEER + Corr)

– Pearsons correlation between co-references (e.g., Pope Benedict XVI, Vatican)

– minCorr = 0.4

• Document Frequencies (NEER +DF)

– Keep co-references with a df < df

query

*scale factor (df

max

)

• Machine Learning (NEER +ML)

– Random forest with features: correlation, covariance, rank correlation and normalized rank correlation.

– 15 fold cross-validation

• Co-occurrence -- Baseline

– Using all co-occurring terms.

Slide 17

(18)

Presentation at KAIST, 20. April 2015

Experiments

• Two settings:

– Using Known CP (Manual) & Found CP (Burst Detection)

• Measure Precision and Recall

– Precision straight forward

• Recall:

– Group names: (Joseph Ratzinger, Cardinal Joseph Ratzinger, ... ) (Pope Benedict XVI, Benedict XVI, ... )

– 100% : Find all direct at least 1 indirect from each group.

Slide 18

(19)

Presentation at KAIST, 20. April 2015

Data and Test set

• New York Times Corpus (1986-2007)

• Manually created test set

–Have a change period in NYTimes

–Occur at least 5 times in at least one change period

33 names

86 co-references (44 indirect/42 direct)

Slide 19

(20)

Presentation at KAIST, 20. April 2015

Results (I/II)

• Burst detection

–Found total 73% of all change periods, –avr. 3.2 bursts/name

–Using top 6 bursts we found 66%.

Slide 20

(21)

Presentation at KAIST, 20. April 2015

Results (II/II)

Slide 21

(22)

Presentation at KAIST, 20. April 2015

Lessons learned

• Lemmatization did not work well.

• Big difference between 1/2 word minimum – (1  higher recall, 2  higer precision)

• The order of rule execution is important.

– Different results if we start with Rule 1, 2 or 3

• Allow stop words to have be lowercase – Union of Myanmar

22

(23)

Presentation at KAIST, 20. April 2015

  Backward Lack of Dictionaries Few experts

OCR errors Data sparsity

Forward  

Incremental changes Resources exist

(Wikipedia, WordNet) People can help

Noisy data (Web) Data flood

23

Available Information: The Time Dimension

Today

(24)

Presentation at KAIST, 20. April 2015

Handling of Name Changes on Wikipedia

Slide 24

(25)

Presentation at KAIST, 20. April 2015

Only structured evolution information is revision history

Does not reflect the evolution of corresponding entity

According to Wikipedia guidelines:

In case of a name change, new article, redirect from former name

Reason is missing, only explanation in revision history (sometimes)

Former names should be mentioned at the beginning of an article

Handling of Name Changes on Wikipedia

List pages provide semi-structured information

e.g., List of city name changes: “Edo → Tokyo (1868)”

Easy to parse

Served as a starting point for our analysis

Not all of list pages look the same

e.g., List of renamed products: “Dime Bar, a confectionery product from Kraft Foods was rebranded Daim bar in the United Kingdom in September 2005”

Do excerpts of limited length exist in articles

that are dedicated to name changes?

Slide 25

(26)

Presentation at KAIST, 20. April 2015

Handling of Name Changes on Wikipedia

Slide 26

(27)

Presentation at KAIST, 20. April 2015

Dataset

19 semi-structured seed lists (9 redundant)  10 remaining (only geo names):

Geographical renaming

List of city name changes

List of administrative division name changes

7 lists dedicated to renamings of cities in certain countries

1,926 distinct entities

2,852 name changes

2,782 articles (of 1,898 entities)

766 entities with names resolvable to multiple articles

28 entities could not be resolved

Analysis

Most changes: 11 (Plovdiv, Bulgaria)

Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis → Trimontium→ Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva → Filibe → Plovdiv

In average: 1.48 changes, 2.39 different names

Slide 27

(28)

Presentation at KAIST, 20. April 2015

Knowledge Base dedicated to Entity Evolution

Extracted from Wikipedia articles

Learning the patterns to extract structures

Three components: Former name, new name, date of change

Analysis: Goal / Objectives

Sentence distance of an evolution

e.g., Thamesdown→ Swindon (1997)

“On 1 April 1997 it was made administratively independent of Wiltshire County Council, with its council becoming a new unitary authority. It adopted the name Swindonon 24 April 1997. The former Thamesdownname and logo are still used by the main local bus company of Swindon, called Thamesdown

Transport Limited.”

Minimum sentence distance: 1

How many sentences do excerpts span that cover all three components?

Slide 28

(29)

Presentation at KAIST, 20. April 2015

1. From evolution lists to excerpts

696 entities remaining with 918 name changes annotated with dates

Results

572 complete names changes are mentioned in articles (preceding, succeeding name and date)

62.3% of the 918 considered name changes

Slide 29

(30)

Presentation at KAIST, 20. April 2015

2. Analyzing excerpts

Sentence distance

More than 85% of the 572 considered changes are completely mentioned in excerpts with 10 sentences or less

There are passages in Wikipedia articles dedicated to describing an entity’s evolution!

Results

Slide 30

(31)

Presentation at KAIST, 20. April 2015

2. Analyzing excerpts

Sentence distance

More than two-thirds of the found changes have a sentence distance of less than 3 (excerpts spanning 3 sentences or less)

79.7% of the 572 considered changes

Results

0 [226]

0 [226]

1 [118]

2 [45]

≥3 [99]

sentence distance [excerpts]

≥10 [84]

Slide 31

(32)

Presentation at KAIST, 20. April 2015

So far results only for geographic entities

Manually parsed “List of renamed products” (unstructured)

Same analysis

Generalization

Similar results

80% of name changes reported in articles (vs. 62.3%)

91.7% span 10 sentences or less (vs. 85.3%)

Again more than two-thirds of the changes have a sentence distance < 3

79.7% vs. 66.7%

Assumption: Similar text patterns

Enables automatic classification / detection of evolutions

Demo @ Digital Libraries 2014

http://evobase.l3s.de/DL2014_demo

Slide 32

(33)

Presentation at KAIST, 20. April 2015

Generalization: Evolution Base

http://evobase.l3s.de/DL2014_demo/Accenture

Slide 33

(34)

Presentation at KAIST, 20. April 2015 Conclusions

NEER provides a good starting point for historical documents

Depends on the collection quality, type, etc.

Need improved linguistic processing

OCR quality affects NEER

Wikipedia

A large majority of name changes are mentioned in Wikipedia articles

More than two-thirds of the found changes have a sentence distance of less than three

Extracted excerpts can serve as a training set for classifying similar passages

Allows automatic extraction of name evolution knowledge from Wikipedia Outlook

Create larger ground truth

Using large text corpus for dating of changes

We are going to use this knowledge for creating an evolution knowledge base

Conclusion & Outlook

http://evobase.l3s.de/DL2014_demo

Slide 34

(35)

Presentation at KAIST, 20. April 2015

References - NEER

• Holzmann, H.; Tahmasebi, N. & Risse, T.

Named entity evolution recognition on the Blogosphere

International Journal on Digital Libraries, Springer Berlin Heidelberg , 1-27, 2014

• Tahmasebi, N.; Niklas, K.; Zenz, G. & Risse, T.

On the applicability of word sense discrimination on 201 years of modern english

International Journal on Digital Libraries, Springer-Verlag , 1-19, 2013

• Tahmasebi, N.; Gossen, G. & Risse, T.

Which Words Do You Remember? Temporal Properties of Language Use in Digital Archives

Theory and Practice of Digital Libraries, Springer , Vol. 7489, 32-37, 2012

• Tahmasebi, N.; Gossen, G.; Kanhabua, N.; Holzmann, H. & Risse, T.

NEER: An Unsupervised Method for Named Entity Evolution Recognition

COLING, Indian Institute of Technology Bombay , 2553-2568, 2012

35

(36)

Presentation at KAIST, 20. April 2015

References - Wikipedia

• Holzmann, H. & Risse, T.

Insights into Entity Name Evolution on Wikipedia

Web Information Systems Engineering – WISE 2014, Springer International Publishing , Vol. 8787, 47-61, 2014

Named Entity Evolution Analysis on Wikipedia

Proc. of the 2014 ACM Conference on Web Science, ACM , 241- 242, 2014

• Holzmann, H. & Risse, T.

Extraction of Evolution Descriptions from the Web, Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries,

IEEE Press, Piscataway, NJ, USA , pp. 413--414 .

36

(37)

Presentation at KAIST, 20. April 2015

Thank You!

Dr. Thomas Risse

Forschungszentrum L3S Appelstrasse 9a

30167 Hannover

E-Mail: risse@L3S.de Telefon: 0511 – 762 17764 Telefax: 0511 – 762 17779

37

Referenzen

ÄHNLICHE DOKUMENTE

An important goal of the project is to evaluate in how far speech recognition technology can be used for the automatic generation of transcripts of radio broadcasts that

In dieser Zeit hat sich in dem Hause ei- niges geändert, nicht nur, dass Weihnachts- und Urlaubsgeld gestrichen wird, welches als AiP sowieso vernachlässigbar war, nein, es wird

… Dass das Fenster der Verwundbarkeit offen bleiben muss – wenn wir Menschen bleiben oder es werden wollen –, scheint unbekannt zu sein.. Als wollten wir uns mit aller Ge- walt

John Lennon hat 1971 einen Song geschrieben, der weltberühmt wurde: „Imagine“: „Stell dir vor,/es gibt den Himmel nicht,/ keine Hölle unter uns./ Stell dir all die Menschen

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder

In this paper we presented the LiWA Terminology evolution module, TeVo, which takes us one step closer to fully auto- matic evolution detection given a long term archive.. TeVo can

and spoken parts of the British National Corpus, as representatives of real life datasets we used datasets such as the New York Times and the Times Archive (traditionally written