Evolution in Digital Archives

(1)

Presentation at KAIST, 20. April 2015

The Role of Language

Evolution in Digital Archives

Thomas Risse, Helge Holzmann (L3S) Nina Tahmasebi, Uni Gothenburg

Presentation at KAIST, 20. April 2015

(2)

Presentation at KAIST, 20. April 2015

Computer Science and interdisciplinary research on all aspects of the Web

• Internet: Communication and Networks

• Information: Accessing information

and knowledge on and through the Web

• Community: Supporting communities and groups on the Web, for research, education, production and

entertainment

• Society: Requirements (technological, social, legal) for the Web

• Selected Projects

Web Science @ L3S

„ Preserving, understanding and shaping the Web”

Open Data and Education

K3: Je mehr Informationen, desto besser ?

Retrieval, Eploration and Analytics for Web

Archives (ERC) Towards a Unified Information & Queuing

Theory (ERC)

25.11.2015

MAPPING

Privacy, Copyright and Internet Governance Managed Forgetting

and Contextualized Remembering

L3S Research Center

(3)

Presentation at KAIST, 20. April 2015

Motivation

• Optimal access to Web archives taking into account

• the temporal dimension of Web archives

• structured semantic information available on the Web

• social media and network information

Objectives

• Evolution-Aware Entity-Based Enrichment and Indexing

• Aggregating Social Networks and Streams

• Temporal Retrieval and Ranking

• Collaborative Exploration and Analytics

Testbeds

• Temporal Wikipedia

• Academic Web Archive

• Politics on the Web

ALEXANDRIA

Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives

Slide 3

(4)

Presentation at KAIST, 20. April 2015

Language changes over time!

Our language is DYNAMIC and changes with our culture, politics, technology,

social media, etc.

– Different spellings over time – New words are introduced

– Words change their meanings

In long-term digital archives  Problems!

Slide 4

(5)

Presentation at KAIST, 20. April 2015

Problem: Finding Documents

A scholar writing a thesis ”Economy of Stalingrad”:

 ”I want to know more about Stalingrad”

?

Slide 5

(6)

Presentation at KAIST, 20. April 2015

What is Google saying?

? ?

?

Slide 6

(7)

Presentation at KAIST, 20. April 2015

… and Wikipedia?

http://en.wikipedia.org/wiki/Stalingrad

Slide 7

(8)

Presentation at KAIST, 20. April 2015

Problem: Finding Documents

• Manual Query Formulation

–Sometimes the solution is in Wikipedia or a dictionary!

–Other times:

•Not important enough?

•User generated content

•Slang/jargon/domain specific

• Better: Search Engine Support

–Needs knowledge about Evolution –How can we extract this knowledge?

Tsaritsyn _1589-1925 Stalingrad _1925-1961 Volgograd

Slide 8

(9)

Presentation at KAIST, 20. April 2015

Available Information: The Time Dimension

  Backward Lack of Dictionaries Few experts

OCR errors Data sparsity

Forward  

Incremental changes Resources exist

(Wikipedia, WordNet) People can help

Noisy data (Web) Data flood

9

Today

(10)

Presentation at KAIST, 20. April 2015

  Backward Lack of Dictionaries Few experts

OCR errors Data sparsity

Forward  

Incremental changes Resources exist

(Wikipedia, WordNet) People can help

Noisy data (Web) Data flood

10

Available Information: The Time Dimension

Today

(11)

Presentation at KAIST, 20. April 2015

Named Entity Evolution Recognition (for historical collections)

Slide 11

(12)

Presentation at KAIST, 20. April 2015

Named Entity Evolution

Definitions:

– Context C

_wi

: all terms related to word w at time i

– Temporal co-references: names used for the same entity at same or different points in time.

– Direct co-reference: co-references with lexical overlap

– Indirect co-reference: co-references without lexical overlap – Change period (CP): period of time where change occurs.

Cardinal Joseph Ratzinger  Pope Benedict XVI

CP = 2005

Slide 12

(13)

Presentation at KAIST, 20. April 2015

Our Method - Overview

1. Identify change periods

2. Create one context per CP.

3. Capture at least two co-references

 No need to compare vastly different contexts!

time

C

walkman-discman

C

discman-minidisc

C

mp3 player -ipod

t₁ t₂ t₃

Change period discman minidisc

walkman discman mp3 player ipod

C

minidisc-mp3 player t₂

minidisc mp3 player

Slide 13

(14)

Presentation at KAIST, 20. April 2015

Finding Change Periods

• Kleinberg’s burst detection

• Out of the box Java implementation from CIShell

For Evaluation:

Compare to manually found change periods (Known CPs)

Slide 14

(15)

Presentation at KAIST, 20. April 2015

Finding Direct Co-references

1. Extract text for each change period

2. Term & NE extraction

3. Build co-occurrence graph 4. Rules to merge terms from

dictionary and graph

Sub-Term Rule: Cardinal Joseph Ratzinger ↔ Joseph Ratzinger

Prefix/suffix Rule: Cardinal Joseph Ratzinger ↔ Cardinal Ratzinger

Prolongation Rule: Pope John Paul + John Paul II = Pope John Paul II

Cardinal Joseph Ratzinger, Cardinal Ratzinger,

Joseph Ratzinger

Cardinal Joseph Ratzinger, Cardinal Ratzinger,

Joseph Ratzinger

Slide 15

(16)

Presentation at KAIST, 20. April 2015

Detailed Merging

• Merge one token terms (Co-ref classes):

– Pope Benedict and Benedict = coref _Benedict {Pope Benedict, Benedict}

– Benedict XVI and Benedict = coref _Benedict {Benedict XVI, Benedict}

 choose Benedict as representative – highest frequency

• Merge co-reference classes

 coref

_Benedict

{Pope Benedict, Benedict, Benedict XVI}

• Apply remaining rules:

– Merge coref _Benedict with Pope Benedict XVI (subterm rule)

 coref _Benedict {Pope Benedict, Benedict, Benedict XVI, Pope Benedict XVI}

Slide 16

(17)

Presentation at KAIST, 20. April 2015

Filtering and Evaluation

• NEER - unfiltered results

• Correlation (NEER + Corr)

– Pearsons correlation between co-references (e.g., Pope Benedict XVI, Vatican)

– minCorr = 0.4

• Document Frequencies (NEER +DF)

– Keep co-references with a df < df

_query

*scale factor (df

_max

)

• Machine Learning (NEER +ML)

– Random forest with features: correlation, covariance, rank correlation and normalized rank correlation.

– 15 fold cross-validation

• Co-occurrence -- Baseline

– Using all co-occurring terms.

Slide 17

(18)

Presentation at KAIST, 20. April 2015

Experiments

• Two settings:

– Using Known CP (Manual) & Found CP (Burst Detection)

• Measure Precision and Recall

– Precision straight forward

• Recall:

– Group names: (Joseph Ratzinger, Cardinal Joseph Ratzinger, ... ) (Pope Benedict XVI, Benedict XVI, ... )

– 100% : Find all direct at least 1 indirect from each group.

Slide 18

(19)

Presentation at KAIST, 20. April 2015

Data and Test set

• New York Times Corpus (1986-2007)

• Manually created test set

–Have a change period in NYTimes

–Occur at least 5 times in at least one change period

33 names

86 co-references (44 indirect/42 direct)

Slide 19

(20)

Presentation at KAIST, 20. April 2015

Results (I/II)

• Burst detection

–Found total 73% of all change periods, –avr. 3.2 bursts/name

–Using top 6 bursts we found 66%.

Slide 20

(21)

Presentation at KAIST, 20. April 2015

Results (II/II)

Slide 21

(22)

Presentation at KAIST, 20. April 2015

Lessons learned

• Lemmatization did not work well.

• Big difference between 1/2 word minimum – (1  higher recall, 2  higer precision)

• The order of rule execution is important.

– Different results if we start with Rule 1, 2 or 3

• Allow stop words to have be lowercase – Union of Myanmar

22

(23)

Presentation at KAIST, 20. April 2015

  Backward Lack of Dictionaries Few experts

OCR errors Data sparsity

Forward  

Incremental changes Resources exist

(Wikipedia, WordNet) People can help

Noisy data (Web) Data flood

23

Available Information: The Time Dimension

Today

(24)

Presentation at KAIST, 20. April 2015

Handling of Name Changes on Wikipedia

Slide 24

(25)

Presentation at KAIST, 20. April 2015

 Only structured evolution information is revision history

 Does not reflect the evolution of corresponding entity

 According to Wikipedia guidelines:

 In case of a name change, new article, redirect from former name

 Reason is missing, only explanation in revision history (sometimes)

 Former names should be mentioned at the beginning of an article

Handling of Name Changes on Wikipedia

 List pages provide semi-structured information

 e.g., List of city name changes: “Edo → Tokyo (1868)”

 Easy to parse

 Served as a starting point for our analysis

 Not all of list pages look the same

 e.g., List of renamed products: “Dime Bar, a confectionery product from Kraft Foods was rebranded Daim bar in the United Kingdom in September 2005”

Do excerpts of limited length exist in articles

that are dedicated to name changes?

Slide 25

(26)

Presentation at KAIST, 20. April 2015

Handling of Name Changes on Wikipedia

Slide 26

(27)

Presentation at KAIST, 20. April 2015

 Dataset

 19 semi-structured seed lists (9 redundant)  10 remaining (only geo names):

 Geographical renaming

 List of city name changes

 List of administrative division name changes

 7 lists dedicated to renamings of cities in certain countries

 1,926 distinct entities

 2,852 name changes

 2,782 articles (of 1,898 entities)

 766 entities with names resolvable to multiple articles

 28 entities could not be resolved

Analysis

 Most changes: 11 (Plovdiv, Bulgaria)

 Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis → Trimontium→ Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva → Filibe → Plovdiv

 In average: 1.48 changes, 2.39 different names

Slide 27

(28)

Presentation at KAIST, 20. April 2015

 Knowledge Base dedicated to Entity Evolution

 Extracted from Wikipedia articles

 Learning the patterns to extract structures

 Three components: Former name, new name, date of change

Analysis: Goal / Objectives

 Sentence distance of an evolution

 e.g., Thamesdown→ Swindon (1997)

 “On 1 April 1997 it was made administratively independent of Wiltshire County Council, with its council becoming a new unitary authority. It adopted the name Swindonon 24 April 1997. The former Thamesdownname and logo are still used by the main local bus company of Swindon, called Thamesdown

Transport Limited.”

 Minimum sentence distance: 1

How many sentences do excerpts span that cover all three components?

Slide 28

(29)

Presentation at KAIST, 20. April 2015

1. From evolution lists to excerpts

• 696 entities remaining with 918 name changes annotated with dates

Results

 572 complete names changes are mentioned in articles (preceding, succeeding name and date)

 62.3% of the 918 considered name changes

Slide 29

(30)

Presentation at KAIST, 20. April 2015

2. Analyzing excerpts

• Sentence distance

• More than 85% of the 572 considered changes are completely mentioned in excerpts with 10 sentences or less

→ There are passages in Wikipedia articles dedicated to describing an entity’s evolution!

Results

Slide 30

(31)

Presentation at KAIST, 20. April 2015

2. Analyzing excerpts

• Sentence distance

• More than two-thirds of the found changes have a sentence distance of less than 3 (excerpts spanning 3 sentences or less)

• 79.7% of the 572 considered changes

Results

0 [226]

1 [118]

2 [45]

≥3 [99]

sentence distance [excerpts]

≥10 [84]

Slide 31

(32)

Presentation at KAIST, 20. April 2015

 So far results only for geographic entities

 Manually parsed “List of renamed products” (unstructured)

 Same analysis

Generalization

 Similar results

 80% of name changes reported in articles (vs. 62.3%)

 91.7% span 10 sentences or less (vs. 85.3%)

 Again more than two-thirds of the changes have a sentence distance < 3

 79.7% vs. 66.7%

 Assumption: Similar text patterns

 Enables automatic classification / detection of evolutions

 Demo @ Digital Libraries 2014

 http://evobase.l3s.de/DL2014_demo

Slide 32

(33)

Presentation at KAIST, 20. April 2015

Generalization: Evolution Base

http://evobase.l3s.de/DL2014_demo/Accenture

Slide 33

(34)

Presentation at KAIST, 20. April 2015 Conclusions



NEER provides a good starting point for historical documents



Depends on the collection quality, type, etc.



Need improved linguistic processing



OCR quality affects NEER



Wikipedia



A large majority of name changes are mentioned in Wikipedia articles



More than two-thirds of the found changes have a sentence distance of less than three



Extracted excerpts can serve as a training set for classifying similar passages



Allows automatic extraction of name evolution knowledge from Wikipedia Outlook

•

Create larger ground truth

•

Using large text corpus for dating of changes

•

We are going to use this knowledge for creating an evolution knowledge base

Conclusion & Outlook

http://evobase.l3s.de/DL2014_demo

Slide 34

(35)

Presentation at KAIST, 20. April 2015

References - NEER

• Holzmann, H.; Tahmasebi, N. & Risse, T.

Named entity evolution recognition on the Blogosphere

International Journal on Digital Libraries, Springer Berlin Heidelberg , 1-27, 2014

• Tahmasebi, N.; Niklas, K.; Zenz, G. & Risse, T.

On the applicability of word sense discrimination on 201 years of modern english

International Journal on Digital Libraries, Springer-Verlag , 1-19, 2013

• Tahmasebi, N.; Gossen, G. & Risse, T.

Which Words Do You Remember? Temporal Properties of Language Use in Digital Archives

Theory and Practice of Digital Libraries, Springer , Vol. 7489, 32-37, 2012

• Tahmasebi, N.; Gossen, G.; Kanhabua, N.; Holzmann, H. & Risse, T.

NEER: An Unsupervised Method for Named Entity Evolution Recognition

COLING, Indian Institute of Technology Bombay , 2553-2568, 2012

35

(36)

Presentation at KAIST, 20. April 2015

References - Wikipedia

• Holzmann, H. & Risse, T.

Insights into Entity Name Evolution on Wikipedia

Web Information Systems Engineering – WISE 2014, Springer International Publishing , Vol. 8787, 47-61, 2014

• Named Entity Evolution Analysis on Wikipedia

Proc. of the 2014 ACM Conference on Web Science, ACM , 241- 242, 2014

• Holzmann, H. & Risse, T.

Extraction of Evolution Descriptions from the Web, Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries,

IEEE Press, Piscataway, NJ, USA , pp. 413--414 .

36

(37)

Presentation at KAIST, 20. April 2015

Thank You!

Dr. Thomas Risse

Forschungszentrum L3S Appelstrasse 9a

30167 Hannover

E-Mail: risse@L3S.de Telefon: 0511 – 762 17764 Telefax: 0511 – 762 17779

37

Evolution in Digital Archives

Presentation at KAIST, 20. April 2015

The Role of Language