Presentation at KAIST, 20. April 2015
The Role of Language
Evolution in Digital Archives
Thomas Risse, Helge Holzmann (L3S) Nina Tahmasebi, Uni Gothenburg
Presentation at KAIST, 20. April 2015
Presentation at KAIST, 20. April 2015
Computer Science and interdisciplinary research on all aspects of the Web• Internet: Communication and Networks
• Information: Accessing information
and knowledge on and through the Web
• Community: Supporting communities and groups on the Web, for research, education, production and
entertainment
• Society: Requirements (technological, social, legal) for the Web
• Selected Projects
Web Science @ L3S
„ Preserving, understanding and shaping the Web”
Open Data and Education
K3: Je mehr Informationen, desto besser ?
Retrieval, Eploration and Analytics for Web
Archives (ERC) Towards a Unified Information & Queuing
Theory (ERC)
25.11.2015
MAPPING
Privacy, Copyright and Internet Governance Managed Forgetting
and Contextualized Remembering
L3S Research Center
Presentation at KAIST, 20. April 2015
Motivation
• Optimal access to Web archives taking into account
• the temporal dimension of Web archives
• structured semantic information available on the Web
• social media and network information
Objectives
• Evolution-Aware Entity-Based Enrichment and Indexing
• Aggregating Social Networks and Streams
• Temporal Retrieval and Ranking
• Collaborative Exploration and Analytics
Testbeds
• Temporal Wikipedia
• Academic Web Archive
• Politics on the Web
ALEXANDRIA
Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives
Slide 3
Presentation at KAIST, 20. April 2015
Language changes over time!
Our language is DYNAMIC and changes with our culture, politics, technology,
social media, etc.
– Different spellings over time – New words are introduced
– Words change their meanings
In long-term digital archives Problems!
Slide 4
Presentation at KAIST, 20. April 2015
Problem: Finding Documents
A scholar writing a thesis ”Economy of Stalingrad”:
”I want to know more about Stalingrad”
?
?
Slide 5
Presentation at KAIST, 20. April 2015
What is Google saying?
? ?
?
Slide 6
Presentation at KAIST, 20. April 2015
… and Wikipedia?
http://en.wikipedia.org/wiki/Stalingrad
Slide 7
Presentation at KAIST, 20. April 2015
Problem: Finding Documents
• Manual Query Formulation
–Sometimes the solution is in Wikipedia or a dictionary!
–Other times:
•Not important enough?
•User generated content
•Slang/jargon/domain specific
• Better: Search Engine Support
–Needs knowledge about Evolution –How can we extract this knowledge?
Tsaritsyn 1589-1925 Stalingrad 1925-1961 Volgograd
Slide 8
Presentation at KAIST, 20. April 2015
Available Information: The Time Dimension
Backward Lack of Dictionaries Few experts
OCR errors Data sparsity
Forward
Incremental changes Resources exist
(Wikipedia, WordNet) People can help
Noisy data (Web) Data flood
9
Today
Presentation at KAIST, 20. April 2015
Backward Lack of Dictionaries Few experts
OCR errors Data sparsity
Forward
Incremental changes Resources exist
(Wikipedia, WordNet) People can help
Noisy data (Web) Data flood
10
Available Information: The Time Dimension
Today
Presentation at KAIST, 20. April 2015
Named Entity Evolution Recognition (for historical collections)
Slide 11
Presentation at KAIST, 20. April 2015
Named Entity Evolution
Definitions:
– Context C
wi: all terms related to word w at time i
– Temporal co-references: names used for the same entity at same or different points in time.
– Direct co-reference: co-references with lexical overlap
– Indirect co-reference: co-references without lexical overlap – Change period (CP): period of time where change occurs.
Cardinal Joseph Ratzinger Pope Benedict XVI
CP = 2005
Slide 12
Presentation at KAIST, 20. April 2015
Our Method - Overview
1. Identify change periods
2. Create one context per CP.
3. Capture at least two co-references
No need to compare vastly different contexts!
time
C
walkman-discmanC
discman-minidiscC
mp3 player -ipodt1 t2 t3
Change period discman minidisc
walkman discman mp3 player ipod
C
minidisc-mp3 player t2minidisc mp3 player
Slide 13
Presentation at KAIST, 20. April 2015
Finding Change Periods
• Kleinberg’s burst detection
• Out of the box Java implementation from CIShell
For Evaluation:
Compare to manually found change periods (Known CPs)
Slide 14
Presentation at KAIST, 20. April 2015
Finding Direct Co-references
1. Extract text for each change period
2. Term & NE extraction
3. Build co-occurrence graph 4. Rules to merge terms from
dictionary and graph
Sub-Term Rule: Cardinal Joseph Ratzinger ↔ Joseph Ratzinger
Prefix/suffix Rule: Cardinal Joseph Ratzinger ↔ Cardinal Ratzinger
Prolongation Rule: Pope John Paul + John Paul II = Pope John Paul II
Cardinal Joseph Ratzinger, Cardinal Ratzinger,
Joseph Ratzinger
Cardinal Joseph Ratzinger, Cardinal Ratzinger,
Joseph Ratzinger
Slide 15
Presentation at KAIST, 20. April 2015
Detailed Merging
• Merge one token terms (Co-ref classes):
– Pope Benedict and Benedict = coref Benedict {Pope Benedict, Benedict}
– Benedict XVI and Benedict = coref Benedict {Benedict XVI, Benedict}
choose Benedict as representative – highest frequency
• Merge co-reference classes
coref
Benedict{Pope Benedict, Benedict, Benedict XVI}
• Apply remaining rules:
– Merge coref Benedict with Pope Benedict XVI (subterm rule)
coref Benedict {Pope Benedict, Benedict, Benedict XVI, Pope Benedict XVI}
Slide 16
Presentation at KAIST, 20. April 2015
Filtering and Evaluation
• NEER - unfiltered results
• Correlation (NEER + Corr)
– Pearsons correlation between co-references (e.g., Pope Benedict XVI, Vatican)
– minCorr = 0.4
• Document Frequencies (NEER +DF)
– Keep co-references with a df < df
query*scale factor (df
max)
• Machine Learning (NEER +ML)
– Random forest with features: correlation, covariance, rank correlation and normalized rank correlation.
– 15 fold cross-validation
• Co-occurrence -- Baseline
– Using all co-occurring terms.
Slide 17
Presentation at KAIST, 20. April 2015
Experiments
• Two settings:
– Using Known CP (Manual) & Found CP (Burst Detection)
• Measure Precision and Recall
– Precision straight forward
• Recall:
– Group names: (Joseph Ratzinger, Cardinal Joseph Ratzinger, ... ) (Pope Benedict XVI, Benedict XVI, ... )
– 100% : Find all direct at least 1 indirect from each group.
Slide 18
Presentation at KAIST, 20. April 2015
Data and Test set
• New York Times Corpus (1986-2007)
• Manually created test set
–Have a change period in NYTimes
–Occur at least 5 times in at least one change period
33 names
86 co-references (44 indirect/42 direct)
Slide 19
Presentation at KAIST, 20. April 2015
Results (I/II)
• Burst detection
–Found total 73% of all change periods, –avr. 3.2 bursts/name
–Using top 6 bursts we found 66%.
Slide 20
Presentation at KAIST, 20. April 2015
Results (II/II)
Slide 21
Presentation at KAIST, 20. April 2015
Lessons learned
• Lemmatization did not work well.
• Big difference between 1/2 word minimum – (1 higher recall, 2 higer precision)
• The order of rule execution is important.
– Different results if we start with Rule 1, 2 or 3
• Allow stop words to have be lowercase – Union of Myanmar
22
Presentation at KAIST, 20. April 2015
Backward Lack of Dictionaries Few experts
OCR errors Data sparsity
Forward
Incremental changes Resources exist
(Wikipedia, WordNet) People can help
Noisy data (Web) Data flood
23
Available Information: The Time Dimension
Today
Presentation at KAIST, 20. April 2015
Handling of Name Changes on Wikipedia
Slide 24
Presentation at KAIST, 20. April 2015
Only structured evolution information is revision history
Does not reflect the evolution of corresponding entity
According to Wikipedia guidelines:
In case of a name change, new article, redirect from former name
Reason is missing, only explanation in revision history (sometimes)
Former names should be mentioned at the beginning of an article
Handling of Name Changes on Wikipedia
List pages provide semi-structured information
e.g., List of city name changes: “Edo → Tokyo (1868)”
Easy to parse
Served as a starting point for our analysis
Not all of list pages look the same
e.g., List of renamed products: “Dime Bar, a confectionery product from Kraft Foods was rebranded Daim bar in the United Kingdom in September 2005”
Do excerpts of limited length exist in articles
that are dedicated to name changes?
Slide 25
Presentation at KAIST, 20. April 2015
Handling of Name Changes on Wikipedia
Slide 26
Presentation at KAIST, 20. April 2015
Dataset
19 semi-structured seed lists (9 redundant) 10 remaining (only geo names):
Geographical renaming
List of city name changes
List of administrative division name changes
7 lists dedicated to renamings of cities in certain countries
1,926 distinct entities
2,852 name changes
2,782 articles (of 1,898 entities)
766 entities with names resolvable to multiple articles
28 entities could not be resolved
Analysis
Most changes: 11 (Plovdiv, Bulgaria)
Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis → Trimontium→ Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva → Filibe → Plovdiv
In average: 1.48 changes, 2.39 different names
Slide 27
Presentation at KAIST, 20. April 2015
Knowledge Base dedicated to Entity Evolution
Extracted from Wikipedia articles
Learning the patterns to extract structures
Three components: Former name, new name, date of change
Analysis: Goal / Objectives
Sentence distance of an evolution
e.g., Thamesdown→ Swindon (1997)
“On 1 April 1997 it was made administratively independent of Wiltshire County Council, with its council becoming a new unitary authority. It adopted the name Swindonon 24 April 1997. The former Thamesdownname and logo are still used by the main local bus company of Swindon, called Thamesdown
Transport Limited.”
Minimum sentence distance: 1
How many sentences do excerpts span that cover all three components?
Slide 28
Presentation at KAIST, 20. April 2015
1. From evolution lists to excerpts• 696 entities remaining with 918 name changes annotated with dates
Results
572 complete names changes are mentioned in articles (preceding, succeeding name and date)
62.3% of the 918 considered name changes
Slide 29
Presentation at KAIST, 20. April 2015
2. Analyzing excerpts• Sentence distance
• More than 85% of the 572 considered changes are completely mentioned in excerpts with 10 sentences or less
→ There are passages in Wikipedia articles dedicated to describing an entity’s evolution!
Results
Slide 30
Presentation at KAIST, 20. April 2015
2. Analyzing excerpts• Sentence distance
• More than two-thirds of the found changes have a sentence distance of less than 3 (excerpts spanning 3 sentences or less)
• 79.7% of the 572 considered changes
Results
0 [226]
0 [226]
1 [118]
2 [45]
≥3 [99]
sentence distance [excerpts]
≥10 [84]
Slide 31
Presentation at KAIST, 20. April 2015
So far results only for geographic entities
Manually parsed “List of renamed products” (unstructured)
Same analysis
Generalization
Similar results
80% of name changes reported in articles (vs. 62.3%)
91.7% span 10 sentences or less (vs. 85.3%)
Again more than two-thirds of the changes have a sentence distance < 3
79.7% vs. 66.7%
Assumption: Similar text patterns
Enables automatic classification / detection of evolutions
Demo @ Digital Libraries 2014
http://evobase.l3s.de/DL2014_demo
Slide 32
Presentation at KAIST, 20. April 2015
Generalization: Evolution Base
http://evobase.l3s.de/DL2014_demo/Accenture
Slide 33
Presentation at KAIST, 20. April 2015 Conclusions
NEER provides a good starting point for historical documents
Depends on the collection quality, type, etc.
Need improved linguistic processing
OCR quality affects NEER
Wikipedia
A large majority of name changes are mentioned in Wikipedia articles
More than two-thirds of the found changes have a sentence distance of less than three
Extracted excerpts can serve as a training set for classifying similar passages
Allows automatic extraction of name evolution knowledge from Wikipedia Outlook
•
Create larger ground truth
•
Using large text corpus for dating of changes
•
We are going to use this knowledge for creating an evolution knowledge base
Conclusion & Outlook
http://evobase.l3s.de/DL2014_demo
Slide 34
Presentation at KAIST, 20. April 2015
References - NEER
• Holzmann, H.; Tahmasebi, N. & Risse, T.
Named entity evolution recognition on the Blogosphere
International Journal on Digital Libraries, Springer Berlin Heidelberg , 1-27, 2014
• Tahmasebi, N.; Niklas, K.; Zenz, G. & Risse, T.
On the applicability of word sense discrimination on 201 years of modern english
International Journal on Digital Libraries, Springer-Verlag , 1-19, 2013
• Tahmasebi, N.; Gossen, G. & Risse, T.
Which Words Do You Remember? Temporal Properties of Language Use in Digital Archives
Theory and Practice of Digital Libraries, Springer , Vol. 7489, 32-37, 2012
• Tahmasebi, N.; Gossen, G.; Kanhabua, N.; Holzmann, H. & Risse, T.
NEER: An Unsupervised Method for Named Entity Evolution Recognition
COLING, Indian Institute of Technology Bombay , 2553-2568, 2012
35
Presentation at KAIST, 20. April 2015
References - Wikipedia
• Holzmann, H. & Risse, T.
Insights into Entity Name Evolution on Wikipedia
Web Information Systems Engineering – WISE 2014, Springer International Publishing , Vol. 8787, 47-61, 2014
• Named Entity Evolution Analysis on Wikipedia
Proc. of the 2014 ACM Conference on Web Science, ACM , 241- 242, 2014
• Holzmann, H. & Risse, T.
Extraction of Evolution Descriptions from the Web, Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries,
IEEE Press, Piscataway, NJ, USA , pp. 413--414 .
36
Presentation at KAIST, 20. April 2015
Thank You!
Dr. Thomas Risse
Forschungszentrum L3S Appelstrasse 9a
30167 Hannover
E-Mail: risse@L3S.de Telefon: 0511 – 762 17764 Telefax: 0511 – 762 17779
37