• Keine Ergebnisse gefunden

Language Resources, Language Technologies, Text Mining, the Semantic Web: How Interoperability of Machines can help Humans in the Multilingual Web

N/A
N/A
Protected

Academic year: 2022

Aktie "Language Resources, Language Technologies, Text Mining, the Semantic Web: How Interoperability of Machines can help Humans in the Multilingual Web"

Copied!
30
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Language  Resources,  Language  

Technology,  Text  Mining,  the  Seman8c   Web:  How  interoperability  of  machines   can  help  humans  in  the  mul8lingual  web  

Felix  Sasaki  

DFKI  /  University  of  Appl.  Sciences  Potsdam   W3C  German-­‐Austrian  Office  

felix.sasaki@dSi.de  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   1  

(2)

Purpose  of  this  talk  (1)  

•  Show  gaps  

–  Between  machines  

–  Between  machines  and  humans  

•  …  which  we  need  to  fill  to  bridge  gaps   between  humans  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   2  

(3)

Purpose  of  this  talk  (2)  

•  Iden8fy  groups  /  communi8es  

–  To  fill  gaps  

–  To  come  together  in  new  alliances  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   3  

(4)

Basics:    

What  are  machines  doing   (not  only  on  the  Web)?  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   4  

(5)

Language  Technology  

•  Summariza8on  

LT   “These  texts  are  

about  ...  “  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   5  

(6)

Language  Technology  

•  Machine  Transla8on  

LT  

このワークショップ

で開催される

 

“The  workshop   takes  place  in  …“  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   6  

(7)

Language  Technology  

•  Spell  and  grammar  checking  

LT   “The  workshop  

takes  place  in  …“  

“The  worksop   take  place  in  …“  

•  And  many  more  applica8ons  

•  Coreference  resolu8on,  discourse  analysis,   named  en8ty  recogni8on,  natural  language   genera8on,  ques8on  answering,  …  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   7  

(8)

Text  mining  

•  Finding  out  things  you  did  not  know  

Text   mining  

•  “Text  A  and  text  B   are  similar”  

•  “The  text  collec8on   has    clusters  of  

topics:  …”  

Visualiza8on   of  results  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   8  

(9)

Basics:    

What  are  machines  doing   (not  only  on  the  Web)?  

How  are  they  doing  it?  

They  are  using  resources  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   9  

(10)

Resources  in  language  technology  

•  Sample  resources  for  summariza8on  

LT   “These  texts  are  

about  ...  “  

NLG  output   text  mining  

output   stop  word  

list   …  

10  

(11)

Language  Technology  

•  Sample  resources  in  Machine  Transla8on  

LT  

このワークショップ

で開催される

 

“The  workshop   takes  place  in  …“  

Lexicon   Grammar   (Training)  

corpora   …  

Genera8on  

11  

(12)

Language  Technology  

•  Sample  resources  for  spell  and  grammar  checking  

LT   “The  workshop  

takes  place  in  …“  

“The  worksop   take  place  in  …“  

Lexicon   Grammar   …  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   12  

(13)

Text  mining  

•  Sample  resources  for  text  mining  

Text   mining  

•  “Text  A  and  text  B   are  similar”  

•  “The  text  collec8on   has    clusters  of  

topics:  …”  

Lexicon   Stop  word  

list   …  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   13  

(14)

In  general:  you  need  three  types  of   data:  input,  resources,  workflow  

Input   Work-­‐

flow   Output  

Resources   Resources   …  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   14  

(15)

What  gaps  need  to  be  filled  for  truly  

“mul8lingual  content  processing”?  

•  Gap  1:  machines  don’t  use  metadata  available   in  the  input  

•  Gap  2:  machines  don’t  know  about  the   workflow  (input)  data  goes  through  

•  Gap  3:  machines  don’t  make  explicit  

–  “Who”  they  are  

–  What  resources  they  are  using  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   15  

(16)

Gap  1:  machines  don’t  use  metadata   available  in  the  input  

•   Input  from  www.postbank.de  

„Ob  Postbank  direkt,  Online-­‐Banking,   Online-­‐Brokerage  oder  myBHW.  Die   häufigsten  Fragen  zu  unseren  

Transak8onssystemen  finden  Sie  an   dieser  Stelle.“    

•  Output  via  Google  translate  

“Whether  Postbank  direct,  online  

banking,  online  brokerage  or  myBHW.  

Frequently  asked  ques8ons  about  our   transac8on  systems  can  be  found  at   this  loca8on.”  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   16  

(17)

Gap  1:  machines  don’t  use  metadata   available  in  the  input  

•   Input  from  www.postbank.de  

„Ob  Postbank  direkt,  Online-­‐Banking,   Online-­‐Brokerage  oder  myBHW.  Die   häufigsten  Fragen  zu  unseren  

Transak8onssystemen  finden  Sie  an   dieser  Stelle.“    

•  Output  via  Google  translate  

“Whether  Postbank  direct,  online  

banking,  online  brokerage  or  myBHW.  

Frequently  asked  ques8ons  about  our   transac8on  systems  can  be  found  at   this  loca8on.”  

Fixed  terminology   should  not  have   been  translated.  

But  –  the  MT  tool   had  no  chance  to  

“know”  that  –   why?  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   17  

(18)

Gap  2:  machines  don’t  know  about   processes  data  goes  through  

•  Input  from  the  data  base  –  the  

“hidden  web”:  

„Ob  <term>Postbank  direkt</term>,  

<term>Online-­‐Banking</term>,  

<term>Online-­‐Brokerage</term>  …“    

•  Output  on  the  Web:  

„Ob  <em>Postbank  direkt</em>,  

<em>Online-­‐Banking</em>,  

<em>Online-­‐Brokerage</em>  …“    

fixed  terminology   (=  metadata)  …  

 …  is  lost  

on  the  Web     

publica8on   process  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   18  

(19)

Gap  3:  no  common  iden8fica8on  …  

•  Of  metadata  and  processes  chains  (previous   slides)  

•  Of  resources  –  e.g.  what  is  a  lexicon  

–  In  machine  transla8on?  

–  In  localiza8on?  

–  For  a  human  reader?  

–  Ability  to  combine  tools  depends  on  knowing   about  them  (capabili8es,  resources)  in  detail  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   19  

(20)

Who  can  fill  these  gaps  –  people   dealing  with  mul8lingual  content  

•  Content  producers  

–  Allow  for  terminology  iden8fica8on  in  source   formats  /  CMS  

•  Localizers  

–  Make  localiza8on  workflows  aware  of  (process  /   source  content)  metadata  

•  “Machine”  experts  

–  Make  their  tools  sensible  to  source  content  metadata   and  expose  their  capabili8es  (what  resources  /  

workflows)  in  a  clear  defined  way  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   20  

(21)

Who  can  fill  these  gaps  –  people   dealing  with  mul8lingual  content  

•  Users  

–  Add  metadata  to  source  content  

–  Use  (machine  transla8on)  tools  without  knowing  the   details  –  e.g.  in  the  browser!  

•   Browser  vendors  

–  Create  APIs  which  make  use  of  automa8c  tools  /   resource  and  workflow  descrip8ons  /  source  code   metadata  

•   …  

 The  people  in  this  room!  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   21  

(22)

How  can  they  fill  the  gaps?  

•  All  these  groups  need  to  agree  upon  one  

machine  readable  informa8on  space  for  filling   the  gaps  

•  It’s  actually  already  here  –  the  Seman8c  Web!  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   22  

(23)

What  is  the  Seman8c  Web  

•  The  Web  as  humans  see  it:  Iden8fica8on  of  

“meaning”  e.g.  via  (typographic  or  other)   conven8ons  

„Ob  Postbank  direkt  …“    

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   23  

(24)

What  is  the  Seman8c  Web  

•  The  Web  as  machines  see  it:  Iden8fica8on  of   meaning  via  RDF-­‐based  mechanisms  (here  via   RDFa)  

„Ob  <span  property=”its:term”>Postbank  direkt</span>  

…“    

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   24  

(25)

What  is  the  Seman8c  Web  –   RDF  in  30  seconds  

•  A  framework  for  making  statements  about   resources,  using  URIs  

•  RDF  can  help  to  fill  our  gaps  

1.   Metadata  in  the  input   2.   Metadata  for  workflows  

3.   Iden8fy  1.,  2.  and  language  technology  resources   uniquely  

•  In  one  informa8on  space  –  the  machine   readable  Web  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   25  

(26)

Instead  of  a  summary  –  call  for  project   (par8cipa8ng  in  )  proposals  

•  Who  needs  to  come  together  

–  Content  producers,  localizers,  “machine”  experts,  browser   vendors,  users  

•   What  should  their  work  be  based  upon  

–  Seman8c  Web  technologies  

–  Clear  interfaces  to  the  human  (e.g.  browser)  Web,  like  RDFa  

•  What  we  do  not  need  

–  Web-­‐centred  standardiza8on  of  formats  for  language  resources   themselves  –  that  is  already  done  elsewhere  (see  this  session)  

•  Where  the  place  is  to  do  that  work?  

–  W3C,  since  it  needs  to  be  part  of  core  Web  technologies  

•  For  making  it  happen,  we  need  a  strong  alliance  of  Web   technologies,  other  fields  and  machine  technologies  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   26  

(27)

META-­‐NET  

•  EU-­‐funded  project,  closely  related  to  

“Mul8lingual  Web”  

•  Main  aim:  build  an  alliance  for  improving   language  technologies  in  Europe  

•  Laaarge:  soon  40+  par8cipa8ng  organiza8ons   in  30+  countries  

•  Very  important:  bring  users  of  language   technology  in  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   27  

(28)

META-­‐NET  

•  Users  and  language  technology  companies  =  in   Europe  not  only  large  companies,  but  more  

and  more  small  SMEs  

•  Target  of  META-­‐NET  are  these  small  and  fast   units  –  including  you     

•  EU  has  started  special  funding  programs  for   SMEs  –  see  hup://8nyurl.com/eu-­‐lt-­‐sme       (“objec8ve  4.1”)    

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   28  

(29)

META-­‐NET  

•  Event:  META-­‐NET  Forum  

•  Brussels,  November  17 th /18 th  

•  Aim:  Bring  users  /  language  technology   developers  /  policy  makers  together  

•  Discuss  a  road  map  for  the  next  10  years  of   language  technology  road  map  and  its  

applica8ons  

•   Details  and  registra8on  at  

hup://www.meta-­‐net.eu/events    

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   29  

(30)

Language  Resources,  Language  

Technology,  Text  Mining,  the  Seman8c   Web:  How  interoperability  of  machines   can  help  humans  in  the  mul8lingual  web  

Felix  Sasaki  

DFKI  /  University  of  Appl.  Sciences  Potsdam   W3C  German-­‐Austrian  Office  

felix.sasaki@dSi.de  

W3C  Workshop    “The  Mul8lingual  Web  -­‐  Where  Are  We?”  26-­‐27  October  2010,  Madrid   30  

Referenzen

ÄHNLICHE DOKUMENTE

• The Semantic Web for Language Technology: Semantic web method- ologies (metadata, web services) and standards (RDF/S, OWL) will be used in the specification of

To be more precise, a grammar in SProUT consists of pattern/action rules, where the LHS of a rule is a regular expression over typed feature structures (TFS)

However the queries have to be translated (using the translator provided through Babelfish4) into English because answer extraction is only performed for English Web pages.

An overview of components, their underlying technologies and resources will be presented: language identification, document classification, linguistic analysis,

These technologies consist of ‘core’ technologies in knowledge markup (i.e. markup languages, knowledge representation) and knowledge processing (i.e. intelligent

The Semantic Web for Language Technology: Seman- tic web methodologies (metadata, web services) and standards (RDF/S, OWL) will be used in the specifi- cation of web-based,

&lt;uitext&gt;MAKE-READY/ RUN&lt;/uitext&gt;.

The reason to use these uniform weights is three-fold: (a) under the assumption that more relevant nuggets will be included in a larger amount of documents, we attempted to weight