The objective of
the LC-STAR project is to improve human-to-human and man-machine
communication in multilingual environments. The project aims to create
lexica and corpora needed for transferring speech-to-speech translation
(SST) components, i.e. flexible vocabulary speech recognition, high
quality text-to-speech synthesis (TTS) and speech centered translation
into selected languages. SST components are targeted to be integrated
into speech driven interfaces embedded in mobile appliances and network
servers. LC-STAR
will concentrate on the one hand on the creation of language resources,
i.e. pronunciation lexica with phonetic, prosodic and morpho-syntactic
content and on the creation of bilingual aligned text corpora.
Furthermore all language resources will be validated by external
validation centers.
On the other hand speech-to-speech translation technologies will be
investigated with respect to their demand on language resources. The
transfer will be shown by a demonstrator translating within 3 languages.
Summary
of 2004 activities
Main activities followed in two parallel tracks
Progress in Track I
During the year all deliverables and nearly all lexica have been
finalized. The following lexica for ASR and TTS were
submitted for full validation to external validation centers, namely
SPEX, Netherlands
and CST, Denmark: Catalan, Finnish, German, Hebrew, Italian, Mandarin
Chinese, Russian, Spanish, Turkish, US-English.
A conference paper describing validation issues has
been presented at LREC, Lissabon 2004.
Progress in
Track II
The development and revision of a tri-lingual aligned corpus in
Catalan,
Spanish and US-English for demonstration of speech to speech
translation technology has been completed. In addition a reference
corpus of
10.000 phrases in US-English covering tourist domains has been
created
for development of lexica for statistical machine
translation
and ASR and TTS components. The translation of the reference corpus
into Catalan, Finnish,
German, Hebrew, Italian, Russian, Slovenian and Spanish has been
finished. A common DTD for these lexica has been developed. The
lexica will be submitted for validation to an external validation
center, namely CST, Denmark.
For demonstration of the suitability of different structured language
resources various experiments have been performed. It could be shown
that the use of parts-of-speech tags improved the performance of single
word based models significantly. Furthermore methods for the
integration of morpho-syntactic information have been investigated. The
results have been published in various conferences. In addition methods
for
automatic evaluation of the translation have been investigated and
developed.
The demonstrator platform 'Gaia' has been finished. It is is a telephone server which can offer
translation in the three research languages covering a tourist domain
defined in the project. The following engines have been
included in the demonstrator platform:
a speech recognizer, speech synthesis systems and statistical based
translation engines.
The University of Maribor
joined the project both for Track I and Track
II as external partner.
Progress
of work
Track I :
A detailed description on the corpora collection and
word list extraction requirements can be found in the web page
(deliverable D1.1). The linguistic specifications include phonetic and
linguistic
information also for non-European languages amongst them Turkish,
Hebrew, Standard Arabic, for which descriptions up to now were rare or
did not exist at all (cf. deliverable D2.1.-4). A
common
linguistic DTD (document type description) for all
languages has been finished and language specific DTD's as subsets of
the common one were specified. The
descriptions can be found at the public part
of the web page.
The development of the lexica has been completed for nearly all
languages. A subset of all lexica has been pre-validated. The
pre-validation checks are performed for detecting major
errors in the production before laborious (but possibly useless)
efforts in lexicon creation are entered. Pre-validation results
indicated that all lexica are of high quality. During the year nearly
all
lexica have subsequently been submitted for full validation. The
validation checks carried out can be
divided into
two groups: formal and manual checking of the input. A more precise
description of the criteria can be found in deliverable D6.1 which is
publicly available on the web page.
Track II:
The tri-lingual corpus has been finished: approximately 800 dialogues
were selected from transcriptions of speech recordings from
Spanish and Catalan in a
tourist domain (TALP tourism) and from public parts of the Verbmobil
corpus. All dialogues have been translated to the
core language pairs
Spanish/Catalan, Catalan/Spanish, Spanish/ US-English,
Catalan/US-English,US-English/Spanish and US-English/Catalan. The whole
reference corpus covers scenarios like requests to travel agencies and
hotels, timetable
information and typical informations asked for in tourist
offices. Furthermore the validation
specifications
for the tri-lingual corpus have been created and will be published on
the web page. The corpus will be
submitted for validation to an external validation center.
Additionally
a corpus of
10.000 phrases in US-English has been collected from various sources
like web pages and tourist guides which will be the basis
for the translation lexica and lexica for ASR and TTS for the tourist
domain. The translation process into the
target languages: Catalan,
Finnish, German, Hebrew, Italian, Russian,
Slovenian and Spanish has been
finished. A common DTD for the lexica has been provided and final work
in completing these lexica is ongoing. Validation criteria have been
developed and are available on the web page. The lexica will be
submitted for validation to CST, Denmark.
For statistical machine translation experiments
were performed using parts of
the tri-lingual aligned corpora, mono-lingual lexica with enriched
morphological information in Spanish and Catalan (additional databases)
as well
as first versions of phrasal translation lexica . One
major focus has been be to integrate all available resources for
experiments
and tests. A summary of the experiments is given below.
1. Word
Alignment using Lexicon Smoothing:
The standard lexicon model is based on full
form words only. For highly inflected languages such as German this
might cause
problems, because many full form words occur only a few times in the
training
corpus. Compared to English, the token/type ratio for German is usually
much
lower (e.g. on the Verbmobil corpus: English 99.4, German 56.3). The
information that multiple full form words share the same base forms is
not used
in the lexicon model. To take this information into account, the
lexicon model
is smoothed with a backing-off lexicon that is based on word base
forms. The
smoothing method that was applied is absolute discounting with
interpolation.
This smoothing method was tested on the
German-English Verbmobil corpus, because manually aligned data is
available for
this task. It results in an improvement of the alignment error rate
(AER) for
both translation directions. For the source-to-target direction the AER
improves from 5.7% to 5.2% and for the target-to-source direction from
9.9% to
9.1%. These improvements of the AER are statistically significant at
the 95%
level.
2. Word Alignment using Morpho-Syntactic
Information:
Existing translation systems usually treat
different derivations of the same base form as they were independent of
each
other. A method was proposed which takes into account these
interdependencies
during the EM training of the statistical alignment models. A
hierarchical
representation of the lexicon model is used. It contains additional
information
for each word, i.e. base form and sequence of POS tags (e.g. the German
word
"gehe" ( <I> go) is represented as (gehe, gehen-V-IND-PRES,
gehen)). The experiments were done on the Verbmobil corpus, because
manually
aligned data is available for this task. The Alignment Error Rate (AER)
was
reduced from 5.5% into 5.0% on the full corpus of 34k training
sentences and
for the small corpus containing 500 sentences the error rate went down
from
15.9% to 14.8%.
3. The use of morpho-syntactic information
mainly for statistical translation from English into Spanish has been
investigated. Translation from English into transformed Spanish where
the full
forms of the verbs are replaced with its base forms is easier; error
rates are
lower by 6-10% relative.
The following experiments were performed:
First, translate English into transformed Spanish and then map the verb
base
forms into the correct full forms. The latter can be done by finding
the
corresponding POS-tag for each verb base form. We applied the RWTH
maximum-entropy POS-tagger. First
experiments have been done with the
base set of the features (e. g. lexical, suffix).
Results will be published recently.
A new version of the demonstrator platform has
been
developed. The
demonstrator platform, Gaia, is a telephone server which can offer
translation in the three research languages covering a tourist domain
defined in the project. The platform can be configured to be
used either for one channel (one person speaks in the source language
and the systems provides the translation) or for two channels (two
persons speaking through the platform, and the platform performs the
translation). All the software has being finished. The kernel of the
platform communicates with different types of servers:
- Terminal servers: they collect the input of the user and provide the
output. A telephone terminal to interact using the telephone and a
speech console terminal to send speech through and IP connection have
been integrated and additionally a text console terminal which is
mainly used to test the translation engine.
- Speech Technology servers: from UPC, ASR,
TTS and
SST components have been integrated. From
RWTH the statistical machine translation component has been
integrated. If it is
feasible the Spanish recognizer from RWTH will also be
integrated. For
US-English synthesis the Festival TTS component will be used.
- Additionally debugging, configuration and visualizations servers have
been developed.
The translation models are provided by RWTH. The last version is
being developed now, using the final version of the tri-lingual aligned
corpus.
- Acoustic models for ASR for Spanish and Catalan have been trained
using either the TALP-tourism corpus or a combination with the
SpeechDat databases.
For Spanish, RWTH are also training acoustic models using the
tri-lingual speech corpus.
For English the MACROPHONE corpus has been used.
- The language models for speech recognition are trained from the
TALP-tourism corpus, using both the source
sentences and the translated sentences. N-
grams,
defining classes for hotels, person names, etc. have been used. Several trial
were done to
include corpus from Verbmobil or touristic web pages but there was no
significant decrease on the perplexity.
- LC-STAR lexicons for Spanish, English and Catalan
have been integrated and are used for ASR and TTS components
In order increase the translation
performance methods for impoving speech recognition and translation
models
(RWTH) are investigated.
Overview of the demonstrator
Future work
Track I:
- Completion of validation of large lexica.
Track II:
- Validation of the translation lexica and
tri-lingual corpus
- Demonstrator testing.
- Finalizing dissemination and exploitation reports.
Outcome
Quasi industrial standards for language resources have been
established. A common DTD for thirteen European and non-European
languages which is easily extendible to new languages has been
created for all lexica. A set of validation criteria for all language
resources has been developed which can be used by others. The project
is the first one where lexica have been consistently validated in such
a large extent by external validation centers.
All specifications concerning linguistic contents of the lexica as
well as specifications of the validation criteria can be downloaded
from
the web page.
As prototypes:
- 13 lexicons for speech recognition and synthesis have been finished
and are validated.
- 9 lexicons suited for statistical machine translation have been
created.
- Tri-lingual aligned text corpora in three languages, i.e. Catalan,
Spanish and US-English will be made available.
Experimental results for speech centered translation approaches
concerning their requirements on language resources have been reported
and will be made publicly available on the web page.
The language transfer will be shown with a demonstrator translating
between Catalan, Spanish and US-English.
Dissemination and Awareness
In order to promote the project to the international community our
website is updated regularly (http://www.lc-star.com)
. It provides information on the project in general (objectives,
milestones
and expected results) as well as a description of the consortium and
further details. Documents like specifications, technical reports,
research papers are publicly available via the website. Furthermore all
relevant major events, press releases and presentations of the
demonstrator as well as links to other projects and institutes can be
found at the site.
The following papers were presented:
H. Fersøe, E.
Hartikainen, H. van den Heuvel, G. Maltese, A. Moreno, S. Shammass,
U. Ziegenhain: Creation and Validation of
Large Lexica for Speech-to-Speech Translation Purposes. In: Proc. of
LREC2004,
Lisbon, Portugal, May 2004. (Click
here for poster)
.
Maja Popovic and Hermann Ney: Towards the
use of word stems and suffixes for SMT. In: Proc. of LREC2004, Lisbon,
Portugal, May 2004.
Victoria
Arranz, Núria Castell, Josep Maria Crego, Jesús
Giménez, Adrià de Gispert and Patrik Lambert: Bilingual
Connections for
Trilingual Corpora: An XML Approach. In:
Proc.of LREC 2004,
Lisbon, Portugal. 2004.
Victoria Arranz, Núria Castell i
Jesús
Giménez: Creació de recursos lingüístics per
a la traducció automàtica. In: 2n
Congrés d'Enginyeria en Llengua Catalana (CELC'04), Andorra,
2004.
Ute Ziegenhain, Asuncion Moreno, Nuria
Castell: Creation of Lexica for Speech-Centered Translation. In: Proc.
of the
10. ASR Workshop, Maribor, Slovenia, July 2004.
Maja Popovic and Hermann Ney: Improving
Word Alignment Quality using Morpho-Syntactic
Information. In: Proc. of
CoLing 2004, Geneva, Switzerland, August 2004.
Richard Zens, Evgeny Matusov and Hermann
Ney: Improved Word Alignment Using a
Symmetric Lexicon Model. In: Proc.
of CoLing 2004, Geneva, Switzerland, August 2004.
Evgeny Matusov, Maja Popovic, Richard Zens
and Hermann Ney: Statistical Machine Translation of Spontaneous Speech
with
Scarce Resources. In: Proc. of IWSLT 2004, Kyoto, Japan, Sept./Oct.
2004.
Folkert de Vriend,
Núria
Castell, Jesús
Giménez and Giulio Maltese: LC-STAR: XML-coded Phonetic Lexica
and Bilingual
Corpora for Speech-to-Speech Translation. In:
Proc. of the Coling 2004 Papillon Workshop on Multilingual Lexical
Databases.
Grenoble, France 2004
Available documents can be downloaded from the webpage.
During the year a
cooperation with ELRA on a new project called 'Unified Lexicon
Approach' has
been started. The aim of the project is to join different levels of
information
from various lexica (pronunciation, morphology, syntax, semantic)
already
available (eg. Parole / Simple lexica) with the purpose to create ever
larger
lexical databases.
Exploitation Prospects
All lexica created within the project will be distributed via ELRA (http://www.elra.info/) no later that
18 months after the official end of the project.
All data thus will be made available to research institutes and
companies worldwide for further exploitation in research and commercial
applications. All documents describing linguistic and phonetic
specifications and validation criteria will be available from public
parts of the web page.
|