Lexica and Corpora for Speech-to-Speech Translation Components     

 



http://www.lc-star.com

LC-STAR Annual Report 2004

The objective of the LC-STAR project is to improve human-to-human and man-machine communication in multilingual environments. The project aims to create lexica and corpora needed for transferring speech-to-speech translation (SST) components, i.e. flexible vocabulary speech recognition, high quality text-to-speech synthesis (TTS) and speech centered translation into selected languages. SST components are targeted to be integrated into speech driven interfaces embedded in mobile appliances and network servers. LC-STAR will concentrate on the one hand on the creation of language resources, i.e. pronunciation lexica with phonetic, prosodic and morpho-syntactic content and on the creation of bilingual aligned text corpora. Furthermore all language resources will be validated by external validation centers.
On the other hand speech-to-speech translation technologies will be investigated with respect to their demand on language resources. The transfer will be shown by a demonstrator translating within 3 languages.


Summary of 2004 activities

Main activities followed in two parallel tracks


Progress in  Track I
During the year all deliverables and nearly all lexica have been finalized. The following lexica for ASR and TTS were submitted for full validation to external validation centers, namely SPEX, Netherlands and CST, Denmark: Catalan, Finnish, German, Hebrew, Italian, Mandarin Chinese, Russian, Spanish, Turkish, US-English.
A conference paper describing validation issues has been presented at LREC, Lissabon 2004.

Progress in Track II
The development and revision of a tri-lingual aligned corpus in Catalan, Spanish and US-English  for demonstration of speech to speech translation technology has been completed. In addition a reference corpus of  10.000 phrases in US-English covering tourist domains has been created for development of  lexica for statistical machine translation and ASR and TTS components. The translation of the reference corpus into Catalan, Finnish, German, Hebrew, Italian, Russian, Slovenian and Spanish  has been finished. A common DTD for these lexica has been developed. The lexica will be submitted for validation to an external validation center, namely CST, Denmark.

For demonstration of the suitability of different structured language resources various experiments have been performed. It could be shown that the use of parts-of-speech tags improved the performance of single word based models significantly. Furthermore methods for the integration of morpho-syntactic information have been investigated. The results have been published in various conferences. In addition methods for automatic evaluation of the translation have been investigated and developed.
The demonstrator platform 'Gaia' has been finished. It is is a telephone server which can offer translation in the three research languages covering a tourist domain defined in the project. The following engines have been included in the demonstrator platform: a speech recognizer, speech synthesis systems and statistical based translation engines.

The University of Maribor joined the project both for Track I and Track II as external partner.


Progress of work

Track I :
A detailed description on the corpora collection and word list extraction requirements can be found in the web page (deliverable D1.1). The linguistic specifications include phonetic and linguistic information also for non-European languages amongst them Turkish, Hebrew, Standard Arabic, for which descriptions up to now were rare or did not exist at all (cf. deliverable D2.1.-4).
A common linguistic DTD (document type description) for all languages has been finished and language specific DTD's as subsets of the common one were specified. The descriptions can be found at the public part of the web page.
  The development of the lexica has been completed for nearly all languages. A subset of all lexica has been pre-validated. The pre-validation checks are performed for detecting major errors in the production before laborious (but possibly useless) efforts in lexicon creation are entered. Pre-validation results indicated that all lexica are of high quality. During the year nearly all lexica have subsequently been submitted for full validation. The validation checks carried out can be divided into two groups: formal and manual checking of the input. A more precise description of the criteria can be found in deliverable D6.1 which is publicly available on the web page.

Track II:
The tri-lingual corpus has been finished: approximately 800 dialogues were selected  from transcriptions of speech recordings from Spanish and Catalan in a tourist domain (TALP tourism) and from public parts of the Verbmobil corpus. All dialogues have been translated to the core language pairs Spanish/Catalan, Catalan/Spanish, Spanish/ US-English, Catalan/US-English,US-English/Spanish and US-English/Catalan. The whole reference corpus covers scenarios like requests to travel agencies and hotels, timetable information and typical  informations asked for in tourist offices. Furthermore the validation specifications for the tri-lingual corpus have been created and will be published on the web page. The corpus will be submitted for validation to an external validation center.

Additionally a corpus of 10.000 phrases in US-English has been collected from various sources like web pages and tourist guides which will be the basis for the translation lexica and lexica for ASR and TTS for the tourist domain. The translation process into the target languages: Catalan, Finnish, German, Hebrew, Italian, Russian, Slovenian and Spanish has been finished. A common DTD for the lexica has been provided and final work in completing these lexica is ongoing. Validation criteria have been developed and are available on the web page. The lexica will be submitted for validation to CST, Denmark.

For statistical machine translation experiments were performed using parts of the tri-lingual aligned corpora, mono-lingual lexica with enriched morphological information in Spanish and Catalan (additional databases) as well as first versions of phrasal translation lexica . One major focus has been be to integrate all available resources for experiments and tests. A summary of the experiments is given below.
1. Word Alignment using Lexicon Smoothing:
The standard lexicon model is based on full form words only. For highly inflected languages such as German this might cause problems, because many full form words occur only a few times in the training corpus. Compared to English, the token/type ratio for German is usually much lower (e.g. on the Verbmobil corpus: English 99.4, German 56.3). The information that multiple full form words share the same base forms is not used in the lexicon model. To take this information into account, the lexicon model is smoothed with a backing-off lexicon that is based on word base forms. The smoothing method that was applied is absolute discounting with interpolation.
This smoothing method was tested on the German-English Verbmobil corpus, because manually aligned data is available for this task. It results in an improvement of the alignment error rate (AER) for both translation directions. For the source-to-target direction the AER improves from 5.7% to 5.2% and for the target-to-source direction from 9.9% to 9.1%. These improvements of the AER are statistically significant at the 95% level.
2. Word Alignment using Morpho-Syntactic Information:
Existing translation systems usually treat different derivations of the same base form as they were independent of each other. A method was proposed which takes into account these interdependencies during the EM training of the statistical alignment models. A hierarchical representation of the lexicon model is used. It contains additional information for each word, i.e. base form and sequence of POS tags (e.g. the German word "gehe" ( <I> go) is represented as (gehe, gehen-V-IND-PRES, gehen)). The experiments were done on the Verbmobil corpus, because manually aligned data is available for this task. The Alignment Error Rate (AER) was reduced from 5.5% into 5.0% on the full corpus of 34k training sentences and for the small corpus containing 500 sentences the error rate went down from 15.9% to 14.8%.
3. The use of morpho-syntactic information mainly for statistical translation from English into Spanish has been investigated. Translation from English into transformed Spanish where the full forms of the verbs are replaced with its base forms is easier; error rates are lower by 6-10% relative. 
The following experiments were performed: First, translate English into transformed Spanish and then map the verb base forms into the correct full forms. The latter can be done by finding the corresponding POS-tag for each verb base form. We applied the RWTH maximum-entropy POS-tagger. First experiments have been done with the base set of the features (e. g. lexical, suffix).
Results will be published recently.

A new version of the demonstrator platform has been developed.  The demonstrator platform, Gaia, is a telephone server which can offer translation in the three research languages covering a tourist domain defined in the project. The platform can be configured to be used either for one channel (one person speaks in the source language and the systems provides the translation) or for two channels (two persons speaking through the platform, and the platform performs the translation). All the software has being finished. The kernel of the platform  communicates with different types of servers:
- Terminal servers: they collect the input of the user and provide the output. A telephone terminal to interact using the telephone and a speech console terminal to send speech through and IP connection have been integrated and additionally a text console terminal which is mainly used to test the translation engine.
- Speech Technology servers: from UPC,
ASR, TTS and SST components have been integrated. From RWTH the statistical machine translation component has been  integrated. If  it is feasible the Spanish recognizer from RWTH will also be integrated.  For US-English synthesis the Festival TTS component will be used.
- Additionally debugging, configuration and visualizations servers have been developed.
The translation models are provided by RWTH. The last version is being developed now, using the final version of the tri-lingual aligned corpus.
- Acoustic models for ASR for Spanish and Catalan have been trained using either the TALP-tourism corpus or a combination with the SpeechDat databases.
 For Spanish, RWTH are also training acoustic models using the tri-lingual speech corpus.
 For English the MACROPHONE corpus has been used.
- The language models for speech recognition are trained from the TALP-tourism corpus, using both the source sentences and the translated sentences.
N- grams, defining classes for hotels, person names, etc. have been used. Several trial were done to include corpus from Verbmobil or touristic web pages but there was no significant decrease on the perplexity.
- LC-STAR lexicons for Spanish, English and Catalan have been integrated and are used for ASR and TTS components
In order increase the translation performance methods for impoving speech recognition and translation models (RWTH) are investigated.


Overview of the demonstrator 





Future work

Track I:
- Completion of validation of large lexica.
Track II:
- Validation of the translation lexica and tri-lingual corpus
- Demonstrator testing.
- Finalizing dissemination and exploitation  reports.


Outcome

Quasi industrial standards for language resources have been established. A common DTD for thirteen European and non-European languages which is easily extendible to new languages has been  created for all lexica. A set of validation criteria for all language resources has been developed which can be used by others. The project is the first one where lexica have been consistently validated in such a large extent by external validation centers.
All specifications concerning linguistic contents of the lexica as well as specifications of the validation criteria can be downloaded from the web page.
As prototypes:
- 13 lexicons for speech recognition and synthesis have been finished and are validated.
- 9 lexicons suited for statistical machine translation have been created.
- Tri-lingual aligned text corpora in three languages, i.e. Catalan, Spanish and US-English will be made available.
Experimental results for speech centered translation approaches concerning their requirements on language resources have been reported and will be made publicly available on the web page.
The language transfer will be shown with a demonstrator translating between Catalan, Spanish and US-English.


Dissemination and Awareness

In order to promote the project to the international community our website is updated regularly  (http://www.lc-star.com) . It provides information on the project in general (objectives, milestones and expected results) as well as a description of the consortium and further details. Documents like specifications, technical reports, research papers are publicly available via the website. Furthermore all relevant major events, press releases and presentations of the demonstrator as well as links to other projects and institutes can be found at the site.
The following papers were presented:
H. Fersøe, E. Hartikainen, H. van den Heuvel, G. Maltese, A. Moreno, S. Shammass, U. Ziegenhain: Creation and Validation of Large Lexica for Speech-to-Speech Translation Purposes. In: Proc. of LREC2004, Lisbon, Portugal, May 2004. (Click here for poster) .
Maja Popovic and Hermann Ney: Towards the use of word stems and suffixes for SMT. In: Proc. of LREC2004, Lisbon, Portugal, May 2004.
Victoria Arranz, Núria Castell, Josep Maria Crego, Jesús Giménez, Adrià de Gispert and Patrik Lambert: Bilingual Connections for Trilingual Corpora: An XML Approach. In: Proc.of LREC 2004, Lisbon, Portugal. 2004.
Victoria Arranz, Núria Castell i Jesús Giménez: Creació de recursos lingüístics per a la traducció automàtica. In: 2n Congrés d'Enginyeria en Llengua Catalana (CELC'04), Andorra, 2004.
Ute Ziegenhain, Asuncion Moreno, Nuria Castell: Creation of Lexica for Speech-Centered Translation. In: Proc. of the 10. ASR Workshop, Maribor, Slovenia, July 2004.
Maja Popovic and Hermann Ney: Improving Word Alignment Quality using  Morpho-Syntactic Information. In: Proc. of CoLing 2004, Geneva, Switzerland, August 2004.
Richard Zens, Evgeny Matusov and Hermann Ney: Improved Word Alignment Using a Symmetric Lexicon Model. In: Proc. of CoLing 2004, Geneva, Switzerland, August 2004.
Evgeny Matusov, Maja Popovic, Richard Zens and Hermann Ney: Statistical Machine Translation of Spontaneous Speech with Scarce Resources. In: Proc. of IWSLT 2004, Kyoto, Japan, Sept./Oct. 2004.
Folkert de Vriend, Núria Castell, Jesús Giménez and Giulio Maltese: LC-STAR: XML-coded Phonetic Lexica and Bilingual Corpora for Speech-to-Speech Translation. In: Proc. of the Coling 2004 Papillon Workshop on Multilingual Lexical Databases. Grenoble, France 2004
Available documents can be downloaded from the webpage.
During the year a cooperation with ELRA on a new project called 'Unified Lexicon Approach' has been started. The aim of the project is to join different levels of information from various lexica (pronunciation, morphology, syntax, semantic) already available (eg. Parole / Simple lexica) with the purpose to create ever larger lexical databases.


Exploitation Prospects

All lexica created within the project will be distributed via ELRA (http://www.elra.info/) no later that 18 months after the official end of the project. All data thus will be made available to research institutes and companies worldwide for further exploitation in research and commercial applications. All documents describing  linguistic and phonetic specifications and validation criteria will be available from public parts of  the web page.