Lexica and Corpora for Speech-to-Speech Translation Components     

 



http://www.lc-star.com

LC-STAR Annual Report 2003



The objective of the LC-STAR project is to improve human-to-human and man-machine communication in multilingual environments. The project aims to create lexica and corpora needed for transferring speech-to-speech translation (SST) components, i.e. flexible vocabulary speech recognition, high quality text-to-speech synthesis (TTS) and speech centered translation into selected languages. SST components are targeted to be integrated into speech driven interfaces embedded in mobile appliances and network servers. LC-STAR will concentrate on the one hand on the creation of language resources, i.e. pronunciation lexica with phonetic, prosodic and morpho-syntactic content and on the creation of bilingual aligned text corpora. Furthermore all language resources will be validated by external validation centers.
On the other hand speech-to-speech translation technologies will be investigated with respect to their demand on language resources. The transfer will be shown by a demonstrator translating within 3 languages.

 

Summary of 2003 activities

Main activities

Progress in  Track I:
Major achievements were the development of a detailed common DTD for all languages, the specification of validation criteria for wordlists and lexica which will be made publicly available on the homepage. The corpora collection for all languages in seven major semantic domains was finished. Large wordlists for all lexica have been extracted and cleaned. To enshure high quality the wordlists have been validated by two external validation centers. For each language a subset from the final lexicon has been pre-validated. The final work on lexicon production for ASR and TTS components is ongoing.
Progress in Track II:
The development of a tri-lingual aligned corpus in Catalan, Spanish and US-English  for demonstration of speech to speech translation technology has been nearly completed. A reference corpus of  10.000 phrases in US-English covering tourist domains has been created for development of resources suitable for statistical machine translation and ASR and TTS components. The translation into Catalan, Finnish, German, Hebrew, Italian, Russian and Spanish has been started.
For demonstration of the suitability of different structured language resources various experiments have been performed. It could be shown that the use of parts-of-speech tags improved the performance of single word based models significantly. Furthermore a method for the integration of morpho-syntactic information has been investigated. The results will be published in LREC 2004. In addition methods for automatic evaluation of the translation have been investigated and developed.
The following engines have been included in the demonstrator platform: a speech recogniser, speech synthesis systems and a statistical based translation engine.
One additional partner joined the project both for Track I and Track II


Progress of work

Track I :
The collection of  the corpora in seven major semantic domains has been completed. Very large corpora for all languages have been collected with varying sizes. Due to demands on coverage the size of wordlists also highly vary between the languages: for high-inflecting languages like Finnish, 140.000 entries had to be extracted whereas for Mandarin Chinese and US-English 30.000 entries covered nearly 100% of the corpora. A detailed description on the corpora collection and wordlist extraction requirements can be found in the webpage. A common linguistic DTD (document type description) for all languages has been finished. It includes phonetic and linguistic information also for non-european languages amongst them Turkish, Hebrew, Standard Arabic, for which descriptions up to now were rare or did not exist at all.
The wordlists have been validated by two validation centers, namely SPEX and CST. The production process of the large lexica has been ongoing.  A subset of the large lexica has been pre-validated for each language. The validation checks are performed for detecting major errors in the production before laborious (but possibly useless) efforts in lexicon creation are entered. The checks can be divided into two groups: formal and manual checking of the input. A more precise description can be found in D6.1 which will be made publicly available on the webpage.


Track II:
The transcriptions of the speech-database in Spanish and Catalan have been finished. All corpora are now translated into the core languages Spanish, Catalan and US-English. In combining all databases an US-English reference text corpus (about 500K) has been extracted covering scenarious like requests to travel agency, hotel, timetable information and tourist information office. Additionally a corpus of 10.000 phrases in US-English has been collected which will be the basis for the translation lexica and lexica for ASR and TTS for the tourist domain. The translation process into
Catalan, Finnish, German, Hebrew, Italian, Russian, Slovenian and Spanish has been started.
For statistical machine translation the use of morpho-syntactic information mainly for translation from Spanish and Catalan into English was investigated. Spanish and Catalan have a richer inflectional morphology than English, which imposes problems for the translation engine. Methods for handling this have been investigated and will be published in LREC 2004 (Popovic, Ney "Towards the use of word stems and suffixes for SMT", to be published in Proc. LREC 2004).
For the translation from English into Spanish or Catalan, RWTH analysed the use of POS information (Ueffing, Ney "Using POS Information for Statistical Machine Translation into Morphologically Rich Languages", Proc. EACL03). This method improves performance of single word based models significantly.
Furthermore, a new string-to-string distance measure has been developed which can be used as an evaluation criterion in machine translation (Leusch, Ueffing, Ney "A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation", Proc. MT Summit IX). Since cheap and reliable evaluation of machine translation quality is still an open problem, the correlation between different automatic evaluation criteria and human judgment was systematically investigated.
A new version of the demonstrator platform has been developed. The new development allows not only speech-to-speech translation (two users) but also speech acquisition and speech translation with only one user. This last functionality has being added for easy demonstration of the translation capabilities by only one person. Furthermore, a graphical applet has being developed for monitoring the platform and the engines using any Internet browser.
The following engines have being integrated: UPC Speech recognition, UPC Speech Synthesis, Festival Speech Synthesis, RWTH Translation.
Some experiments have being done to adapt the Spanish acoustic models to the tourist domain using the speech corpus from UPC. The Spanish and Catalan language model for speech recognition have been estimated using the transcriptions.

Overview of the demonstrator 





Future work

Track I:
- Production of large lexicons suited for speech recognition and synthesis
- Validation of large lexica
Track II:
- Specifications on content of translation lexica suited for statistical machine translation
- Specifications of validation criteria for translation lexica
- Creation of lexicons for ASR and TTs components in a tourist domain
- Creation of translation lexica suited for statistical machine translation
- Continuation of translation experiments
- Demonstrator testing


Outcome

- Quasi industrial standards for language resources will be established
- 13 lexicons for speech recognition and synthesis will be created
- Text corpora and databases for three languages to demonstrate transfer are created
- 9 lexicons suited for statistical machine translation will be created
- Experimental results for speech centered translation approaches concerning their requirements on language resources
- Language transfer will be shown with a demonstrator translating between Catalan, Spanish and US-English


Dissemination and Awareness


In order to promote the project to the international community our website is updated regularly  (http://www.lc-star.com) . It provides information on the project in general (objectives, milestones and expected results) as well as a description of the consortium and further details. Documents like specifications, technical reports, research papers are publicly available via the website. Furthermore all relevant major events, press releases and presentations of the demonstrator as well as links to other projects and instituts can be found at the site. A new version of the leaflet was created which was also distributed at the Eurospeech Conference in Geneva, September, at SEPLN, Madrid 2003 and at RANLP, Borovets, Bulgaria.
The following papers were presented:
Ueffing N. Ney H.(2003): Using POS Information for Statistical Machine Translation into Morphological Rich Languages. In: Proc. of EACL, Budapest, Hungary, p. 347-354.
Ueffing N., Macherey K., Ney, H.(2003): Confidence Measures for Statistical Machine Translation. In: Proc. MTSummitIX, New Orleans, LO. September 2003.
Leusch G., Ueffing N., Ney, H. (2003): A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation. To appear in: Proc. MTSummitIX, New Orleans, LO, September 2003.
Hartikainen, E., Maltese, G., Moreno A., Shammass Sh., Ziegenhain U. (2003): Large Lexica for Speech-to-Speech Translation: From Specification to Creation. In: Proc.of Eurospeech, Geneva, p.1529-1532.
D. Conejero et al. (2003): Lexica and Corpora for Speech-to-Speech Translation: A Trilingual Approach. In: Proc. of Eurospeech, Geneva, 2003, p.1593-1596.
Bisani M., Bonafonte A., Castell N., Hartikainen E., Maltese G., Moreno A., Shammass Sh., Ziegenhain U. (2003): Lexica and Corpora for Speech-To-Speech Translation (LC-STAR). In: Proc. of SEPLN, Madrid, September, 2003.
Available documents can be downloaded from the webpage.



Exploitation Prospects

All lexica created within the project will be distributed via ELRA (http://www.elra.info/) no later that 18 months after the official end of the project. All data thus will be made available to research institutes and companies worldwide for further exploitation in research and commercial applications. The specifications will be publicly made available on the webpage.