|
The objective of the LC-STAR project is to improve human-to-human and man-machine communication in multilingual environments. The project aims to create lexica and corpora needed for transferring speech-to-speech translation (SST) components, i.e. flexible vocabulary speech recognition, high quality text-to-speech synthesis (TTS)and speech centered translation into selected languages. SST components are targeted to be integrated into speech driven interfaces embedded in mobile appliances and network servers. LC-STAR will concentrate on the one hand on the creation of language resources, i.e. pronunciation lexica with phonetic, prosodic and morpho-syntactic content and on the creation of bilingual aligned text corpora. On the other hand speech-to-speech translation technologies will be investigated with respect to their demand on language resources. The transfer will be shown by a demonstrator translating within 3 languages.
|
|
Summary of 2002 activities Main activities Development of 12 large pronunciation lexica. The languages covered include Catalan, Finnish, German, Greek, Hebrew, Italian, Mandarin Chinese, Russian, Spanish, Standard Arabic, Turkish, US-English. Each lexicon will consist of 50.000 common word entries and 50.000 proper names entries. Subtasks include the specification of corpora and wordlist extraction and the specification of linguistic features for pronunciation lexica. Goals of Track II: Development of corpora and language resources needed for speech-to-speech translation as well as the development of an experimental baseline speech-to-speech translation system. The languages covered include Catalan, Finnish, German, Hebrew, Italian, Russian and Spanish for development of translation lexica as well as Catalan, Spanish and US-English for demonstration of speech to speech translation technology. Development of a demonstration system to show the language transfer.
Corpora and domains of common words and proper names were specified. A common approach to extract wordlists from corpora was developed. The linguistic features required for speech recognition and synthesis were specified. Corpora collection and wordlist extraction was started for all languages. The main difficulty was to find common approaches to account for all language specific features of the 12 languages. The languages range from non-inflected (Mandarin Chinsese) to highly-inflected (Russian, Finnish) and agglutinative languages (Turkish) for which a common approach for wordlist extraction and description of linguistic features had to be established. For Catalan and Spanish speech databases for a tourist domain were collected. The scenarious include travel agency, hotel, timetable information and tourist information office. 125 pairs of speakers with several dialogues per recording were collected resulting in a total amount of 60 hours speech material. The transcription is already finished and the translation into the three target languages started. The US-English reference database (from Verbmobil (http://verbmobil.dfki.de/) ) was translated into Catalan and Spanish. Development of translation lexica was started using a subset of a US-English reference wordlist. The reference wordlist was extracted from the Verbmobil corpus. Words are balanced with respect to syntactic categories. Trigram information as well as semantic information from Wordnet synsets (http://www.cogsci.princeton.edu/) are provided with the list to facilitate translation into the 7 target languages.
Different approaches for speech centered translation technologies were investigated and reported. These experiments were carried out for all six language pairs of Spanish, Catalan and US-English. As translation engines a single-word based approach, namely the IBM4 model, and a phrase-based approach, namely alignment templates, were used. Additional experiments were carried out to incorporate more linguistic information into the translation process by using a parts-of-speech (POS) based language model. A baseline system was established. First experiments on the relation of language resources and translation quality were carried out. A demonstrator for the system has been established. A distributed platform was designed and the kernel was developed. The translation platform will support Catalan, Spanish and US-English. The demonstrator will show the language transfer based on speech tospeech translation components using the created language resources in a tourist domain. Multiple user interfaces - audio and text - are supported. The major components of the platform are shown in the following picture. |
Overview
of the demonstrator
|
Future work - Creation and validation of large lexicons suited for speech recognition and synthesis Track II: - Specification and creation of translation lexicons in 6 languages - Development of ASR and TTS components for the demonstrator - Continuation of translation experiments - Incorporation of additional corpora and additional linguistic information - Experiments with new language models |
|
- 12 lexicons for speech recognition and synthesis will be created - Text corpora and databases for three languages to demonstrate transfer will be created - 6 lexicons suited for translation will be created - Experimental results for speech centered translation approaches concerning their requirements on language resources - Language transfer will be shown with a demonstrator translating between Catalan, Spanish and US-English |
Dissemination In order to promote the project to the international community a website was established (http://www.lc-star.com) which covers objectives, milestones and expected results as well as a description of the consortium and further details. Documents like specifications, technical reports, research papers are publicly available via the website. Furthermore all relevant major events, press releases and presentations of the demonstrator as well as links toother projects and instituts can be found at the site. The project was presented: - at the LangTech Forum in Berlin, 26-27 September (http://www.lang-tech.org/) - at the FP6 Launch - European Research in Brussels, 11-13 November (http://www.hltcentral.org/) - at the "Informatiktage 2002" of the "Gesellschaft für Informatik e.V.", Bad Schussenried, Germany, November 2002. Exploitation Prospects All lexica created within the project will be distributed via ELRA (http://www.elra.info/) no later that 18 months after the official end of the project. All data will be made available to research institutes and companies worldwide for further exploitation in research and commercial applications. |