An Integration Of Machine Translation

Published Date: 02 Nov 2017

Abstract

This paper provides an integration of machine translation and speech synthesis system for converting English text to Tamil speech in English to Tamil speech to speech translation system. Speech to speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to analyze the impact of the speech synthesis and BLEU score for the machine translation system. Here we implement a rule based machine translation and concatenative syllable based speech synthesis technique. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

Index termsâ€” concatenative speech synthesis, Rule Based Machine Translation (RBMT), Speech to Speech Translation (S2ST), subjective evaluation.

Introduction

The goal of the speech-to-speech translation system research is to facilitate real-time, interpersonal communication via natural spoken language for people who do not share a common language. Speech Translation (ST)Â is the process by whichÂ conversationalÂ spoken phrases are instantlyÂ translatedÂ and spoken distinctly in a second language. This differs fromÂ phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business. Today, speech translation systems are being used throughout the world. Examples include medical facilities, schools, police, hotels, retail stores, and factories. These systems are applicable anywhere that spoken language is being used to communicate. Currently, speech translation technology is available as product that instantly translates free form multi-lingual conversations. These systems instantly translate continuous speech. Challenges in accomplishing this include overcomingÂ Speaker dependent variationsÂ in style of speaking orÂ pronunciationÂ are issues that have to be dealt with in order to provide high quality translation for all users. Moreover,Â speech recognitionÂ systems must be able to remedy external factors such as acoustic noise or speech by other speakers in real-world use of speech translation systems. For the reason that the user does not understand the target language when speech translation is used, a method "must be provided for the user toÂ check whether the translation is correct, by such means as translating it again back into the user's language"[1].In order to achieve the goal of erasing the language barrier worldwide,Â multiple languagesÂ have to be supported. This requiresÂ speech corpora, bilingual corpora and text corporaÂ for each of theÂ estimated 6,000 languagesÂ said to exist on our planet today. As the collection ofÂ corporaÂ is extremely expensive, collecting data from the Web would be an alternative to conventional methods. "Secondary use of news or other media published in multiple languages would be an effective way to improve performance of speech translation." However, "currentÂ copyrightÂ lawÂ does not take secondary uses such as these types of corpora into account" and thus "it will be necessary to revise it so that it is more flexible.

A speech to speech translation system comprises into three components: speech recognition, machine translation and speech synthesis. In the simplest S2ST system, only the single-best output of one module is used as input to the next component. Therefore, errors of the previous component strongly affect the performance of the next component. Due to errors in speech recognition, the machine translation component cannot achieve the same level of translation performance as achieved for correct text input. To overcome this problem, many techniques for integration of speech recognition and machine translation have been proposed, such as (2,3). In these, the impact of speech recognition errors on machine translation is alleviated by using N-best list or word lattice output from the speech recognition component as input to the machine translation component. Consequently, these approaches can improve the performance of S2ST significantly. However, the speech synthesis component is not usually considered. The output speech for translated sentences is generated by the speech synthesis component. If the quality of synthesized speech is bad, users will not understand what the system said: the quality of synthesized speech is obviously important for S2ST and any integration method intended to improve the end- to-end performance of the system should take account of the speech synthesis component.

The EMIME project [3] is developing personalized S2ST, such that the a userâ€™s speech input in one language is used to produce speech output in another language. Speech characteristics of the output speech are adapted to the input speech characteristics using cross-lingual speaker adaptation techniques [4]. While personalization is an important area of research, this paper focuses on the impact of the machine translation and speech synthesis components on end- to-end performance of an S2ST system. In order to understand the degree to which each component affects performance, we investigate integration methods. We first conducted a subjective evaluation divided into three sections: speech synthesis, machine translation, and speech-to-speech translation. Various translated sentences were evaluated by using N-best translated sentences output from the machine translation component. The individual impacts of the machine translation and the speech synthesis components are analyzed from the results of this subjective evaluation.

2. RELATEDWORK

In the field of spoken dialog systems, the quality of synthesized speech is one of the most important features because users cannot understand what the system said if the quality of synthesized speech

is low. Therefore, integration of natural language generation and speech synthesis has been proposed [5, 6, 7]. In [5], a method was proposed for integration of natural language generation and unit selection based speech synthesis which allows the choice of wording and prosody to be jointly determined by the

language generation and speech synthesis components. A template- based language generation component passes a word network expressing the same content to the speech synthesis component, rather than a single word string. To perform the unit selection search on this word network input efficiently, weighted finite-state transducers (WFSTs) are employed. The weights of the WFST are determined by join costs, prosodic prediction costs, and so on. In an experiment, this system achieved higher quality speech output. However, this method cannot be used with most existing speech synthesis systems, because they do not accept word networks as input. An alternative to the word network approach is to re-rank sentences from the N-best output of the natural language generation component [6]. N-best output can be used in conjunction with any speech synthesis system although the natural language generation component must be able to construct N-best sentences. In this method, a re-ranking model selects the sentences that are predicted to sound most natural when synthesized with the unit selection based speech synthesis component. The re-ranking model is trained from the subjective scores of the synthesized speech quality assigned in a preliminary evaluation and features from the natural language generation and speech synthesis components such as word N-gram model scores, join cost, and prosodic prediction costs. Experimental results demonstrated higher quality speech output. Similarly, a re-ranking model for N-best output was also been proposed in [7]. In contrast to [6], this model used a much smaller data set for training and a larger set of features, but reached the same performance as reported in [6]. These are integration methods for natural language generation and speech synthesis for spoken dialog systems. In contrast to these methods, our focus is on the integration of machine translation and speech synthesis for S2ST. To this end, we first conducted a subjective evaluation â€“ using Amazon Mechanical Turk [8] then analyzed the impact of machine translation and speech synthesis on S2ST.

This paper describes the recent progress in the research on integrating machine translation and speech synthesis system for developing English text to Tamil speech synthesis system, including the development of English to Tamil hybrid machine translation system (combination of rule based and statistical machine translation system) and syllable based concatenative speech synthesis system. Firstly we describe about the hybrid machine translation system in Section 3. In Section 4 discusses the development of syllable based concatenative text to speech synthesis system. The integration of these component technologies for English text Tamil speech synthesis system is described in Section 5. In Section 6 discuss about the objective and subjective performance measures of the developed components.

Finally , we draw our conclusion in Section 7.

Overcoming the Language Barrier with Speech Translation Technology"Â by Satoshi, Nakamura inÂ Science & Technology Trends - Quarterly Review No.31Â April 2009

2. E. Vidal, "Finite-State Speech-to-Speech Translation," Proc. ICASSP, pp.111â€“114, 1997.

3. H. Ney, "Speech Translation: Coupling of Recognition and Translation," Proc. ICASSP,

pp.1149â€“1152, 1999.

4. The EMIME project, http://www.emime.org/

5. Y.-J. Wu, Y. Nankaku, and K. Tokuda, "State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis," Proc Interspeech2009, pp.528â€“531, 2009.

6.I. Bulyko and M. Ostendorf, "Efficient integrated response generation from multiple target using weighted finite state transducers," Computer Speech and Language, vol.16, pp.533â€“ 550, 2002.

7.C. Nakatsu and M. White, "Learning to say it well: Reranking realizations by predicted synthesis quality," Proc ACL, pp.1113â€“1120, 2006.

8.C. Boidin, V. Rieser, L.V.D. Plas, O. Lemon, and J. Chevelu "Predicting how it sounds: Re-ranking dialogue prompts based on TTS quality for adaptive Spoken Dialogue Systems," Proc Interspeech, pp.2487â€“2490, 2009.

9. Amazon Mechanical Turk, https://www.mturk.com/

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now