The Speech Driven Computer Aided Articulatory Movements

Published Date: 02 Nov 2017

ABSTRACT

The aim of this research is to improve the phonetic skills of Tamil language learners. Here, the Automatic Speech Recognition (ASR) system is designed mainly to know how the words are perceived by the second language learners and early language learners as well as to identify the fluency of speech. Ways of improving the accuracy and efficiency of automatic speech recognition (ASR) systems have been a long term goal of researchers to develop an efficient human machine interaction. In ASR, feature extraction technique is used at the frontend for speech signal parameterization, and hidden Markov model (HMM) approach for speech recognition. The Speech Recognizer output is given to the semi transparent face model. Face model consists of lips and tongue model to perform the articulation. The parameters are extracted from MRI to do the articulation by adjusting the control points on the modeled lips and tongue. The visualization of inner articulators mainly helps to perceive and understand the learning efficiently. The proposed system develops a machine which understands the spoken isolated Tamil speech and drives the inner articulation based on the speech input. Thus, a speech driven face animation with inner articulators are modeled and this act as a response system.

INTRODUCTION

Speech is most efficient way of communication among humans. The speech is composed of set of phonemes which makes words out of consonants and vowels. The interaction between human and machines leads a path towards man machine interaction. Speech is the natural form of communication which serves well than primary interfaces like keyboard and pointing devices. Automatic speech recognition (ASR) system receives the speech input from the subject and converts it to text or in other words translation of spoken words to text. This is likely to be known as "computer speech recognition", "speech to text", or "STT". A speech recognition system converts an acoustic signal captured by a microphone to a set of words. People with disabilities that prevent them from typing feel comfortable in adopting the speech-recognition systems [1]. Due to unnatural positions, other modalities require more concentration, restrict movement and cause body strain.

Speech recognition systems are mainly used for command recognition or in dictation applications. This is because communications using spoken languages are predominant. So it is natural for people to expect speech interfaces. Despite the amount of research spent in automatic speech recognition many questions are still unanswered. This is due to the complexity of problems and the type of solutions it requires is from a variety of disciplines. It is observable fact that when we speak in everyday conversations, we mispronounce speech sounds in a large percentage of the time [2]. Speaking style and pronunciations of words differs from person to person. Researches in ASR always aim to develop techniques that allow computers to recognize in a real-time operation with efficient use of CPU and memory, 100% accuracy for all utterances by any person [3]. The main focus on todayâ€™s research is on developing speech recognition system for native languages as country like India are multilingual.

Internal speech articulators will provide the better perception of speech to the subjects. The computer aided speech visualization makes the visualization of inner articulation clearly. Inner speech articulators such as lips and tongue are used to describe the words to be pronounced. The lip sync model visualizes the articulator movements as described by the articulation parameters like lips, jaw and tongue [4]. For each speech sound the visual articulation module provides a co-articulated target position which is held for a fixed fraction of the speech duration [5].

RELATED WORKS

This part of paper explains about the speech recognition and about the inner articulations. Speech technologies are commercially available for range of tasks. For developing an effective recognizer the most important is the speech corpus. The Linguistic Data Consortium for Indian Languages (LDC-IL) has collected a huge speech corpus in different Indian languages and they are ready to distribute the database to the researchers for developing the application [5]. The speech recognition systems are mainly categorized based on the types of utterances, speaker models and approaches used in the recognition. Various classification techniques have been used in speech recognition. In early ASR systems uses acoustic-phonetics approach, which describes the phonetic elements of speech and explains how acoustically realized in a spoken utterance [1]. Later it moves on to template based, knowledge based, neural network-based, pattern matching, support vector, dynamic time wrapping, statistical method and so on.

HMMs are used to represent complete words which can be easily constructed from phone HMMs. The word sequence probabilities are added and completely searched in the network for best path corresponding to the optimal word sequence [7].The importance of HMM is dynamic linking of models through graphical representations. It provides better representations of speech characteristics. Small models are assembled to larger ones while recognition is done [8]. Learning of new words is done without affecting already trained words. This extremely reduces the time and complexity of recognition process for training large vocabulary [7]. Feature extraction technique is used to separate speech from another. From the utterances, the characteristics of speech are identified. Comparing various speech recognition approaches with its features extraction technique is analyzed and it is found MFCC provides better recognition rate [7, 9].

Speech sounds are produced based on various articulators such as jaw, tongue and lips. Lips and tongue movements are the most powerful objects with several skeletons created for the realistic animation [10]. The 3D model is connected using polygons, results in smoother shapes and structure [11, 12, and 13].these polygon model makes the animation to be more natural. The two portions of lips (upper lip and lower lip) are soft, protruding, movable, and serve primarily for articulation of Speech [11]. The movements are created based on articulatory targets for each phoneme [12]. Realistic speech animation requires a tongue model with clearly visible tongue motion that is an important cue in speech reading [13]. There are many methods to acquire data on inner articulators: Magnetic Resonance Imaging (MRI), Kinematic data from Electropalatography (EPG) and Electromagnetic Articulography (EMA) has been used [14]. MRI does not cause any health risks to subjects. The articulatory movements in the synthetic face are synthesized based on phoneme recognition of the spoken utterance and presents the animations synchronized with the acoustic signal [15].

SYSTEM OVERVIEW

The system provides a speech recognizer which in turn activates the inner articulatory movements. The system possesses two essential parts: the speech interface and the visualization of inner articulators.

3.1. SPEECH INTERFACE

Recognizer is constructed for Tamil words. The type of speech utterances selected is isolated words. The type of vocabulary is small along with speaker independent model. Speech recognition is formulated to map speech signals with corresponding textual representations. An ASR (Automatic Speech Recognition) system mainly consists of modules such as feature extraction, acoustic models, language models and dictionary. Mainly ASR systems are divided into pre-processing and post-processing. This system uses sphinx 4 for performing the recognition of isolated Tamil words. This uses MFCC as feature extraction along with HMM.

3.1.1 FEATURE EXTRACTION

The feature extraction is used to extract the parameter information of each utterance. The parameter which is known as feature vectors are used for analysis. The feature extraction process extracts only the relevant and needed information. The Mel-scaled Frequency Cepstral Coefficients (MFCCs) derived from Fourier transform and filter bank analysis are perhaps the most widely used front-ends in state-of-the-art speech recognition systems [16].The overall MFCC process consist of pre-emphasizer, windowing, DFT, Mel Filter Bank, DCT and feature vectors. The pre-emphasis is used to compensate the high-frequency part that was suppressed during the sound production mechanism of humans. The speech after pre-emphasis sounds sharper with a smaller volume [17]. Moreover, it is also used to amplify the importance of high-frequency formants. The speech signal received from the pre-emphasizer is segmented into frames. The next block in the feature extraction is the hamming window, to obtain the continuity of the speech signal. To make the frames of the signal periodic and continuous FFT is applied over the frames. To overcome the problem of discontinuity, two main strategies are applied [17]. Mel Filter Bank uses triangular band pass filters; each filter output is the sum of its filtered spectral components. Mel Filter Bank for given frequency f in HZ is given in equation (1)

Mel (f) = 2595 * log10 [1+ f/700] --------- (1)

Using DCT, log Mel spectrum is converted into time domain. This results in Mel Frequency Cepstral Coefficient. These coefficients are known as feature vectors. MFCC is close with human earâ€™s critical bandwidth. The figure 1 shows feature extraction steps to be followed in MFCC. These features are used for further analysis in speech recognition.

Figure 1. Steps in MFCC

3.1.2 DICTIONARY

Dictionary possesses the details of how the words are to be pronounced. Dictionary is also known as a lexicon contains the sequence of phonemes which are to be used in speech recognition. A single word can have multiple pronunciations. Hence, a Lexicon contains a list of all words which are to be recognized by the speech recognition system.

3.1.3 LANGUAGE MODEL

Language model helps the recognizer to find likelihood of the word sequence. It contains a representation (often statistical) of the probability of occurrence of words [18].With the help of pronunciation dictionary, the words are detected. Based on the frequency of utterance the word search can be restricted. The words in the language model should have its pronunciation in the dictionary. There are several methods of creating word grammars.

3.1.4 ACOUSTIC MODEL

The goal of acoustic modeling is to characterize the statistical variability of the feature set. The acoustic model uses word model or phone model. In sub-word units, all words are composed of basically a small set of sounds such as syllables or phonemes. These are modeled and shared across different words. During recognition, the input sound is matched against each word present in the model and the best possible match is then considered to the spoken word.

3.2. VISUALIZATION OF INNER ARTICULATORS

The system consists of a transparent face model with both front and side view. It possesses both tongue and lips to perform the inner articulations. The polygon model is used for model construction of inner articulators. Parameters of the models includes lip opening and closing, lip protrusion, lip rounding, tongue body raise, tongue front and back, tongue contact with palate, tongue tip raise. The transparent face model helps to view the movements of inner articulatorâ€™s clearly. The tongue model consists of 86 polygons with 98 control points arranged in 7Ã—7 grid. Each top and bottom layer has 49 control points, out of which 7 major points are identified for performing the articulator movements. Likewise lip model consists of 28 polygons with 42 control points which is arranged in a 6Ã—7 grid. For the movements of lips 6 major control points are selected. For each word that are selected, the control point information are extracted from the MRI video.MRI video is recorded while the word or letter is pronounced by [AR]. When a system recognizes the Tamil word it drives the inner articulation. Hence the modelled lip and tongue performs the articulator movements of the recognized words.

PERFORMANCE

The performance of speech recognition is measured using accuracy and speed. Accuracy is measured using Word Error Rate (WER), likewise speed is measured through real time factor.

The WER is derived from theÂ Levenshtein distance, working at the word level instead of the phoneme level.Â The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence

Word error rate can then be computed in equation 1 as follows:

WER=(S+D+I)/N --------- (1)

Where SÂ is the number of substitutions, DÂ is the number of deletions, IÂ is the number of insertions, NÂ is the number of words in the reference.

If it takes timeÂ PÂ to process an input of durationÂ I, the real time factor is defined in equation 2 as:

RTF= P/I ------------------ (2)

Thus, the recognizer of the constructed speaker independent small vocabulary isolated model recognizes the Tamil words with the recognition rate of--------%.

DISCUSSIONS

Speech is effective mode of communication among humans. Successful improvements have been taken place in a period of years, which starts from digit recognizers and now, we have speech enabled multimodal devices.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now