Structure of Speech Recognition System

Published Date: 23 Mar 2015

This chapter presents a fundamental of continuous speech recognition with Hidden Markov Model. The first discussion describes the basic structure of continuous speech recognition system which included five stage; feature analysis, unit matching system, lexical decoding, syntactic analysis and semantic analysis. Then, we present three approaches to speech recognitions which are acoustic phonetic, pattern recognition and artificial intelligent approach. In pattern recognition approach, we only highlight the statistical method since this approach is adopted by Hidden Markov Model and used in this project. In HMM section, we present three basic problems that that are very important in HMM and that from the strength of HMM as a speech recognizer for different languages. In the last section, we discuss about the speech recognition in Standard Malay language and other languages.

Speech recognition process basically allowed human to speak with computer, having a computer to recognize what user is saying, and lastly, doing this in real time. This process fundamentally functions as a pipeline that converts Pulse Code Modulation (PCM) digital audio from a sound card into recognized speech. In Figure 3.1, the basic structure of speech recognition system is presented. It contains front end unit, acoustic model unit, language model unit and search unit.

Input speech

Acoustic

Front-end

Acoustic Model

P(A|W)

Search

Language Model

P(W)

Recognized Utterance

Figure 3.1: Block diagram of speech recognizer.

In the recognition process, the speech signal has to be digitized first, before computer can process the signal. Acoustic front ends are used for the purpose of feature extraction to transform the raw speech corpus into more usable information for both training and testing files. In feature extraction, the signal is converted to a sequence of feature vectors based on spectral and temporal measurements. In this project, Mel Frequency Cepstral Coefficients (MFCC) was used. It is the most commonly used acoustic front-end in field of speech recognition.

Two types of file to recognize speech are required by speech recognition engine which are acoustic model and language model. Acoustic model represent sub-word units or often called phonemes. It is created by compiling audio recordings of speech and their transcriptions into statistical representations of the sound that make up each word. The equation that represents acoustic model is P(A|W). Meanwhile, the language model is a file containing the probabilities of set of words and controls which models are hypothesized. The equation that represents language model is. The following Bayesian formula for speech recognition is used with the objective to minimize the word error rate by maximizing , where refers to acoustic symbols and refers to word sequence.

The most crucial step in speech recognition system is the search step. In this step, many combinations of words must be investigated to find the most probable word sequence. Depending on the chosen criterion, the speech recognition classifications can be carried base on a tree structure as presented in Figure 3.2

Speech Recognition

Speaking Style

Vocabulary Size

Speaker Mode

Speech Mode

Dictation

Medium

Small

Large

Speaker Independent

Isolated Speech

Spontaneous

Speaker Dependent

Continuous Speech

Speaker Adaptive

Figure 3.2: Speech recognition classification

Types of Speech Recognition Approach

Basically, there are three approaches to speech recognition, namely the acoustic-phonetic approach, the pattern recognition approach and the artificial intelligence approach.

Acoustic-phonetic Approach

Acoustic phonetic is the earliest approaches to speech recognition based on finding speech sounds and providing appropriate labels to these sounds (Anusuya and Katti, 2009). This approach draws on the distinctive feature theory first proposed by Jakobson, Fant and Halle in 1952 and later expanded by Chomsky and Hale in 1968. In between 1971 and 1976, this approach has been applied by Advanced Research Projects agency (ARPA) to develop continuous speech recognition system (Wilson, 1987).

The theory behind acoustic-phonetic approach is acoustic phonetics which it assumes that spoken language is divided into phonetic units that are finite and particular. This theory postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language and that the phonetic units are broadly characterized by a set of properties that are manifest in the speech signal, or its spectrum, over time (Rabiner and Juang, 1991).

The architecture of acoustic-phonetic approach is shown in Figure 3.3. The first step is the speech analysis system, which provides an appropriate (spectral) representation of the characteristics of the time-varying speech signal. The most common techniques of spectral analysis are discrete Fourier Transform (DFT), Linear Predictive Coding (LPC), and Mel-Scaled Frequency Cepstral Coeffients (MFCC). The second step is the feature-detection stage. The spectral measurements are converted in a parallel fashion to a set of features describing the broad acoustic properties of the various phonetic units, e.g. nasality (nasal resonance), frication (random excitation), format locations (frequencies of the first three resonances), voiced/unvoiced classification (periodic or a periodic excitation), and energy ratios. The third step is the segmentation and labeling phase, in which the system tries to find feature stable regions and then label those regions according to how well the features within that region match those of individual phonetic units (Ting, 2007). This approach is suitable for implementing applications with semantic and grammatical constraints, such as voice-dictation.

Figure 3.3: Block diagram of the acoustic-phonetic approach for ASR

(Rabiner & Juang 93)

Pattern Recognition Approach

Pattern recognition is the study of how machines can observe the environment, learn to distinguish patterns of interest from their background, and make sound and reasonable decisions about the categories of the patterns. The four best known approaches for pattern recognition are: (1) Template Matching, (2) Statistical classification, (3) Syntactic or structural matching, and (4) Neural network (Amit and Rama, 2007). In this section, the statistical classification approach will be discussed further.

Unlike the acoustic-phonetic approach, in the statistical pattern recognition approach, speech pattern is not segmented nor checked for its properties. In this approach, each pattern is represented by a set of d features or attributes and is viewed as a d-dimensional feature vector (Anil et al., 2000). Furthermore, the speech patterns are directly inputted into the system and compared with the patterns inputted in the system during training. The performance of the system is sensitive to the amount of training data, speaking environment and transmission characteristics of the medium used to create the speech (Rabiner and Juang, 1991).

test

pattern

Classification

Feature Measurement

Preprocessing

Classification

Training

training

Learning

Preprocessing

Feature Extraction/Selection

pattern

Figure 3.4: Model for statistical pattern recognition (Anil et al., 2000)

As can be seen in Figure 3.4, the recognition system is operated in two modes; training mode and classification mode. In the training mode, the feature extraction/selection module finds the appropriate features for representing the input patterns and the classifier is trained to partition the feature space. The feedback path allows a designer to optimize the preprocessing and feature extraction strategies. In the classification mode, the trained classifier assigns the input pattern to one of the pattern classes under consideration based on the measured features (Anil et al., 2000). One of the well-known statistical models in ASR research is the Hidden Markov Models (HMMs) which is used in this project.

Artificial Intelligence Approach.

The artificial intelligence (AI) approach exploits the concepts of both acoustic-phonetic approach and pattern recognition approach. This approach attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualising, analysing and finally making a decision on the measured acoustic features. The main idea of AI is to collect and employ knowledge from a number of sources for solving the problem in question (Rabiner and Juang, 1993).

There are two approaches to incorporate knowledge source to speech recognition. They are called bottom-up approach and top-down approach. Both approaches are different according to how they tackle the problems. The most commonly used approach is the bottom-up processor which the lowest level processes such as feature extraction or phonetic decoding, precede higher level processes such as lexical decoding or the language model as shown in Figure 3.5. Meanwhile, in the top-down approach, processor integrates the word hypothesis matching, lexical decoding and syntactic analyses blocks into a consistent framework as shown in Figure 3.6 (Ting,2007). Expert systems are used widely with the AI approach.

Figure 3.5: Bottom-up approach to knowledge integration

(Rabiner and Juang 1993)

Figure 3.6: Top-down approach to knowledge integration

(Rabiner and Juang 1993)

Summary of Speech Recognition Approach

The acoustic-phonetic approach has not been widely used in most commercial applications (Rabiner and Juang, 1991).A limited success has been obtained because of the lack of good knowledge of the acoustics phonetics and the related area (Wilson, 1989). Statistical pattern recognition has been used successfully to design a number of commercial recognition systems (Anil et al., 2000). This approach is a popular choice for most ASR system nowadays because it is simple and is computationally feasible to use. As artificial intelligence approach is hybrid of acoustic phonetic approach and pattern recognition approach, it become the most difficult approach to use. This approach had only limited success largely due to the difficulty in quantifying expert knowledge. Another difficulty is the integration of many levels of human knowledge such as phonetics, phonotactics, lexical access, syntax, semantics and pragmatics (Anusuya and Katti, 2009). Three approaches that have been discussed above can be as in Table 3.1.

Table 3.1: Speech recognition approaches

Approach

Representation

Recognition Function

Typical Criterion

Acoustic phonetic Approach

Phonemes / Segmentation and Labeling.

Probabilistic lexical access procedure.

Log likelihood ratio.

Statistical Pattern Recognition Approach

Features vector

Clustering functions (code book).

Discriminate functions.

Classification error.

Artificial Intelligence Approach

Knowledge based

Word error probability.

HMM for Speech Recognition

As mentioned in the previous section, statistical pattern recognition is widely used in development of speech recognition.

Three Basic Problems of HMMs

In the development of the HMMs methodology, there are three fundamental problems of interest. The first one is the evaluation, the second one is the decoding and the third one is the learning.

Problem 1 : The Evaluation Problem

Given HMM ÃŽÂ» and a sequence of observations , what is the probability that the observations are generated by the model, ?

In the first problem, given a sequence of observations, the goal is to compute the likeliness of this sequence to be produced by a given HMM. We need to find a way to compare HMMs that best fits the observations (Germain, 2009). This problem can be solved using dynamic programming such as the forward algorithm. It can be used to solve isolated recognition.

Problem 2: The Decoding Problem

Given HMM ÃŽÂ» and a sequence of observations , what is the most likely state sequence in the model that produced the observations?

In the second problem, given a sequence of observations, the goal is to find the more likely sequence of states that could have been generated in the HMM. Thus, we "recognize" the original information of the underlying Markov process (Germain, 2009). This problem can be solved using Viterbi algorithm. This problem is related to the continuous recognition and the segmentation.

Problem 3: The Learning Problem

Given HMM ÃŽÂ» and a sequence of observations , how should we adjust the model parameter in order to maximize ?.

In the third problem, given sequences of observations, the goal is to find the optimal HMM where these sequences are the more likely to be produced (Germain, 2009). This problem can be approximated using Baum-Welch algorithm or also known as forward-backward algorithm. The learning problem is used if we want to train HMM for the subsequent use of recognition task by solving the problem first.

According to Germain (2009), these three problems are related to efficient algorithms since they were optimized by using dynamic programming. The two algorithms widely used in recognition are the Viterbi algorithm and the Forward Backward algorithm.

The Strength of HMMs for Speech Recognition

HMM has been widely used in speech recognition due to the strength of the models themselves. Rabiner and Juang (1991) stated that the strength in HMM models lie in two abroad areas, which are its mathematical framework and its implementation structure. They can be summarized as follow:

The inherent statistical (mathematical precise) framework which the ease and availability of training algorithms for estimating the parameters of the models from finite training sets of speech data,

The flexibility of the resulting recognition system in which one can easily change the size, type, or architecture of the models to suit particular words, sounds and so forth, and

The ease of implementation of the overall recognition system.

Another possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. The HMMs also can be trained automatically and are simple and computationally feasible to use (Holmes and Huckvale, 1994).

The Summary of HMMs

Since the 1970, the idea of machine might talk with humans has been flawed in speech recognition area. The statistical pattern recognition approaches via HMM has become dominant in development of ASR system. Based on three basic problems in HMM as explained in section 3.4.1, the new development system can be applied for real-world applications. This is because of the strength of HMM which include the availability of its mathematical framework, simple architecture, feasible to use and the high accuracy for its performance made this model as ideal solution for ASR system. As a result there are a lot of applications that use speech recognition such as machine translation and bio-medicine.

Standard Malay ASRs

According to Ting (2007), the research of speech recognition in Malaysia is still in its infancy stage due limited to small vocabulary, isolated word application and lack of speaker-independency. However, there is a great potential for the application of the speech technology in Malaysia especially in the context of Malay speech. Currently, Malay language has been applied in speech recognizer commercial product such as teliSpeech Recognizer 2.0. It shows Malay language has great potential to be applied in many more speech recognizer applications. In this section, several researches on Malay ASRs are presented.

Malay ASR - Research Paper 1

Title: Malay Continuous Speech Recognition Using Density Hidden Markov Model

A study by Ting (2007), aims to solve the constraints of current Malay speech recognizers which are speaker-dependent, small vocabulary and isolated words. A basis study and research on developing a medium vocabulary, speaker independent and Malay continuous speech recognition system are also provided. This study used Continuous Density Hidden Markov Modeling (CDHMM) with mixture densities, which is more capable in modeling inter-speaker acoustic variability compared to other alternative techniques such as Discrete Markov Hidden Model (DHMM), Neural Network and Dynamic Time Warping. A word-based Malay connected word recognition system was being designed and developed by extending the existing Malay isolated word recognition system to Malay continuous speech recognition tasks. Meanwhile, Malay phonetic segmentation and classification experiments were included as a preliminary research in using sub-word model as modeling unit which is needed in developing large vocabulary system. This task will provide basis on developing sub-word unit based Malay medium and large vocabulary continuous speech recognition system.

As the study focus on three major limitations in current Malay speech recognition which are speaker independency, continuous speech and medium and large vocabulary, the following experiments were done.

Speaker Independency

In this experiment, the performances of the use of DHMM and CDHMM with different training algorithms were carried out. The evaluation was performed on Speaker Dependency-multi speaker Malay isolated digit recognition task. The result showed that the recognition accuracy of CDHMM is higher the DHMM as shown in Table 3.2

Table 3.2: Comparison of DHMM and CDHMM recognizers with different training algorithm

Ting (2007) states the better result by CDHMM in Speaker Dependency-multi speaker test motivates its usage in Speaker Independency task in the future.

Continuous Speech

In this experiment, the whole-word based CDHMM connected word recognition system for Malay connected digit recognition task was developed. Ting (2007) stated the specification of the system is as follows,

MFCC feature extraction.

CDHMM with multivariate Gaussian mixture densities.

Five state left-to-right whole word models.

Segmental K-mean connected word training procedure with Viterbi/segmental kmean as word model re-estimation algorithm.

Unigram and bi-gram language modeling

Full Viterbi Search for decoding.

For this experiment, the use of segmental K-mean and manual segmentation training procedure on recognition accuracy for Single speaker test and Multi speaker test were carried out. The result shows the word accuracy achieved from using segmental K-mean algorithm is higher than manual segmentation for both tests. Ting (2007) explained that this is because the segmental K-mean algorithm can converge at a more optimum string segmentation than the manual segmentation, and thus generate more reliable and robust model. Meanwhile, the bi-gram model performs better than unigram model in multi speaker test with segmental K-mean training.

Medium and Large Vocabulary

The experiment attempted the use of sub-word unit modeling in Malay phonetic classification and segmentation task on medium vocabulary continuous speech database. Phone model was chosen because the small sizes of such units enable the model being well trained on limited training data. Meanwhile, CDHMM mixture density was used that several maxima in the probability density function can model the contextual variability of the same phoneme. A set of 35 Malay phones was chosen. Experimentations on Malay phoneme classification showed the following results (Ting, 2007):

Incorporation of energy and differential cepstrum increases the classification rates.

Increasing the number of Gaussian mixture components further improve the accuracy.

The accuracy for intra-phone classification is higher than the all phoneme classification.

Greater improvement for consonants compared to vowel category.

Furthermore, HMM based Viterbi forced alignment was used in phonetic segmentation. It used to investigate the effect of using different feature sets and varying number of mixture components. The following results are observed (Ting, 2007):

For small tolerance (5-10ms), HMMs with fewer Gaussians perform better.

For large tolerances (>20ms) HMMs with more Gaussians perform better generally.

For medium tolerances (15ms), the result shows an intermediate change in segmentation result between small and large tolerances where increasing the Gaussian start to perform better.

As conclusion, Ting (2007) says, the reasonable good phoneme classification accuracy and phonetic segmentation performance enable the extension to Malay phoneme recognition and finally lead to large vocabulary Malay continuous speech recognition.

Malay ASR - Research Paper 2

Title: Isolated Malay Digit Recognition Using Pattern Recognition Fusion of Dynamic Time Warping and Hidden Markov Models

The project by Al-Hadad (2008) presents a pattern recognition fusion method for isolated Malay digit recognition using Dynamic Time Warping (DTW) and HMM. DTW is used to detect the nearest recorded voice while HMM is used to emit new feature vector for each frame according to an emission probability density function associated with that state. The goals of this project are: (1) to increase the accuracy percentage of Malay speech recognition, (2) to develop patterns of reference for the Malay digit in the recognition database using DTW and HMM, and (3) to fuse DTW and HMM using weight mean vector for improving the recognition.

The algorithm is tested on Malay digits from 0 to 9 which are 'KOSONG', 'SATU', 'DUA', 'TIGA', 'EMPAT', 'LIMA', 'ENAM', 'TUJUH', 'LAPAN', and 'SEMBILAN'. Each digit is repeated 10 times by 15 male and 15 female speakers. While recording, a speaker will pause for 1 second between each digit. The system begins with input speech, followed by end point detection, framing, normalization, filtering, MFCC, time normalization, and using DTW and HMM to calculate the reference patterns as can be seen in Figure 3.7. In the last stage, weighted mean vector is used with results from DTW and HMM to get the final decision output. The weight mean vector equation used as follow,

where,

= query recognition rate in HMM test phase,

= query recognition rate in DTW test phase,

= the real time value of recorded speeches and

= weight mean vector.

Figure 3.7: Block diagram for decision fusion on Malay isolated digit recognition using DTW and HMM

The result shows that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. The refinement normalization by using weight mean vector give better performance with accuracy of 94% compared to accuracy for DTW and HMM, which is 80.5% and 90.7% respectively. The results obtained are shown in Table 3.3.

Table 3.3: Comparison digit recognition accuracy test result

Summary of Standard Malay ASRs

The Table 3.4 summarizes the studies in speech recognition for Malay language in Malaysia. All these studies utilized HMM techniques that used to recognize isolated words and continuous speech.

Table 3.4: Several Malay speech recognition

Developer

Research Title

Overview of Research

Results

Ting Chee-Ming, Sheikh Hussain Shaikh Salleh and A.K.Ariff

2009

Malay Continuous Speech Recognition Using Fast HMM Match algorithm

The research describes the implementation of fast HMM match algorithm in a phoneme-based Malay continuous speech recognition system.

The strategy is decouples the phone HMM state-likelihood computations from the main search which is bounded by syntactical and lexical constraints.

The strategy they used avoids redundant state-likelihood computations of same phone HMM across word models in conventional search.

The result shows, fast decoding algorithm yields a speedup factor of 31.8 compared to without decoupling, while maintaining word accuracy without loss at 91.9% for a test-set perplexity of 15.45 in speaker-dependent mode.

Noraini Seman and Kamaruzaman Jusoff

2008

Acoustic Pronunciation Variations Modeling for Standard Malay Speech Recognition

In this study, two types of pronunciation variations are defined which are complete or phone change and partial or sound change

SM speech database was built using HTK version 3.2. It contains 4 hours utterance from news broadcast.

Two different approaches are evaluated which are; probabilistic pronunciation variation dictionary to augment the base form lexicon and pronunciation variation information is introduced during the decoding process.

The experiment shows there is not clear boundary that separates a phone change and sound change.

The proposed techniques of handling sound change are not as effective as the methods for handling phone change.

Fadhilah Rosdi and Raja N. Ainon.

2008

Isolated Malay Speech Recognition Using Hidden Markov Models

Automatic isolated word speech recognition for Malay language was developed using HMM.

It focuses on 5 isolated phonemes word structure such as empat (four), lapan (eight), rekod (record), tidak (no), tujuh

(seven) and tutup (close).

The experiment can identify a spoken word at an average rate of 88.67% which is acceptable for speech recognition.

Tian-Swee Tan, Helbin-Liboh, A. K. Ariff, Chee-Ming Ting and Sheikh Hussain Shaikh Salleh.

2007

Application of Malay Speech Technology in Malay Speech Therapy Assistance Tools

The project is design the training software which assists the therapist to diagnose Malaysian children with stuttering problem.

The speech recognition system utilizes the HMM techniques.

The voice pattern of the normal and stutter children are used to train the HMM model for classifying the problem of speech stuttering.

Malay-Text-to-Speech system and Talking Head are also utilized in this project.

The study shows that the average percentage of correct recognition rate for normal speech is 96% while for the artificial stutter speech is 90%.

Existing Speech Recognition Using HMM for Other Languages

In this section, several existing HMM based speech recognition system for languages other than English are presented. We choose the continuous speech recognizer for language that under resource and highlight the objectives, methods and performance for each of them.

Tamil Speech Recognition

Tamil is a Dravidian language spoken predominantly in the state of Tamilnadu in India and Sri Lanka. A study by Thangarajan et al. (2008), aims to build a small vocabulary word based and a medium vocabulary triphone based speech recognizers for Tamil language. The Tamil speech recognition was developed in three modules which are dictionary, language model, and acoustic model. They are built on CMU Sphinx-4, the fourth version Sphinx software from the Carnegie Mellon University. This application is a state-of-art HMM based speech recognition which featured feasibility of continuous speech and speaker-independent large-vocabulary recognition. The program is entirely written in Java programming language.

In this study, statistical tri-gram language model were built for 341 words and 1700 phonemes. A total of 22.5 hours of continuous speech for training and 7.5 hours of continuous speech for testing were recorded by 25 speakers. For training stage, the acoustic model trainer has been employed by CMU, SphinxTrain for word and triphone based models. Triphone based model follows a generic strategy for training with the following process; (1) Flat-start monophones training, (2) Baum-Welch training of monophones, (3) triphone, (4) creation, (5) training context dependent untied models, (6) building decision tress and parameter sharing, and (7) mixture generation. Meanwhile word based model follows only the first two steps in its training procedure. Table 3.5 shows, the training parameter for both models. Notice that, for word model, the number of states in HMM is 20 since duration of words is longer than phones.

Table 3.5: Training parameter

After the end of training stage, SphinxTrain generates the parameter files of the HMM namely, the probability distributions and transitions matrices of all the HMM models. Then, the language model, dictionary and acoustic model were deployed on Sphinx-4 decoder configured with the components shown in Table 3.6.

Table 3.6: Components of ASR

The performance of Tamil speech recognition is tested for both word and triphone models in batch mode with three trails. They are; (1) trained voice on trained sentences, (2) trained voice on test sentences and (3) new voice on test sentences. The result is generated in term of word error and word accuracy. Word errors are categorized into number of insertion, substitutions and deletions and the word accuracy is computed by the following equation:

The accuracy of both models is very high for the trained voice with trained sentence which are 94.55% for word model and 93.70% for triphone model. For trained voice on test sentences the results show 71.05% for word model and 88.82% for triphone model. Meanwhile the results for new voice on test sentences are 70.08% for word model and 92.06% for triphone model. From the observation, Thangarajan (2008) pointed out that, the word error shows a majority of deletions errors in word based model and substitution errors in triphone model. As conclusion, he stated, for medium and large vocabulary in Tamil language, a triphone based approach is best suited for Tamil language.

Uyghur Speech Recognition

Uyghur is an agglutinative language belongs to the Turkic language family, which is grouped to the Altaic languages system. A study by Silamu and Tursun (2009) aimed to develop HMMbased continuous speech recognition system called UASRS using HTK 3.3 and MS Visual C++ 8.0.

In this study, the text corpus was collected from various sources which original text corpus includes 30000 sentences. However, according to the phonetic context, they used Greedy algorithm to choose the sentences from original text corpus. From this process 1018 sentences were selected where each sentence mostly includes 5 to 10 words. According to the characteristic of Uyghur language, 5 states HMM topology for monophone or triphone, 3 states HMM topology for silence and short pause, and training acoustic model for each unit by using 54 speakers data were used. The 2gram language model by using the text corpus of Uyghur including original text was built.

The development of speech recognizer consisted of two parts. The first part included acoustic modeling, language modeling and the recognizer based on HTK 3.3, while the second part included user interface based on Microsoft Visual Studio 2005. Further, the user interface included the input speech data, speech recognition and display result. The speech recognizer could get the speech input by using Microphone or reading from speech files. The input speech data was recognized by using Viterbi Algorithm, and was converted to the Latin format Uyghur text. Before displaying the results, the Latin format Uyghur text was converted into Arabic format Uyghur text (Unicode format) by using Text Converter. The structure of user interface was shown in Figure 3.7.

Figure 3.7: Structure of speech recognition system

From this study, two experiments results were presented. The first experiment was using the test set which includes 10 speakers' speech files. The result in Table 3.7 shows that the triphone unit recognition has better performance than the monophone unit recognition. It demonstrated the effectiveness of triphone modeling.

Table 3.7: Recognition performance for test data

Meanwhile, the second experiment was on the realtime recognition where 6 speakers read 100 sentences randomly from text corpora using microphone. Table 3.8 shows the results of unit type experiment for the realtime recognition. It shows that the second experiment has poorer recognition performance than the first experiment. This indicates that the speech data for training should include more unit models and must consider the robustness of the recognizer in real-time recognition for the noisy environments.

Table 3.8: Recognition performance for test data

Bangla Speech Recognition

The Bangla language is an eastern Indo-Aryan language and the state language of Bangladesh. A study by Hasnat et al. (2007) aimed to present two different type of Bangla speech recognition which are the isolated and the continuous Bangla speech recognition from the implementation, performance and application perspective. These systems are implemented using HTK.

Both the isolated and the continuous speech recognizers applied different strategy for incorporating language model and dictionary into their entire system model. They also used different style of data representation for both of their training and recognition system. However, the basic theory of both type of speech recognition system in pattern recognition approach is quite similar. The entire model is divided into three major parts; (1) speech data extraction or preprocessing, (2) feature extraction, and (3) pattern recognition. In the first part of the model, the researchers used an adaptive filter to eliminate noise from the recorded speech signal. The generalized end point detection algorithm is used to detect the presence of voice and remove pauses and silences in the background noise. For continuous speech, the researchers use this algorithm only for the start and end point detection. Meanwhile, for isolated speech the intermediate noisy signal and unwanted signal within the speech is also eliminated. In the feature extraction part, MFCC features are used for both speech recognizers. The final part of the model is pattern recognition. The tasks in this step are divided into two phases; training and recognition. In the training phase, the word based HMM model is created for isolated speech and phoneme based HMM model for continuous speech recognition. The following training algorithm for creating the HMM models are used.

Step 1: Initialize

refers to HMM models.

refers to initial state probabilities.

refers to state transition probabilities.

refers to emission probabilities.

Step 2: Compute probabilities using ÃŽÂ»

Step 3: Adjust

Step 4: Repeat step 2 to 3 until converge.

To create a model for isolated words, the researchers used a HMM prototype with 4 active states and 2 nonemiting states. After initializing the HMM model for each word with HInit from HTK, the models were trained with feature data set and reestimated by HRest from HTK. This process will separate model for each word in dictionary.

In order to create a model for continuous speech, the researchers used a HMM prototype with 3 active states and 2 nonemitting states. The HMM models were initialized using HInit from HTK. The HTK specified Master Macro File (MMF) for all monophone using prototype HMM files were created. After the reestimation, a separate models for each phoneme will be produced.

The vocabulary of 100 words and test sample from five speakers were used. Table 3.9 shows the performance for isolated and continuous speech recognition. For isolated speech, the recognizer is capable of recognizing each spoken word existing in the dictionary only when the words are spoken by the same speaker and the mood of the speaker is same. However for different speaker the performance decreases by almost 20%. Meanwhile for continuous speech, Hasnat et al. (2007) state that the average accuracy rate is 85% which is higher than other previous continuous speech recognition using different method such as ANN (73.36%), three layer BackPropagation Neural Network (86.67%) and Euclidian distance (80%). From the observation, the language specification is not significant for isolated speech recognition. However it very important for continuous speech recognition as the language specific issues are constructed the phoneme set, phonetically balance sentences and regular grammar for Bangla language.

Table 3.9: Performance analysis for Bangla speech recognizers

Amharic Speech Recognition

The Amharic is the official language of communication for the federal government of Ethiopia. A study by Seid and Gambäck (2005) aimed to examine and demonstrate the performance of a hybrid HMM/ANN system for a speaker independent continuous Amharic speech recognizer.

The recognizer uses phonemes as base units and is built using the HMM/ANN hybrid model approach. The development process was performed using the Center for Spoken Language Understanding (CSLU) Toolkit installed on the Microsoft Windows 2000 platform. This toolkit supports the development of HMM or HMM/ANN hybridbased speech recognition systems. It is highlighted that the toolkit needs a consistent organization and naming of directories and files which has to be strictly followed.

As Amharic is an underresourced language and no standard speech corpus has been developed for Amharic. Therefore, the corpus was prepared using SpeechView from CSLU which is used to record, display, save and edit speech signals in their wave format. It also provides spectrograms and some other speech wave related data like pitch and energy counters, neural net outputs and phonetics labels. A total of 120 sentences and 778 words from 12 speakers were used. In details, ten spoken sentences each from ten female speakers were annotated at the phoneme level for the training corpus and the time aligned word level transcriptions were generated automatically. Another two more speakers were annotated for evaluation purposes. Long silences at the beginning and end of the wave file were trimmed off and the boundaries of word-level transcriptions were adjusted accordingly.

Seid and Gambäck (2005) explained, after a recognizer was created, the frame vectors were generated automatically in the toolkit and the recognizer was trained on the phone part files. The ANN of the recognizer contained an output layer with the phone parts, while the input layer has 180 node grids representing 20 features each from nine time frames (t Â± 4*10ms).

A total of 20 sentences and 236 words from ten speakers found in the training data was evaluated by recognizer. The best results obtained were 78.6% for word and 44.1% for sentence level accuracy as shown in Table 3.10. Another 20 sentences and 218 words were tested from ten speakers who were not included in the training data. The recognizer had a 74.3% word accuracy and 39.7% sentence accuracy as shown in Table 3.11. From the results, the relative reduction compared to previous work on Amharic using pure HMMbased method is thus 9.92% at the world level and 24.36% at the sentence level compared to the previous work on Amharic, using pure HMM-based methods. It shows CSLU can be a good vehicle to develop hybrid HMM/ANN base recognizer.

Table 3.11: Recognition accuracy on

unknown speakers

Table 3.10: Recognition accuracy on

known speakers

Vietnamese Speech Recognition

Vietnamese is an Austroasiatic language and has been the official language of Vietnam since 1954. A study by Duc et al. (2003) presented the continuous digit recognition for Vietnamese over the telephone line. The corpus used in this study consists of 442 sentences with 2340 words recorded for 78 females and 130 males.

The recognizer were trained, developed and tested using the CSLU Toolkit. The feature vectors were computed from the hand-labeled training data for each 10 ms frame. The feature set contains features of the frame to be classified at -60, -30, 30 and 60 ms relative to the given frame. The feature vectors were used for training a three-layer feed-forward neural network with an error back-propagation procedure. Meanwhile, the neural network has 200 hidden nodes. The number of input nodes depends on the number of features while the number of output nodes depends on the number of categories of each recognizer. The training was adjusted by negative penalty modification and was done for 30 iterations. The development set was used for evaluating the trained networks to find the best iteration. In order to improve recognition result, the best neural network forwardbackward training was used. The researchers explained that in this method, the targets used to train the neural network were derived from posterior state occupation probabilities. The forward backward re-estimation algorithm was used to regenerate the targets for training sentences. The neural network trained with hand-labeled data was used for the initial neural network. Unlike other hybrid systems, this hybrid HMM/ANN used within-phone model transitions. The training was done after doing forwardbackward training twice which were the best recognizer forwardbackward was found for testing.

In Vietnamese language, the syllable may be considered a combination of Initial, Final and Tone component. The Initial component is always consonant, or it may be omitted in some syllables. Meanwhile, Final component may be decomposed into Onset, Nucleus and Coda. The Onset and Coda are optional and may not exist in a syllable. The Nucleus consists of a vowel or a diphthong and Coda consist of a consonant or semivowel.

For the first experiment, the researchers compared the recognition performance of three systems based on different basic speech units. The first recognizer was a contextdependent demisyllable recognizer where the initial was defined as right dependent to take into account coarticulation effects from the first vowel in Final on Initial. The Final was split into 3 categories. The second recognizer was context independent phoneme recognizer where each phoneme was defined as one part. The last recognizer was context-dependent phonemes where each phoneme was divided into one, two or three parts. The vowels were split in three parts, and all stops were defined as one part because they are very short. All three recognizer were trained, developed, and tested using 12 MFCC coefficients with cepstral mean subtraction, plus energy and their delta values. Table 3.12 presents the performance for each reacognizer. WA stand for wordlevel accuracy in percentage and SC indicates sentencelevel correct in percentage. The contextdependent phoneme recognizer has better accuracy than others demonstrating the effectiveness of contextdependent modeling that suited to continuous Vietnamese speech digit recognition.

Table 3.12: Recognition performance of three recognizers

In the second experiment, the study focused on the influence of feature extraction on recognition performance. The different feature sets were applied to contextdependent phoneme recognizer. The 13 PLP coefficient and 13 MFCC coefficients were computed with one of two preprocessing techniques; RASTA (RealAtive SpecTraAl) or CMS (Cepstral Mean Subtraction). Table 3.13 shows that the contextdependent phoneme recognizer with 13 PLP with CMS, plus energy and their delta values achieves the best results. However, the addition of D2 in the feature set does not improve significantly the performance of the recognizer.

Table 3.13: Recognition performance of contextdependent phoneme recognizer with eight different feature sets

Summary

The ultimate goal of most speech recognition research is developing a system which allows the natural communication by speech from people to machines (Wilson, 1987). Nowadays, there are tremendous numbers of studies in the field of speech recognition. Furthermore, speech recognition has been studied in various languages in the world. In 20 years time, we may look back on the mouse and keyboard as input devices the same way we currently look back on punch cards (Feld, 2010) and cue the speech recognition system as a norm.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now