The Dynamic Bayesian Networks

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

N. Dyer

Department of Mathematics and Statistics

University of Reading

Part 3 Project (Project Report) N. Dyer

Contents

1 Introduction 4

2 Markov Processes 4

2.1 Absorbing Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Ergodic Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Hidden Markov Models 8

3.1 The Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 The Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 The Discrete Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.1 Time Update Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.2 Observation Update Step . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Bayesian Networks 16

4.1 Causal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Dynamic Bayesian Networks 19

5.1 Hierarchical HMMs (HHMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 HHMMs in Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Conclusion 23

List of Tables

List of Figures

1 A basic HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 The most likely state sequences at t = 2. . . . . . . . . . . . . . . . . . . . . . . 12

3 The most likely state sequences at t = 3. . . . . . . . . . . . . . . . . . . . . . . 12

4 The most likely state sequences at t = 4. . . . . . . . . . . . . . . . . . . . . . . 13

5 A serial connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 A diverging connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 A converging connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8 A causal network with G instantiated. . . . . . . . . . . . . . . . . . . . . . . . 17

9 A causal network with E instantiated. . . . . . . . . . . . . . . . . . . . . . . . 17

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

10 A causal network with a feedback cycle. . . . . . . . . . . . . . . . . . . . . . . 17

11 A Bayesian network with probabilities to specify: P(A), P(BjA), P(CjA;B), P(DjC). 17

12 A Markov blanket around A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

13 A Bayesian network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

14 A Bayesian network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

15 A Dynamic Bayesian Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

16 A HMM where the parameters do not vary with time. . . . . . . . . . . . . . . 20

17 An HHMM state transition diagram where solid arcs represent horizontal transitions

between state, dashed arcs represent vertical transitions (going to a sub-HHMM),

the red arcs represent emissions from production states and dotted arcs from doubleringed

states represent the end of a sub-HHMM (there is one for each sub-HHMM). 20

18 The HHMM state transition diagram shown in Figure 17, flattened to a regular

HMM, where represents the empty string. . . . . . . . . . . . . . . . . . . . . 21

19 A 3-level HHMM as a DBN where Xl

t is the state at time t, level l and Fl

t is 1 if

the HMM at level l has finished, 0 otherwise. . . . . . . . . . . . . . . . . . . . 21

20 A DBN for the pronunciation of a single word. . . . . . . . . . . . . . . . . . . . 23

21 A DBN for speech recognition for more than one word. . . . . . . . . . . . . . . 23

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Abstract

Dynamic Bayesian networks (DBNs) are particularly useful in areas such as robotics and

speech recognition. A DBN is a Bayesian network (BN) for a dynamical system (ie a sequence of

variables), with causal time steps. This project will begin by outlining the basics of the Markov

process and chain. We are then able to consider hidden Markov models which will become

a useful tool when moving on to Bayesian networks and finally dynamic Bayesian networks.

1 Introduction

There are several key topics that shall be discussed within this project, namely Markov models,

hidden Markov models, Bayesian networks and finally, dynamic Bayesian networks. Markov models

are particularly important as not only do they form the foundations required when looking at dynamic

Bayesian networks, but they also form a part of the motivation behind the need for dynamic Bayesian

networks. Bayesian networks themselves were not formally established as a field of study until

recently, although Bayesian analysis been developing for a long time - named after Reverend Thomas

Bayes who developed the famous Bayes' theorem during the development of probability theory in the

18th century (McGrayne, 2011). As such, there have been many variations within the subject, for

instance, John Henry Wigmore created Wigmore charts in 1913 to analyse evidence in trails. There

is currently some discussion as to whether it might be beneficial to combine Bayesian networks and

Wigmore charts (Leucari, Dawid, and Schum, 2005). Another variation is the path diagram and path

analysis, developed by Sewell Wright in the 1920s to explore the effects of hypotheses in phylogenetic

studies (Lleras, 2004, pg25). However, these variations are not the primary focus of this project

and we must begin laying the foundations for Bayesian networks by exploring Markov processes.

2 Markov Processes

A Markov process is a probabilistic model used for complex systems (such as weather forecasting and

economic modelling) and the key features are states and state transitions (Howard, 2007, pg1). A state

is the state that the thing you're modelling is in, for instance if looking at how a substance reacts to

temperature changes, you might be interested in whether it is a solid, liquid or gas, in which case there

would be three states; solid, liquid and gas. This state is called a random variable because the state is

the variable we are interested in and it is generated by a random observation (such as the outcome of

tossing a coin). As each time step passes, the system has the ability to change to another state, which

is where state transition comes in. Notably, for a first order Markov chain (which is all that shall be

considered here), the current state is the only information necessary in order to predict how the process

develops, so, whilst past history may be known, it is not used (Tijms, 2007, pg385), in other words

P(Xt+1 = SijXt; :::;X0) = P(Xt+1 = SijXt);

(where Xt is the state at time t and Si is a particular state) - this is known as the Markov property.

With each state, there is a state transition probability, which is the probability of being in state

Si and moving to state Sj, this is denoted aij = P(Sj at t + 1jSi at t). Thus we can form the

transition matrix:

A = faijg =

0

BBBB@

a11 a12 a1N

a21 a22 a2N

...

...

. . .

...

aN1 aN2 aNN

1

CCCCA

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

which satisfies

NP

j=1

aij = 1, where all aij > 0. Two types of Markov chain that are of particular

interest are absorbing and ergodic. In an absorbing Markov chain, the state will become recurrent

at the time of absorption (Iosifescu, 1980, pg99) (in other words state Si is absorbing if and only if

aii = 1 (Kemeny, and Snell, 1960, pg 35)) and there may in fact be more than one absorbing state

(Tijms, 2007, pg398), but an ergodic Markov chain is irreducible and has converging transition

probabilities as n ! 1 (Iosifescu, 1980, pg122).

2.1 Absorbing Markov Chains

As stated, for an absorbing Markov chain, the system reaches a particular state (point of absorption)

and \remains in that state permanently" (Tijms, 2007, pg398). Tijms (2007, pg398) also states

that absorbing Markov chains can be applied population models, with the absorbing state being

extinction, and to gambling models, where the point of absorption is bankruptcy. As noted above

it is possible for an absorbing Markov chain to have more than one absorbing state, so we need

a way of calculating the probabilities of ending up in each of these states. This is demonstrated

in an example from Tijms (2007, pg400) which has been adapted to the create the following:

Example 2.1. Suppose Daniel, D, wishes to play in a competition of naughts and crosses against

the computer, C. The competition will continue until one of them has won two consecutive games.

For the first game (and any game following a draw), Daniel will: win with probability 0:3, draw

with probability 0:5 and lose with probability 0:2. If Daniel has won a game, then in the following

game he will: win with probability 0:5, draw with probability 0:35 and lose with probability 0:15.

If, however, Daniel loses a game then in the following game he will: win with probability 0:2, draw

with probability 0:5 and lose with probability 0:3. Thus we can calculate:

the probability that the competition will last more than 5 matches.

the probability that Daniel will eventually win.

how many games the competition will last for.

We form an absorbing Markov chain with two absorbing states (one being the case where D has

won the competition and and the other being the case where C has won the competition). Let's

say the system is in state S1 = (1;D) if Daniel has won one game, S2 = (2;D) if Daniel has won

two games in a row, s0 = 0 if the first game is about to commence, or if the game before was a

draw, S3 = (1;C) if the computer has won one game and S4 = (2;C) if the computer has won two

games in a row. Then we have the random variable Xn, which is the system state after n games.

Then we have formed a processfXng with five states, so we can form the transition matrix:

A =

0

BBBBB@

from=to 0 (1;D) (1;C) (2;D) (2;C)

0 0:5 0:3 0:2 0 0

(1;D) 0:35 0 0:15 0:5 0

(1;C) 0:5 0:2 0 0 0:3

(2;D) 0 0 0 1 0

(2;C) 0 0 0 0 1

1

CCCCCA

:

Then let L be the random variable for the number of games in a competition and let r = 5. Then for

L > r the Markov chain must not visit either state (2;D) or state (2;C) in the first r games, i.e

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

P(L > r) = P(Xk 6= (2;A); (2;C) for k = 1; : : : ; rjX0 = 0)

= P(Xr 6= (2;A); (2;C)jX0 = 0)

= 1 ???? p(r)

0;(2;A) ???? p(r)

0;(2;C);

so we need only multiply A by itself r times in order to find P(L > r). So if r = 5 then to four

decimal places

A5 =

0

BBBBB@

0:2464 0:1099 0:0803 0:3947 0:1688

0:1390 0:0619 0:0454 0:6501 0:1035

0:1844 0:0822 0:0600 0:2863 0:3871

0 0 0 1 0

0 0 0 0 1

1

CCCCCA

:

So after 5 games the probability that the competition will still be ongoing is

P(L > 5) = 1 ???? 0:3947 ???? 0:1688 = 0:4365(4dp):

However, to use this method to find the state of absorption (when Ar = Ar+1) is very inefficient as r

could be very large. Instead we can solve a system of linear equations. We define fSk as the probability

that Daniel will be the competition winner if starting in state Sk (Sk can be 0, (1;D), (2;D), (1;C)

or (2;C)). It is clear that f(2;D) = 1 and that f(2;D) = 0. For general fSk though, either state (2;D)

is reached directly, or it is achieved from another state,

. We then note that the joint probability of

passing from Sk to

and then from

to (2;D) is given by aSk

f

as they are independent events.

Using the law of conditional probabilities we see that fSk =

P

aSk

f

. Thus we have

f0 = 0:5f0 + 0:3f(1;D) + 0:2f(1;C)

f(1;D) = 0:35f0 + 0:15f(1;C) + 0:5f(2;D)

f(1;C) = 0:5f0 + 0:2f(1;D) + 0:3f(2;C)

where, as previously stated, f(2;D) = 1 and that f(2;C) = 0. We then solve the set of linear equations

to find that f0 = 340

487 = 0:6982, f(1;D) = 400

487 = 0:8214 and f(1;C) = 250

487 = 0:5133 (all to 4d.p.).

Thus the probability that Daniel will win the competition is 0.6982 (to 4d.p.). Let Sk be the expected

value of games that need to be played before the competition will end, starting from state Sk. Then

the state will change to

with probability aSk

and the number of games still to be played until

the point of absorption is expected to be

. Hence we see that Sk =

P

(1 +

)aSk

, which gives

0 = 1 + 0:50 + 0:3(1;D) + 0:2(1;C)

(1;D) = 1 + 0:350 + 0:15(1;C) + 0:5(2;D)

(1;C) = 1 + 0:50 + 0:2(1;A) + 0:3(2;C)

(2;A) = (2;C) = 0:

In order to find the total number of expected games before the competition ends we must find 0,

which can be found by solving the above system of linear equations: 0 = 6:3860, (1;D) = 3:9836

and (1;C) = 4:9897 (to 4d.p.). Thus we expect there to be a minimum of 6 games before the

competition ends.

The above example has demonstrated the key aspects of an absorbing Markov chain, such as how

we can find the point of absorption, which allows us to move on to ergodic Markov chains.

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

2.2 Ergodic Markov Chains

A Markov Chain is ergodic (or irreducible) if it is possible to move from each state to every other

state (Anderson, 2008, pg1). This means it cannot be absorbing as it will never be forced to remain

in just one state. Notably, ergodic Markov chains can be classified as one of two types, either

regular or cyclic. A regular Markov chains occurs if there is some power, n, of the transition matrix,

A, such that all entries, aij, are positive. This power n, is called the order of the Markov chain.

For example, if a Markov chain had the transition matrix

A1 =

0

B@

0 0:5 0:5

0:75 0:1 0:15

0:25 0:4 0:35

1

CA

) A2

1 =

0

B@

0:5 0:1125 0:3875

0:25 0:445 0:305

0:25 0:4425 0:3075

1

CA

:

So it is clear that for A2

1, all aij > 0, hence A1 is both ergodic and regular and we see that A1

has order 2. If, however, the Markov chain had the transition matrix

A2 =

0 1

1 0

!

;

then

A2

2 = A4

2 = : : : = A2n

2 =

1 0

0 1

!

and

A2 = A3

2 = : : : = A2n+1

2 =

0 1

1 0

!

where n 2 N, we see that here A2 is ergodic, but not regular, which means that as A1 must be

cyclic. A cyclic Markov chain is defined by Kemeny, and Snell (1960, pg37) as a chain with period,

d, with states subdivided into d cyclic sets if d > 1, which \will move through its cyclic sets in a

definite order, returning to the starting state after d steps". As previously mentioned, these Markov

chains can be used to predict things such as the weather, so let's see how this might work:

Example 2.2. Suppose we wanted to do a basic model of the weather in Reading. Let us consider

three states, which are monitored every 12 hours; rain, R, cloud, C, and sun, S. Now suppose

we have the following transition matrix for this:

A =

0

B@

from=to S C R

S 0:45 0:35 0:2

C 0:4 0:2 0:4

R 0:15 0:45 0:4

1

C A

:

We can now calculate the probability of the chain being in any of these states after n steps (here

n 12 hours) by defining u as the probability vector which represents the probability distribution

at the start and then using the formula

u(n) = uAn

as explained by Grinstead, and Snell (1998, pg409). For instance, if we wish to calculate the

probability that the weather will be sunny in 2:5 days time (so that's 60 hours ) n = 5) if at the

starting point it is raining, then we begin by calculating the probability distribution with n = 5:

u(5) = uA5 = (0:15; 0:45; 0:4)

0

B@

0:3338 0:3333 0:3329

0:3334 0:3331 0:3334

0:3327 0:3336 0:3337

1

CA

(to 4 d.p.)

= (0:3332; 0:3333; 0:3334)(to 4 d.p.)

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Thus we see that the probability that it will be sunny in two and a half days, if it's raining to begin

with will be 0:3332 to 4 decimal places. We also note that all of the entries in the vector u(5) are

approximately the same, which means we have reached a stationary distribution, and this is true

in all cases for a large enough n. This makes sense in the context of our example since the further

ahead in time you look, the less accurate your predictions become, so that there will be a point in

time where it is equally likely to rain, be sunny or be cloudy, regardless of our initial data.

3 Hidden Markov Models

Hidden Markov models (HMMs) are similar to Markov chains; in each state a visible observation

is generated, but the actual state cannot be observed (Jensen, 2001, pg65). A HMM is a

sequence of states X0;X1; : : : ;Xt, up to time t, where Xt can be any one of a number of states,

S = fS0;S1; : : : ;SNg, which generates a sequence of observed states, Y0;Y1; : : : ;Yt, where each

Yt can be any of the possible observations O = fO0;O1; : : : ;OMg, with probability distributions.

We can represent this chain of events up to some time, T, in the form of a Bayesian network (which

will be looked at later), where the shaded nodes are observed and clear nodes are hidden:

.

Figure 1: A basic HMM.

A basic example is as follows:

Example 3.1. Suppose we have three biased coins, A, B and C, with the following probabilities:

P(Head) P(Tail)

A 0:6 0:4

B 0:35 0:65

C 0:8 0:2

.

Then suppose we let Z be A or B and we throw it (Z is A or B at the start with equal probability),

write down the result and then toss coin C. If C shows a head then Z changes to A if it was B

and B if it was A. If C shows a tail the Z remains the same. We then repeat this process until

we are satisfied. If we denote a head for A as H, a tail for A as T, a head for B as h and a tail

for B as t, then we can compute the transition matrix for this Markov chain:

A =

0

BBB@

from=to H T h t

H 0:12 0:08 0:28 0:52

T 0:12 0:08 0:28 0:52

h 0:48 0:32 0:07 0:13

t 0:48 0:32 0:07 0:13

1

CCCA

:

Now imagine that the data we receive from this experiment does not distinguish between H and

h or T and t. Then we have a hidden Markov model. Suppose we were given the sequence of data

Y = fY1 = H;Y2 = T;Y3 = Hg, we don't know which coin was used at each point, so we would

want to calculate the probability of a particular coin being used at each point.

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

In general, a HMM consists of five main elements, described by Dymarski (2011, pg6) as:

1. The N model states,

S = S1; : : : ;SN:

2. The M observation symbols for each state, O = O1; : : : ;OM.

For continuous observations, M is infinite.

3. A state transition matrix, A = faijg; aij = P(Sj at t + 1jSi at t), as described in the

previous section.

4. The probability distribution of O for each state (otherwise known as the emission probability),

B = fbj(k)g, where bj(k) is the probability that Ok is the observation, at time t, if the

model is in state Sj, at time t, and

bj(k) = P(Yt = OkjXt = Sj); 1 j N; 1 k M

where Ok denotes the kth observation symbol, Yt is the current observation and Xt is the

current state. The constraints

bj(k) 0; 1 j N; 1 k M and

MX

k=1

bj(k) = 1

must also be satisfied. However, if we are working with continuous data we must also use

a continuous probability density function, where we specify the parameters. This is usually

approximated using a weighted sum of M Gaussian distributions for N:

bj(Yt) =

MX

m=1

cjmN(jm;

X

jm

;Yt)

where cjm = weighting coefficients, jm = mean vectors and

P

jm

= covariance matrices. The

constraints: cjm 0, 1 j N, 1 m M and

MP

m=1

cjm must also be satisfied by cjm.

5. The initial state has the distribution u = fu0;u1; : : : ;uNg, where ui = PfX0 = Sig,

1 j N, is the probability that the model is in state Si at time t = 0.

However, HMMs have three key problems, as described by Dymarski (2011, pg8):

1. The Evaluation Problem

For a HMM, , how do we calculate the probability of the observations Y = fY1 = Oi;Y2 =

Oj; : : : ;YT = Okg being generated, i.e. P(Y j) ?

2. The Decoding Problem

If we have got observations Y = fY1 = Oi;Y2 = Oj; : : : ;YT = Okg, for a model, , then what

is the most likely state sequence?

3. The Learning Problem

How should our parameters be adjusted to maximise P(Y j)?

These three problems are easily solved by using a series of algorithms, namely: the forward-backward

algorithm, the Viterbi algorithm and the Baum-Welch algorithm.

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

3.1 The Forward-Backward Algorithm

The forward-backward algorithm is a combination of the forward algorithm and the backward

algorithm which enables us to find P(Y1;2;:::;T j). We start with the forward algorithm, with forward

variable,

t(Xt), defined by Rabiner, and Juang (1986, pg9) as

t(Si) = P(Y1;Y2; : : : ;Yt;Xt = Sij);

this is the probability of the observation sequence, until time t, and the state Si, given that we

have a HMM, . Then we have that

1(Si) = P(Y1;X1 = Sij)

= uibi(Y1); where 1 i N;

which is the initial state distribution probability of being in state Si multiplied by the emission

probability of the observed state X1. Rabiner, and Juang (1986, pg9) then uses induction to show

that for t = 1; 2; : : : ;T ???? 1;

t+1(Sj) = P(Y1;Y2; : : : ;Yt+1;Xt+1 = Sj)

=

X

Si

P(Y1;Y2; : : : ;Yt+1Xt = Si;Xt+1 = Sj)

=

X

Si

P(Y1;Y2; : : : ;Yt;Xt = Si)aijbj(Yt+1)

=

2

4

XN

Si=1

t(Si)aij

3

5bj(Yt+1); 1 j N; 1 t T ???? 1

and thus

P(Y1;2;:::;T j) =

XN

i=1

T (i):

The latter of these results occurs because by the original definition,

T (Si) = P(Y1;Y2; : : : ;YT ;XT = Sij):

Similarly, for the backward algorithm, Rabiner, and Juang (1986, pg9) defines the backward variable,

t(Si), as

t(Si) = P(Yt+1;Yt+2; : : : ;YT jXt = Si;)

and states that T (Si) = 1, 1 i N, which allows the solution for t(Si) to be found similarly

by induction, giving

t(Si) =

XN

j=1

aijbj(Yt+1(j)):

Here the backward algorithm is giving the probability of the partial observation sequence

Yt+1;Yt+2; : : : ;YT , given that for, HMM , the sate Si occurred at time t. Thus combining

the two algorithms allows us to find the probability of being in a particular state at time t given

the observed states at Y1;Y2; : : : ;Yt.

3.2 The Viterbi Algorithm

The Viterbi Algorithm allows us to calculate the most likely sequence of states, up to time t, given

our observation data. In other words we can calculate j = argmaxS P(XjY ).

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

We can do this by defining the auxillary variable

t(Sj) = max

X1;:::;t????1

P(X1;X2; : : : ;Xt;Y1;Y2; : : : ;Yt)

= max

X1;:::;t????1

[P(YtjXt)P(XtjXt????1)P(X1;:::;t????1;Y1;:::;t????1)] by the Markov property

= max

X1;:::;t????1

P(YtjXt)P(XtjXt????1) max

X1;:::;t????2

[P(X1;:::;t????1;Y1;:::;t????1)]

which gives us the recursion relationship

t(Sj) = max

X1;:::;t????1

[P(YtjXt)P(XtjXt????1)t(Xt????1)]

= bj(Yt) max

X1;X2;:::;Xt????1

[t(Xt????1)aij]

with 1(Sj) = ujbj(Y1) for 1 j N (Dymarski, 2011, pg10). Then to find j we make a note

of the most likely paths at each time step. This is best demonstrated by example.

Example 3.2. Suppose that we have a set of states, S = fx; y; zg, a set of observations,

O = fa; b; cg, an initial distribution, u = f0:3; 0:4; 0:3g, and the following transition matrix,

A, and emission matrix, B:

A =

0

B@

from=to x y z

x 0:1 0:4 0:5

y 0:3 0:3 0:4

z 0:6 0:3 0:1

1

C A

B

=

0

B@

x y z

a 0:7 0:2 0:1

b 0:1 0:6 0:3

c 0:2 0:2 0:6

1

C A

:

Then suppose we receive the observation sequence acbb and we wish to calculate the most likely

state sequence that occurred. We begin by calculating values for 1(X1) = P(X1)P(Y1jX1):

1(x) = uxP(ajx) = 0:3 0:7 = 0:21

1(y) = uyP(ajy) = 0:4 0:2 = 0:08

1(z) = uzP(ajz) = 0:3 0:1 = 0:03

So we see that the most likely state at t = 1 is x. Next we calculate values for 2(X2) =

max

X1

[P(Y2jX2)P(X2jX1)1(X1)]:

2(x)= max

X1

[P(cjx)P(xjX1)1(X1)] = max

X1

8><

>:

0:2P(xjx)1(x)

0:2P(xjy)1(y)

0:2P(xjz)1(z)

9>=

>;

= max

X1

8><

>:

0:2 0:1 0:21 = 0:0042

0:2 0:4 0:08 = 0:0064

0:2 0:5 0:03 = 0:003

9>=

>;

= 0:0064

2(y)= max

X1

[P(cjy)P(yjX1)1(X1)] = max

X1

8><

>:

0:2P(yjx)1(x)

0:2P(yjy)1(y)

0:2P(yjz)1(z)

9>=

>;

= max

X1

8><

>:

0:2 0:3 0:21 = 0:0126

0:2 0:3 0:08 = 0:0048

0:2 0:4 0:03 = 0:0024

9>=

>;

= 0:0126

2(z)= max

X1

[P(cjz)P(zjX1)1(X1)] = max

X1

8><

>:

0:6P(zjx)1(x)

0:6P(zjy)1(y)

0:6P(zjz)1(z)

9>=

>;

= max

X1

8><

>:

0:6 0:6 0:21 = 0:0756

0:6 0:3 0:08 = 0:0144

0:6 0:1 0:03 = 0:0018

9>=

>;

= 0:0756:

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Thus we see that the most likely sequence of states so far, given our observation acbb, is xz. We

can also represent these chains of events graphically, adding to our diagram at the end of each time

step (see Figure 2).

Figure 2: The most likely state sequences at t = 2.

We continue for the next time step:

3(x)= max

X2

[P(bjx)P(xjX2)2(X2)] = max

X2

8><

>:

0:1P(xjx)2(x)

0:1P(xjy)2(y)

0:1P(xjz)2(z)

9>=

>;

= max

X2

8><

>:

0:1 0:1 0:0064 = 0:000064

0:1 0:4 0:0126 = 0:000504

0:1 0:5 0:0756 = 0:00378

9>=

>;

= 0:00378

3(y)= max

X2

[P(bjy)P(yjX2)2(X2)] = max

X2

8><

>:

0:6P(yjx)2(x)

0:6P(yjy)2(y)

0:6P(yjz)2(z)

9>=

>;

= max

X2

8><

>:

0:6 0:3 0:0064 = 0:001152

0:6 0:3 0:0126 = 0:002268

0:6 0:4 0:0756 = 0:018144

9>=

>;

= 0:018144

3(z)= max

X2

[P(bjz)P(zjX2)2(X2)] = max

X2

8><

>:

0:3P(zjx)2(x)

0:3P(zjy)2(y)

0:3P(zjz)2(z)

9>= >;

= max

X2

8><

>:

0:3 0:6 0:0064 = 0:001152

0:3 0:3 0:0126 = 0:001134

0:3 0:1 0:0756 = 0:002268

9>=

>;

= 0:002268:

So now our most likely observation sequence is xzy, which is represented in Figure 3.

Figure 3: The most likely state sequences at t = 3.

Finally we calculate the last set of probabilities at t = 4:

4(x)= max

X3

[P(bjx)P(xjX3)3(X3)] = max

X3

8><

>:

0:1P(xjx)3(x)

0:1P(xjy)3(y)

0:1P(xjz)3(z)

9>=

>;

= max

X3

8><

>:

0:1 0:1 0:000378 = 3:78 10????6

0:1 0:4 0:018144 = 0:00072576

0:1 0:5 0:002268 = 0:0001134

9>=

>;

= 0:00072576

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

4(y)= max

X3

[P(bjy)P(yjX3)3(X3)] = max

X3

8><

>:

0:6P(yjx)3(x)

0:6P(yjy)3(y)

0:6P(yjz)3(z)

9>=

>;

= max

X3

8><

>:

0:6 0:3 0:000378 = 0:00006804

0:6 0:3 0:018144 = 0:0032592

0:6 0:4 0:002268 = 0:00054432

9>=

>;

= 0:0032592

4(z)= max

X3

[P(bjz)P(zjX3)3(X3)] = max

X3

8><

>:

0:3P(zjx)3(x)

0:3P(zjy)3(y)

0:3P(zjz)3(z)

9>=

>;

= max

X3

8><

>:

0:3 0:6 0:000378 = 0:00006804

0:3 0:3 0:018144 = 0:00163296

0:3 0:1 0:002268 = 0:00006804

9>=

>;

= 0:00163296:

Thus we now have Figure 4 and the most likely state sequence for the observation sequence acbb

is xzyy.

Figure 4: The most likely state sequences at t = 4.

3.3 The Baum-Welch Algorithm

The Baum-Welch algorithm is also a type of forward backward algorithm which aims to maximise

the probability of an observation. We shall follow the explanation of Rabiner, and Juang (1986,

pg11) for this section and begin by using the definitions of t(Si;Sj) as

t(Si;Sj) = P(Xt = Si;Xt+1 = SjjO;)

=

t(Si)aijbj(Yt+1)t+1(j)

P(Y j)

and

t(Si) as

t(Si) = P(Xt = SijY;)

=

t(Si) =

t(Si)t(Si)

P(Y j)

which relate to each other to give

tSi =

XN

j=1

t(Si;Sj):

Summing

t(Si) over t gives the expected number of transitions made from state Si and summing

t(Si;Sj) over t gives the expected number of transitions from Si to Sj, in other words

TX????1

t=1

t(Si) = Expected number of transitions from Si

TX????1

t=1

t(Si;Sj) = Expected number of transitions from Si to Sj:

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

We can now calculate the re-estimation formulas for u, A and B:

uSi =

1(Si); 1 i N

aij =

TX????1

t=1

t(Si;Sj)

TX????1

t=1

t(Si)

b

j(Ok) =

XT

t=1

Yt=Ok

t(Sj)

X

t

= 1T

t(Sj)

where us1 is the probability of being in state Si at time t = 1, aij is the expected number of

transitions from Si to Sj divided by the expected number of transitions from Si and bj(Ok) is the

expected number of times of being in Sj and observing Ok divided by the expected number of times

of being in Sj. Now define the re-estimation model, with uSi , aij and bj(Ok) to show that either

The initial model, defines a critical point of the probability function ) =

or

P(Y j ) > P(Y j) ) use instead of and repeat the re-estimation calculation until we reach

a limiting point.

Thus we have found the estimated model and have completed the algorithm.

3.4 The Discrete Kalman Filter

As we have seen, the forward-backward, Viterbi and Baum-Welch algorithms are all very valuable,

but they only allow us to look at discrete state HMMs. If we wish to consider similar problems for

HMMS with continuous state variables and observations (Si and Oj are now vectors), with Gaussian

(normal) distribution and a set of discrete times, then the Kalman filter is required. Each state Xt can

no longer be a discrete state, such as S4, as before; now each Xt has an m1 mean vector ^X and an

mm covariance matrix, C, where m is the number of parameters that describe that state. Similarly,

each observation Yt is no longer discrete and is now an n1 vector. This new method of describing

the states is known as a state-space model or SSM. We would want to use a HMM with continuous

state variables to predict things such as the position of an airplane in dense fog or cloud, where the

parameters may be the x; y and z components to describe position as well as perhaps a parameter

for velocity or acceleration. Effectively, the Kalman filter is \an estimator used to estimate the state

of a linear dynamic system perturbed by Gaussian whit noise using measurements that are linear

functions of the system state but corrupted by additive Gaussian white noise" (Grewel, and Andrews,

2001, pg22). White noise means that there is no connection between the noise and time, so knowledge

of its value at a particular time will not help you predict its value at any other time (Maybeck,

1979, pg7). For the Kalman filter we require a linear transition function for the state transition

from t to t + 1 and a linear function that describes the relationship between state and observation

and we will use those described by Funk (2003, section 2.3.1). The linear transition function is

Xt+1 = AXt + wt

where A is the state transition matrix and wt is the noise at time t. The noise term, wt, is

independent of Xt and has the Gaussian probability distribution

P(w) N(0;Q)

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

where 0 is the mean and Q is the covariance matrix known as the process noise covariance matrix.

For the relationship between observation and state we have the linear equation

Yt = BXt + vt

where B is the emission matrix and vt is noise with a normal distribution

P(v) N(0;R)

which has 0 mean and covariance matrix, R, known as the measurement noise covariance matrix. We

might wish to find the expected observation, given all the previous observations, E[Yt+1jY0; : : : ;Yt],

but this requires us to first calculate the expected state at t+1 given all of the previous observations,

E[Xt+1jY0; : : : ;Yt] - this is similar to the previous decoding problem for discrete HMMs. So again

we can use recursive relations, but here two steps, as described by Funk (2003, pg8) are required

within each recursion:

1. The time update step: computes a forecast for the value of Xt+1, denoted X0

t+1 and a

forecast for its covariance matrix, denoted C0

t+1.

2. The observation update step: computes ^Xt+1 and Ct+1 given Yt+1.

3.4.1 Time Update Step

This section will use the derivation described by Leondes (1970, pg48). As stated, within the time

update step we aim to find a forecast for Xt+1 and its covariance matrix Ct+1. Let ^Xt be the best

estimate of Xt, then

^X

t = Xt + t

where t is the error. Since wt has zero mean, the best forecast must be

X0

t+1 = A^Xt

= AXt + At

= Xt+1 ???? wt + A:

Since epsilont is the error of Xt, it has the same variance, Ct, which mean that the variance of

At is ACtAT . Also note that wt and t are independent, which allows us to calculate the forecast

for the variance:

C0

t+1 = ACtAT +Q:

Thus we have successfully calculated forecasts for Xt+1 and Ct+1.

3.4.2 Observation Update Step

For this section we shall continue to use the derivation described by Leondes (1970, pg48). At this

point the observation Yt+1 has been made and can now be compared to the forecast X0t1 to obtain

the best estimate ^Xt+1. The best estimate is found to be

^X

t+1 = X0

t+1 ????Kt+1(BX0

t+1 ???? Yt+1);

where

Kt+1 = C0t + 1BT (BC0

t+1BT + R)????1

is known as the optimal gain matrix. Finally, the variance of ^Xt+1 is

Ct+1 = C0

t+1 ????Kt+1BC0

t+1:

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Hence, we have found that the Kalman filter updates ^Xt and Ct by using the equations:

X0

t+1 = A^Xt

C0

t+1 = ACtAT +Q

Kt+1 = C0

t+1BT (BC0

t+1BT + R)????1

^X

t+1 = X0

t+1 ????Kt+1(BX0

t+1 ???? Yt+1)

Ct+1 = C0

t+1 ????Kt+1BC0

t+1:

4 Bayesian Networks

Before fully defining a Bayesian Network we must first understand some terminology and features

that will occur. We do this by looking at causal networks.

4.1 Causal Networks

A causal network is a type of directed graph, with variables and directed links (Jensen, 2001, pg6).

Jensen (2001, pgs6-8) also states that there are three main basic types of connections in these graphs

are serial, diverging and converging, shown respectively in Figure 4.1, Figure 4.1 and Figure 4.1 .

Figure 5: A serial connection.

Figure 6: A diverging

connection.

Figure 7: A converging

connection.

A link that goes form one variable, A say, to another, B, then A is referred to as the parent of

B and similarly B is the child of A. We must also define the term d-separation. According to

Jensen (2001, pg10)

\Two distinct variables A and B in a causal network are d-separated if, for all paths

between A and B there is an intermediate variable V (distinct from A and B such

that either

- the connection is serial or diverging and V is instantiated (has received evidence)

or

- the connection is converging and neither V nor any of V 's descendants have received

evidence."

Any A and B that are not d-separated are referred to as d-connected. For example in the following

figures, we see that in Figure 8 we have a d-connecting path A ???? B ???? C ???? D ???? E ???? F and in

Figure9 A is d-separated form D.

We are interested in d-separation because if \A and B are d-separated, then changes in the certainty

of A have no impact on the certainty of B" (Jensen, 2001, pg11). This ultimately will affect

conditional probabilities as described when looking at the forward-backward algorithm.

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Figure 8: A causal network with G

instantiated.

Figure 9: A causal network with E

instantiated.

4.2 Bayesian Networks

A key difference between causal and Bayesian networks is that a Bayesian network cannot contain

a feedback cycle, whilst a causal network can. In other words Figure 10 is not a Bayesian network,

but it is a causal network.

Figure 10: A causal network with a feedback cycle.

In general a Bayesian network, according to Jensen (2001, pg19), consists of

Variables with directed edges between them.

A set of states for each variable.

For each variable A, with parents B1;B2; : : : ;Bn there is a probability P(AjB1;B2; : : : ;Bn).

Together the variables and edges form a directed acyclic graph (DAG) - acyclic means that there

can be no feedback, i.e there is no directed path A1 ! A2 ! : : : ! An such that A1 = An8n

(Jensen, 2001, pg19). An example of a Bayesian network can be seen in Figure11.

Figure 11: A Bayesian network with probabilities to specify: P(A), P(BjA), P(CjA;B), P(DjC).

A Bayesian network does not refer to causality, but the d-separation properties implied by the

structure must hold. For instance, in Figure11, if B was instantiated, then A is d-separated from

D. We must also note the existence of the chain rule for Bayesian Networks. The theorem and

proof of the chain rule is demonstrated by Jensen (2001, pg21):

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Theorem 4.1 (The chain rule for BNs). Let B be a Bayesian network over U = fA1;A2; : : : ;Ang.

Then P(U), the joint probability distribution over U, is given by the product

P(U) =

Y

i

P(Aijpa(Ai)) where pa(Ai) is the parent set of Ai.

Proof. The proof is easily seen by induction and we begin with the trivial case; U = fA1g, U has

is only one variable. Then the theorem just states that P(A1) =

Q1

i=1

P(Aijpa(Ai)) = P(A1jpa(A1)).

Now we assume that the chain rule is true for all BNs with n ???? 1 variables. Let U be a DAG

with n variables. Because the network is a DAG it must be acyclic which means there must be

at least one variable, Aj, with no children. Now consider the same DAG but with Aj removed

(UnfAjg). Then from the assumption P(UnfAjg) is the product of all specified probabilities with

the exception of P(Ajjpa(Aj)). By the fundamental rule (P(A \ B) = P(AjB)P(B)) we see that

P(U) = P(UnfAjg \ Aj)

= P(AjjUnfAjg)P(UnfAjg):

Note that Aj is d-separated from Un(fAjg [ pa(Aj), given pa(Aj). Thus we see that

P(U) = P(AjjUnfAjg)P(UnfAjg)

= P(Ajjpa(A))P(UnfAjg)

and we know that P(UnfAjg) is the product of all specified probabilities, hence we have found

P(U) and have completed the induction.

We also note that a node is independent from all other nodes, given its Markov blanket (Murphy,

2002, pg124). For a BN this means given its parents, children and children's parents. In Figure12 we

see the Markov blanket for a general point A in a BN. Therefore, in Figure13, the Markov blanket

for E contains B, D and F. Another interesting property of BNs is the directed factorization

Figure 12: A Markov blanket around A. Figure 13: A Bayesian network.

property. The directed factorization property states that if the nodes are ordered topologically, with

parents before children, as 1; : : : ;N, then the joint distribution is given by

P(X1; : : : ;XN) = P(x1)P(X2jX1)P(X3jX1;X2) : : :P(XNjX1; : : : ;XN????1) from the chain rule for probability

=

NY

i=1

P(XijX1:i????1)

=

NY

i=1

P(Xijpa(Xi)) as node Xi is independent of its ancestors, given its parents

(Murphy, 2002, pg125). For example, in Figure 13,

P(A;B;C;D;E;F) = P(A)P(B)P(DjA;B)P(CjA;D)P(EjB)P(FjD;E):

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

5 Dynamic Bayesian Networks

In this section we shall be considering dynamic Bayesian networks (DBNs) as described by (Murphy,

2002, chapter2). A DBN is, obviously from the name, an extension of the topic of BNs. DBNs are commonly

used in speech recognition, robotics and digital forensics and they can produce solutions analogous

to those from HMMs and Kalman filters. In particular a DBN models probability distributions

over random variables, Z1;Z2; : : :, which we shall denote in terms of a SSM, Zt = (Vt;Xt;Yt) where

Vt is the input variable, Xt is the hidden variable and Yt is the output variable. A dynamic Bayesian

network is formally defined byMurphy (2002, pg14) to be a pair (B1;B!), where B1 defines the prior

P(Z1), and B! is a two-slice temporal BN (2TBN) which defines P(ZtjZt????1) using a DAG such that

P(Ztjzt????1) =

NY

i=1

P(Zi

t jpa(Zi

t))

where pa(Zi

t) are the parents of Zi

t in the graph and Zi

t is the ith node at time t and could be a

component of Xt, Yt or Vt. In this 2TBN, the nodes in the first slice have no associated parameters,

but in the second slice, each node has an associated conditional probability distribution which

defines P(Zi

t jpa(Zi

t)) for t > 1. Note that we will only be considering first order Markov processes,

so the parents of a node, pa(Zi

t), must either be in the same time slice or the previous time slice.

A node is defined as persistent if there exists an arc from Zi

t????1 to Zi

t . A DBN can contain arcs

that represent instantaneous causation - this is when a directed arc goes from one node to another

within the same time slice. The 2TBN can also give us the joint probability distribution, P(Z1:T ),

by being \unrolled" to give T time slices (each a BN), then

P(Z1:T ) =

YT

t=1

NY

i=1

P(Zi

t jpa(Zi

t)):

It is useful at this point to see what a DBN might look like, for instance if each time slice is

represented by the BN shown in Figure 14, then the DBN would look like that shown in Figure 15.

Figure 14: A Bayesian network. Figure 15: A Dynamic Bayesian Network.

We can represent HMMs as DBNs, but should first note the key difference between the two. A

HMM just has a single variable Xt as its hidden state, whilst a DBN represents the hidden state

with a set of random variables, X1

t ; : : : ;XN

t . As we have already seen a HMM can be represented

as shown in figure 1. So we begin by defining the conditional probability distribution which requires

us to know P(X1), P(XtjXt????1) and P(YtjXt). These values have all been previously defined in

the definition for HMMs - P(X1) = u (the intitial distribution), P(XT jXT????1) = A (the state

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

transmission matrix) and P(YtjXt) = B (the emission matrix). If the parameters do not change

over time then it is only necessary, for the conditional probability distribution, to specify P(X1),

P(X2jX1) and P(Y1jX1). In this case the network is now given by Figure 16.

Figure 16: A HMM where the parameters do not vary with time.

As stated, DBNs are particularly useful in speech recognition, which is certainly an area of great

interest in today's technology. However, in order to take a closer look into the specifics behind

speech recognition we must first introduce hierarchical HMMs.

5.1 Hierarchical HMMs (HHMMs)

In this section we shall consider HHMMs as described by Murphy (2002, section 2.3.9). A HHMM

is used to model things with heirarchical structure, and many states of an HHMM are an HHMM

themselves. Within this structure there are two types of states that can occur, namely production

states and abstract states. A production state emits a single observation, whilst an abstract state

emits a string of observations and it is these that are then governed by sub-HHMMs. It is best

to see how an HHMM might look in terms of a diagram, see Figure 17.

Figure 17: An HHMM state transition diagram where solid arcs represent horizontal transitions

between state, dashed arcs represent vertical transitions (going to a sub-HHMM), the red arcs

represent emissions from production states and dotted arcs from double-ringed states represent

the end of a sub-HHMM (there is one for each sub-HHMM).

All HHMMs can also be converted into regular flat HMMs (see figure 18). However, it is now much

more complex to calculate the transition probabilities. In fact the probability of going from node

i to node j in the flat HMM is given by the sum over all paths from i to j in the HHMM which

pass only through abstract states. For example, in Figure 18 the probability of a transition from

state c back to state c is given by

Pflat(c ! c) = Ph(8 ! 8) + Ph(8 ! 9 ! 10 ! 3 ! 8) + Ph(8 ! 10 ! 3 ! 8)

= Ph(8j8) + Ph(9j8)Ph(10j9)Ph(8j3) + Ph(10j8)Ph(8j3)

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

where Pflat represents probabilities in the flat HMM and Ph represents probabilities in the HHMM.

Figure 18: The HHMM state transition diagram shown in Figure 17, flattened to a regular HMM,

where represents the empty string.

We can also represent HHMM as DBNs, the general case of this can be seen in Figure 19 It is now nec-

Figure 19: A 3-level HHMM as a DBN where Xl

t is the state at time t, level l and Fl

t is 1 if the

HMM at level l has finished, 0 otherwise.

essary to define the conditional probability distributions for each of the node types seen in Figure 19 for

a general HHMM with L layers up to time T. To do this we shall look at the bottom, middle and top

layers of the hierarchy separately as well as looking at the first, middle and last time slices separately:

Bottom layer: Here l = L and t = 2 : T ????1. XL is a Markov chain with parameters determined

by its sub-HHMM, which is encoded by X1:L????1

t , which we shall represent by k. State XL does

not enter its end state, but sets FL = 1 to specify the completion of that sub-HHMM. This

signals that HMMs in higher layers can change state as well as representing a vertical transition

back to the prior distribution. This can be written as

P(XL

t = SjjXL

t????1 = Si;FL

t????1 = f;X1:L????1

t = k) =

(

~A

Lk

(i; j) if f = 0

uL

k (j) if f = 1

)

where Si and Sj are not end states for this HMM, ~A

Lk

is a rescaled version of AL

k which is the

transition matrix for layer L, given that the parent variables are in state Sk, and L

k is the

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

initial distribution for layer L given that the parent variables are in the set of states, k. Also the

probability of FL

t = 1 is given by

P(FL

t jX1:L????1

t = k;XL

t = Si) = AL

k (i; end)

where end is the end state for the HMM.

Intermediate levels: Here l = 2 : L ???? 1 and t = 2 : T ???? 1. This is the same as before except

now we also get a signal, Fl + 1, from below, if this signal specifies that the model below is

finished then the state is free to change, otherwise the state must remain the same. This is

equivalent to

P(Xl

t = SjjXl

t????1 = Si;Fl+1

t????1 = b;Fl

t????1 = f; ;X1:l????1

t = k) =

8><

>:

(i; j) if b = 0

~A

l

k(i; j) if b = 1 and f = 0

ul

k(j) if b = 1 and f = 1

9>= >;

where (i; j) is a delta function and Fl = 1 only if Xl is able to enter a final state, which depends

on X1:l????1. This can be written as

P(Fl

t = 1jXl

t = Si;X1:l????1

t = k;Fl+1

t = b) =

(

0 if b = 0

Al

k(i; end) if b = 1

)

:

Top layer: Here l = 1 and t = 2 : T ???? 1. In this layer each node has no parent to specify

which distribution to use, the equations are the same as before, but without the condition that

X1:l????1

t = k.

Initial time slice: Here t = 1 and l = 1 : L. For the first layer the conditional probability

distribution is given by P(X1

1 = Sj) = u1(j). For all layers thereafter, (l = 2 : L), P(Xl

1 =

SjjX1:l????1

1 = k) = ul

k(j).

Final time slice: Here t = T and l = 1 : L. We must certify that all of the sub-HMMs have

reached their end states by the end of the sequence. To do this we set Fl

T = 1 for all layers.

Now we can set this knowledge to good use with an example in speech recognition.

5.2 HHMMs in Speech Recognition

The passage in Murphy (2002, pg27) shall be used here and it begins by explaining how we might

model the pronunciation of a single word - each word is split up into a sequence of sounds or letters,

known as phones, which become the states in the context of HHMMs. Typically, it is best to use

HHMMs to model this because each phone has its own underlying HMM. To see how this might

work, look at the DBN in Figure 20. In this figure, Xh is hidden state in the word's HMM, X

is the phone, Z is the state within the phone HMM (subphone) and Y is the observed state - here

an acoustic vector. Again, Fz is used to determine when a subphone HMM is complete. So what is

happening is each state of the model for the overall word, Xh

t , emits Xt which starts the HMM for

that phone. The state in the phone HMM is given by the subphone, Zt. We require Fz to tell us

when the subphone HMM is finished because we don't know how long each subphone might last for.

If instead of modeling a single word, we wished to model a sentence or sequence of words, then all

we have to do is add another layer to the heirarchy, see Figure21, where W is the state of the word.

We can now precisely define all the conditional probability distributions when t > 1 for the model

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer

Figure 20: A DBN for the pronunciation

of a single word.

Figure 21: A DBN for speech

recognition for more than one word.

in Figure 21 as stated by Murphy (2002, pg29):

P(Wt = w0jWt????1 = w;F=

t????1f) = W

(

(w;w0) if f = 0

A(w;w0) if f = 1

)

P(FW

t = fjXh

t = x;Wt = w;Fz

t = b) =

8><

>:

(f; 0) if b = 0

1 ???? Aw(x; end) if b = 1 and f = 0

Aw(x; end) if b = 1 and f = 1

9>=

>;

PXh

t = x0jXh

t????1 = x;Wt = w;FW

t????1 = f;Fz

t????1 = b) =

8>< >:

(x;x0) if b = 0

Aw(x;x0) if b = 1 and f = 0

uw(x0) if b = 1 and f = 1

9>=

>;

P(Xt = kjXh

t = x;Wt = w) = (Bw(x); k)

P(Fz

t = 1jZt = Sj;Xt = k) = Ak(j; end)

P(Zt = SjjZt????1 = Si;Xt = k;Ft????1 = f) =

(

Ak(i; j) if f = 0

uk(j) if f = 1

)

where Aw(w;w0) is the word bigram probability, Aw(x;x0) is the transition probability between

states in word model w, uw(x) is the initial state distribution and Bw(x) is the phone emission

probability from state x in the HMM for the word w. Thus we have seen how speech recognition

can be modeled using HHMMs and DBNs.

6 Conclusion

In this project, we have seen how we can build up layers of mathematical knowledge, starting

with Markov chains, in order to explore DBNs and finally discuss its application in modern day

technology in the form of speech recognition. After considering the section on dynamic Bayesian

networks, we can see how significant the earlier sections on Markov chains and HMMs are in the

foundations to this relatively new area of mathematics. As shown, DBNs are particularly useful

in modern day technology, not only just in speech recognition (as considered here), but in many

other areas such as tracking devices and quantitative risk assessments for various diseases.

MA3PR Dr R. Everitt

Part 3 Project (Project Report) N. Dyer



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now