Experimental Setup For Hbids

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Masquerade dataset have been used for evaluation of HBIDS. The Masquerade dataset [110, 111] has chosen for evaluation of our model. The author of this dataset is Dr. Schonlau has collected masquerading user data for the training and testing purposes for masquerade detection [110, 112]. Sample dataset is shown in table 5.2.

This data set consists of 50 data files, one file per user. In each file, there are 15,000 commands (collected using the UNIX audit tool, acct [113]). The first 5000 commands are from an original user and these commands are intended to serve as training data. The following 10,000 commands are seeded with a masquerader user’s commands, and they are intended to serve as test data. The test data can be viewed as 100 command blocks, with 100 commands in each block. The training data provided by Schonlau contains only normal behaviors but no intrusion behavior. This is sufficient for intrusion detection since it is a specific case of anomaly-based techniques, which do not require signature of intrusion behaviors for our model.

Mapped the dataset to HMM parameters, namely the number of hidden states (N), the observation symbols (M), the transition probability matrix (A) and Emission Probability Matrix (B), and initial distribution (Ï€). A brief description on how to set these parameters follows. The HMM state transition corresponds to masquerade dataset is shown in figure 5.2. From figure 5.2, the mapping between HMM parameters and masquerade dataset is as follows:

In Masquerade data traces from T1 to T7 corresponds to HMM state variables.

T4

T3

T2

T1

T3

T5

T7

Figure 5.2 State Transition diagram for Masquerade data

5.3.1 Number of States (N)

The estimation of number of hidden states in HMM for a given application is more art than science. We have chosen to build the HMM with a number of states exactly equal to the number of traces. Therefore, in our experiments the number of states is 7, because there are7 users in the dataset.

5.3.2 Number of Observation Symbols (M)

This parameter corresponds to the number of distinct system calls for each trace. In masquerade dataset each trace is having from 100 to 160 unique systems calls. If M value is high then the probability of detection rate will increase with more number of computations. For simplicity we have taken M as 10 that is number of unique system calls for each trace. The chosen system calls list for trace T1 is:

awk, basename, binhex, cal, cat, col, compress, cpp, csh, date.

Table 5.2 Sample records of TCPDUMP dataset

The chosen unique system calls list for all traces from T1 to T7 is shown in table 5.3.

Table 5.3 Unique System calls list for traces T1 to T7.

Table 5.4 State Transition Probability Matrix A for state trasition diagram (figure 5.2)

Table 5.5 Emission Probability Matrix B for state trasition diagram

5.3.3 HMM State Transition diagram and Initialization of Matrices (A, B, and π)

The behavior of an HMM resulting from the training procedure depends on the training data as well as on the initial values of the matrices A and B. There are many strategies available to initialize the matrices A and B [114]. We built Bayesian Network structure for chosen training traces. This Bayesian network structure can be taken as Hidden Markov Model state transition diagram which is shown in figure 5.3. The initialization of A is done based on the conditional probability of state variables (traces: T1 to T7) which is shown in table 5.4. The initialization of B is made with the frequency or probability of trace emissions which is shown in table 5.5. The size of this matrix is 7x34. Seven corresponds to number of traces and thirty four corresponds to the total number unique system calls. The variable π is the initial distribution of system calls for traces.

5.4 Framework for IDS using Bayes Network and HMM

The framework for building IDS using Bayes network and HMM is shown in figure 5.3. The framework consists of different levels of processing: dataset reading, preprocess the data, building the Bayes network, initializing the HMM parameters, generate sequence and states, estimate state transition and emission probability matrices, and evaluate the model. We have used tabu search algorithm for finding a well scoring Bayes network structure. Tabu search [115] is a higher level heuristic procedure for solving optimization problems, designed to guide other methods to escape the trap of local optimality.

It uses flexible structured memory to permit search information to be exploited more thoroughly than by rigid memory systems. It uses conditions for strategically constraining and freeing the search process embodied in tabu restrictions and aspiration criteria [115]. It uses memory functions of varying time spans for intensifying and diversifying the search. Tabu search is hill climbing till an optimum is reached. The best network found in this traversal is returned. The conditional probability tables of a Bayes network can be estimated once the structure has been learned. The network structures have been created for both normal and attack record types and shown in figure 5.4 and figure 5.5 respectively. The Conditional Probability Tables (CPTs) for normal records dataset of figure 5.4 shown in tables 5.8 to 5.11. The CPTs for attack type records dataset of figure 5.5 shown in tables 5.12 to 5.15.

5.4.1 Read the training Data

There are many standard data sets are available for building IDS, we have chosen the standard KDDCUP dataset. The description of the dataset has been given in chapter3. The IDS Model is trained and tested with the KDDCUP99 dataset. After choosing the dataset it should be preprocessed.

Start

Select Training Dataset

Preprocessing of the dataset

Replace Numeric discrete values with symbols

Build Bayes Network for normal and attack connection records separately (HMM architecture)

Classify test records into normal and attack types

Initialize HMM parameters based on Bayesian Networks

Figure 5.3 Framework for NIDS using Bayesian Network and HMM

5.4.2 Pre-processing of KDDCUP dataset

The standard KDDCUP [53] dataset is in text format. It has 41 dimensions and 4,900,000 records of 10% of total size. The attributes of the data types are continuous, categorical, and binary types. We have taken samples of around 35,000 records with five attributes. The selected attributes and samples of connection records of discretization process are listed in table 5.6. The format of text data is changed to comma separated values (CSV) which is easy to read and to analyze the data. With this we achieve data size reduction by reducing number of attributes from 41 to 4 and number records from 4,900,000 to 35,000.records. The discrete values have been replaced by symbols.

The pre-processed data has been used for training and testing the model. The chosen five attributes are protocol_type, src_bytes, dst_bytes, and, count. We have discretized the continuous variables and the discrete values are represented by symbols. The discretized values represented by symbols are shown in table 5.7.

5.4.3 Build the Bayesian Network

The simplest kind of Dynamic Bayesian Network (DBN) is a Hidden Markov Model (HMM). A dynamic Bayesian network is a Bayesian network that represents sequences of state variables. These state variables represent the nodes of the graph. These sequences are often time-series or sequences of symbols. The hidden Markov model can be considered as a simple dynamic Bayesian network. The dependency or relationship between the variables (nodes) is shown by a directed edge. Each relationship (edge) is associated with conditional probability table (CPT). Each edge has one conditional table showing the relationship with all the remaining variables. The Bayesian networks for normal and attack type records of kddcup99 dataset were shown in figure 5.4 and 5.5.

count

dst_bytes

protocol_type

src_bytes

Figure 5.4 Bayes Network for Normal Records

count

dst_bytes

protocol_type

src_bytes

Figure 5.5 Bayes Network for Intrusion Record

5.4.4 HMM Parameters initializations

The IDS has been designed using KDDCUP’99 [53] dataset. This dataset has been described above, which has 41 features. We have chosen five features out of 41 features as state hidden variables, namely protocol_type, flag, src_bytes, dst_bytes, and, count. These chosen variables are nothing but hidden state variables and each has set of distinct values which can emit a symbol. Each state variable has K distinct values say, V1; V2; . . . VK, forming the observations X(t). The state transition diagrams of HMM for normal records and attack type records were shown in figure 5.4 and figure 5.5 respectively.

The number of hidden state variables N is 5 based on the number of chosen variables. Therefore the size of the state transition matrix is 5x5.

The number of distinct emission symbols is 18, so M is 18. Therefore the size of the emission transition probability matrix is 5x18.

The initial probability distribution π = {, 0.261902, 0.090411, 0.375828, 0.271858};

5.4.5 Estimation of Transition and Emission Probabilities

To determine the parameters of HMM [16], it is necessary to make a rough guess for state transition probability matrix and for emission probability matrix values at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the so-called Baum-Welch re-estimation formulae [105]. Generally, the learning problem is how to adjust the HMM parameters, so that the given set of observations (called the training set) is represented by the model in the best way for the intended application. Here we are using Baum-Welch Algorithm [105] to train the IDS Model. The Baum-Welch algorithm is also known as Forward-Backward algorithm. The estimated HMM parameters A and B using the above algorithm are shown in table 5.16 and table 5.17 respectively.

5.4.6 Model Evaluation: The Evaluation Problem and the Forward Algorithm

We want to find the probability of an observed sequence, when the HMM [114] parameters (Description: Description: P, A, B) are known. Consider for our problem to find the probabilities of sequences corresponds to connection records of KDDCUP observations. The observations are: normal, DoS, R2L, U2R and probe. In each of these observations, the record may have been normal, DoS, R2L, U2R, and Probe. We have taken only normal connection records, for which we have built the model. For this data we have shown hidden states as a trellis in figure 5.1 [16] . The Forward Algorithm [116] can be used for calculating the probability of the sequence.

Table 5.6 Sample Dataset of discriminating the attribute values

Table 5.7 Sample dataset after discretization of attribute values

We have a model and a sequence of observations O = O1,O2,….,OT , and must be found. We can calculate this quantity using simple probabilistic arguments. But this calculation involves number of operations in the order of NT. This is very large even if the length of the sequence, T is moderate. Therefore we have to look for another method for this calculation. Fortunately there exists one which has a considerably low complexity and makes use an auxiliary variable, called forward variable.

The forward variable is defined as the probability of the partial observation sequence O = O1, O2, …., OT, when it terminates at the state i. Mathematically,

t (i) = p{ O1, O2, ……., Ot, qt = i| λ} (5.13)

 The complexity of this method, known as the forward algorithm is proportional to N2T , which is linear with respect to T whereas the direct calculation mentioned earlier, had an exponential complexity.

In a similar way we can define the backward variableas the probability of the partial observation sequence: , given that the current state is i. Mathematically ,

βt(i) = p{ Ot+1, Ot+2, ……,Ot|qt = i, λ} (5.14)

Experimental Setup for IDS using KDDCUP’99 to evaluate Integrated Bayes Network and HMM

The continuous attributes src_bytes, dst_bytes, and count have been discretized. In the following way the variables have been discretized:

src_bytes : ‘L’, ’M’, and ‘H’,

dst_bytes : ‘l’, ‘m’, ‘h’,

count: ‘low’, ‘mid’, ‘high’.

The dataset has 22 attack types such as smurf, Neptune, satan,etc. The 22 attacks have been replaced by four types namely DoS, R2L, U2R and Probe which are explained above.

Mapping the Bayes Network (BN) parameters to HMM parameters as follows:

The parameters which are same for both BN and HMM: N, M, A, and B.

The values of these parameters have been initialized for the KDDCUP99 dataset as N = 4, M = 12.

The parameters A and B have been initialized to some random values as follows:

Table 5.8 CPT for count variable

Count

Low

Mid

High

0.997028652

0.002759109

2.12E-04

Table 5.9 CPT for src_bytes variable

Src_bytes

dst_bytes

count

M

H

L

H

low

0.9263

0.045779

0.027851

H

mid

0.3333

0.333333

0.333333

H

high

0.3333

0.333333

0.333333

M

low

0.2932

0.233365

0.473416

M

mid

0.0731

0.024390

0.902439

M

high

0.6

0.2

0.2

L

low

0.1101

0.183134

0.706733

L

mid

0.3333

0.333333

0.333333

L

high

0.3333

0.333333

0.333333

Table5.10 CPT for dst_bytes variable

dst_bytes

Count

H

M

L

Low

0.6647

0.22270

0.112593118

Mid

0.0243

0.95121

0.024390244

High

0.2

0.6

0.2

Table5.11 CPT for protocol_type variable

protocol_type

src_bytes

dst_bytes

Tcp

Udp

Icmp

M

H

0.9997

1.15E-04

1.15E-04

M

M

0.9935

0.005393

0.001078

M

L

0.7740

0.005649

0.220338

H

H

0.9953

0.002320

0.002320

H

M

0.9972

0.001360

0.001360

H

L

0.9931

0.003412

0.003412

L

H

0.9923

0.003802

0.003802

L

M

0.0583

0.940983

6.56E-04

L

L

0.6035

0.301333

0.095111

Table 5.12 CPT for count variable

Count

high

Low

Mid

0.623876

0.2276258

0.148498

Table 5.13 CPT for protocol_type variable

protocol_type

Count

Icmp

Tcp

Udp

High

0.9365916

0.063347

6.13E-05

Low

0.1231726

0.831961

0.04487

Mid

0.003863

0.995364

7.73E-04

Table 5.14 CPTfor src_bytes Variable

Count

Protocol_type

H

L

M

High

Icmp

0.910835

6.55E-05

0.0891

High

Tcp

9.66E-04

0.99807

9.66E-04

High

Udp

0.333333

0.333333

0.333333

Low

Icmp

0.755102

0.24354

0.001361

Low

Tcp

0.108015

0.86836

0.023622

Low

Udp

0.003717

0.97777

0.018587

Mid

Icmp

0.882353

0.05882

0.058824

Mid

Tcp

2.59E-04

0.99948

2.59E-04

Mid

Udp

0.2

0.6

0.2

Table 5.15 CPT for dst_bytes variable

dst_bytes

Count

src_bytes

L

H

M

High

H

0.998068

9.66E-04

9.66E-04

High

L

0.998533

7.34E-04

7.34E-04

High

M

0.523373

0.47571

9.17E-04

Low

H

0.952351

0.01708

0.030571

Low

L

0.853659

0.13821

0.00813

Low

M

0.882353

0.05882

0.058824

Mid

H

0.999483

2.58E-04

2.58E-04

Mid

L

0.333333

0.333333

0.333333

Mid

M

0.333333

0.333333

0.333333

Table 5.16. State Transition Probability Matrix A

protocol_type

src_bytes

dst_bytes

Count

protocol_type

0.50

0

0.30

0.20

src_bytes

0.50

0.3

0.1

0.1

dst_bytes

0.20

0.1

0.5

0.2

Count

0.60

0.2

0.1

0.1

Table 5.17 Emission Transition Probability Matrix B

udp

icmp

tcp

L

M

H

l

m

h

low

mid

high

protocol_type

0.10

0.21

0.177

0.52

0.0477

0.02785

0

0

0

0

0

0

src_bytes

0.133

0.13

0

0.1

0.1333

0.133

0

0.1333

0

0.1333

0.1

0

dst_bytes

0

0

0

0.13

0.1333

0.1333

0.27

0.01

0.02

0.1

0.1

0.1

Count

0.1

0.1

0.1

0.1

0.133

0.173

0

0

0

0.128

0.1

0.624

Result Analysis

The experiments have been conducted using two datasets KDDCUP’99 and Masquerade data to evaluate the model performance. Best features(attributes) for each dataset chosen in the first phase of the model using the entropy statistical measure. The performance of any model depends on the chesen feature vectors.

In the second phase of the model using the chosen features model is built. In training the model(Model Parameters estimation) the training dataset has been used.This model is tested with test samples. During the test sample classificaton which is the third phase of the model, anypath method of HMM was used and giving high percentage of true positive classification. The performance model is also checked and giving Mean Squared Error (MSE) is also very low.

Summary

In this thesis the proposed work has described the usage of Hidden Markov Model for Intrusion Detection System. Training and testing of the KDD Cup 1999 dataset for IDS using HMM for applicator. We have taken only four features out of 41 features in our dataset. As described above, the HMM has been trained for normal TCP connection records of the KDD Cup 1999 data set. While training the model, it is necessary to initialize appropriate values, because the performance of the model mainly depends on these values. So for this we have initialized the parameter B using chosen dataset. After the initialization of A, B, and π parameters, the model selection is a major issue. Training is performed using standard Baum- Welch algorithm. The forward algorithm is suitable for to test the network traffic. The traffic is classified as normal or intrusion. Thus, this shows that Hidden Markov Methodology, with suitable parameter estimation and the training, represents a powerful approach for creating Intrusion Detection System which can find whether the traffic is normal or intrusions at runtime that might solve the major concern of the Computer Security.

The proposed IDS model is designed with Bayes Hidden Markov Model for Intrusion detection. The attributes have been selected with high entropy. This will reduce the data size with the same performance. The joint probabilities have estimated with full data which is also called as model learning was done as the first phase. This part of the work is about using Bayes Network and BN parameters have been initialized. Some of these BN parameters are same as the HMM parameters. In the second phase the model has been tested which shows high performance in detecting intrusions for large and high dimensional datasets. The same is to be tested on other real time data streams. The proposed approach is simple and easy to implement for real time applications to the network security applications. We are extending our work on a more rigorous data set for building a highly reliable intrusion detection system.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now