Machine Learning Based Anomaly Detection

Published Date: 02 Nov 2017

This chapter describes the basics of Intrusion Detection Systems (IDSs) and the techniques that are used. IDS was introduced by Anderson [6] later was focused by Denning [18], and more over from the past 20 years has seen continuous research on it. IDSs are the systems that aim at detecting intrusions, sets of actions that attempt to compromise the integrity, confidentiality or availability of a computer resource [19]. There are two approaches for intrusion detection: Anomaly detection and Misuse detection. Misuse Detection maintains pre-recorded and known intrusion patterns, signatures, and try to find similar patterns in ongoing activities. If a match is found it is reported as an intrusion activity. A classifier is trained with known intrusion patterns and rely on network traffic attributes, to discriminate regular patterns and intrusion patterns. In anomaly detection method, intrusions are found out by looking at network behavior. If the behavior of the network is deviated from the normal or regular behavior significantly, then the network is said to be intruded.

2.1 Misuse-based IDS

Misuse detection is also known as signature-based or knowledge-based detection systems. It works on similar principle as the most anti-virus software companies follow. It relies on the accumulated knowledge about past attacks and vulnerabilities to detect intrusion attempts [20, 21]. Misuse detection systems monitors host and network to compare current activities with the "signatures" of past attacks. If the current activity matches any one of the known signatures, an alarm is triggered [22].

Signature-based detection is normally used for detecting known attacks. There are different representations to attack signatures [23]. One of these is content signatures, a string of characters appeared in the payload of attack packets. No knowledge of normal traffic is required but a signature database is needed for this type of detection systems. For example in the case of worm detection, the system does not care about how a worm finds the target, how it propagates itself or what transmission scheme it uses. The system screens the payload and identify whether or not it contains a worm.

Advantages and Limitations:

i. Low Rate of False Alarms: Misuse detection systems are able to detect known attacks and low false alarm rate when signatures are defined properly in the form of rules [24].

ii. Only Known Attacks Detection: The main drawback of misuse detection systems is that they are unable to detect unknown attacks.

2.2 Anomaly based IDS

Anomaly detection systems are also named as behavior-based systems. In this system intrusions are easily found by observing the system behavior and it should be as expected. These "expected" behaviors are drawn from the observations made in the past or to some forecasts made by various techniques. Behavior is not corresponding to the expected will be flagged as anomalous. In the process of anomaly detection the core part is about knowing what is normal behavior or expected behavior but not what is anomalous [25].

Consider an intruder, who does not have knowledge of legal userâ€™s activity patterns, trying to intrude the system, and then there is a strong probability that the intruderâ€™s activity will be detected as anomalous. In the ideal case, the set of anomalous activities also treated as the set of intrusive activities and flagging all anomalous activities as intrusive activities results in no false positives and no false negatives. However, intrusive activity is not always recognized as anomalous activity. Kumar and Stafford [26] suggested that there are four possibilities, each with a non-zero probability:

Intrusive but not anomalous: These are false negatives. An intrusion detection system fails to detect this type of activity as the activity is not anomalous. These are called false negatives because the intrusion detection system falsely reports the absence of intrusions.

Not intrusive but anomalous: These are false positives. In other words, the activity is not intrusive, but because it is anomalous, an intrusion detection system reports it as intrusive. These are called false positives because an intrusion detection system falsely reports intrusions.

Not intrusive and not anomalous: These are true negatives; the activity is not intrusive and is not reported as intrusive.

Intrusive and anomalous: These are true positives; the activity is intrusive and is reported as such.

Advantages and Limitations:

i) Unknown Attacks Detection: The main advantage of anomaly detection systems is that, contrary to misuse detection systems, they can detect unknown or novel attacks. They do not rely on any apriori knowledge concerning the intrusions. It is also important to note that anomaly detection systems are not to replace misuse detection systems and but to complement to use the efficiency of misuse detection.

ii) High Rate of False Alarms: Two factors may lead to a very high rate of false alarms or to a very poor accuracy of anomaly detection systems.

2.2.1 Statistical anomaly detection

In statistical methods, the system observes the activity of subjects and generates profiles to represent their behaviour. The profile typically includes measures such as activity intensity measure, audit record distribution measure, categorical measures and ordinal measure (such as CPU usage). Typically, two profiles are maintained for each subject: the current profile and the stored profile. As the system/network events (viz. Audit log records, incoming packets, etc.) are processed, the intrusion detection system updates the current profile and periodically calculates an anomaly score (indicating the degree of irregularity for the specific event) by comparing the current profile with the stored profile using a function of abnormality of all measures within the profile. If the anomaly score is higher than a certain threshold, the intrusion detection system generates an alert.

Statistical approaches to anomaly detection have a number of advantages. Firstly, these systems, like most anomaly detection systems, do not require prior knowledge of security flaws and/or the attacks themselves. As a result, such systems have the capability of detecting â€˜â€˜zero dayâ€™â€™ or the very latest attacks. In addition, statistical approaches can provide accurate notification of malicious activities that typically occur over extended periods of time and are good indicators of future denial-of-service (DoS) attacks. A very common example of such kind is portscan. Typically, the distribution of portscans is highly anomalous in comparison to the usual traffic distribution. This is particularly true when a packet has unusual features (e.g., a crafted packet). With this in mind, even portscans that are distributed over a lengthy time frame will be recorded because they will be inherently anomalous.

Shortcomings:

Skilled attackers can train a statistical anomaly detection to accept abnormal behaviour as normal.

It can also be difficult to determine thresholds that balance the likelihood of false positives with the likelihood of false negatives.

Statistical methods need accurate statistical distributions, but all behaviours can be modelled using purely statistical methods.

Haystack [27] is one of the earliest examples of a statistical anomaly-based intrusion detection system. It uses both user and group-based anomaly detection strategies, and models system parameters as independent and Gaussian random variables. Haystack defined a range of values that are considered as normal for each feature. In a session, a feature fell outside the normal range; the score for the subject was raised. Assuming the features are independent, the probability distributions of the scores are calculated. An alarm is raised if the score is too large. Haystack also maintained a database of user groups and individual profiles. If a user had not previously been detected, a new user profile with minimal capabilities was created using restrictions based on the userâ€™s group membership.

It was designed to detect six types of intrusions:

attempted break-ins by unauthorized users,

masquerade attacks,

penetration of the security control system,

leakage,

DoS attacks

malicious use

One drawback of Haystack is that it is designed to work offline. This system is failed when an attempt is taken to perform statistical analyses on real time traffic since it requires high performance systems. Secondly, It is difficult for administrator to determine the attributes, to maintain profiles, to be a good indicator to find intrusion activity.

Intrusion Detection Expert System (IDES) is another earliest intrusion detection system. It was developed at the Stanford Research Institute (SRI) in the early 1980â€™s [28, 29]. IDES continuously monitors user behaviour and detects suspicious events as they occur. In IDES, intrusions are easily found by detecting deviations from established normal behaviour patterns for individual users. As the analysis methodologies developed and evolved for IDES, scientists at SRI developed an improved version of IDES called the Next-Generation Intrusion Detection Expert System (NIDES) [30, 31].

NIDES can operate in real time mode and batch mode. In real time mode it continuously monitors user activity and where as in batch mode, it performs periodical analysis of audit data. However, the intention to design NIDES is to run in real-time. NIDES is a hybrid system, unlike IDES, an upgraded statistical analysis engine. In both IDES and NIDES, statistical analysis unit maintains profiles of normal behaviour based on a selected set of variables. The frequency distribution of variables and their probabilities along with possible ranges is kept in the form of histogram. The cumulative frequency distribution is then calculated by using the ordered set of bin probabilities. Using this frequency distribution and the value of the corresponding measure for the current audit record, it is easy to compute a value that reflects how far the current value is away from the â€˜â€˜normalâ€™â€™ value of the measure.

The actual computation in NIDES [31] produces a value that is related with how abnormal this measure is. Combining produced values of each measure and considering the correlation between measures, statistical analysis unit computes a value and it represents how far the current record deviated from the normal behaviour. If any current record value beyond the threshold is considered as a possible intrusion.

However the above techniques have several drawbacks [28-31]. Firstly, the techniques are sensitive to the normality assumption. If the data on a measure are not normally distributed, the techniques would give a high false alarm rate. Secondly, the techniques are predominantly univariate in that a statistical norm profile is built for only one measure of the activities in a system.

However, intrusions often affect multiple activity measures together. Statistical Packet Anomaly Detection Engine (SPADE) [32] is a statistical anomaly detection system used as a plug-in for SNORT [33]. It is used for automatic detection of stealthy port scans. SPADE uses the calculated anomaly score of the packet, using frequency based approach, to detect port scans, instead of using the traditional approach of looking at p attempts over q seconds [32]. The fewer times a given packet was seen, the higher was its anomaly score. In other words, an anomaly score is the degree of strangeness based on recent past activity. Once the anomaly score crossed a threshold, the packets are forwarded to a correlation engine designed to detect port scans. However, the one major drawback for SPADE is that it has a very high false alarm rate. This is due to the fact that SPADE classifies all unseen packets as attacks regardless of whether they are actually intrusions or not.

2.2.2 Machine learning based anomaly detection

Machine learning can be defined as the ability of a program and/or a system to learn and improve their performance on a certain task or group of tasks over time. Machine learning aims at to answer many of the questions as given by statistics or data mining [34]. However, unlike statistical approaches which tend to focus on understanding the process that generated the data, machine learning techniques focuses on the building a system that improves its performance based on previous results. In other words systems that are based on the machine learning paradigm have the ability to change their execution strategy on the basis of newly acquired information.

2.2.3 System call based sequence analysis

One of the widely used machine learning techniques for anomaly detection involves learning the behaviour of a program and recognizing significant deviations from the normal. Forrest et al. [35] established an analogy between the human immune system and intrusion detection. To build a normal profile, they analyze programâ€™s system call sequences like sendmail, lpr, etc., and showed that correlations in fixed length sequences of system calls could be used to build a normal profile of a program. Therefore, the programs that have the sequences deviated from the normal sequence profile can be considered to be the victims of an attack. This system is used in off-line using previously collected data and uses a quite simple table-lookup algorithm to learn the profiles of programs. This work was extended by Hofmeyr et al. [36], and they use the database of normal behaviour of the programs which are in interest. Once a stable database is constructed for a given program in a particular environment, the database was then used to monitor the programâ€™s behaviour. The sequences of system calls formed the set of normal patterns for the database, and sequences not found in the database indicated anomalies.

2.2.4 Bayesian networks

A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, Bayesian networks have several advantages for data analysis [37]. Firstly, it can handles missing data because Bayesian networks encode the interdependencies between variables. Secondly, Bayesian networks have the ability to represent causal relationships. Therefore, they are used to predict the consequences of an action. Lastly, because Bayesian networks have both causal and probabilistic relationships, they can be used to model problems where there is a need to combine prior knowledge with data. Several researchers have adapted ideas from Bayesian statistics to create models for anomaly detection [38-40]. Valdes et al. [39] developed an anomaly detection system that employed naive Bayesian networks to perform intrusion detection on traffic bursts. It used as a part of EMERALD [41], have the capability to potentially detect distributed attacks where each individual attack session is not suspicious enough to generate an alert. However, this scheme also has a few disadvantages. First, as pointed out in [38], the classification capability of naive Bayesian networks is identical to a threshold based system that computes the sum of the outputs obtained from the child nodes. Secondly, because the child nodes do not interact between themselves and their output only influences the probability of the root node, incorporating additional information becomes difficult as the variables that contain the information cannot directly interact with the child nodes.

Another area, within the domain of anomaly detection, where Bayesian techniques have been frequently used is the classification and suppression of false alarms. Kruegel et al. [38] proposed a multisensory fusion approach where the outputs of different IDS sensors were aggregated to produce a single alarm. This approach is based on the assumption that any single anomaly detection technique cannot classify a set of events as an intrusion with sufficient confidence. Although using Bayesian networks for intrusion detection or intruder behavior prediction can be effective in certain applications, their limitations should be considered in the actual implementation. Since the accuracy of this method is dependent on certain assumptions that are typically based on the behavioral model of the target system, deviating from those assumptions will decrease its accuracy. Selecting an inaccurate model will lead to an inaccurate detection system. Therefore, selecting an accurate model is the first step towards solving the problem. Unfortunately selecting an accurate behavioral model is not an easy task as typical systems and/or networks are complex.

2.2.5 A Hidden Markov model

Hidden Markov Model is one of most popular statistical technique [42], where the system being modeled is assumed to be a Markov process with unknown parameters. The challenge is to determine the hidden parameters from the observable parameters. Unlike a regular Markov Model, where the state transition probabilities are the only parameters and the state of the system is directly observable. In a Hidden Markov Model, the only visible elements are the variables of the system that are influenced by the state of the system and it is hidden. A Hidden Markov Modelâ€™s states represent some unobservable condition of the system being modeled. In each state, there is a certain probability of producing any of the observable system outputs and a separate probability indicating the likely next states. By having different output probability distributions in each of the states, and allowing the system to change states over time, the model is capable of representing non-stationary sequences.

To estimate the parameters of a hidden Markov model for modeling normal system behavior, sequences of normal events collected from normal system operation are used as training data. An expectation-maximization (EM) algorithm is used to estimate the parameters. Once a Hidden Markov Model has been trained, when confronted with test data, probability measures can be used as thresholds for anomaly detection. In order to use Hidden Markov Models for anomaly detection, three key problems need to be addressed. The first problem, also known as the evaluation problem, is to determine given a sequence of observations and the probabilities generated by the model. The second is the learning problem which involves building from the audit data, a model, or a set of models, that correctly describes the observed behaviour. Given a Hidden Markov Model and the associated observations, the third problem, also known as the decoding problem, involves determining the most likely set of hidden states that have led to those observations.

Warrender et al. [43] compare the performance of four methods viz., simple enumeration of observed sequences, comparison of relative frequencies of different sequences, a rule induction technique, and Hidden Markov Models at representing normal behavior accurately and recognizing intrusions in system call datasets. They determine that while Hidden Markov Models outperform the other three methods, the higher performance comes at a greater computational cost. In the proposed model, they use a Hidden Markov Model with fully connected states, i.e., transitions were allowed from any state to any other state. Therefore, a process that issues S system calls will have S states. This implies that we will roughly have 2S2 values in the state transition matrix. In a computer system/network, a process typically issues a very large number of system calls. Modeling all of the processes in a computer system/network would therefore be computationally infeasible.

In another paper, Yeung et al. [44] describe the use of Hidden Markov Models for anomaly detection based on profiling system, call sequences and shell command sequences. On training, their model computes the sample likelihood of an observed sequence using the forward or backward algorithm. A threshold on the probability, based on the minimum likelihood among all training sequences, was used to discriminate between normal and anomalous behavior. One major problem with this approach is that it lacks generalization and/or support for users who are not uniquely identified by the system under consideration.

Mahoney et al. [45-47] presented several methods that address the problem of detecting anomalies in the usage of network protocols by inspecting packet headers. The common denominator of all of them is the systematic application of learning techniques to automatically obtain profiles of normal behavior for protocols at different layers. Mahoney et al. experimented with anomaly detection over the DARPA network data [48] by range matching network packet header fields. Packet Header Anomaly Detector (PHAD) [49], Learning Rules for Anomaly Detection (LERAD) [46] and Application Layer Anomaly Detector (ALAD) [47] use time-based models in which the probability of an event depends on the time since it last occurred. For each attribute, they collect a set of allowed values and flag novel values as anomalous.

PHAD, ALAD, and LERAD differ in the attributes that they monitor. PHAD monitors 33 attributes from the Ethernet, IP and transport layer packet headers. ALAD models incoming server TCP requests: source and destination IP addresses and ports, opening and closing TCP flags, and the list of commands (the first word on each line) in the application payload. Depending on the attribute, it builds separate models for each target host, port number (service), or host/port combination. LERAD also models TCP connections. Even though, the data set is a multivariate network traffic data containing fields extracted out of the packet headers, the authors break down the multivariate problem into a set of uni-variate problems and sum the weighted results from range matching along each dimension. While the advantage of this approach is that it makes the technique more computationally efficient and effective at detecting network intrusions, breaking multivariate data into uni-variate data has significant drawbacks especially at detecting attacks. For example, in a typical SYN flood attack an indicator of the attack, is having more SYN requests than usual, but observing a lower than normal ACK rate. Because higher SYN rate or lower ACK rate alone can both happen in normal usage (when the network is busy or idle), it is the combination of higher SYN rate and lower ACK rate that signals the attack.

The major drawback of many of the machine learning techniques, like the system call based sequence analysis approach and the hidden Markov model approach mentioned above, is that they are resource expensive. For example, an anomaly detection technique that is based on the Markov chain model is computationally expensive because it uses parametric estimation techniques based on the Bayesâ€™ algorithm for learning the normal profile of the host/network under consideration. If we consider the large amount of audit data and the relatively high frequency of events that occur in computers and networks of today, such a technique for anomaly detection is not scalable for real time operation.

2.2.6 Fuzzy Inference System

Fuzzy inference systems (FIS) are one of the most famous applications of fuzzy logic and fuzzy sets theory [50]. Fuzzy inference systems (FIS) are widely used for process simulation or control. They can be designed either from expert knowledge or from data. For complex systems, FIS based on expert knowledge only may suffer from a loss of accuracy. This is the main incentive for using fuzzy rules inferred from data. Designing a FIS from the data has two main phases [51]

i) automatic rule generation

ii) system optimization.

Rule generation leads to a basic system with a given space partitioning and the corresponding set of rules. System optimization can be done at various levels. Variable selection can be an overall selection or it can be managed rule by rule. Rule base optimization aims to select the most useful rules and to optimize rule conclusions. Fuzzy inference systems are based on fuzzy logic. They are systems that take one or more inputs and export results for one or more outputs. The mapping of the value of inputs and outputs to a degree of membership (a number between 0 and 1) is conducted through the use of membership functions [52]. The relation between inputs and outputs is declared by if-then rules.

2.3 Summary

In this chapter, different intrusion detection approaches were discussed. Various other techniques useful for mining security information have been introduced. As a security is becoming more challenging day by day, machine learning and soft computing techniques are used to develop novel approaches in this thesis. One such technique a novel rule based classifier for IDS is discussed in the next chapter.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Machine Learning Based Anomaly Detection

2.1 Misuse-based IDS

Advantages and Limitations:

2.2 Anomaly based IDS

Advantages and Limitations:

2.2.1 Statistical anomaly detection

Shortcomings:

2.2.2 Machine learning based anomaly detection

2.2.3 System call based sequence analysis

2.2.4 Bayesian networks

2.2.5 A Hidden Markov model

2.2.6 Fuzzy Inference System

2.3 Summary

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time