A Concept Drift In Data Stream

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract- The Data Stream in dynamic and emerging environment such as e-commerce, financial data analysis, sensor systems, social networking and many more fields, that possess distribution and the term concepts refer to the whole distribution of the problem in a certain point in time and hence the concept drift represents a change in distribution of the problem. So a Data Stream that constantly changes with time due to some hidden concepts that exhibit varying degree of drift, often the magnitude and the frequency of drifting concept are not known a priori is very difficult to handle, because of inadequacy of traditional techniques. However, with the advent of streaming data and long life classification system, it becomes clear that above assumption is no longer hold, and hence training an accurate, fast and light classifier for unpredictable, large and growing environment is very important and still open research problem.

Keyword- Classifier, Stream Data, Change Detection, Concept Drift.

introduction

In last decade, continuous and high speed of dynamic data is generated in the field of emerging environment such as telecommunication, social networking, web mining, scientific data, financial data and many more application called as data stream. A data stream is an ordered sequence of instances that show evidence of varying degree of changes. Stream Data mining is a difficult process as it dealing with data arrives in the form of continuous, high speed, large, and time varying streams and the processing of such streams involve a real time constraint. Therefore, only a small summary can be computed and stored. The speed of arrival of data is very fast, so each component has to be processed effectively in real time, and then discarded. The problem of data stream classification has been widely studied over last decade, the dynamic and evolving nature of data stream get attention towards research. Researcher find two most challenging characteristics of data stream that are infinite length and concept drift. The concept drift occurs in the stream when the underlying concepts of the stream changes over a time. However, online environments are often non stationary and the variables to be predicted by the learning machine may change with time. For example, the users may change their subjects of interest with time in an information system filtering. So, learning machines should be able to model changes and adapting these environments quickly and accurately.

Several approaches discussed in [8] have been proposed to handle concept drift over the last few years, such as Flora Framework in which Window Adjustment Heuristic Method By Widmer G. is used which is based on the matching conditions between the described item and the sample, in this method, the concept description items in training set can be divided into three categories positive Descriptor Set, Negative Descriptor set and uncertain description set. Benefit of using this, it maintain a Dynamic Window to keep track of occurrence of Drift but it deals with only one sample each time, so it has limitation on the speed of arriving data and also several ensemble classifier method such as streaming ensemble algorithm and dynamic weighted ensemble have been developed so far.

However, with the advent of data stream and long life classification system, it has become clear that these assumptions no longer hold, and hence training an accurate, fast and light classifier for unpredictable, large and growing environment is very important and still open research problem, Therefore There is need of a new advancement in this field which have a capability to handle concept drift in data stream quickly and accurately.

The paper is organized as, Theoretical foundation including data stream background and types of concept drift are discussed in Section II. Section III review the several methodology that handle concept drift in data stream and discussion about data sets issues are included in Section IV with concluding conclusion in Section V.

II. theoretical foundation

Data stream: These pose several challenges on data mining algorithm design. One is that the algorithms must be capable of using limited resources time and memory and another, by necessity they must deal with data whose nature or distribution changes over time. In turn, dealing with time changing data requires strategies for detecting and quantifying change. A data stream is an ordered sequence of instances. Stream mining is a difficult process as it dealing with data arrives in the form of continuous, high speed, and time varying data streams, and the processing of such streams needs a real time constraint. Several techniques such as sampling, load shading and windowing have been applied in the field of data stream mining that are discussed the data stream mining and the importance of its applications; in Zaslavsky et al [10], the techniques have their roots in statistics and theoretical computer science basic. There are two category of stream mining algorithm that are Data based and task based techniques. Based on these two categories, a number of classification, clustering, time series, and frequency counting analysis have been developed.

Concept Drift: The fundamental problem in learning drifting concepts is how to identify those data in the training set in a timely manner that are no longer consistent with the current concepts and hence several criteria is used to measure the concept Drift such as speed.

Speed: It is the inverse of the time taken for old concept is completely replaced by new concept. There are two forms exist, that are gradual and abrupt.

Gradual Concept Drift: The time step is taken slowly or gradually for old concept is completely replaced by new concept.

Abrupt Concept Drift: The time step is taken immediately or suddenly for old concept is completely replaced by new concept

Figure 1 shows two major types of concept change that are abrupt (sudden) and gradual.

Fig 1: Types of concept change in stream data

III. methodology that handle concept drift

Data streams have gained ground attention in the field of research. The research in this field is mainly focus on the areas like query processing, and mining data streams. For instance several research have been done now in the field of data stream mining which includes the methods like classification and clustering but mining concept drifting data stream is still open research problem, So several method have been reviewed here which can handle concept drifting data stream.

A. CSHT- Haitian Chen et al. [1]: Classifier in the ensemble is selected for integration on Hypothesis Test is a new ensemble learning environment for supervised learning in the concept drifting. The system is based on detecting concept drift and adapting the system by classifier selection. The Technique aim to identify the usability of base classifier that representing same or similar concept with current one to improve accuracy. This approach used Naves Bayes as base classifier. Hypothesis test is used to monitor the stability of the distribution underlying the data batches over the extended period of time. Benefits of using this method is adapt the different kinds of drift and could achieve better performance but this approach does not deal with abruptly changing and conflicting concept.

B. CDBD [2]: Confidence Distribution Batch Detection is proposed by Mortaza Zi Hayat el al. In this, Support vector Machine, Kullback-Leiber diversion techniques are used. CDBD is a concept drift handling approach which explicitly detects changes in the data without using labeled data. In this, the classifier built from initial training data, classifies the instances in the stream and store the output of classifier in the batch. The detection algorithm calculates the indicator value on current batch and flag is changes in concept. Benefits of using this approach is that it has explicit detection mechanism which is used to detect concept change, so do not require labeled instance to detect concept drift but rebuild policy is less suitable.

C. ACoBE- Teo Susnjak et al. [4]: This is layered approach in which individual classifier combined into ensemble cluster and assigned competence weight based on performance on training data in each layer. During run time all ensemble cluster produce a collective value. This value compared with threshold to formulate final classifier, Benefits of using this approach, it can handle different types of drift such as gradual and reoccurrences but there is difficulty in determining responsiveness to concept drift between gradual adaption and performance degradation due to layer threshold update.

D. AEBC [5]: This is adaptive ensemble boosting approach is proposed by K Wankhede et al. This uses adaptive sliding window and Hoeffding Tree with naïve bayes adaptive as base learner. The Adaptive Ensemble Boosting Classifier method achieves distinct features such as; it is dynamically adaptive, uses less memory and processes data fast. The sliding window is used for change detection, time and memory management. In this algorithm sliding window is parameter and assumption free in the sense that it automatically detects and adapts to the current rate of change. If change detect, it raises change alarm. This can adapt gradual concept drift but it cannot adapt abrupt drift.

E. WEAP-I [6]: This is weighted ensemble on averaging probability integrated proposed by Tao Wang, This is based on weighted ensemble and averaging probability ensemble under the learnable assumption. The method that which train a weighted ensemble on most n data chunks and trains an averaging probability ensemble on most recent data chunk. This approach can solve the problem of continuous concept drift occurrences. It is more robust to the averaging probability ensemble.

F. Learn++.NSE- Ryan Elwell et al. [7]: This approach allows the algorithm to identify, and perform accordingly, to the changes in data distributions, as well as to recognize a possible reoccurrence of an earlier distribution. Learn++.NSE is an ensemble-based batch learning algorithm that uses weighted majority of voting, and the weights are dynamically updated with respect to the classifiers time adjusted errors on current and past environments. It employs a passive drift detection mechanism, and uses only current data for training. It can handle a variety of non stationary environments, including drift that is slow or fast, gradual, cyclical or even variable rate drift. It is also one of the few algorithms that can handle concept addition new class or deletion of an existing class but it does not focus on statistical analysis for possible performance guarantees on different NSE scenarios

IV. discussion

Data Sets: A data set has several characteristics which define its structure. These include the types of the attributes and the various statistical measures, In statistics, data sets usually come from actual observations obtained by sampling a statistical measure, and each corresponds to the observations on one element. So analyzing concept drifting data stream is difficult when it is work with real world data sets, because the magnitude and the frequency of drifting concept, type of drift and when it is started are not known a priori, even also not known there really a drift is present or not. So it is not possible to analyze detailed behavior of method or algorithm in presence of concept drift using only real world data sets. So it is necessary to use synthetic data containing simulated drift. The Synthetic Data sets may further be generated by algorithms for the purpose of testing that are discussed how synthetic data is generated in [12][14].

V. conclusion

Mining concept drifting data streams is a challenging research. In particular, this paper incorporated the review of data stream, types of concept drift and several important approaches that are used in mining concept drifting data stream. In addition, the data sets such as real word data and synthetic data simulated concept drift are also discussed in order to analyze the detailed behavior of concept change in data stream.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now