Constraint Based Data Cleaning

Published Date: 02 Nov 2017

Data Management is one of the issues which require lot of consideration in order to carry regular management activities in effective manners. But at the same time two much data may become the biggest hurdle in the way of successful management. Data cleaning prior to data processing is a way through which only useable data can be segregated from raw data. Ciszak (2008) has defined data processing that Data cleaning is a process which attempts to maintain data quality in information systems. In his article he takes the minute technicalities under consideration regarding data cleaning process. He asserts that cleaning primarily means identification of duplication and in corrective entries made at the time of data insertion.

The article starts with identifying parameters which presences ensures data quality. Completeness, Validity, Consistency, Timelines, Accuracy, Relevance and Accessibility and the factors which determines that to what extent data contains element of quality. But to have this much of elements also encourages data collector to go with comprehensive approach. Many other authors have also presented the same thing.

On the other hand there are few key problems which creates hurdle in the destination of ideal quality data containing the above prescribed quality parameters. And that is data not always come from one source rather it come from more than one or more often from several sources. So to manage more than source with same level of accuracy always remain a hectic task. In more than one source data entrance it is very difficult to minimize data redundancy because of duplicate entries, So one of the tasks of the data cleaning solution becomes identification of duplicates and joining them into a single record through prescribed and structured manner. And secondly data standardization becomes a tricky issue when it comes to unification of system into one correct set of values used in the target system.

Later this articles moves to its real purpose which is finding out the methods of data mining and the primary concern is attribute correction. The Human input (involvement) process is always remaining critical to data validity. In regard of attribution correction this article presents two applications of data mining techniques: context-independent attribute correction implemented using clustering techniques and context- dependent attribute correction using associations. The first collection which is Context-independent attribute refers to way in which all the record attributes are examined and cleaned in isolation without consideration of values of other attributes of a given record. To do this an algorithm has been used which is based on an observation that in most data sets there is a certain number of values having a large number of occurrences within the data set and a very large number of attributes with a very low number of occurrences. And the second used way is Context-dependent means that attribute values are corrected with regard not only to the reference data value it is most similar to, but also takes into consideration values of other attributes within a given record. The idea of the algorithm is based on assumption that within the data itself there are relationships and correlations that can be used as validation checks. The algorithm generates association rules from the dataset and uses them as a source of validity constraints and reference data.

To conclude this article the researcher has found out that data mining is considerable way of data cleaning and it defiantly requires more research attention. Data mining methodologies which are applicable in data cleaning an also be used in a situation where no reference data is provided. In such cases this data can be inferred directly from the dataset.

Identification of bad data

Now a days most of the society parts are basing upon critical infrastructures of computer made plans. Infrastructure may include: security, electric and communication etc. There is an investment of billions of dollars in transforming our infrastructure into a smart system. These smart systems are supposed to be give better life and environment. Some of these sytems are the key impactors on our system. Water system is also going to be more and more smart system like other infrastructures. All types of datas are to be assembled and then detection of errors can be found through processing data. Each and every data suspected can be eliminated from the data sets on any step in processing.

Data Cleaning

There are several authentic and prolific ways to clean data out of which some are as follow

Constraint based data cleaning

Constraint approach has two aspects that can be used to describe these approaches

Identification of constraints to be followed by data

Application of constraints in order to find a consistent database

Most of the research was based upon denial and full dependencies which integrated functional dependencies. It is used repair algorithm that derived repairs along with the help of inclusion dependencies and FDs (BohannonP, Fan W, M, & R, 2005). Equivalent classes were used to distribut attributes into different classes to obtain a full complyinmg and consisitent database. Repair approach was based upon CFDs to repair data either considered as a non-trivial extension for algorithnm repair. The algorithm proposed was based upon a greedy cost based heuristic to repair errors. (Kolahi & Lakshmanan, 2009) used FDs to solving repairing problems to solver optimization problem of hyper-graph. Heuristic vertex give a help in finding minimal attribute number for modification in database consistent with FDs. The drawback of this approach is authenticity of data sets by domain experts so that there will be an authentic approach to arrive at a database consistent with FDs.

Literature has introduced several classes for rules of data quality. One is Conditional Functional Dependencies (CFDs) (BohannonP et al., 2005), that extended the concept of functional dependencies (FDs) along with some conditions that define tuples subsets or context that hold underlying FD. Matching Dependencies (MDs) (W, 2008), similar to that of FDs whereas it is similar because of matching but it match only values not exact matching. Duplicate records can be rooted back through matching rules and dedupalog. CFDs has been studied extensively because of its benefit as constraint of integrity and inconsistency identification in data.

Literature and earlier studies of CFDs focused on consistency analysis of CFDs (WENFEI FAN, GEERTS, JIA, & KEMENTSIETSIDIS, 2008), Data integration of databases in CFDs (W, S, Y, J, & Y, 2008), CFDs extensions through negation and disjunctions (Bravo L, Fan, Geerts, & Ma, 2008) or range addition (Golab, Karloff, Korn, Srivastava, & Yu, 2008). So finally, algorithms for CFDs are introduced in latest literature (Chiang & Miller, 2008; W. Fan, Geerts, Lakshmanan, & Xiong, 2009; Golab et al., 2008).

Machine Learning Techniques for Data Cleaning

Data cleansing method on the base of Machine learning technique was mainly comprised of deduplication (Elmagarmid, Ipeirotis, & Verykios, 2007), errors detection (Zhu & Wu, 2004) and imputation of data(W. Fan et al., 2009). It has not been addressed that how ML techniques can be used for bad databases repairing. Data imputations use relational learning to seek for attributes relationships in relational databases. After that missing values can be calculated through learnt model, requires priori knowledge of attributes relationships to construct Bayesian networks. Challenges involved in this technique are

Scalability of large databases in correlations of existing databases

The accuracy of the predicted values of replacement.

It is the difference that arises because of inconsistency in views of local and global data. Scalable ML techniques exist but they are either dependent on data or model. Scalability is also an issue for constraint based cleansing.

Involving Users in the Data Cleaning Process

Automatic data repair has not been benefitted by most of the systems now a days through data exploration techniques and transformation techniques. Repair actions are "explicitly specified by the user". For instance (Elmagarmid et al., 2007) proposed a language that eliminated duplicate attributes duriong transformation processes. Data transformation has been combined in irregularities form during detection errors Potterâ€™s Wheel (Raman & Hellerstein, 2001). Not of any system provide feedback of user through learning mechanism. Repair critical data has been introduces with guaranteed quality (Elmagarmid et al., 2007). They assumed that there is a reference correct data and user has to specify attributes in order to correct dataset on the whole. Whole system is based upon already defined rules for editing. Literature based upon two objectives

Identification of completely matched reference in LSI data (Doan, Domingos, & Halevy, 2001) and to eliminate duplicate from data (Doan & McCann, 2003)

Improvement of prediction quality on base of learning model (BohannonP et al., 2005; Doan & McCann, 2003; WENFEI FAN et al., 2008; W. Fan et al., 2009).

(Doan et al., 2001) and (W. Fan et al., 2009) addressed user feedback incorporation into a scheme that match schema. (Doan & McCann, 2003)Provided users a framework that gave an ease of candidates match facility, with no condition of selection and ranking, and then combination of responses to arrive at a correct dataset. A decision theoretic approach has been given to rank candidates to enhance quality of feedbacks according to data space. This approach has a limitation of confined user feedback to matching candidates resolving problem method from data space and cannot be applied on relational databases in constrained framework of repair.

(Bravo L et al., 2008) Proposed an approach known as active learning to formulate a generic matching function to identify duplicate records. Supervision when discussed in selective manner combines the supervision with active learning in decision theory (W. Fan et al., 2009). They used an information value to label unclassified cases. Labelling in a selective manner (Bravo L et al., 2008) give an assumption of unreliability on part of user feedback and select repeated labelling instances on the basis of combination of label uncertainty with active learning. Aim of these approaches was reduction of uncertainties of predicted output despite of the fact that how much value they have for quality of a database underlying.

Those approaches that mainly focuses on behaviour of user or entities deduplication, major area of relatedness is user with an adaptive system for information retrieval and web navigation (W. Fan et al., 2009; W et al., 2008). These techniques mostly focus on statistical modelling of user interaction to understand priorities of users through extracting specific domains. These models base on statistical significance of a model extracted and take into consideration the actions sequence done by users. Time dimensions have been ignored in these models to determine patterns of repeated actions of users. The best applications of these models are in determination of common behaviour groups and can be used in finding out the similarities among entities, whereas they lack their benefits in computations of pair-wise matching of entities on the basis of their registered actions.

WWW for Data Integration and Data Cleaning

WWW is the most familiar work developed in Octopus system by Cafarella et. al. [20]. The operation pattern presented by this approach contains similarity with approach of finding missing values through web tables. This approach used API to retrieve matching tables, that lacked a semantic, well defined. As web searching is not only table formatting because in many cases there were no relevant results for first 1000 surfed urls. Octopus put more research API to record information for each user and does clustering approach for table matching leading to confined performance even to manage databases of small sizes. This promote the need of a unique approach according to a well-defined semantic method that will match web tables and user tables.

Approach require performance of "heavy lifting" at initial steps to get fast and quick response in time. Researchers have proposed various techniques of web tables along with names of relationships and column names (Bravo L et al., 2008; W, 2008). These methods support in building a best matchin table known as "Semantic Matching Web tables" (SMW) graph. To build SMW grapgh of a vast work body is according to schema matching (Chiang & Miller, 2008; Golab et al., 2008). Most of latest techniques use base techniques for example linguistic matching related to detection of data overlap and attribute names and combines these detections to arrive at final matching, the basic technique and combination technique can be learning and non-learning technique (Chiang & Miller, 2008; W, 2008).

Clustering

Clustering is an ultimate partitioning of large form of datas into small groups or clusters. The objects of one clusters can be similar but different clusters can be dissimilar. Normally clusters are divided in three types; Hierarchical, partitioning and density based methods. Partitioning method is flat decomposition of data in a specified number of clusters, hierchical methods provide hierarchical data representation and Density-based clustering is used to identify clusters of objects lying in the same region and separated by those objects lying closer in regions.There can be two types of patitioning method for complete values.

k-Means Algorith

Fuzzy c-Means Algorithm (FCM)

Clustering of incomplete data

There are various methods in clustering incomplete data in statistical method out of which some are discussed here in short form.

k-Means Clustering with Soft Constraints

k-Means Clustering with Partial Distance Strategy

Whole-Data Strategy (WDS)

Partial Distance Strategy (PDS)

Optimal Completion Strategy (OCS)

Nearest Prototype Strategy (NPS)

Distance Estimation Strategy (DES)

k-Means Clustering with soft constraints

(Wagstaff, 2004) used k means clustering approach on the basis of soft clustering. The data

under observation in this study was with missing values. It is based upon defining soft constraints on data sets of missing values and usage osf those data sets as a supplementary and additional information.

k-Means Clustering with Partial Distance Strategy

Another method that can be used in missing values cases is adaptation of k-means algorithm to deal with missing values through using function of partial distance rather than function of Euclidean distance to calculate similarity of two data points. It has been represented as "partial distance strategy" (PDS). It has been used in (J, 1979) in context of the k-nearest neighbour surfing. Hathaway and Bezdek employed this technique for clustering in fuzzy method of incomplete data (J, 1979).

Whole-Data Strategy (WDS)

A method to adopt "fuzzy c-means algorithm" for managing data that contain missing values can be termed as whole-data strategy (WDS) (Elmagarmid et al., 2007). Only complete data regarded during process of clustering. Missing values are deleted from data and later on items are clustered according to "fuzzy c-means algorithm".

Partial Distance Strategy (PDS)

This approach also adapts fuzzy c-means algorithm to handle missing values of data set. It is based upon partial distance calculation between missing value data items and can be termed as "Partial Distance Strategy Fuzzy C-Means Algorithm" (PDSFCM). Square distance functioin is replaced by partial distance function durng calculation.

Optimal Completion Strategy (OCS)

It is also an adaptation of "fuzzy c-means Algorithm" to handle missing values of data depending on clusters of all iteration steps to find out the missing values of data set also termed as "optimal completion strategy (OCS) and changed value of algorithm because of this procedure is termed as "Optimal Completion Strategy Fuzzy C-Means Algorithm" (OCSFCM). Change is because of additional 3rd iteration step. Missing values are replaced at beginning with random values

Nearest Prototype Strategy (NPS)

"Nearest prototype strategy" (NPS) ismodified form of OCSFCM. Missing values are substituted with corresponding values of related cluster prototype with smaller partial distance. The algorithm resulted is known as "Nearest Prototype Strategy Fuzzy C-Means Algorithm" (NPSFCM) and it can be attained through OCSFCM through change in third iteration step.

Distance Estimation Strategy (DES)

Last approach according to this study is to adapt FCM to deal with missing values in data set. It is based upon distance estimation between incomplete data items and cluster prototypes, referred as "Distance Estimation Strategy" (DES) and the resulted algorithm as "Distance Estimation Strategy Fuzzy C-Means Algorithm"

(DESFCM).

Conclusion: In nutshell, literature describes various methods that can be used to clean data collected through primary and secondary resources. Statistical methods are not completely defined but the study will focus on techniques to arrive at error correction without manual working.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Constraint Based Data Cleaning

Identification of bad data

Data Cleaning

Constraint based data cleaning

Machine Learning Techniques for Data Cleaning

Involving Users in the Data Cleaning Process

WWW for Data Integration and Data Cleaning

Clustering

Clustering of incomplete data

k-Means Clustering with Soft Constraints

k-Means Clustering with Partial Distance Strategy

Whole-Data Strategy (WDS)

Partial Distance Strategy (PDS)

Optimal Completion Strategy (OCS)

Nearest Prototype Strategy (NPS)

Distance Estimation Strategy (DES)

k-Means Clustering with soft constraints

k-Means Clustering with Partial Distance Strategy

Whole-Data Strategy (WDS)

Partial Distance Strategy (PDS)

Optimal Completion Strategy (OCS)

Nearest Prototype Strategy (NPS)

Distance Estimation Strategy (DES)

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time