The Consequences Of Dirty Data

Published Date: 02 Nov 2017

ABSTRACT:

Over a period of time it has been observed that most of the real world applications are becoming technology oriented. With increase in performance demanded on a continuous basis a considerable number of approaches are being framed to make them more efficient and suitable to day to day world. A large number of these applications require a great deal of data. Efforts have been made to collect all possible sources of data to help identify itâ€™s working. Increasing volume of information available in digital media becomes a challenging problem for data administrators. Digital media have been affected by the presence of duplicates or near duplicate entries in their repositories. Duplication of data provides a bottleneck to the approach. The major problem being observed is that the same data may be represented in a different way. Removing replicas from the repositories provides high quality information and saves processing time. This paper presents a thorough analysis of metrics and algorithms to dispatch the duplicate records from the databases.

INTRODUCTION:

Over a period of years, database plays a vital role in IT industries. These industries finally depend on the accuracy of database to carry out operations. When data are integrated from different sources to enforce data warehouse, the problem called data heterogeneity occurs. This is because of structure and semantics maintained in each and every database. Heterogeneity is also due to the variations in representation such as misspellings, typographical errors and abbreviations. Heterogeneity is of two typeâ€™s namely structural and lexical heterogeneity. Structural heterogeneity occurs when fields in records are differently structured in each database, in lexical heterogeneity fields share similar structure, but they differ in terms of their representation. Data cleaning and standardization are performed to solve the heterogeneity problem.

The main objective of deduplication is to identify two or more records that represent the same real world entity, though they are not similar. The problem was previously called as record linkage and record matching. Replicas are called as dirty data. Government agencies and companies have spent large amount of money to clean the dirty data in the repositories, which results in quality data content. Replica-free repositories will improve the efficiency and save the processing time. When data stored in database gets higher, problems such as security, low response time, quality assurance and availability also grows.

CONSEQUENCES OF DIRTY DATA:

1) Because of more amount of additional data (i.e.) duplicates, enormous time is needed to even answer a simple query and hence it results in performance degradation.

2) Presence of replicas increases the operational costs.

3) Dirty data also needs additional processing power.

METHODS:

1) DATA PREPARATION:

Duplicate record detection process starts with data preparation phase, where data are stored in consistent manner in repository by solving the heterogeneity problem. Data preparation consists of parsing, data transformation and data standardization. The main goal of parsing is to locate, identify and isolate the data element from the source files. Parsing promotes the comparison of individual components which is used to match the data. Data transformation refers to data type conversion, element of one data type to another. It mainly focuses on one element at a time. Final step is data standardization, where data elements presented in various data formats are standardized to specific format. Because of standardization, data presented in various databases are converted into uniform representation. One of the best examples for data standardization is "address". . Data standardization must be done before the record matching process starts. Many duplicates have been designated as non-duplicates, without standardization. After finishing the data preparation stage, data with their similar fields are stored in a table to find which field should be compared with another. Comparing the field "first name" and the field "address" would not provide useful information to detect duplicates. Even though, certain fields such as date, time, name and title pose some standardization problems.

2) ACTIVE LEARNING:

One of the duplicate detection system named ALIAS , was designed by sarawagi and bhamidipaty.By the discovery of challenging training pairs, ALIAS promotes the automatic construction of deduplication function. The main difference between the ordinary and an active learner is , while the former one depends on a static training set, the later actively picks subset of instances from an unlabelled data, which, when labelled will provide the highest information gain to the learner. The main role of ALIAS is to clearly distinct duplicate and non-duplicate pairs. ALIAS requires humans only when the uncertainty is very high to label the pair as duplicates or not. ALIAS does not sound to be a better approach, because it always requires some training data which is not available in real world.

3) SIMILARITY METRICS:

Most of the mismatches occur in database due to the typographical variations of string data. Certain similarity metrics are used to handle inconsistencies occurred during typing, .They include character based similarity metrics, token based similarity metrics and phonetic similarity metrics, which works well for matching individual fields of records.

CHARACTER BASED SIMILARITY METRICS:

Character based similarity metrics covers typographical errors.

EDIT DISTANCE:

Edit distance transforms one string into another by minimum number of edit operations. Edit distance is also called as levenshtein distance. Edit operations are of three types. It can be done by

Inserting a character into the string

Deleting a character from the string

Replacing a character by another character in the string.

AFFINE GAP DISTANCE:

Affine gap distance adds two edit operations namely opening and extending the gap. It fails to find the matches when the strings were shortened.

SMITH WATERMAN DISTANCE:

Smith waterman distance is called as substring matching. It identifies the matches in short distance by checking the middle of the string, also ignores the beginning and end of the strings. Distance between the strings can be computed by using Needleman and wunsch algorithm. For example, Prof."Hillary R.Clinton,U.S" and "Hillary.R.Clinton".Prof can be matched using smith waterman distance.

JARO DISTANCE METRIC:

Jaro distance metric is used to compare first and last names. It identifies the common characters and the number of transpositions. Jaro metric assigns higher weight to prefix matches , which is more important for matching surnames.

TOKEN BASED SIMILARITY METRICS:

Token based similarity metrics identify the similar strings which share the same meaning by rrranging the strings.

Q-GRAMS WITH TF.IDF:

Q-grams use tokens, instead of words. It identifies the matches despite the spelling error by means of inserting and deleting the words in the string. It also ignores low weight. For example, the string "Gateway Communications" and another string "Communication Gateway international" are matched under high similarity, because the word "international" have low weight.

ATOMIC STRINGS:

Atomic string is the ratio of matched atomic strings to the mean of atomic strings. Atomic strings can be matched, if one string is a prefix of the other string or both the atomic strings are identical. Atomic string is a succession of alphanumeric characters, which were delimited by punctuation characters.

PHONETIC SIMILARITY METRICS:

Character and token metric addresses the issue in representation of string. Strings which were not matched by using the above mentioned similarity metrics can be similar in this metric because of their phonetics. For example, the string Kageonne and the string Cajun are similar by using phonetics, though their representations are different.

SOUNDEX:

Soundex is the phonetic coding scheme which is used for matching surnames. It acts upon by grouping identical group of consonants which were phonetically similar.

METAPHONE AND DOUBLE METAPHONE:

Metaphone is an alternative for soundex. It uses 16 consonants to describe more number of words which were used in English and non English words. Double metaphone is an improved version of metaphone, which allows multiple encodings for the names that, has several likely pronunciations.

OXFORD NAME COMPRESSION ALGORITHM:

ONCA is used to group similar names by two steps. First one performs compression, whilst the second one uses the resultant of the compressed data and performs soundex.

NUMERIC SIMILARITY METRICS:

Numeric similarity metric consider each number as strings and the similarity can be identified using the above mentioned metrics.

CONCLUSION:

Similarity metrics are used to identify the matches in the strings for duplicate detection.

MARLIN FRAMEWORK:

Marlin architecture consists of training phase and duplicate detection phase. In training phase, learnable distance metrics are trained for each field in record, which results in paired field duplicates and non-duplicates. Some individual fields which are not equivalent may also be present in duplicate records; hence it results in noisy training data, whilst it wonâ€™t imply a serious problem. Duplicate detection phase acts by generating potential duplicate pairs. MARLIN framework employs clustering method to separate record into clusters, which consists of overlapping duplicates. Learned distance metrics are used to estimate the distance features of each field of records, producing distance features as input to the classifier. Classifier acquires the duplicates for each pair of records by using a measure called confidence. The problem lies in choosing a similarity threshold; hence it results in two similarity thresholds. One type indicates the high confidence duplicates and another indicates possible matches, which requires human expert to label.

4)SECURE DATA DEDUPLICATION:

The main idea behind the approach is to find the chunks in documents and the identified chunks are encrypted using keys. By matching the encrypted data, identical contents can be exploited. The drawback of this approach is , if the same content is encrypted using two different keys, it results in two different cipher text, which makes difficult to identify the duplicates. Hence it does not sound to be a better approach.

5)INDEXING TECHNIQUES:

Peter christen discussed several indexing techniques for record linkage, which uses blocking as a key technique. Records are placed into their blocks using their blocking key values(BKV). By using BKV, records which were similar are inserted in one block and records that are dissimilar are placed in different block. Record linkage process consists of two phases namely build and retrieve.

BUILD:

While merging the two databases, one of the two possibilities may be used. 1) Using a separate index and data stricture for each database. 2) Single data structure with common key values. This form can be done using an index table or hash table.

RETRIEVE:

Records that share the same bounded key values would be compared with another record in same block or in different database to generate candidate record pairs. Comparisons will be done by the classifier for the given record pairs to find the matches.

INDEXING TECHNIQUES:

TRADITIONAL INDEXING:

Records which share the same keys are placed in one block to be compared with another. The drawback in this approach is some records could be enclosed into a wrong block and it is hard to forecast the total number of record pairâ€™s generation.

SORTED NEIGHBORHOOD INDEXING:

Sorted array sorts the database based on bounded key values. It uses windows as key field. This approach influence records in current windows and pairs are generated from this window. The drawback in this approach is choosing the window size. If the window size was chosen small, it wonâ€™t be enough to compare all those records that share the same key values.

Q-GRAM:

Q-gram indexes the database by inserting the records that share the same bounded key value into same block. Q-grams were implemented effectively in relational database management system.

SUFFIX ARRAY:

Suffix array can be obtained by inserting their bounded key values and their suffixes into suffix array based inverted index. Suffix array is a collection of sequences or strings and their suffixes in alphabetical order, which are more suitable for bibliographic databases.

CANOPY CLUSTERING:

Overlapping clusters are called as canopies and they are created using threshold or nearest neighbour mechanism. Candidate record pairs are generated from the overlapping clusters. Nearest neighbour approach results in large number of true matches from the given record pairs and it outperforms the threshold approach.

SRING MAP BASED INDEXING:

Records are considered as strings, they are mapped onto multidimensional Euclidean space, followed by mapping into second dimensional space. It uses k-d tree approach for efficient matching.

6)GENETIC PROGRAMMING USING KFIND:

KFIND improves the accuracy of the classifier by finding the most represented data samples. Distance is calculated by the mean value of the most represented data samples. If the minimum distance is less than the mean value, duplicates can be removed and hence it calculates the centroid for the new data set samples. These steps will be repeated until the required data samples are selected.

7) UNSUPERVISED DUPLICATE DETECTION:

Unsupervised duplicate detection is used for web databases. UDD uses two types of classifiers namely weighted component summing classifier (WCSS) and Support vector machine classifier (SVM). WCSS assigns weight to each field based on their distance (i.e.) by calculating the dissimilarity among the records. The main function of WCSS is to perform record matching. It identifies the record pairs from different data sources by adjusting the weights, which results in positive data set or duplicates and negative data set or non-duplicates. SVM classifier identifies the new duplicates from the given positive and negative data set. Finally, by using the resultant duplicates and non-duplicates, the field weights assigned in the first step can be adjusted to start new iteration again for identifying new duplicates. This iteration continues until no new duplicates are found. While the first classifier does not require any pre labeled training data, second classifier requires limited training data.

8) IDENTIFICATION AND REMOVAL OF DUPLICATED RECORDS:

Bibal proposed a deduplicator algorithm to identify and remove the duplicate records. In this approach, all the data values or fields such as string, dates are converted into numeric forms to create clusters. First, clusters can be formed by using k- means clustering algorithm to reduce the number of comparisons. Records within the clusters can be matched by using divide and conquer method. This sounds to be more efficient in terms of identifying fully duplicated records, partially duplicated records and erroneous duplicated records. This approach is more suitable for single tables, instead of multiple tables. Performance metric can be estimated by precision, recall and F-score.

9) GENETIC PROGRAMMING APPROACH:

The main factor of genetic programming is Evidence. Evidence is a combination of attribute and similarity function, which were extracted from the given data content. By using the evidence efficiently, GP generates an automated deduplication function to check whether two or more entries ina repository represent the same real world entity. The suggested deduplication function also adapts to a different identification boundary, based on the parameter. GP outperforms all the other existing methods, including MARLIN, state of the art method.

10)PSO:

Deepa suggested a PSO algorithm for deduplication. The main idea behind the PSO algorithm is swarm. PSO algorithm includes two phases namely training and duplicate record detection phase. It uses cosine similarity and levenshtein distance to find the matches between the record pairs. The resultant data forms the feature vectors to represent the elements, which requires duplicate checking. Duplicate detection phase identifies the duplicates from the feature vectors by means of PSO algorithm. It outperforms genetic algorithm by providing high accuracy.

DUPLICATE RECORD DETECTION TOOLS:

1) FEBRL:

FEBRL stands for Freely Extensible Biomedical record linkage. It is a data cleaning tool kit, which is available under open source software license. Several techniques were developed and encapsulated into Graphical user interface (GUI), which performs data cleaning, data standardization, record linkage and deduplication. It allows both new and experienced users to learn more about record linkage techniques. It is more efficient to work with large number of data sets. Febrl comprises of two components namely data standardization and duplicate detection. Whilst the former one depends on hidden Markov model, later one relies on similarity metrics such as jaro, edit distance and q-gram to find the duplicates.

2) TAILOR:

TAILOR is a record matching toolbox, which allows users to enforce disimilar duplicate detection methods on the data sets. It supports multiple models; hence it is also called as a flexible tool. Record linkage process in TAILOR consists of two steps. First step is to generate comparison vector for each record pair and the second step is to ascertain the matching status of record pairs. Metrics such as performance and accuracy can also be compared with other models.

WHIRL:

Whirl is an open source duplicate detection system, which is obtained by combining cosine similarity and tf-idf weighting scheme. Whirl uses token based similarity metric to find the similarity between the two strings in the lists It can be used in academic and research studies.

8) BIGMATCH:

The role of big match is to perform deduplication between two types of files namely a large file A, and a moderate file B. The process works by choosing the records from A that corresponds to file B. Several parameters such as matching and blocking field can be specified by the user in the big match program. For each blocking criterion, a key list is created by reading the file B, and then it moves to A. The program finds the matches between the files through their same key value and it also computes the comparison weight. This program continue.es by reading the next records from file A.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now