Privacy Preservation For Microdata Computer Science Essay

Published Date: 02 Nov 2017

DEPARTMENT OF CSE,

JAYARAM COLLEGE OF ENGINEERING AND TECHNOLOGY, THURAIYUR.

ABSTRACT

The publication of data requires balancing factor of privacy and utility. The microdata is a collection of records. It always contains personally identifiable information. Therefore releasing microdata may result in violation of privacy. However the data publication processes are very difficult. Several anonymization techniques have been designed for privacy preserving microdata publishing. The anonymization technique means removing or modifying the identifying variables in the dataset. The anonymization techniques are generalization and bucketization. Generalization loses considerable amount of information, especially for high dimensional data. Bucketization does not protect the membership disclosure because it publishes the original form of quasi-identifying attributes. Therefore the adversary can easily find out whether an individual has a record in the published dataset or not. To overcome these drawbacks the slicing technique can be proposed. Slicing is a novel technique which publishes and shares the microdata in a privacy-preserving manner. It partitions the data both horizontally and vertically. It preserves better data utility than generalization and also provides membership disclosure protection. Slicing can be mainly used to handle high-dimensional data. It shows how slicing can be used for attribute disclosure protection.

1. INTRODUCTION

OVERVIEW

The data are increasingly being collected and used. Privacy-preserving data mining tries to strike a balance between two opposing forces: the objective of discovering valuable information and knowledge verse the responsibility of protecting individualâ€™s privacy. Microdata contain record search of which contains information about an individual entity, such as a person, a household, or an organization. For example, microdata are collected and used by various government agencies (e.g., U.S. Census Bureau and Department of Motor Vehicles) and by many commercial companies (e.g., health organizations, insurance companies, and retailers).Companies and agencies who collect such data often need to publish and share the data for research and other purposes. However, such data usually contains personal sensitive information, the disclosure of which may violate the individualâ€™s privacy. Privacy has become an important problem in data publishing and data sharing. Several microdata anonymization techniques have been proposed. The most popular ones are generalization for k-anonymity and bucketization for diversity.

MICRODATA PUBLISHING

Consider microdata such as census data and medical data. Typically, microdata is stored in a table, and each record (row) corresponds to one individual. Each record has a number of attributes, which can be divided into the following three categories: Identifiers are attributes that clearly identify individuals. Examples include Social Security Number and Name. Second the Quasi-identifiers are attributes whose values when taken together can potentially identify an individual. Examples include Zip-code, Birthdate, and Gender. An adversary may already know the Quasi-Identifier values of some individuals in the data. This knowledge can be either from personal contact or from other publicly available databases (e.g., a voter registration list). Sensitive attributes are attributes whose values should not be associated with an individual by the adversary. The microdata table is shown in table 1.1

Name

Zip-code

Age

Disease

Ali

4712

Heart Disease

Inan

4787

Heart Disease

Sweety

4767

Heart Disease

David

4801

Flu

Ganesh

4833

Heart Disease

Frank

4890

Cancer

Ghinita

4760

Heart Disease

Harini

4776

Cancer

Ian

4791

Cancer

Table 1.1 Microdata Table

When releasing microdata, it is necessary to prevent the sensitive information of the individuals from being disclosed. Three types of information disclosure have been identified. These are membership disclosure, identity disclosure and attribute disclosure. The membership disclosure is When the data to be published is selected from a larger population and the selection criteria are sensitive (e.g., when publishing datasets about diabetes patients for research purposes), it is important to prevent an adversary from learning whether an individualâ€™s record is in the data or not. The identity disclosure (also called re-identification) occurs when an individual is linked to a particular record in the released data.

Identity disclosure is what the society views as the clearest form of privacy violation. If one is able to correctly identify one individualâ€™s record from the anonymized data, then people agree that privacy is violated. When identity disclosure occurs, "anonymity" is broken. The attribute disclosure occurs when new information about some individuals is revealed, i.e., the released data makes it possible to infer the characteristics of an individual more accurately than it would be possible before the data release. Identity disclosure often leads to attribute disclosure. Attribute disclosure can occur with or without identity disclosure. In some scenarios, the adversary is assumed to know who is and who is not in the data, i.e., the membership information of individuals in the data. The adversary tries to learn additional sensitive information about the individuals.

ANONYMIZATION FRAMEWORK

The anonymization framework mainly focuses the privacy models and anonymization methods. A number of privacy models have been proposed. It includes k-anonymity and l-diversity. K-anonymity is the property that each record is indistinguishable with at least k-1 other records with respect to the quasi-identifier. K-anonymity requires that each QI group contains at least k records. The protection k-anonymity provides is simple and easy to understand. If a table satisfies k-anonymity for some value k, then anyone who knows only the QI values of one individual cannot identify the record corresponding to that individual with confidence greater than 1/k. Two attacks were identified: the homogeneity attack and the background knowledge attack. A QI group is said to have â„“-diversity if there are at least â„“ "well-represented" values for the sensitive attribute.

Several popular anonymization methods are used. Generalization replaces a value with a "less-specific but semantically consistent". A number of generalization schemes have been proposed. It can be put into three categories. These are global recoding, regional recoding, and local recoding. In global recoding, values are generalized to the same level of the hierarchy. Regional recoding allows different values of an attribute to be generalized to different levels. Local recoding allows the same value to be generalized to different values in different records.

Another anonymization method is bucketization (also known as anatomy or permutation-based anonymization). The bucketization method first partitions tuples in the table into buckets and then separates the quasi-identifiers with the sensitive attribute by randomly permuting the sensitive attribute values in each bucket. The anonymized data consists of a set of buckets with permuted sensitive attribute values. From the microdata table, it can be done. It does not prevent membership disclosure protection.

2. RELATED WORK

Ali. Inan et al. [15] proposed a new approach for building classifiers using anonymized data by modeling anonymized data as uncertain data. It does not assume any probability distribution over the data. Experiments spanning various alternatives both in local and distributed data mining settings reveal that the distance-based classification method performs better than heuristic approaches for handling anonymized data. More specifically, investigated methods included instance-based classifiers (also called k-nearest neighbor classification) and support vector machines (SVMs).To compare the accuracy of: (1) various classification models built over anonymized data sets and transformed using the different heuristics. (2) The approach of modeling anonymized data as uncertain data using the expected distance functions. SVM classification is trickier than IB1 because there are various different parameters about feature representation of anonymization data.

G. Ghinita et al. [13] suggested a novel anonymization method for sparse high dimensional data. To employ a particular representation that captures the correlation in the underlying data. It also facilitates the formation of anonymized groups with low information loss. It proposed CAHD (Correlation aware Anonymization of High-dimensional Data). A greedy heuristic that can be capitalizes on the data correlation and groups together transactions. The algorithm stops when there are no more ungrouped sensitive transactions remaining or when no new groups can be formed. Additionally, if all remaining transactions are non-sensitive, there is no information loss incurred if data are published as a single group, regardless of their number, since publish the QID directly.

The random perturbation proposed to prevent re-identification of records, by adding noise to the data. An attacker could filter the random noise, and hence breach data privacy, unless the noise is correlated with the data. It has two limitations: (i) the usability of the data is constrained by the rules that the owner decides to disclose and (ii) it is assumed that the data owner has the resources and expertise to perform advanced data mining tasks.

Justin Brickell et al. [9] suggested whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but k-anonymization does not guarantee any privacy. By contrast, measure the balance between privacy and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records. Modest privacy gains require almost complete destruction of the data-mining utility.

Tiancheng Li et al. [12] proposed the Injector framework for modeling and integrating background knowledge. Injector models background knowledge that is reliable with the original data by mining background knowledge from the original data. If certain facts or knowledge exist in the data (e.g., males cannot have ovarian cancer), should visible themselves in the data and it should be able to discover them using data mining techniques. Two approaches can be proposed for modeling background knowledge based on the general Injector framework. These are rule-based Injector and distribution-based Injector. Rule-based Injector models background knowledge as negative association rules. A negative association rule is an inference. It is some permutation of the QI values cannot require some sensitive attribute values. In distribution-based Injector, the background knowledge modeled as probability distributions. To model the adversaryâ€™s prior belief on each individual as a probability distribution, it includes different types of knowledge that exists in the data.

David J. Martin et al. [6] proposed a language that can direct any background knowledge about the data. It provides a polynomial time algorithm to measure the amount of disclosure of sensitive information in the worst case. Given that the attacker has at most k pieces of information in this language. It also provides a method to efficiently sanitize the data so that the amount of disclosure in the worst case is less than a specified threshold. A data publisher (such as a hospital) has collected useful information about a group of individuals (such as patient records that would help medical researchers) and would like to publish this data while preserving the privacy of the individuals involved. The data publisher would like to limit the disclosure of the sensitive values of the individuals in order to defend against an attacker who possibly already knows some facts about the table. The utility of data has been altered to preserve privacy for specific future uses of the data. When the attacker knows full identification information, then generalization provides no more privacy than bucketization.

Two popular sanitization methods are used. The bucketization is to partition the tuples in T into buckets, and then to separate the sensitive attribute from the non-sensitive ones by randomly permuting the sensitive attribute values within each bucket. The second sanitization technique is full-domain generalization, where harden the non-sensitive attribute domains. The sanitized data consists of the coarsened table along with generalization used.

2.1GENERALIZATION AND BUCKETIZATION

In privacy preserving data mining several microdata anonymization techniques have been proposed. The most popular ones are generalization for k-anonymity. It first removes identifiers from the data and then partitions tuples into buckets. Next it transforms the QI-values in each bucket into "less specific but semantically consistent" values so that tuples in the same bucket cannot be distinguished by their QI values. The generalization table is shown in table 1.2 that can be constructed from microdata table 1.1 (original table). The multi-set based generalization to replace the each attribute in the table by multi-set of values. It can be used to confuse the unknown person. Then the proposed the slicing technique can be applied in that table. It will provide privacy than generalization and bucketization. Data perturbation is another anonymization method. It sequentially perturbs search record in the dataset. In data perturbation method, the data contains a lot of noises and is not useful for data analysis. Background knowledge can lead to disclosure of sensitive information.

No

Zip Code

Age

Disease

47**

Heart Disease

48**

â‰¥40

Flu

Heart Disease

Cancer

47**

Heart Disease

Cancer

Table 1.2 Example of Generalization.

No

Zip Code

Age

Sex

Disease

4712

Cancer

4787

Cancer

4767

Heart Disease

4801

Cancer

4833

Heart Disease

4890

Heart Disease

4760

Heart Disease

4776

Flu

4791

Cancer

Table 1.3 Example of Bucketization

3. SLICING TECHNIQUE

Slicing approach can be used to improve the current state of the art. Slicing partitions the dataset both vertically and horizontally. Vertical partitioning is done by grouping attributes into columns based on the correlations among the attributes. Each column contains a subset of attributes that are highly correlated. Horizontal partitioning is done by grouping tuples into buckets. Finally, within each bucket, values in each column are randomly permutated to break the linking between different columns. The basic idea of slicing is to break the association cross columns, but to preserve the association within each column. It reduces the dimensionality of the data and preserves better utility than generalization and bucketization. Slicing preserves utility because it groups highly-correlated attributes together, and preserves the correlations between such attributes. It protects privacy because it breaks the associations between uncorrelated attributes, which are infrequent and thus identifying. When the dataset contains QIs and one SA, bucketization has to break their correlation.

3.1 FORMALIZATION OF SLICING

The algorithm consists of three phases. These are attribute partitioning, column generalization, and tuple partitioning. In attribute partitioning the algorithm partitions attributes so that highly correlated attributes are in the same column. It is good for both utility and privacy. Column generalization may be required for identity/membership disclosure protection. Tuple partitioning phase, tuples are partitioned into buckets. The main part of the tuple-partition algorithm is to check whether a sliced table satisfies â„“-diversity. The tuple partition algorithm is based on tuple partitioning algorithm. It must satisfy the diversity-check algorithm.

3.2 COMPARISON WITH GENERALIZATION

There are several types of recodings for generalization. The recoding that preserves the most information is local recoding. In local recoding, one first groups tuples into buckets and then for each bucket, one replaces all values of one attribute with a generalized value. Such a recoding is local because the same attribute value may be generalized differently when they appear in different buckets. Slicing preserves more information than such a local recoding approach. It is better than enhancement of the local recoding approach. Rather than using a generalized value to replace more specific attribute values, one uses the multi-set of exact values in each bucket.

3.3 COMPARISON WITH BUCKETIZATION

To compare, the bucketization can be viewed as a special case of slicing, where there are exactly two columns. First one column contains only the SA and the other contains all the QIs. The advantages of slicing over bucketization can be understood as follows. First, by partitioning attributes into more than two columns, slicing can be used to prevent membership disclosure but the bucketization cannot prevent the membership disclosure protection. Second bucketization, which requires a clear separation of QI attributes and the SA, slicing can be used without such a separation. For dataset such as the census data, one often cannot clearly separate QIs from SAs because there is no single external public database that one can use to determine which attributes the adversary already knows. Finally, by allowing a column to contain both some QI attributes and the sensitive attribute, attribute correlations between the SA and the QI attributes are preserved.

Slicing to provide protection against membership disclosure and attribute disclosure. It is a little unclear how identity disclosure should be defined for sliced data (or for data anonymized by bucketization), since each tuple resides within a bucket and within the bucket the association across different columns are hidden. In any case, because identity disclosure leads to attribute disclosure, protection against attribute disclosure is also sufficient protection against identity disclosure.

4. CONCLUSION AND FUTURE WORK

From the literature survey, generalization does not handle high dimensional data and also bucketization cannot protect the membership disclosure. To overcome these limitations of generalization and bucketization a slicing approach is going to be proposed. It provides the privacy in microdata publishing. It also preserves better data utility while protecting against privacy threats. Slicing to prevent attribute disclosure and membership disclosure. The association between highly correlated attributes preserves data utility. To protect privacy breaks the association between uncorrelated attributes in the dataset. The outcome of slicing technique is expected to be effective than generalization and bucketization.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Privacy Preservation For Microdata Computer Science Essay

ABSTRACT

1. INTRODUCTION

OVERVIEW

MICRODATA PUBLISHING

Name

Zip-code

Age

Disease

ANONYMIZATION FRAMEWORK

2. RELATED WORK

2.1GENERALIZATION AND BUCKETIZATION

No

Zip Code

Age

Disease

No

Zip Code

Age

Sex

Disease

3. SLICING TECHNIQUE

3.1 FORMALIZATION OF SLICING

3.2 COMPARISON WITH GENERALIZATION

3.3 COMPARISON WITH BUCKETIZATION

4. CONCLUSION AND FUTURE WORK

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time