Present Solutions For Data Quality Problems

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract

The quality of data on various parameters influences decision making in any process. The complexity of business processes is increasing in the globalised world. Along with this, data is also growing at an alarming rate. Hence, data quality management becomes an imperative necessitating a framework. Data Quality and its management have assumed significant importance as decisions based on poor quality data result in both financial losses and loss of credibility. Total Data Quality Management (TDQM) is one of the data quality management frameworks involving four phases viz. Define, Measure, Analyse and Improve. The define phase of TDQM still needs a better understanding, although it is developed on a highly well accepted and practiced Deming’s Total Quality Management (TQM). Further, the define phase becomes more complex as the data quality being a multidimensional concept requiring a better relational understanding between them for better decision making.

In this thesis, an attempt is made to give a better insight into define phase of TDQM by studying relational aspects of multiple dimensions of data quality which is expected to help in decision making of choosing appropriate dimensions of data quality. The relational understanding of different dimensions and their relevance in the information system is decided using Analytic Hierarchy Process (AHP) method. AHP is a multi-criteria decision making technique that works on a reciprocal matrix built on pair-wise comparisons. In view of this, it is necessary to build pair-wise comparison between two dimensions of data quality. This has been obtained in the present work through a survey involving experts in the related information system. For n dimensions, the survey questionnaire contains (n-1) questions. Since the survey is administered on the experts using the related information systems, it is expected that the relation between two dimensions (pair) can be built up better and a data consumer’s perspective can be brought in to decide appropriate data quality dimensions and relation between them.

Optimising on the number of dimensions of data quality is done by eliminating the lesser important dimensions using AHP method. Further, validation of stakeholders’ perception of data quality is also attempted by constructing an idealised reciprocal matrix. The correlation between the optimised dimensions and idealised reciprocal matrix is much better than the one which has been done without optimisation. This clearly indicates the advantages of the method proposed. To demonstrate the above method, the following information systems are considered for study:

Distributed Information System and

Cooperative Information System

The results and analysis of above two systems is presented in the thesis.

In the thesis are also presented the work on aggregational semantic heterogeneity problem which complement contextual issues of data quality.

Introduction

Introduction

The data in the digital era is growing at an enormous rate. The dependence of organisations on data in decision making is increasing day by day. The problems related to data are also growing with this growth of data. Data problems are observed to result in lost revenues and market share, reduced profits, and customer dissatisfaction. Poor quality data is estimated to result in the increase of operational cost by at least 10% (and probably as much as 20%) of revenue [Redman 2004].

In a recent study by Gartner [2009] , involving more than 140 companies in various industries and geographic regions, participating organizations estimated that companies are losing an average of US$8.2 million annually as a result of data quality issues. Annual losses of US$20 million or more were cited by 22% of the organizations, and 4% indicated annual losses as high as US$100 million [Gartner 2009].

Today, most organizations use data in two ways: transactional/operational use, and analytic use. The knowledge gained as a result of analysis influences further improvements. Both usage scenarios rely on high quality data, which emphasises the need for processes to ensure that data is of sufficient quality to meet all the business needs. Therefore, organisations need to move from reactive and ad-hoc approaches to data quality issues and initiate proactive and comprehensive processes to protect themselves from data related disasters.

Data Quality – Different Perspectives

Traditionally, data quality has often been described from the perspective of accuracy. Research and practice indicates that data quality has a number of other dimensions including accuracy [Huang et al 1999]. However, there is no unanimity among the researchers and practitioners about the data quality definition [Klein 1998]. The dimensions of data quality emerge in categories like intrinsic, representation, contextual etc.

Drawing a parallel from the manufacturing industry, the data quality is defined considering the data consumer viewpoint of "fit for use" and is defined as "data that are fit for use by data consumers" [Wang and Strong 1996]. Further, the user defines good data quality for each proposed use of the data, within its context of use and hence data quality is a contextual issue [Strong et al 1997].

Therefore, data are of high quality if they are fit for their intended uses in operations, decision-making, and planning. Data that are defect free and possessing required features are considered fit for use [Redman 2001]. Data quality improvements cannot be achieved only through technology intervention. It involves processes and people in addition to technology [Karr et al 2000].

Although data and information are different concepts, data quality is often treated as the same as information quality in real-world practice. Therefore, in this research, data quality and information quality are considered as synonymous.

Impact of poor Data Quality on performance of enterprises

The DemandGen Report survey, conducted in October 2010, reveals that poor quality data can have its impact on all areas of operations of organisations. The summary of survey respondents indicating the impact of poor quality data on different areas of organisations is shown in Figure 1.1. The survey, found that more than 62% of organizations rely on marketing/prospect data that is 20 to 40% incomplete or inaccurate. Additionally, almost 85% of businesses reported, they are operating customer relationship management (CRM) and/or sales force automation (SFA) databases, with 10 to 40% bad records [DemandGen Report 2010].

Figure 1.1 Impact of bad data on different areas of enterprises

Present Solutions for Data Quality Problems

To ensure quality of data there are several approaches followed. Some of these approaches focus on selection of data quality dimensions specific to a few systems where as a few others propose the frameworks for data quality management. TDQM is one such framework designed in line with TQM. This has four phases: Define, Measure, Analyse and Improve. While there is substantial work seen in the measurement of impact of poor quality data, analysis of root cause and process improvement, there is lack of unanimity in the very definitions of data quality dimensions among the researchers and practitioners, and there by this influencing the consequent processes followed. Further, choice of appropriate dimensions of data quality for a system under consideration is mostly done based on the experience of designers of the systems and the possibility of missing data consumer’s perspective seems to be high. Bringing in data consumer perspective to decide the appropriate dimensions of data quality is desirable as quality is "fit for use". It is felt that there exists no systematic method or process to achieve the same. Thus, there is a need for a methodology for selection of appropriate dimensions of data quality that forms the basis of the rest of the activities of life cycle of data. Such a methodology should not only help in selection of appropriate dimensions of data quality but also in optimising their numbers, as addition of dimensions increases the complexity of design and removing a dimension may reduce the quality. Further, there may be a gap between expectations of user of the information system and the services/products offered by the system. This could be due to semantic heterogeneity. There is also a need to find solutions to the problems arising due to semantic heterogeneity as they contribute towards improving the quality of data from the data consumer’s perspective.

A large number of researchers are carrying out work in this direction, viz. Dimensions of data quality, and importance of data quality[Wang 1996, Wang & Strong 1996, Pipino et al 2002, Kahn et al 2002 and references there in]. However, there seems to be little work available in the literature regarding the following points connected with TDQM:

Well accepted definitions for defining data quality dimensions in TDQM

Characterising information systems through relevant data quality dimensions

Optimising on their numbers by removal of irrelevant dimensions for the given specific information system and

Impact of semantic heterogeneity on data quality

Evolving characteristics of information systems pose new requirements with respect to data quality. There is a need to find the ways and means of evaluating which are the most relevant data quality dimensions according to the different characteristics of a specific information system [Batini and Pernici 2006].

The Problem Definition

The present research has focussed on the choosing of appropriate dimensions of data quality for an information system under consideration based on the inputs from the consumer and optimising on their numbers.

The key contributions of the present research work in line with the above observations are:

Use of Analytical Hierarchy Process (AHP) model as a novel way of identifying appropriate data quality dimensions for an information system.

Identification of data quality dimensions for distributed information systems and cooperative information system.

A solution to aggregational semantic heterogeneity problem

Organisation of the Report

The thesis is organised is organised in seven chapters as given below:

Chapter 1: An introductory chapter mentioning the impact of poor data quality on systems and enterprises and the importance of data quality. It also briefly speaks about the problem identified and the key contributions.

Chapter 2: This chapter deals with literature survey. A formal understanding of the terms involved helping in defining the scope, different frameworks for data quality, identification of the gap is presented in this chapter.

Chapter 3: This chapter deals with the framework chosen, choice of appropriate dimensions and developing a survey questionnaire, for two information systems.

Chapter 4: This chapter deals with Analytical Hierarchic Process (AHP) approach with an example.

Chapter 5: This chapter deals with the results obtained of the survey and results of AHP application to the data collected through the survey. The results so obtained are presented and discussed for distributed information system and cooperative information system. The results of optimisation on the dimensions of data quality are presented. The impact of this optimisation on the above two information systems is discussed.

Chapter 6: The issues related to the aggregational semantic heterogeneity problem in relation to data quality are discussed and a possible solution is proposed.

Chapter 7: This chapter deals with the conclusions and future scope of the work proposed in the thesis.

After Chapter 7, the bibliography is presented which is followed by Annexures containing the survey questionnaire and related data.

Literature Survey

The literature survey on the three of the data quality management frameworks that work towards continuous data quality improvement are presented in this chapter along with the methodology of their comparison. This is used to identify the gaps. A study of data quality dimensions is also carried out. As data quality being is multi-dimensional concept, it was felt that a better decision making tool which puts the various dimensions on their importance helps in choosing appropriate data quality dimensions and hence improving the quality of data. The literature survey on decision making tools suggests that the AHP is a widely accepted tool in this regard and hence the study of the same was carried out and is presented here. Finally, semantic heterogeneity as a cause for data quality problems is looked into.

Data Quality Management Frameworks

Various approaches can be seen in data quality research. Data Quality Dimensions are listed and categorised based on

1. Empirical research [Wang & Strong 1996],

2. Experience from professional and consultancy practice [Redman 1996, English 1999A, Loshin 2001], and

3. Theoretical conclusions [Wand & Wang 1996, Price & Shanks 2005].

Data Quality Management encompasses initiatives aimed at improving the quality of data [Batini & Scannapieco 2006]. It goes beyond a purely reactive improvement of data quality which involves identification and elimination of data defects [Shankaranarayanan & Cai 2006]. Data Quality Management consists of the proactive and preventive improvement of data quality through a continuous cycle of definition, measurement, analysis and improvement as well as the design of the required framework [Wang et al. 1998, English 1999A, Eppler & Helfert 2004].

According to [Eppler 2000], a data quality management framework should facilitate the following:

1. A concise set of criteria for data evaluation.

2. A scheme to analyze and solve data quality problems.

3. The basis for data quality measurement and proactive management.

4. A conceptual map for the community of researchers that can be used to structure a variety of approaches, theories, and data quality related phenomena.

Apart from this, it is felt that optimising on the number of data quality dimensions and understanding the relation between them are also important. Instead of reactive and ad-hoc solutions to data quality problems, proactive and evolutionary approaches leading to cycle of continuous improvement is naturally preferred as mentioned earlier.

A literature study of three such data quality management frameworks is attempted in the following sub sections viz. Total Data Quality Management (TDQM), Total Quality Data Management (TQdM) and Information Quality Management Framework ( IQMF).

Total Data Quality Management (TDQM)

TDQM is a data quality management framework that has been designed to govern the complete Data Quality (DQ) life cycle. This framework is developed at Massachusetts Institute of Technology by Richard Wang and others [Wang and Strong 1996, Shankaranarayan, et al. 2003]. TDQM is developed by barrowing concepts and practices from TQM proposed by Deming [1982]. The objective of the works of Wang and others was to apply the knowledge of manufacturing domain to the Information System domain. Here, they have used the best practices related to quality improvement of manufacturing world to the data and information processes to achieve improved quality. This concept has been named as Information Product (IP) that treats data like products found in the manufacturing environment. Based on their definition of data and the analogy between information systems and manufacturing systems, they have considered data as raw material for information. That is, this data is like input that is sent through the information system like raw material sent as input in manufacturing system. Further they draw the analogy between the output of information system like reports, data bases, forms and graphs with the tangible products in manufacturing industry after the processing. Table 2.1 shows the equivalence for data quality based on product manufacturing principles.

Table 2.1 Product Vs Information Product

Product Manufacturing

Information Manufacturing

Input

Raw Materials

Raw Data

Process

Assembly Line

Information System

Output

Physical Products

Information product

The above referred work by Wang and others have naturally used four phases similar to plan, do, check and act of TQM and they being definition, measurement, analysis and improvement. One can see another similarity between these two, both operating in cyclic manner as shown in Figure 2.1.

TDQM CYCLE

Figure 2.1 Phases of TDQM

TDQM methodology suggests that, the vision must be clearly understood by top management, and it is the responsibility of top management to share and convince that Data Quality is an important consideration throughout the organisation. The following key roles are suggested for better implementation along with their meanings [Wang and Strong 1996]: i) Data Providers: These are the people who create or collect raw information. They are known as suppliers, ii) Data Custodians: These are the people responsible for design, develop, and maintain IT infrastructure consisting of data and computer systems. They are responsible for data storage, maintenance, integrity, security, recovery, and applications. They are also known as manufacturers,

iii) Data Consumers: These are the users of the information. They work on the immediate output of the system. These data consumers are the ones to determine the product quality, iv) Manager: Manager is the person responsible for the overall process and the IP’s life cycle. The responsibilities of the Manager include coordination and managing the three stakeholders mentioned above.

As mentioned above there are four phases in TDQM and for the sake of completion of information a brief description of these as conceived by Wang and others is presented here. i ) Definition Phase : In this step of the TDQM, the IP’s characteristics are defined, IP’s data quality requirements are assessed, and IP’s information manufacturing system is identified [Wang 2001]. All the DQ dimensions should be considered here to evaluate and define the level of importance and relevancy they will play in a specific project at a certain time. Even if one of the dimensions is taken for granted, then many quality problems might arise in future steps of the process or even when used already by the consumer. Therefore, in this step all data quality dimensions have to be clearly identified, understood, prioritized and filtered. ii) Measurement Phase : The key of measurement phase is to produce metrics for quality in data that estimates and define the extent of problems. The metrics produced in this phase should reflect the organisation’s goals for improving the IP such as key performance indicators (KPI). Tools like Data Profiling help to perform the measurement in a structured manner. iii) Analysis Phase: During the analysis phase, the IP team does the calculation of impact of poor quality data. This is done by identifying the root causes of the problems of data quality. The tools usually used in this analysis phase are : Data Profiling methodology, Statistical Process Control (SPC), and Pareto chart. iv)Improvement Phase: The IP team needs to identify key areas for improvement in this phase. These key areas are: [Wang 98]

Aligning information flow and workflow with infrastructure

Realigning the key characteristics of the IP with business needs.

Process reengineering and solutions to managerial issues are the activities in this phase.

More information of this methodology is given at a later stage to come to the conclusion as to why this framework is more preferred over the others.

The Total Quality Data Management (TQdM) methodology gives knowledge about management improvement solutions. This methodology is reviewed in the next part of this chapter.

Total Quality Data Management ( TQdM)

Larry English [1999A] defines data quality (information quality) as "consistently meeting knowledge worker and end customer expectations". He adds consistent data provision to the context dependant data quality proposed by Wang and Strong [Wang and Strong 1996]. Here, consistency encompasses a) fulfilling the requirements of all data users, and b) providing the same data for the same business objects. Further, English defines information as the aggregate of i) data, ii) the clear meaning of data (definition) and iii) comprehensible presentation. Different Data Quality dimensions are listed by English for these three components. TQdM, the framework built by English for sustainable improvement of data quality distinguishes between processes for measuring data definition quality and data quality. It consists of five discrete processes of measurement and improvement and an umbrella process of implementing a cultural transformation to grow an environment of valuing information customers, a mindset of process excellence and a habit of continuous improvement as presented in his work [Figure 4.1, page 70, English 1999A]. TQdM is also referred to as Total Information Quality Management (TIQM) later in the literature [English 1999B]. The TQdM Methodology was initially designed for data warehouses projects, but it has proven to be a helpful general purpose DQ methodology because of its broad scope and the level of detailed characteristics.

The management improvement solutions suggested by the TQdM methodology are [Batini and Scannapieco 2006] : i) Organization readiness assessment to pursue DQ processes, ii) Customer satisfaction survey, to discover problems at the source, i.e., directly from service users, iii) Focus on a pilot project in the beginning and working towards its success by learning and adapting. This avoids the risk of failure in the initial phase, which is typical of large-scale projects performed in one single phase, iv) Definition of information stewardship, i.e., the organizational units and their managers who, with respect to the laws (in public administrations) and rules (in private organizations) that govern business processes have specific authority on data production and exchange, v) Follow the results of the readiness assessment, analysis of the main barriers in the organization to the DQ management perspective. This analysis is done in terms of resistance to change processes, control establishment, information sharing, and quality certification, vi) Establishment of a specific relationship with senior managers, in order to get their consensus and active participation in the process.

Dasu and Johnson [2003] suggests a second set of major managerial principles, viz. i) Whenever fresh data arrives, verify the schema constraints and business rules and communicate discrepancies to the concerned. ii) Bind a relationship with owners and creators of data towards ensuing incorporation of changes, iii) Manage non-cooperative constituents through senior management’s intervention, iv) Avoid repeated data entry through automation. Ensure data capturing as per schema and business rules, v) Identify discrepancies through regular audits, vi) Maintain an updated and accurate view of the schema and business rules; use proper software and tools to enable this, vii) Appoint a data steward who owns the entire process and is accountable for the quality of data, viii) Publish the data where it can be seen and used by as many users as possible, so that discrepancies are more likely to be reported.

2.1.3 Information Quality Management Framework (IQMF)

IQMF, proposed by Caballero et al [2008] is based on their concept of Information Management Process (IMP), an abstraction to address the different information manufacturing processes and IQ management activities. IQMF consists of two main components. i) A reference model (the Information Quality Management Maturity Model, IQM3) for the IMP, and based on staged maturity levels, as CMMI [SEI 2002] does for Software Processes, and ii). A methodology for assessing and improving the IMP (the Methodology for the Assessment and Improvement of Information Quality, MAIMIQ) according to reference models such as SCAMPI [SEI 2001] used for Software Process.

Software Process concept given by Fuggeta [2000] is applied in this framework in order to propose a new abstraction. This concept is named as Information Management Process (IMP), and it brings together both information manufacturing [Wang 1998] and IQ management processes in order to generate an Information Product. This product they claim, satisfies both user’s requirements and user’s IQ requirements. This is also the opinion of Ballou and Tayi [1996].

Another part of this system is Information Quality Management Maturity Model– IQM3. This recognises five maturity levels namely: i) Initial, ii) Defined, iii) Integrated, iv) Quantitatively Managed and v)Optimizing. Each stage addresses a specific IQ management goal. This method uses the tools which are barrowed from software engineering and databases fields which one thinks are the best fit in the context, but not the best ones always. Probably this is the reason why researchers in the field have not extensively used this method.

2.1.4 Comparison of Data Quality Management Framework

Eppler [2000] has proposed a method for evaluation of the frameworks. According to this, the frameworks are evaluated based on two criteria: i)The analytic criteria which are based on academic standards and require clear definitions of the terms used in a framework, a positioning of the framework within existing literature, and a consistent and systematic structure. ii) The pragmatic criteria consisting of dimensions which make the framework applicable, namely conciseness (i.e., if the framework is easily remembered), whether examples are provided to illustrate the working of framework, and the availability of tools that are based on the framework. In Table 2.2 are given the lists of meta-criteria for evaluation of frameworks and the corresponding evaluation questions.

Table 2.2 Eppler Wittig Framework Evaluation Criteria

Meta Criteria

Evaluation Questions

Definitions

Whether the definitions of dimensions and their categories are available along with their explanation?

Positioning

Whether the application of framework and its limits are clearly mentioned? Is the framework positioned clearly in the existing literature?

Consistency

Whether the dimensions mentioned are collectively complete and mutually exclusive?

Conciseness

Whether the framework can be easily remembered?

Examples

Whether examples / case studies are provided to illustrate different criteria?

Tools

Whether tools are available to implement and practice the framework?

A review by Eppler and Wittig [2000] suggests that most of the existing data quality frameworks are often domain specific and either strong on objective or subjective measurements, but not strong on both types of measurements at the same time. Frameworks also fail to analyse interdependencies between the various criteria within the framework. Therefore, Eppler and Wittig [2000] suggest five future directions:

A generic framework, not specific to a single application such as data warehouse or corporate communications

A framework that shows interdependencies between the different quality criteria

A framework that includes a list of problem areas and indicators therefore going beyond a simple quality criteria list

The development of tools which are based on information quality frameworks

A framework that is the same time theoretical and practical

The above criteria have been applied to the listed frameworks above and the results of comparison and the reason for the choice of the framework selected for the study on the basis of Eppler’s work is presented in Chapter 3. The work suggests the preference of the use of TDQM over other frameworks.

Data Quality Dimensions

Among the various frameworks mentioned above, the definition phase is defined through the data quality dimensions. These dimensions consider i) conformance to specifications, ii) user requirements, iii) context of use, etc.; however, a comprehensive list of commonly agreed data quality dimensions has remained an issue to be tackled [Batini et al 2006]. Dimensions are distinguished typically in two categories:

a. Schema quality dimensions referring to the quality of the intentional representation of the data related to the application. The application cannot perform unless these intrinsic characteristics of data quality are satisfied.

b. Data quality dimensions referring to the quality of the extensional representation of data, i.e. their values, formats etc. These contribute to the contextual and representational attributes of data quality.

A large amount of work is available in literature on data quality. This is presented in the review work of [Naumann & Rolkar, 2000]. However the work related to qualifying the conceptual definitions seems to be very less. This is probably because of the subjective nature of definitions of data quality dimensions.

A representative set of proposals on data quality dimensions is listed below. These proposals share a number of definitive characteristics regarding their classification of the dimensions of data quality. The references of the works in the literature and the characteristics inferred in these works regarding data quality dimensions are presented in Table 2.3.

Table 2.3 Summary of Data Quality Dimensions in literature



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now