OLAP Multidimensional Database Concept

Published Date: 02 Mar 2018

CHAPTER 2 LITERATURE REVIEW

2.1 INTRODUCTION

This chapter is designed to provide background information and reviewing the characteristics of data warehouse, OLAP multidimensional database Concept, data mining model and the application of data mining. Within this research, the concept, design and implementation approaches in developing a complete data warehouse technology framework for deploying a successful model with the integration of OLAP Multidimensional Database and data mining model.

Section 2.2 discussed about the fundamental of data warehouse, data warehouse model and also the Extract, Transform and Loading (ETL) of raw database to data warehouse. It includes research and study on existing data warehouse models authored by William Inmon, Ralph Kimball and various scholars venturing into data warehouse models. Section 2.3 introduces background information of OLAP. It includes the studies and research on various OLAP models, OLAP architecture and concept on processing multidimensional databases, multidimensional database schemas design and implementation in this research. It includes studies and research on schema design and implementation. Section 2.4 introduces fundamental information of data mining. It includes studies and research on the available techniques, method and process for OLAP Data Mining. Section 2.5 discussed the product comparisons for data warehouse, data mining and OLAP by Mitch Kramer. It includes the reason why Microsoft is used to design and implement the new proposed model.

In this literature review, introduction to the relationships between data warehouse, OLAP multidimensional database and data mining model for deploying four experimental applications for benchmarking. This research also proves that the “new proposed model” data warehouse technology framework is ready to transform any type of raw data into useful information. It will also help us to review the new proposed model of each existing data warehouse & OLAP Multidimensional database framework.

2.2 DATA WAREHOUSE

According to William Inmon (1999), known as the “Father of Data Warehousing”, data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the management's decision-making process. Data warehouse is a database containing data that usually represents the business history of an organization. This historical data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit.

Data Warehouse is a type of database system aimed at effective integration of operational databases into an environment that enables strategic use of data (G. Zhou et al., 1995). These technologies include relational and multidimensional database management systems, client/server architecture, meta-data modelling and repositories, graphical user interface and much more (J. Hammer et al., 1995; V. Harinarayan et al., 1996).

Data warehouse currently are much a subject of researched is not only commonly used in business or finance sector but can be applied appropriately in various sectors. Data warehouse are designed for analyzing or processing of data into useful information using data mining tools for critical decision-making. Data warehouse provides access to difficult environments of an enterprise data

In these literature studies, two important authors are identified as the main contributors and co-founder in the area of Data Warehouse, William Inmon (1999; 2005) and Ralph Kimball (1996, 2000). Both author perceptions on data warehouse design and architecture differ from one another. According to Inmon (1996), data warehouse is a dependent data mart structure, whereas Kimball (1999) defined data warehouse as a bus structure which is a combination of data mart populated together as a data warehouse. Table 2.1 discussed the differences in data warehouse ideology between William Inmon and Ralph Kimball.

Table 2.1 William Inmon and Ralph Kimball Data Warehouse Differences

	William Inmon	Ralph Kimball
Paradigm	Inmon's Paradigm: An enterprise has one data warehouse, and data marts source their information from the data warehouse. Information is stored in 3rd normal form.	Kimball's Paradigm: Data warehouse is the collection of heterogeneous data marts within the enterprise. Information is always stored in the dimensional model.
Architecture	Architecture: Using TOP-DOWN approach	Architecture: Using Bottom-up approach
Concept	Data's integration from various systems to centralized repository	Concept of dimensional modelling (Bridging between Relational and multidimensional DB)
Design	The design pattern dependent on 3rd normalization form, purpose is for data granularity.	Data's marts are connected in a bus structure. Data's marts are the "union" of data warehouse. This approach is known also as Virtual Data Warehouse.
ETL Methods	Data's extraction from operational data sources. Data are feed in staging database area. Data are then transformed, integrate, and consolidate and transfer to Operational Data Store database. Data are then load to data mart.	Data extracted from legacy system and then consolidated and verified in staging database. Data feed into ODS and more data us added/updated. Operational Data Store contains fresh copy data that is integrated and transformed to the data mart structure.
Data mart	Data Marts are available as a subset of the data warehouse.	Data Marts can be placed at different at different servers or in geographical locations.

Based on this Data Warehouse literature, both Inmon (2005) and Kimball (2000) have different philosophies, but they do have similar agreement on a successful design and implementation of data warehouse and data marts are mainly depending on the effective collection of operational data and validation of data mart. Both approaches having the same database staging concepts and ETL process of data from a database source. They also have a common understanding that independent data marts or data warehouses cannot fulfil the requirements of end users on an enterprise level for precise, timed and relevant data.

2.2.1 DATA WAREHOUSE ARCHITECTURE

Data warehouse architecture is a wide research area. It has many different sub-areas and it can be treated with different approaches in terms or analysis, design and implementation by different enterprise. In this research studies, the aim is to provide a complete view on data warehouse architecture. Two important scholars Thilini (2005) and Eckerson (2003) from TDWI will discussed in more details on the topic on data warehouse architecture.

According to Eckerson (2003), before implementing a successful business intelligence systems where users can use programs like specialized reporting tools, OLAP tools and data mining tools upfront, a data warehouse architecture model mainly concentrate on the database staging process from different integrated OLTP systems is responsible for the ETL to the whole process workable. Thilini (2005) conducted a two phase study survey on investigating which factors may influence the selection of data warehouse architecture. In Thilini literature study, there are five data warehouse architectures that are practice today as shown in Table 2.2.

Table 2.2 Data Warehouse Architectures (Adapted from Thilini, 2005)

Data Warehouse Architecture

Independent Data Marts

Independent data marts also known as localized and small sized data warehouses. It is mainly used by departments, divisions or regions of company to provide own operational databases. The data marts are different as the structures are different from different location with inconsistent database design which makes it difficult to analyze across the data marts. Thilini (2005) cited the work of Winsberg (1996) and Hoss (2002) that "It is common for organizational units to develop their own data marts". Data marts are best used as a prototype for adhoc data warehouse and as for evaluation before building a real data warehouse.

Data Mart Bus Architecture

Kimball (1996) pioneered the designed and architecture of data warehouse with unions of data marts which are known as the bus architecture. Bus architecture Data Warehouse is derived from the unions of the data marts which are also known as Virtual Data Warehouse.

Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more as virtual reality mode and gathered all data marts and process as one data warehouse.

Hub-and-spoke architecture

Inmon (2005) developed Hub and Spoke architecture. The hub is the central server taking care of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and Spoke mainly focused on building a scalable and maintainable infrastructure for data warehouse.

Centralized Data Warehouse Architecture

Central data warehouse architecture almost similar to hub-and-spoke architecture without the dependent data marts. This architecture copies and stores heterogeneous operational and external data to a single and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data sources.

According to Inmon (1999) and Kimball (2000), central data warehouse should have Database staging or known as Operational Data Store as an intermediate stage for operational processing of data integration before transform into the data warehouse.

Federated Architecture

According to Hackney (2000), Federated Data Warehouse is a integration of multiple heterogeneous data marts, database staging or Operational data store, combination of analytical application and reporting systems. The concept of federated focus on framework of integration to make data warehouse as greatest as possible. Jindal (2004) conclude that federated data warehouse approach are a practical approach for a data warehouse architecture as it is focus on higher reliability and provide excellent value if it is "well defined, documented and integrated business rules".

Thilini (2005) conclude that hub and spoke and centralized data warehouse architectures are similar and the survey scores are almost the same. Hub and spoke is faster and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach.

A data warehouse is a read-only data source where end-users are not allow to change the values or data elements. Inmon's (1999) data warehouse architecture strategy are different from Kimball's (1996). Inmon's data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimball's views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one central repository. Diagram 2.1 illustrates the differences between Inmon's and Kimball's data warehouse architecture adapted from Mailvaganam, H. (2007)

Diagram 2.1 Inmon's and Kimball's Data Warehouse Architecture (adapted from Mailvaganam, 2007)

In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. According to Laney (2000) and Watson, H. (2005), it is important to understand and select the appropriate data warehouse architecture and “the success of the various architectures” acclaimed by Watson. Analysis of this research proved that the most popular data warehouse architecture is hub-and-spoke proposed by Inmon as it is a centralized data warehouse with dependant data marts and second is the data mart bus architecture with dimensional data marts proposed by Kimball. The selection of the new proposed model will use the combination data warehouse architecture of hub-and-spoke and data mart bus architecture as the new proposed model data warehouse architecture are designed with centralized data warehouse and with data marts that can are used for multidimensional database modelling.

2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, LOADING

Data warehouse architecture begins with extract, transform, loading (ETL) process to ensure the data passes the quality threshold. According to Evin (2001), it is essential that right data are important and critical for the success on an enterprise. ETL are an important tool in data warehouse environment to ensure data in the data warehouse are cleansed from various systems and locations. ETLs are also responsible for running scheduled tasks that extract data from OLTPs. Typically, a data warehouse is populated with historical information from within a particular organization (Bunger, C. J et al., 2001). The complete process descriptions of ETL are discussed in table 2.3.

Table 2.3 Extract, Transform, and Load Process in Data Warehouse architecture

Process	Descriptions
Extract	Extract are the first process which involve in moving data from operational databases into database staging area or operational data store before populating into the data warehouse. In this stage, operational databases data need to be examined by extracting into the staging area for handling exceptions and fix all errors before it enters into data warehouse as this will save lots of time when loading into the data warehouse.
Transform	In completion of data extraction in database staging area, it is then transform to ensure data integrity within the data warehouse. Transformation of data can be done in several methods such as filed mapping and algorithm comparisons.
Load	After extract and transform of data, it is finally loaded into data warehouse (in Inmon's model) or into data marts (in Kimball's model). Data loaded into data warehouse are quality data after the process of extraction where erroneous data are removed and data are transform to ensure integrity of the data.

Calvanese, D. et al. (2001) highlight an enterprise data warehouse database tables may be populated with a wide variety of data sources from different locations and often including data providing information concerning a competitor business. Collecting all the different data and storing it in one central location is an extremely challenging task where ETL can make it possible. ETL process as depicts in Diagram 2.2 begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all data's are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse.

Diagram 2.2Extract, Transport, Load Process

G. Zhou et al.(1995) emphasise on data integration in data warehousing stress that ETL can assist in import and export of operational data between heterogeneous data sources using OLE-DB (Object linking and embedding database) based architecture where the data are transform to populate all quality data into data warehouse. This is important to ensure that there are no restrictions on the size of the data warehouse with this approach.

In Kimball (2000) data warehouse architecture model depict in Diagram 2.3, the model focus in two important modules, “the back room” “presentation server” and “the front room”. In the back room process, where the data staging services in charge of gathering all source systems operational databases to perform extraction of data from source systems from different file format from different systems and platforms. Second step is to run the transformation process to ensure all inconsistency are removed to ensure data integrity. Finally, it is loaded into data marts. The ETL processes are commonly executed from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in star schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by query services such as reporting tools, desktop tools, OLAP and data mining tools.

Diagram 2.3 Data Warehouse Architecture (adapted from Kimball, 2000)

Nicola, M (2000) explains the process of retrieving data from the warehouse can vary greatly depending on the desired results. There are many form of possible retrieval from a data warehouses and it is flexibility that will drive how this retrieving process can be implemented. There are many tools for retrieving the data warehouse, such as building simple query and reporting through SQL statements. The tools may expand to OLAP and data mining, where the structure includes many more third party tools. There are many inherent problems associated with data, which includes the limited amount of portability, and the often-vast amount of data that must be sifted through for each query.

Essentially, ETL are mandatory for data warehouse to ensure data integrity. There are many factors to be considered such as complexity and scalability are among the two major issues that most enterprise faces by integrating information from different sources in order to have a clean and reliable source of data for mission critical business decisions. One way to achieve a scalable, non-complex solution is to adopt a “hub-and-spoke” architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Because of its centralized data warehouse design, it can influence the maintaining full access control of ETL processes. Also, empowers the usage of analytical and data mining tools by knowledge workers.

In this study on ETL for effective data warehouse architecture, it is known that Hub-and-spoke is best for data integration as it has the similarity in Inmon and Kimball architecture. The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Inmon and Kimball also recommend same ETL processes to enable hub-and-spoke architecture. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from Data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly.

In this work on the new proposed model, hub-and-spoke architecture are use as “Central repository service”, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architecture. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement. In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database or operational database store, Data marts, multidimensional database, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data. Diagram 2.4 depicts the concept of the new proposed model data warehouse architecture.

Diagram 2.4 New Proposed Model Data Warehouse Architecture

2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORS

Building a data warehouse is indeed challenging as data warehouse project inheriting a unique characteristic that may impact the overall setup if the analysis, design and implementation phase are not properly done. In this research effort, it discusses the studies on failure and success factors in data warehouse project. Section 2.2.3.1 focuses on the investigation on data warehouse project failure and section 2.2.3.2 discuss and investigate mainly on the success factors by implementing the correct model to support a successful data warehouse project implementation.

2.2.3.1 DATA WAREHOUSE FAILURE FACTORS

Hayen, R.L. (2007) studies shows that implementing a data warehouse project is costly and risky as a data warehouse project can cost over $1 million in the first year. It is estimated that one-half ad two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. Hayen R.L. (2007) citied on the work of Briggs (2002) and noticed three factors for the failure of data warehouse project that is Environment, Project and Technical factors as shown in Diagram 2.5 and table 2.4 discussed the factors in more details.

Diagram 2.5 Factors for Data Warehouse Failures (adapted from Briggs, 2002)

Table 2.4 Factors for Data Warehouse Failures (adapted from Briggs, 2002)

Factors	Descriptions
Environment	This leads to organization changes in business, politics, mergers, takeovers and lack of top management support. Also, including human error, corporate culture, decision making and change management.
Technical	Technical factors of a data warehouse project complexity and workload are taken too lightly where high expenses involving in hardware/software and people. Problems occurred when assigning a Project manager with lack of knowledge and project experience in data warehouse costing may lead to impediment of quantifying the return on investment (ROI). Also, failure of managing a data warehouse projects also includes: · Challenge in setting up a competent operational and development team plus not having a data warehouse manager or expert that is politically sound. · Having an extended timeframe for development and delivery of data warehouse system may due to lack of experience and knowledge for selection of data warehouse products and end-user tools. * Failure to manage the scope of data warehouse project.
Project	Poor knowledge on the requirements of data definitions and data quality on different organization business departments. Also, Running a data warehouse projects with incompetent and insufficient knowledge in what technology to use may lead into problems later on data integration, data warehouse model and data warehouse applications.

Vassiliadis (2004) studies shows that data warehouse project failures are an enormous threat and threatened by factors such as design, technical, procedural and socio-technical as illustrated in Diagram 2.6. These factors of failures are vital in finding any unwanted action for success. Each factor group is described in table 2.5.

Diagram 2.6 Factors for Data Warehouse Failures (adapted from Vassiliadis, 2007)

Table 2.5 Factors for Data Warehouse Failures (adapted from Vassiliadis, 2007)

Factors	Descriptions
Design	Design factors in data warehouse project can put up with "No Standard" techniques or design methodologies. A data warehouse project when doing the analysis and design phase may accept ideas on metadata techniques or languages and data engineering techniques. Also, a proprietary solutions and also recommendations from vendors or in-house experts may define the design of the data warehouse blueprint landscape.
Technical	Technical factors associate to the lack of know-how experience in evaluation and choices of hardware setup for data warehouse systems
Procedural	Procedural factors concerning on the imperfection of data warehouse deployment. This factor focuses on training the end-users extensively on the new technology and the design of data warehouse which are completely different than the conventional IT solutions. users communities plays a vital role and are crucial in this factor.
Socio-Technical	Socio-technical factors in a data warehouse project may lead into problems on violation of organization modus operandi where the data warehouse systems will lead into restructuring or reorganization on the way organization operates by introducing changes to the user community.

According to Vassiliadis (2007) also, another potential factors for the failure of data warehouse projects are the “data ownership and access”. This is considered vulnerable within the organization and one mustn't share nor acquire someone else data as this is comparable with losing authority on the data ownership and access. Also, restrict any departmental declaration or request to own a total ownership of pure clean and error free data as this might cause potential problem on ownership data rights.

Watson (2004) stress that the general factors for the failures in data warehouse project success comprises of “weak sponsorship” and top management support, inadequate funding and users participation and organizational politic.

2.2.3.2 DATA WAREHOUSE SUCCESS FACTORS

Data Warehouse Failures can lead into disastrous implementation if careful factors or measures are not taken into serious considerations as discussed in section 2.2.3.1 based on Briggs (2002) and Vassiliadis (2004) studies that may lead into data warehouse failures. According to Hwang M.I. (2007), data warehouse implementations are an important area of research and industrial practices but only few researches made an assessment in the critical success factors for data warehouse implementations. No doubt there is procedure for data warehouse design and implementation but only certain guidelines are subjected for experimental testing. So, it is best to decide and choose the proper data warehouse model for implementation success.

In this study on identifying and filling the gap analysis of the data warehouse success factors, a number of success factors are gathered from data warehouse scholars and professionals (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) to validate their experimental work and research strength individually on various characteristics of data warehouse success. This study beneficial in planning and implementing data warehouse projects and direct into the success of designing and implementing the new proposed model in this research.

There are several success factors in designing and implementing data warehouse solutions and the most important success factors depend on the data warehouse model selection, as different organization may have different scope and road maps in the development of data warehouse. The results of building a successful data warehouse are then used to quantify the factors that are used and also prioritize those factors that are beneficial for continued research purpose to improve and enhanced the data warehouse model success.

According to Hayen, R.L. (2007), data warehouse is a complex system which can complicate business procedures. The complexity of data warehouse prevents companies from changing data or transaction which are necessary. It is important then to analyze on which data warehouse model to be used for such complex systems that are sound critical to an organization. Hwang M.I. (2007) conducted a study on data warehousing model and success factors as a critical area of practice and research but only a few studies have been accomplish to measure the data warehouse projects and success factors.

Many scholars had conducted a profound research in the area of data warehouse and may have succeeded or failed due to possible reasons based on each scholars outcomes on the research. It is useful inspect a few case studies on a selected companied data warehouse implementation and to experiment the failure and success factors through survey. (Winter, 2001; Watson et al., 2004)

Hwang M. I. (2007) conducted a survey study on six data warehouse scholars (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) on the success factors in a data warehouse project. Each scholar has different success factors that are measures in a project. Table 2.6 shows the mentioned six scholars survey study on data warehouse, Watson (1997) measures data warehouse success factors, Chen et al. (2000), Watson et al. (2001) and Shin (2003) measures data warehouse implementation factors and Hwang (2002) measures through development and management practices. Only Wixom (2001) as shown in diagram 2.7 measures both Data warehouse implementation and success factors which can be used as a model for a successful data warehouse implementation. Study shown in all 6 scholars' review, without having data warehouse implementation and success factors, the consequences of any factors on a data warehouse success cannot be validated.

Table 2.6 Factors for Data Warehouse Success (adapted from Hwang M.I., 2007)

Study	Data Warehouse Success Factors	Data Warehouse Implementation Factors	Results Reported
Watson & Haley (1997)	Focus on user involvement and support by having a clear and understandable business needs. Using methodology and modelling methods in data warehouse by targeting on "clean data". Thus, support from upper management to contribute on the success.	N/A	Ordered list of success
Chen et al. (2000)	N/A	Focused on exactness and preciseness of User satisfaction by Support and realization of end users needs.	Support for end users affects user satisfaction
Wixom & Watson (2001)	Implementation factors include management support, resources, User participation, team skills, Source systems aand development technology which contribute to the implementation success based on organizational, project and technical approach.	Factors include Data and System quality to perceived data warehouse success	Some success factors affect DW success
Watson et al. (2001)	N/A	Having users to produce more and better requirement information about the system instead of developers. Thus, this is a better decision approach and improves the business process. Support for the accomplishment of strategic business objectives.	Ordered list of success measures
Hwang & Cappel (2002)	N/A	N/A	Development/management Practices
Shin (2003)	N/A	Factors include System, Information and service quality to ensure user satisfactions.	System quality affects user satisfaction

Diagram 2.7 Wixom Data Warehouse Research Model (adapted from Wixom, 2001)

In this study on data warehouse failure and success factors for developing the new proposed model in this research, Hwang M.I. (2007) research model as depict in diagram 2.8 are applied in this work. Hwang M.I. selected eight implementation factors which will directly affect the six selected success variables by combining the lists of the research studies as depict in Diagram 2.8.

Diagram 2.8 Hwang M. I. Data Warehouse Research Model (adapted from Hwang M.I. 2007)

The research model serves as an important guideline for implementing a successful data warehouse projects. No doubt there are lots of guidelines by other researchers, but in this study shows that Hwang M.I. (2007) research model “facilitate research integration and variable selection for current use and future research. The model is general and new factors or success variables, when identified, can be added easily”. This is the major advantages for using this model throughout the research on building effective data warehouse systems.

2.3 ONLINE ANALYTICAL PROCESSING

OLAP Council (1997) define OLAP as a group of decision support software technology that facilitate business analysts, managers and management executives to acquire fast, consistent and interactive access of information that has been reformulate, transformed and summarized from relational data files mainly from data warehouse into multidimensional data views allowing optimal data retrieval and for performing trend analysis.

Essentially, OLAP create “hypercube” or “cube” of information allowing more composite analysis compares to traditional relational database. OLAP analysis techniques are “slice and dice” and “drill down”. This allows analysts to segregate data into loads of information depending on the analyst's choice.

According to Nigel P. (2008), OLAP also implies "digging through tons of data" to uncover pattern and relationships contained within the business activity and history. OLAP permit the analysis of database information from multiple database source systems at one time. Processing database information using OLAP required an OLAP server to organize and transformed and build multidimensional database. Multidimensional database are then separated by cubes for client OLAP tools to perform data analysis which aim to discover new pattern relationship between the cubes. Some popular OLAP server software programs include Oracle, IBM and Microsoft.

According to Chaudhuri (1997), Burdick D. (2006) and Vassiladis, P. (1999), OLAP is rising into a powerful concept for strategic database analysis. OLAP have the ability to analyze large amount of data for the extraction of valuable information. Analytical development can be of business, education or medical sectors. The technologies of data warehouse, OLAP, and analyzing tools support that ability.

Madeira (2003) supports the fact that OLAP and Data Warehouses are complementary. A Data Warehouse stores and manages data. OLAP transforms Data Warehouse data into strategic information. OLAP ranges from basic navigation and browsing (often known as "slice and dice"), to calculations, to more serious analyses such as time series and complex modelling. As decision-makers exercise more advanced OLAP capabilities, they move from data access to information to knowledge.

2.3.1 OLTP vs. OLAP

The design of a data warehouse and online analytical processing (OLAP) multidimensional cubes is vitally different than an online transactional processing database (OLTP). The data warehouse is specifically designed to facilitate fast query times and multi-dimensional analysis. While OLTP or Online Transactional Processing systems are information systems that support daily processing in an organization. OLTP system main function is to capture information and to get data into computers. OLTP allow effective data manipulation and storage of information for daily operational resulting in huge quantity of transaction data. Relational databases are designed for OLTP systems and OLTP maximize the efficiency in operating on a few records concurrently in predictable patterns. Organisations build multiple OLTP systems to handle the aspect of the daily operations; huge quantities of transactional data can be accumulated in a short period of time.

OLAP or Online Analytical Processing is a class of software technologies that are designed for data access and analysis. OLAP is facilitating application technology and database model that grants managerial users for strategic decision making capabilities. OLAP database model are known as multidimensional database technology. OLAP concentrate on this problem by aggregating data into a multi-dimensional view for direct access into the OLAP tools without hindering the system performance.

It is complicated to merge OLAP and OLTP in one centralized system. The dimensional data design model used in OLAP is much more effective for querying than the relational model used in OLTP system. OLAP may use that one central database as data source and OLTP used different data from several sites. The dimensional design of OLAP is not suitable for OLTP system, mainly due to redundancy and the loss of referential integrity of the data. Organization chooses to have two separate information systems, one OLTP and one OLAP system.

According to Poe (1983), OLAP systems are mainly done through evaluation or analysing patterns and trends of the data mainly in centralized data warehouse database. Compared to OLTP, it is difficult to perform such pattern analysis as information accesses are stored in diverse systems across several departments in the organization.

One might argue that the purpose of OLTP systems is to get data into computers, whereas the purpose of OLAP is to get data or information out of computers. The following According to Han (2001), OLTP systems are known as "Customer oriented" and OLAP is a “market oriented”. Table 2.7 summarizes the major differences between OLTP and OLAP system adapted from Han (2001).

Table 2.7 Differences between OLTP and OLAP (adapted from Han, 2001)

Feature	OLTP	OLAP
Characteristic	Operational processing	Informational processing
Orientation	Transaction	Analysis
User	Clerk, database administrator (DBA), database professional	Knowledge worker (e.g. manager, executive, analyst)
Function	Day to day operations	Long term informational requirements, decision support
Database Design	ER based application, application oriented	star/snowflake, subject oriented
Data	Current, guaranteed up to date	Historical, accuracy maintained over time
Summarization	Primitive, highly detailed	Summarized, consolidated
View	Detailed, flat relational	Summarized, multidimensional
Unit of Work	Simple transaction	Complex query
Access	Read/write	Read
Focus	Data in	Information out
Operations	Index/hash on primary key	Lots of scans
Number of records accessed	Tens	Millions
Number of users	Thousands	Hundreds
Database size	100MB to GB	100GB to TB
Priority	High performance, high availability	High flexibility, end-user autonomy
Metric	Transaction throughput	Query throughput

2.3.2 TWELVE RULES OF OLAP

The term OLAP was discovered and first used by E. F. Codd, father of the relational database. In 1985, E.F. Codd wrote a paper in defining rules for Relational Database Management Systems (RDBMS) which modernized the IT database industry. Later in E.F. Codd 1993 paper (Codd, 1993), E.F. Codd not only invent OLAP but also provided a set of procedures which are known as the “Twelve Rules” for OLAP product ability assessment which include data manipulation, unlimited dimensions and aggregation levels and flexible reporting. OLAP is a software and data processing which allows consolidation and analysis of data in a multidimensional space. E. F. Codd “Twelve rules” for the OLAP systems are shown in Table 2.8:

Table 2.8 E.F. Codd “Twelve Rules” for OLAP Systems (adapted from Codd, 1993)

Rules	Descriptions
Multi-dimensional Conceptual View	Data should be presented to the user in a multi-dimensional pattern in a form of pivoting, rotating and “slicing and dicing”. Multidimensional representation model are more towards data manipulation rather that querying data.
Transparency	OLAP tools complexity and process orchestration should not introduce and should be hidden from users. A typical user is only interested and uses the graphical user interface of the tools without complicate the exiting productive OLAP system. Also, users are transparent from using the system without knowing that they are accessing the OLAP database.
Accessibility	OLAP tool should perform all process orchestration including mappings and translation without having any interface from the users. Users only interact with the relevant data needed for analysis in a consistent way.
Consistent Reporting Performance	Data size to be analyzed should not impact the reporting performance regardless of the number of dimensions use. Reliable reporting performance guarantee users can use the system in any ways rather than working around problems in the system.
Client-Server Architecture	OLAP systems should deploy in client server architecture as to separate the client by using the OLAP tools to manipulate the data in the OLAP server. It is essential that OLAP server component are capable of combining these sources effectively and as it allow multiple attached clients.
Generic Dimensionality	Dimensions are all equivalent and should all functionally and operationally are the same. No bias or preference should be given to operations on any of the dimensions.
Dynamic Sparse Matrix Handling	OLAP tool must also adapt its physical schema to accommodate any data distributions such as null values should be stored in an efficient manner without having problem to the user. OLAP tool should be able to accommodate spontaneously large and sparse data without any undesirable impact on performance.
Multi-User Support	OLAP tools should support in a multi user environment with parallel access for each users to be able to view information in their own flavour.
Unrestricted Cross-Dimensional Operations	OLAP tool should aggregate all rules such as understanding hierarchies, arrays of relationships and dependencies across all dimensions and provide these features for users to navigate without having user to perform any of these events.
Intuitive Data Manipulation	OLAP tools should also have “all-in-one” views of data where everything are there without having additional options like menus and commands buttons which complicate users across the interface.
Flexible Reporting	Users should be able to present reporting in any methods depending on users' choice.
Unlimited Dimensions and Aggregation Levels	OLAP should not bind to any limits on defining the number of dimensions and levels in a model.

In this literature study, Codd twelve rules of OLAP are applied when designing and implementing the new proposed model OLAP systems. In Table 2.9 explains how the new proposed system satisfies Codd's twelve rules of building an OLAP system.

Table 2.9 New Proposed Model Fulfilled E.F. Codd Twelve Rules for OLAP Systems

Rule	Satisfies (Yes/No)	Explanation
Multi-dimensional Conceptual View	Yes	System can present the data in multidimensional database cubes to the users
Transparency	Yes	Users are hidden from knowing that they are using OLAP multidimensional database. Cubes are used as selection to users.
Accessibility	Yes	OLAP systems chooses the best sources of data from tools such as pivoting, writing MDX queries and drill down data.
Consistent Reporting Performance	Yes	System performances are the same regardless of the number of dimensions in use.
Client-Server Architecture	Yes	System deployed in client server architecture. Separation in between OLAP server and client tools.
Generic Dimensionality	Yes	Dimensions are equal when the systems browse the cube data.
Dynamic Sparse Matrix Handling	Yes	System handles internally all null values without affecting the users.
Multi-User Support	Yes	System allows multiple data access to the cubes defines.
Unrestricted Cross-Dimensional Operations	Yes	Systems allow data roll up through levels of hierarchy.
Intuitive Data Manipulation	Yes	The users views of data contain everything required and no additional menus needed.
Flexible Reporting	Yes	User can present the data any way they prefer. There are 4 types of tools being implemented in this study.
Unlimited Dimensions and Aggregation Levels	Yes	There is no limit to the number of dimensions and levels in the systems.

Apart from the Codd twelve rules, Berson, A. (2001) stressed that a powerful quality OLAP system should also support a complete database management tools as a utility for integrated centralized tool to permit database management to perform distribution of databases within the enterprise. Also, OLAP ability to perform drilling mechanism within the multidimensional database allows the functionality of drill down right to the source or root of the detail record level. This implies that OLAP tool permit a smooth changeover from the multidimensional database to the detail record level of the source relational database. OLAP systems also must support incremental database refreshes. This is an important feature as to prevent stability issues on operations and usability problems when the size of the database increases.

2.3.3 OLAP FUNCTIONALITY

According to OLAP Council (1997), OLAP functionality offers dynamic multidimensional analysis supporting end users with analytical activities includes calculations and modelling applied across dimensions, trend analysis over chronological time periods, slicing subsets for on-screen viewing, drill-down/drill-up to deeper levels of records as depict in Diagram 2.9, reach-through to underlying detail data, rotation to new dimensional comparisons in the viewing area. OLAP is implemented in a multi-user client/server environment and provide reliably fast response to queries, in spite of database size and complexity. OLAP facilitate the end user integrate enterprise information through relative, customized viewing, analysis of historical and present data in various "what-if" data model scenario. This is achieved through use of an OLAP Server as depicted in diagram 2.10.

OLAP functionality is provided by an OLAP server. OLAP Council (1997) defined OLAP server as a “high-capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures”. OLAP server design and data structure are optimized for fast information retrieval in any course and flexible calculation and transformation of unprocessed data. The OLAP server may either actually carry out the processed multidimensional information to distribute consistent and fast response times to end users, or it may fill its data structures in real time from relational databases, or offer a choice of both.

Diagram 2.9 OLAP Drill Down/Drill up Concept

Diagram 2.10 OLAP Server (adapted from Berson, 2001)

2.3.4 OLAP ARCHITECTURE

According to Nigel, P. (2006), it is important to differentiate the ability of a Data Warehouse from an OLAP system. To distinguish a Data Warehouse, which is usually based on relational technology, OLAP uses a multidimensional view of aggregate data to provide rapid access to strategic information for further analysis. OLAP architecture is divided into three main type based on the method in which they store multi-dimensional data and perform operations on that data. The major categories are multi-dimensional OLAP (MOLAP), relational OLAP (ROLAP) and hybrid OLAP (HOLAP) as described below (Berson, A., 2001; Burdick, D. et al., 2000; Vassiliadis, P., 1999; Pedersen, T. B., 2001.):

• MOLAP

In MOLAP as depict in Diagram 2.11, data is stored and summarized in a multidimensional cube. MOLAP are the fastest. The storage is not in the relational database but in proprietary formats which is in multidimensional view. This is the more conventional way of OLAP analysis compared to ROLAP. In MOLAP, data is stored in multidimensional cube and the storage is not in the relational database but in proprietary formats.

Diagram 2.11 MOLAP Architecture (adapted from Microsoft, 1999)

MOLAP cubes are excellent in performance because it is built and designed for rapid data retrieval, and are most advantageous for slicing and dicing operations. MOLAP can perform complex calculations where all calculations have been pre-generated when the cube is created. MOLAP is restricted to the amount of data it can manage as the calculations process are executed when the cube is construct and large amount of data are not possible to be included into the cube itself. MOLAP cube technology is proprietary and requires additional investment in human and capital resources.

• ROLAP

In ROLAP as depict in Diagram 2.12, data and aggregations are stored in relational database tables to provide the OLAP slicing and dicing functionalities. ROLAP are the slowest among the OLAP flavours. ROLAP relies on data manipulating directly in the relational database to give the manifestation of conventional OLAP's slicing and dicing functionality. Fundamentally, each slicing and dicing action is equivalent to adding a "WHERE" clause in the SQL statement.

Diagram 2.12 ROLAP Architecture (adapted from Microsoft, 1999)

ROLAP can manage big amounts of data and ROLAP do not have any limitations for data size. ROLAP can influence the intrinsic functionality in a relational database. ROLAP are slow in performance because each ROLAP activity are essentially a SQL query or multiple SQL queries in the relational database. The query time and number of SQL statements executed measures by its complexity of the SQL statements and can be a bottleneck if the underlying data size is large. ROLAP essentially depends on SQL statements generation to query the relational database and do not cater all needs which make ROLAP technology conventionally limited by what SQL functionality can offer.

• HOLAP

HOLAP as depict in Diagram 2.13, combine the technologies of MOLAP and ROLAP. Data are stored in ROLAP relational database tables and the aggregations are stored in MOLAP cube. HOLAP can "drill down" from multidimensional cube into the underlying relational database data. For summary type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Diagram 2.13 HOLAP Architecture (adapted from Microsoft, 1999)

According to Pederson T.B. (2001), OLAP is a technology that allows users to examine business data and to get familiar with the information it contains. OLAP data model is multidimensional so it is called a cube instead of a table. OLAP implies "digging through tons of data" to uncover patterns and relationships contained within the business activity and history.

Some studies (Burdick et al., 2000; Vassiliadis P., 1999; Harinarayan, 1996) have presented and discussed the importance of MOLAP specifications, which are based on the multiple measures of normalization in the database tables. MOLAP partitions store aggregations and a copy of the source data (fact and dimension data) in a multidimensional structure on the OLAP server. All partitions are stored on the OLAP server.

Harinarayan (1996) highlight that MOLAP requires the data to be stored in a multidimensional format as it involves the creation of multidimensional blocks called data cubes. The cube in MOLAP Architecture may have three axes (dimensions), or more. Each axis (dimension) represents a logical category of data. One axis may for example represent the geographic location of the data, while others may indicate a state of time or a specific school. Each of the categories, which will be described in the following section, can be broken down into successive levels and it is possible to drill up or down between the levels.

Cabibo (1997) states that MOLAP partitions are normally stored on an OLAP server, with the relational database frequently stored on a server separate from OLAP server. OLAP server must query across the network whenever it needs to access the relational tables to resolve a query. The impact of querying across the network depends on the performance characteristics of the network itself. Even when the relational database is placed on the same server as OLAP server, inter-process calls and the associated context switching are required to retrieve relational data. With a MOLAP partition, calls to the relational database, whether local or over the network, do not occur during querying.

2.3.5 MULTIDIMENSIONAL DATABASE SCHEMA

According to Kimball (1999), the base of every data warehousing system is a relational database built using a dimensional model. Dimensional model consists of fact and dimension tables and is typically described as star or snowflake schema. A schema is a collection of database objects, tables, views and indexes. To understand dimensional data modelling, Table 2.10 defines some of the terms commonly used in this type of modelling:

Table 2.10 Multidimensional Database Schema Terms

Terms	Descriptions
Fact Table	A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated.
Dimension Table	A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size as compared to fact table
Attribute	A unique level within a dimension. For example, Month is an attribute in the Time Dimension
Hierarchy	The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day

In designing data models for data warehouses, the most commonly used schema types are Star Schema and Snowflake Schema. In the star schema design, fact table sits in the middle and is connected to other surrounding dimension tables like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Most data warehouses use a star schemato represent the multidimensional data model. The database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of a pointer or foreign key to each of the dimensions that provide its multidimensional coordinates, and stores the numeric measures for those coordinates. Each dimension table consists of columns that correspond to attributes of the dimension. Diagram 2.14 shows an example of a star schema For Medical Informatics System.

Diagram 2.14: Star Schema

Star schemas do not explicitly provide support for attribute hierarchies. For the implementation of the applications in this research in the MOLAP environment, it requires lots of hierarchies of dimension tables for efficient drilling of data's, the answers to this problem is the implementation of both star and snowflake schemas.

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Snowflake schemasprovide a refinement of star schemas where the dimensional hierarchy is explicitly represented by normalizing the dimension tables, as shown in Diagram 2.15. This leads to advantages in maintaining the dimension tables. However, the denormalized structure of the dimensional tables in star schemas may be more appropriate for browsing the dimensions.

Diagram 2.15: Snowflake Schema

The multidimensional data model is implemented directly by MOLAP servers. When a MOLAP server is used, the multidimensional model and its operations have to be mapped into relations and SQL queries. The design of database schemas that reflect the multidimensional views of data. Entity Relationship diagrams and normalization techniques are popularly used for database design in OLTP environments.

Levene. M (2003) stresses that in addition to the fact and dimension tables, data warehouses store selected summary tables containing pre-aggregated data. In the simplest cases, the pre-aggregated data corresponds to aggregating the fact table on one or more selected dimensions. Such pre-aggregated summary data can be represented in the database in at least two ways. Whether to use star or a snowflake mainly depends on business needs. In this research, snowflakes are preferred as when there is a analysis of information at that a very detailed level, snowflake schema can use to drill out the information.

2.3.6 SUMMARY

In this work, MOLAP is used to create the multidimensional databases as MOLAP architecture can store data, create multidimensional mapping files, and calculate aggregations faster than a relational engine can create indexes and calculate aggregations. The primary reason is because the relational engine is slower is due to the large table joins that the relational engine must perform during the processing of a ROLAP partition. In addition, because the relational engine performs the actual processing tasks, competing demands for resources on the computer hosting the relational tables can negatively affect processing performance for a ROLAP partition.

There is no significant difference in processing performance between MOLAP partitions and HOLAP partitions. In both cases, all fact-level data is read from the relational database, and aggregations are calculated. With MOLAP, it writes the fact-level data into the MOLAP structure. With HOLAP, it does not store fact-level data. This difference has minimal impact on processing performance, but can have a significant impact on query performance. Because HOLAP and MOLAP processing speeds are approximately the same and MOLAP query performance is superior, MOLAP is the optimum storage choice. In chapter 4.3, the experiment on the OLAP performance assessment are best described the OLAP architecture.

If you choose to deploy a real-time ROLAP partition, you should make the ROLAP partition as small as possible and use MOLAP for all other partitions in the cube. This configuration will maximize query responsiveness and minimize processing time. Due to shrinking hardware and processing cost, MOLAP are generally most often used. HOLAP is a better solution if the solution is accessing a stand-alone database. ROLAP are more convenient to set up when the query demands are relatively low and also on a stand-alone database.

2.4 DATA MINING

According to Fayyad (1998), data mining is about discovering hidden patterns from historical datasets and applying patterns for prediction. Data mining involves searching for specific, unknown information within databases. Data mining is particularly proficient at discovering hidden trends in data by using pattern recognition in addition to statistical and mathematical techniques. Chung, H. (1999) referred data mining as Data mining is also ref

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

You're running out of money & a deadline?

We know how critical is the final-year dissertation for a student. Check out how we help students in passing the final year.

Get 20% Discount, Now
£21 £17/ Per Page
14 days delivery time

Now! moonlight your way to A+ grade academic success. Get the high-quality work - or your money back.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now