Overview Of Data Mining

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Data mining is a process which deals with extraction of knowledge from databases. Data mining consists of numerous techniques to extract useful information from large files, without having any conceptualised notions about what can be discovered. Extracted information consists of patterns and relationships which were previously unknown. Data mining process is also called as "Knowledge Discovery in Databases".

Data mining is a technique which deals with iteration and interaction. Business expertise were used jointly with new technologies to discover features and relationships in the data. By the use of business experience and expertise, useless information can also be transformed into valuable information. Data mining techniques can also be referred as "modelling" or "machine learning" . Data mining is used as an application in business community and it is supported by three technologies namely

Data mining algorithms

Multi processor computers

Enormous data collection

Business transactions: Business industry consists of many transactions which is often "memorized" for perpetuity. Those transactions deals with time related data or intra and inter business operations such as assets, banking, purchases, stock and exchanges. Many prominent department stores depends on the wide use of barcodes to store billions of transactions , which represents terabytes of data. Though storage space does not pose big problem, because of the less price range of hard disk. But the problem deals with how to use the data effectively in the reasonable amount of time period.

Scientific data: Data repositories consists of large amount of scientific data which needs to be analysed. These data are collected from iceberg gathering (contains details about oceans functions), nuclear laboratory that counts particles, university which contains investigation about human pshycs. Unfortunately, it is east to capture and accumulate new data in faster manner than to analyse the accumulated old data. Medical and personal data: Large amount of information are gathered about groups and individuals for government census. Many organizations, companies and governments are stocking these personal data to manage human resources or to better use in business markets.. While it was correlated with other data, this information can classify customer like and their behaviour. But, it fails to deal with privacy Surveillance video and pictures: Video cameras are ubiquitous because of their high price. Video tapes are recycled from the cameras and then the content is lost. Nowadays, there is a new trend to store the video tapes and it can also be digitized in need for future analysis.

Satellite sensing: There are enormous amount of satellites around the globe, which were controlled by NASA .some satellites are geostationary and some orbits around theearth. Every satellite sends continuous stream of data to surface, which were received by NASA. Many data and pictures which were received by NASA are made public, so that it can also be analysed by other researchers. 

Games: Sports or gaming database consists of large amount of data about players, athletes and games. It contains information about volleyball serves, cricket scores, car racing lapses and football goals. These information were used by journalists for reporting, while jocks would like to explore this information to know more about components and their performance.

Digital media: Digital media is explosed due to the growth of scanners, video cameras and digital cameras. Nowadays, audio and video datas are digitized to improve their multimedia asset by television and flim studios. Many associations such as NBA and NHL convert their huge gaming selection into a digitized form.

CAD and Software engineering data: Computer Assisted Design (CAD) is used by architects to design innovative buildings and engineers to design system circuits or components. There are a multitude of Computer Assisted Design (CAD) systems for architects to design buildings or engineers to conceive system components or circuits. Software engineering is a source of similar data with libraries, code and objects, etc.. It requires more tools for management and maintenance.

Virtual Worlds: Most of the applications makes use of three dimensional- virtual spaces, which consists of objects and places. Objects are described by Virtual Markup Language. Managing the virtual repositories and content based search and retrival are an important issue, though size of virtual space grows.

Text reports and memos (e-mail messages): Communications within organization or people and between companies takes place through memos and reports , which were in textual format and they are exchanged through e-mail. These texts are stored in digitized form for creating digital library in future

The World Wide Web repositories: Documents with variuos format, description, and content are collected together and inter connected with hyperlinks to make world wide web as the largest repository. Despite of its dynamic nature and it unique characteristic, World Wide Web is considered as the important data collection which is used for references. In future World Wide Web can become the compilation of human knowledge.

HOW TO DISCOVER THE KNOWLEDGE:

The Knowledge Discovery in Databases process consists of a few steps ranging from raw data collections to form of new knowledge. The iterative process in data mining consists of the following steps:

Data cleaning: It is known as data cleansing, which is used to remove the noise and irrelevant data from the databases.

Data integration: Multiple heterogeneous data sources are integrated together to convert it into a single data base source.

Data selection:Data selection retrieves the data based on the query analysis from the data collection.

Data transformation: It converts the selected data into standard forms appropriate for mining. It is also known as data consolidation.

Data mining: Data mining is used to extract patterns from the database.

Pattern evaluation:Patterns which represents the knowledge can be identified based on interestingness measures.

Knowledge representation:It is the phase where the discovered knowledge can be effectively represented to the user in the form of results.

 

Phases in Data mining:

Data mining consists of six phases. They include Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.

1) Business Understanding: Business Understanding is considered as the first and most important phase of data mining.It identifies objectives and success criteria of the business and performs situational assessment which deals with resources, assumptions, benefits, risks and costs. It is also used to determine the goal of data mining to produce an effective project plan.

2)Data Understanding:

Data understanding needs to analyse and understand the characteristics of data resources.

It needs to collect initial data, also describes and explores data and it verifies data quality.

3)Data Preparation:

Data preparation includes Selection, Cleansing, Construction, formatting and integration. Bata preparation is a time consuming process. Data preparation includes:

1)Extract data from data mart or data warehouse

2)Links tables within the database

3)Combines data files from heterogeneous systems and reassign the field values

4)Identifies incorrect, or missind data values and performs data selection

5)Restructure data into a standard form for the required analysis

4)Modeling:

Modeling consists of various analytical methods to extract useful information from the database. It starts by selecting a suitable modelling technique to generate test designs for building and accessing models.

5)Evaluation:

Once the model have been chosen, it must be evaluated to check whether it achieves business objectives. Tha aim of this phase is to determine whether all the business issues have been solved. At the final stage, decision can be made based on the data mining results.

6)Deployment:

Based on the requirements, deployment phase can be categorized as simple or complex. The sole purpose of model is to gain knowledge, and it must be presented and organized in a way, so that it can be used for decision making. In this phase, the business understanding process can be revised and the rest of the process can proceed with best target.

Data mining Techniques:

Data mining techniques are of four types. They are

Prediction

Classification

Clustering Analysis

Association Rules Discovery

Prediction: Prediction is a method which is derived from traditional forecasting to predicts a variable’s value. Prediction method involves various statistical traditional methods such as discriminant analysis, regression analysis and non traditional methods are Machine learning and artificial intelligence.

Classification Tools:  Classification method attempt to distinguishes distinct classes of actions or objects. Classification tool is most widely used in credit card transaction to classify it as one or other to save money in a credit card company.

Clustering Analysis Tools: Groups which fall under one category are clustered together and the groups are discovered by the program and not by researchers. These discovered clusters are used for business decision. Clustering tool is also widely used in "market segmentation". Association Rules Discovery:  Data mining tools are used to discover associations; e.g., what kinds of films certain groups of people look out, what products certain groups of people buy, what movies certain groups of people watch, etc. These information were used by businesses to target their markets. Businesses can use this information to target their markets. It was used quite effectively in Netflix and Amazon.

Techniques in Data mining:

Artificial neural networks: Linear and Non-linear predictive models can be learned through training back propagation and biological neural network in structure.

Decision trees: Decision tree is a flow chart like structure, where tree structure represents the decisions. Decisions are mainly used to generate rules to classify the data set. Some of the specified decision tree methods are Chi Square Automatic Interaction Detection (CHAID) and Classification and Regression Trees (CART).

Genetic algorithms: Genetic algorithm is an optimization technique which use processes such as reproduction, crossover, mutation and selection based on evolution.

Nearest neighbor method: It classify each record in a data set based on the combination of classes of records, which is most similar to it in the represented data set. It is also called as k- nearest neighbour technique

Rule induction: If-then rules can be extracted from the database, based on statistical significance.

Issues in Data mining:

 Data mining algorithms substantiate techniques that have existed for years, but it have been recently applied as scalable and reliable tool and it also outperforms other statistical methods. Nowadays, data mining techniques are ubiquitous. Before data mining develops into a trusted and mature disciple, many issues have to be solved.

Security and social issues: Security is considered as an important issue, when the data collections are shared and is going to be used for decision making. Large amount of confidential and private information about each individual and an organization are collected and stored. Example includes user behaviour understanding, customer profiling etc., It becomes controversial when these data are granted illegal access. Data mining reveals new knowledge about groups or individuals, which were against privacy schemes, particularly while there is a dissemination of discovered information.

 User interface issues: Knowledge which were discovered by the data mining tools are expressed in high level languages, so that the discovered knowledge can be directly usable by humans.. Best visualization technique eases rendering data mining results and it also helps users to understand their needs. Several data analysis tasks are used to provide best visual representation. The main issue in user interface are interaction and information rendering. Interaction with data mining result is difficult. Hence it requires the system to use some knowledge expression techniques such as tables, rules, trees, charts, graphs, cross tabs, curves and matrices.

 Mining methodology issues: Every user is interested in mining unique kinds of knowledge from databases, hence data mining covers a wide range of analysis and knowledge discovery tasks which includes data discrimination, characterization, association, correlation analysis, classification, prediction, clustering, outlier and evolution analysis. These tasks use same data base in distinct ways to develop various data mining techniques.

Data stored in database may speculate noise, irrelevant data objects or some exceptional cases. While mining, these data may confuse the process and cause the constructed model to overfit the data. Thus the discovered patterns leads to poor accuracy. To handle noise , exceptional cases and incomplete information, various methods such as data cleansing, data analysis and outlier mining methods are adopted.

Regardless of size of data stored in database, search space size is also more critical for data mining methods. Usually the size of search space size depends on the dimensions in domain space. If the number of dimensions increases, search space also grows. It is called as the "curse of dimensionality". Later the "curse" itself become as one of the major problem to solve.

 Performance issues:Artificial intelligence and statistical methods are the techniques which exist for data analysis and data interpretation. However, these techniques are not suitable to deal with large data sets. It also raises the issue of efficiency and scalability of data mining methidologies. Later, this problem was solved by incremental updating and parallel programming. Parallel programming solves the size problem by subdividing the problem into smaller pieces which were processed in parallel and the results are merged later. Incremental updating is used to update the database, by reducing the need to refine from scratch when the new data becomes available. .

 

Data source issues: More issues are related to data sources which includes diversion of data types and data flood.It consists of large amount of data than to handle and it collects data at a higher rate. The work is to collect large amount of data as much as possible for processing. The issue is whether the user collects the correct data and the user is able to distinguish between the important and insignificant data.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now