Text Mining Algorithms Data Mining Computer Science Essay

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Lokesh Kumar

Student

Department of IT

ASET, Amity University

[email protected]

Parul K. Bhatia

Asst. Prof.

Department of IT

ASET, Amity University

[email protected]

ABSTRACT

With the advancement of technology, more and more data is available in digital form. Among which, most of the data (approx. 85%) is in unstructured textual form. text, so it has become essential to develop better techniques and algorithms to extract useful and interesting information from this large amount of textual data. Hence, the area of text mining, text analytics and information extraction has become popular areas of research in recent years, to extract interesting and useful information. In this paper different existing Text Mining is briefly reviewed, stating the merits / demerits of the algorithms.

General Terms

Text Mining Algorithms, Data Mining.

Keywords

Text Mining Algorithms, Data Mining, Information Retrieval, Information Extraction, Classification Algorithm, Association Algorithms.

INTRODUCTION

Text mining is defined as "the non-trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data’’ [1]. Text Mining is a burgeoning new field that attempts to glean meaningful information from natural language text. It may be loosely characterized as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with algorithmically. Nevertheless, in modern culture, text is the most common vehicle for the formal exchange of information. The field of text mining usually deals with texts whose function is the communication of factual information or opinions, and the motivation for trying to extract information from such text automatically is compelling - even if success is only partial.

Text mining is similar to data mining, except that data mining tools [2] are designed to handle structured data from databases, but text mining can work with unstructured or semi-structured data sets such as emails, full-text documents and HTML files etc. As a result, text mining is a much better solution.

Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

C:\Users\Lokesh\Desktop\text-mining-general-process-resized-600.png

Fig 1: Process of Text Mining

The phrase "text mining" is generally used to denote any system that analyzes large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract probably useful (although only probably correct) information

AREAS OF TEXT MINING

Text analysis involves information retrieval information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) methods.

C:\Users\Lokesh\Desktop\Untitled_1.png

Fig 2: Text mining areas

Information Retrieval (IR)

Information retrieval might be regarded as an extension to document retrieval where the documents that are returned are processed to condense or extract the particular information sought by the user. Thus document retrieval could be followed by a text summarization stage that focuses on the query posed by the user, or an information extraction stage using techniques. IR systems allow us to narrow down the set of documents that are relevant to a particular problem.

As text mining involves applying very computationally-intensive algorithms to large document collections, IR can speed up the analysis considerably by reducing the number of documents for analysis.

Data Mining (DM)

Data mining can be loosely described as looking for patterns in data. It can be more fully characterized as the extraction of implicit, previously unknown, and potentially useful information from data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

Natural Language Processing (NLP)

NLP is one of the oldest and most difficult problems in the field of artificial intelligence. It is the analysis of human language so that computers can understand natural languages as humans do.

NLP research pursues the elusive question of how we understand the meaning of a sentence or a document. What are the clues we use to understand who did what to whom, or when something happened, or what is fact and what is supposition or prediction? While words--nouns, verbs, adjectives and adverbs--are the building blocks of meaning, it is their relationship to each other within the structure of a sentence, within a document, and within the context of what we already know about the world, that conveys the true meaning of a text.

The role of NLP in text mining is to provide the systems in the information extraction phase (see below) with linguistic data that they need to perform their task.

Information Extraction (IE)

Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction.

It involves defining the general form of the information that we are interested in as one or more templates, which are then used to guide the extraction process. IE systems rely heavily on the data generated by NLP systems.

WHAT IS TEXT MINING

THE COCNEPT

Text mining is a burgeoning new field that attempts to glean meaningful information from natural language text. It may be loosely characterized as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with algorithmically. Nevertheless, in modern culture, text is the most common vehicle for the formal exchange of information. The field of text mining usually deals with texts whose function is the communication of factual information or opinions, and the motivation for trying to extract information from such text automatically is compelling—even if success is only partial.

Text mining, using manual techniques, was use first during the 1980s. It quickly became apparent that these manual techniques were labor intensive and therefore expensive. It also cost too much time to manually process the already growing quantity of information. Over time there was increasing success in creating programs to automatically process the information, and in the last 10 years there has been much progress.

Currently the study of text mining concerns the development of various mathematical, statistical, linguistic and pattern-recognition techniques which allow automatic analysis of unstructured information as well as the extraction of high quality and relevant data, and to make the text as a whole better searchable.

A text document contains characters that together form words, which can be combined to form phrases. These are all syntactic properties that together represent defined categories, concepts, senses or meanings. Text mining must recognize, extract and use all this information.

Using text mining, instead of searching for words, we can search for linguistic word patterns, and this is therefore searching at a higher level.

PROCESS

Text mining involves a series of activities to be performed in order to efficiently mine the information. These activities are:

Text Pre-processing

It involves series of steps:

Text Cleanup

Text Cleanup means removing of any unnecessary or unwanted information such as remove ads from web pages, normalize text converted from binary formats, deal with tables, figures and formulas.

Fig 3: Activities / Process of Text Mining

Tokenization

Tokenizing in its simplest form is achieved by splitting the text at white spaces and at punctuation marks that do not belong to abbreviations identified in the preceding step.

Part of Speech Tagging

Part-of-Speech (POS) tagging means word class assignment to each token. Its input is given by the tokenized text. Taggers have to cope with unknown words (OOV problem) and ambiguous word-tag mappings. Rule-based approaches like ENGTWOL [4] operate on a) dictionaries containing word forms together with the associated POS labels and morphologic and syntactical features and b) context sensitive rules to choose the appropriate labels during application.

Text Transformation (Attribute Generation)

Text document is represented by the words (features) it contains and their occurrences. Two main approaches of document representation are a) Bag of words b) Vector Space.

Feature Selection

Feature selection also known as variable selection, is the process of selecting a subset of relevant features for use in model construction. The central assumption when using a feature selection technique is that the data contains many redundant or irrelevant features. Redundant features are those which provide no more information than the currently selected features, and irrelevant features provide no useful information in any context. Feature selection techniques are a subset of the more general field of feature extraction.

Data Mining

At this point the Text mining process merges with the traditional Data Mining process. Classic Data Mining techniques are used on the structured database that resulted from the previous stages.

Evaluate

Evaluate the result, after evaluation the result can be discarded or the generated result can be used as an input for the next set of sequence.

ACKNOWLEDGMENTS

Our thanks to the experts who have contributed towards development of the template.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now