Text Mining Concepts Process And Applications Computer Science Essay

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Lokesh Kumar

Student

Department of IT

ASET, Amity University

[email protected] Parul Kalra Bhatia

Asst. Prof.

Department of IT

ASET, Amity University

[email protected]

ABSTRACT

With the advancement of technology, more and more data is available in digital form. Among which, most of the data (approx. 85%) is in unstructured textual form. text, so it has become essential to develop better techniques and algorithms to extract useful and interesting information from this large amount of textual data. Hence, the area of text mining, text analytics and information extraction has become popular areas of research, to extract interesting and useful information. This paper, focuses on the concept, process and applications of Text Mining.

General Terms

Text Mining Algorithms, Data Mining.

Keywords

Text Mining Algorithms, Data Mining, Information Retrieval, Information Extraction, Classification Algorithm, Association Algorithms.

INTRODUCTION

Text mining is defined as "the non-trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data’’ [1]. Text Mining is a new field that tries to extract meaningful information from natural language text. It can be defined as the process of analyzing text to extract information that is useful for particular purposes. Compared with the type of data stored in databases, text is unstructured, ambiguous, and difficult to process. Nevertheless, in modern culture, text is the most communal way for the formal exchange of information. Text mining usually deals with texts whose function is the communication of actual information or opinions, and the stimuli for trying to extract information from such text automatically is fascinating - even if success is only partial.

Text mining is similar to data mining, except that data mining tools [2] are designed to handle structured data from databases, but text mining can work with unstructured or semi-structured data sets such as emails, text documents and HTML files etc. As a result, text mining is a far better solution.

Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

C:\Users\Lokesh\Desktop\1.png

Fig 1: Basic Process of Text Mining

The phrase "text mining" is generally used to denote any system that analyzes large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract probably useful (although only probably correct) information

AREAS OF TEXT MINING

Text analysis involves information retrieval information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics [3]. The goal is, essentially to turn text (unstructured data) into data (structured format) for analysis, via use of natural language processing (NLP) methods.

C:\Users\Lokesh\Desktop\Untitled_1.png

Fig 2: Text mining areas

Information Retrieval (IR)

Information retrieval is regarded as an extension to document retrieval where the documents that are returned are processed to condense or extract the particular information sought by the user. Thus document retrieval could be followed by a text summarization stage that focuses on the query posed by the user, or an information extraction stage using techniques. IR systems helps in to narrow down the set of documents that are relevant to a particular problem.

As text mining involves applying very complex algorithms to large document collections, IR can speed up the analysis significantly [4] by reducing the number of documents for analysis.

Data Mining (DM)

Data mining can be loosely described as looking for patterns in data. It can be more fully characterized as the extraction of implicit, previously unknown, and useful information from data. Data mining tools can predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They search databases for hidden and unknown patterns, finding critical information that experts may miss because it lies outside their expectations. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

Natural Language Processing (NLP)

NLP is one of the oldest and most challenging problems in the field of artificial intelligence. It is the analysis of human language so that computers can understand natural languages as humans do [4].

NLP research pursues the vague question of how we understand the meaning of a sentence or a document. What are the indications we use to understand who did what to whom, or when something happened, or what is fact and what is supposition or prediction? While words - nouns, verbs, adjectives and adverbs [5] - are the building blocks of meaning, it is their correlation to each other within the structure of a sentence, within a document, and within the context of what we already know about the world, that conveys the true meaning of a text.

The role of NLP in text mining is to deliver the system in the information extraction phase as an input.

Information Extraction (IE)

Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity includes processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and mining information out of images/audio/video could be seen as information extraction and the best practical and live example of IE is Google Search Engine.

It involves defining the general form of the information that we are interested in as one or more templates, which are used to guide the extraction process. IE systems greatly depend on the data generated by NLP systems.

WHAT IS TEXT MINING

THE COCNEPT

Text mining is a burgeoning new field that tries to extract meaningful information from natural language text. It may be characterized as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, ambiguous, and difficult to process. Nevertheless, in modern culture, text is the most communal way for the formal exchange of information. Text mining usually deals with texts whose function is the communication of actual information or opinions, and the stimuli for trying to extract information from such text automatically is compelling—even if success is only partial.

Text mining, using manual techniques, was use first during the 1980s [6]. It quickly became apparent that these manual techniques were labor intensive and therefore expensive. It also require too much time to manually process the already growing quantity of information. Over time there was huge success in creating programs to automatically process the information, and in the last few years there has been a great progress.

Currently the study of text mining concerns the development of various mathematical, statistical, linguistic and pattern-recognition techniques which allow automatic analysis of unstructured information as well as the extraction of high quality and relevant data, and to make the text as a whole better searchable.

A text document contains characters that together form words, which can be combined to form phrases [6]. These are all syntactic properties that together represent already defined categories, concepts, senses or meanings. Text mining must recognize, extract and use the information. Instead of searching for words, we can search for semantic patterns, and this is therefore searching at a higher level.

PROCESS

Text mining involves a series of activities to be performed in order to efficiently mine the information. These activities are:

Text Pre-processing

It involves series of steps:

Text Cleanup

Text Cleanup means removing of any unnecessary or unwanted information such as remove ads from web pages, normalize text converted from binary formats, deal with tables, figures and formulas.

C:\Users\Lokesh\Desktop\2.png

Fig 3: Activities / Process of Text Mining

Tokenization

Tokenizing in its simplest form is achieved by splitting the text at white spaces and at punctuation marks that do not belong to abbreviations identified in the preceding step.

Part of Speech Tagging

Part-of-Speech (POS) tagging means word class assignment to each token. Its input is given by the tokenized text. Taggers have to cope with unknown words (OOV problem) and ambiguous word-tag mappings. Rule-based approaches like ENGTWOL [6] operate on a) dictionaries containing word forms together with the associated POS labels and morphologic and syntactical features and b) context sensitive rules to choose the appropriate labels during application.

Text Transformation (Attribute Generation)

Text document is represented by the words (features) it contains and their occurrences. Two main approaches of document representation are a) Bag of words b) Vector Space.

Feature Selection (Attribute Selection)

Feature selection also known as variable selection, is the process of selecting a subset of important features for use in model creation. The main assumption when using a feature selection technique is that the data contains many redundant or irrelevant features. Redundant features are those which provide no extra information than the currently selected features, and irrelevant features provide no useful information in any context. Feature selection techniques are a subset of the more general field of feature extraction.

Data Mining

At this point the Text mining process merges with the traditional Data Mining process. Classic Data Mining techniques are used on the structured database that resulted from the previous stages.

Evaluate

Evaluate the result, after evaluation the result can be discarded or the generated result can be used as an input for the next set of sequence.

Applications

Text Mining can be applied in a variety of areas. Some of the most common areas are:

Web Mining

These days web contains a treasure of information about subjects such as people, companies, organizations, products, etc. that may be of wide interest. Web Mining is an application of data mining techniques to discover hidden and unknown patterns from the Web.

Web mining is an activity of identifying term implied in large document collection say C, which can be denoted by a mapping i.e. C →p [9]. The first step toward any Web-based text mining effort would be to gather a substantial number of web pages having mention of a subject. Thus, the challenge becomes not only to find all the subject occurrences, but also to filter out those that have the desired meaning.

Medical

Users actively exchange information with others about subjects of interest or send requests to web-based expert forums, or so-called "ask the doctor" services [9]. Everyone want to understand specific diseases (what they have), to be informed about new therapies, ask for a second opinion before one can decide a treatment. In addition, these expert forums also represent seismographs for medical and/or psychological needs, which are apparently not met by existing health care systems [11].

E-mails, e-consultations, and requests for medical advice via the Internet have been manually analyzed using quantitative or qualitative methods [11]. To help the medical experts and to make full use of the seismographic function of expert forums, it would be helpful to categorize visitors’ requests automatically. So, specific requests could be directed to the expert or even answered semi-automatically, thereby providing complete monitoring. By generating "frequently asked questions (FAQs)," similar patient requests and their corresponding answers could be congregated, even before the expert replies. Machine-based analyses could help both the public to better handle the mass of information and medical experts to give expert feedback.

An automatic classification of amateur requests to medical expert internet forums is a challenging task because these requests can be very long and unstructured as a result of mixing, for example, personal experiences with laboratory data.

Resume Filtering

Big enterprises and head-hunters receive thousands of resumes from job applicants every day. Extracting information from resumes with high precision and recall is not an easy task [1]. In spite of constituting a restricted domain, resumes can be written in multitude of formats (e.g. structured tables or plain texts), in different languages (e.g. Japanese and English) and in different file types (e.g. Plain Text, PDF, Word etc.). Moreover, writing styles can also be much diversified. In the initial manual scan of resume, a recruiter looks for errors, educational qualifications, buzzwords, employment history, frequency of job changes, job titles, and other personal information [13]. Automatically extracting this information can be the first step in filtering resumes. Hence, automating the process of resume selection is an important task.

ACKNOWLEDGMENTS

Our thanks to the experts who have contributed towards development of the template.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now