Applications Of Text Mining

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Independent Study-I

Crime tendency in person by analysis of their social media Profile (FACEBOOK)

Name : Sajjad Ali Khan

Reg.Number : 1073110

I.S. Advisor : Zohaib Jan

Department : Computer Science

I would like to say thanks to Almighty ALLAH by the grace of whom I am able to complete this research project.

Especial thanks to my supervisor Mr. Zohaib Jan for his guidance and support and my friends who help me in this project directly & indirectly.

ABSTRACT

The social networks become the important part of the life. During the last decade use of the social networks increase to a great extent. People publish their content on the web they use to convey their idea by posting statuses, links, videos, images, comments(Facebook). The use of social networks rapidly increased during the last decade and there is huge amount of data publish on these networks People share their personal information and insights of their lifes.In this paper we are using the text mining algorithm, which will be used to extract necessary information from the text massages (statuses, comments etc) through which predication can be made. The main aim of this paper is to find the crime tendency in person from the information they share on social networking site (FACEBOOK).

Table of Contents

ABSTRACT 3

Table of Contents 4

1. INTRODUCTION 6

2. Social networking site (FACEBOOK) 8

3. LITERATURE REVIEW 9

4. Results. 14

4.1 Processing assumptions: 15

5. Conclusion 15

REFERENCES 16

LIST OF TABLES

TABLE 1: Dictionary built through domain specific documents .......................……………….14

TABLE 2: TF-IDF of the parsed documents having unknown domain ..................…………….15

TABLE 3: Domain where term "barrel" exists ............................................................................16

TABLE 4: showing the index of word "Economy" ....................................................................16

TABLE 5: showing the index of word "Economy" ....................................................................16

TABLE 6:showing the index of romney and obama ...................................................................16

INTRODUCTION

Social media is the place where people share their personal information to the world. In January 2005, a survey of social networking websites estimated that among all sites on the web there were roughly 115 million members [14].In last 5 years the FACEBOOK alone has around 500 million members. People reveal a lot of information about themselves during creation of their profile and even after the use by sharing or commenting. People share statuses, upload images, and interests so that will be useful in determining their personality. In this paper we are going to use the information people share on their FACEBOOK profile through commenting or status updates and then by using the text mining technique we will predict the crime tendency of a person. We will search for the anger or negative words people used to publish on the web as these words will reflect their state of mind. In the previous research it was shown that information people give at the time of creation of their profile is idealized version not the actual reflective of their personality we will use the information people post on the web not the information they wrote in their profile.

We are actually using the language features of the users FACEBOOK profiles. Previous research has shown that linguistic features can be used to predict personality traits [26, 36]. We put the linguistic feature of the profiles that is status updates, Comments and the post containing text into a single string for analysis.

Text Mining

Text mining is the process of extracting the useful information from text. It is the discovery of text by the computer from the previously unknown information by extracting it from different written documents. It includes tokenization, stemming, parsing and N-gram of structuring text. Our research is based on the prediction and classification. There are a lot of other different problems that can be solved by using text-mining techniques.

Applications of Text Mining.

Text mining mainly used for the extraction of information from the text. This field also covers lot of other applications and problems, e.g. retrieval of data, storage,email support, spam filtering, recommendations and suggestions(e.g. Amazon) and labeling of document(Automatically).

Document Classification

Document classification is the process of categorizing unlabeled articles or documents. The sample training data is provided with labels for example "cricket", "Information technology" and "News" and from these information the classifier should be able to accurately predict the newly unseen document to correct class.

Information Retrieval

Information retrieval mostly belongs to the online documents. In this we retrieve the information from the unstructured text mostly the textual document by using the query which might itself be unstructured. We provide attribute to the written or online document to extract the desire values from that document. In short the concept is to retrieve the similar things among the document.

Clustering Document.

Clustering document is not as much power as text classification. It is basically the assignment of label which is required for text categorization. Clustering document involves descriptors and extraction of descriptors. It is useful in the companies who want to know which domain problem is severe then other.

Information Extraction.

It is the process of automatically extracting the information from the semi structured or un-structured text. In most cases it processing is done through human language text through natural language

Prediction and Evaluation.

Prediction and evaluation in general belongs to the predication. In this we train the program through generalized rules from the sample document and it will give relatively correct answer based on generalized rules.

Social networking site (FACEBOOK)

Facebook is the most famous social networking site. User has a list of friends,can join groups, likes pages of relative intrest. User can update their status and comment on the other posts or their own post,similary they can also like each other activity and share photos,images,links etc, User share their emotions and state of mind through different status updates and by commenting on the pages,groups. User can share each other status and post which are public or they have access.Facebook aslo has the options for chat and can send inbox messages to each other.

Recently facebook also included the voice and video calling service and also the graph API which is given randomly given to the users through graph API users can search for people or any other thing directly by typing in the name. In last 5 years facebook alone have roughly 500 million users. So we can expect how huge amount of data is being shared on the facebook. Users shares a lot of their personal information on FACEBOOK which can be used to analyse their personality.

LITERATURE REVIEW

FACEBOOK becomes one the most popular social network where lot of user share their information and hence a lot of research has been made in detecting the personality of the users by using their personal information. Bernardo J. Golbeck, C. Robles, and K. Turner shown that a users’ Big Five personality traits can be predicted from the public information they share on Facebook[1]. Yoram Bachrach, David Stillwell and Micheal Kosinski. show that personality traits are correlated with patterns of social network use, as reflected by features of Facebook profile, In previous research M. Back, J. Stopfer, stated that Facebook profiles is reflective of their actual personalities, not an "idealized" version of themselves [3]. It has been shown in [4] that extroversion extroversion and conscientiousness positively correlate with the perceived ease of use of social media websites. G. Hodgkinson and J. Ford. Concluded that people’s personality can be successfully judged by the others based on their Facebook pro-files [5].

Twitter Dataset Characteristics.

FACEBOOK data for analyzing the crime tendency in person is gathered by hours and hours of crawling. We start collecting the data by using crawling by searching for the anger and negative words used in the profiles status updated or any other text post. Data is gathered over frequent intervals by using Facebook Graph api. We gathered the information like user id, timestamp and text posts for analysis. Some of the important KEYWORDS are listed in Table 1.

Keyword

keyword

murder

Evil

anger

knife

killed

guns

death

blood

In the above tables some of the priority keywords are shown which we use for crawling. This process is among one of the most time consuming part as to look for the words which reflect the angriness and negativity of a person in their profiles.these keywords are most important part of the research as all the data collected for mining is based on these keywords. After the collection of data from facebook we implement text mining on dataset after it is pre-processed.

Tokenization

Tokenization is the term used for defining document unit which is known as Tokens.It is the procedure of cutting the document into fragments and also to leave out irrelevant words or character such as special characters or punctuation. Examples of tokenization are

Input: hobby, cricket, sports, kindly give me bike etc.

Tokens are also known as term but sometimes it is very important and necessary to make a token such that they are dictinct from each other. Tokens might be used as an identifier in the classification. However in recent information retrieval systems they belongs to the tokens in document.

Dropping common terms

Several words that seem to be the common and appear as the most frequently and of very small value in helping for the document selection are omitted from entire vocabulary. These are also known as stop words. In general the strategy for the determination of common words should be to sort the words relative to the amount of frequency and to pick the most frequently used words or terms. It will help in reducing the indexing the systems have to store.

Capitalization

The normal strategy should be to convert all the sentences, words and single letters to lower case as it will help in matching the word in query e.g "Word" and "word" should match.

Concerns in English.

Lot of other problems and challenges in belongs to English usage. As one might wish for matching the word "don’t" and "do not", "center" and "centre" and also the other problems and matching issues for the formatting of Time/Date and alike items.

Simple Tokenization Algorithm

A simple tokenization algorithm is as follows:

Initialize reading document {

For loop in statement {

For loop each word {

If (word == "common")

Continue;

Generate token

If token exists

Increment frequency of token

While (next word not stop word) {

Initialize token generation of multi word

}

}

}

Sample Data

We have applied our procedure to more than 60000 text related post to with 50 keywords on 50 different profiles and we get some exciting figures few of them are listed below in Table 2.the keywords used in the text’s post of the users in overall facebook dataset shows that we can get some overview of data.

Keyword

No. of occurrence

Blood

2230

Killed

4000

Suicide

4985

Murder

2430

Depress

8032

3.5 Discussion and Analysis of Related Work

Personality prediction can be done by using the personal information of the user profile. There are big five personality traits which are now used worldwide by the psychologist known as big five personality model. Researchers used this personality model in order to detect the personality of a person through Facebook. These five personality triats are Openness, Conscientiousness, Extroversion, Agreeableness and Neuroticism. Many researchers found that personality can affect job performance and other issues. Personality can also be important for marriages and most of the people first look into the social network profile before going to meet someone and marry. In pervious researches Researches look for the language features and personal information interest hobbies to judge the personality of users by searching for the words within their profile and to then define where these words life in the big five personality traits researcher look for the swear, anger, anxiety and other words and also the last name plus the about me in their profile and prediction is then made upon these results[1]. Researchers also uses the other information in order to predict the personality of the user by looking at the number of status updates,likes,comments, photo sharing and no of friends[2].in above researches data is collected through the Facebook API which save the time and also the cost spend on questionnaires. Some researchers also think that the information user give when creating profile is an idealized version not the actual representation of their personality. But the research also shows that human somehow share their emotion somewhere on social networks and as the time goes on the use of social network is keep on increasing which make it quite easy to find the personality. After all these literature reviews and research we come to the point of detecting the criminal tendency of a person by using their social profiles as at some point of time people will show their reaction and state on mind on social networks hence we gather the information based on anger words used by the users in their post to find out criminal tendency.

3.6 Discussion and Analysis of pre work

Results.

4.1 Processing assumptions:

We applied our process on approximately 60,000 text posts data of Facebook related to more than 50 keyword on 50 different profiles then applied naive Bayes classification algorithms. First we have to break down the text related post into tokens and then to remove stop words from them. After that we have implemented stemming and N-Gram algorithm to retrieve multi-words and after this we implemented Naïve Bayed algorithm to collect some model. Which is applied on the Facebook dataset.

Conclusion

We give our approach to find out the crime tendency on person by using their social networking site (Facebook) dataset. It is quite possible to predict the tendency of a crime in personality. But there are few concerns which effect the crime prediction.

First point is the data which we are using is public data and it is possible that users with criminal mind tend to share these things in private information and won’t make it visible to the public.

There are very huge people who use these sorts of words in friendly mood or just to irritate some friends.

People with criminal mind share less things and socially not very active.

Above listed issue are challenging and need a lot of further work and need to be research.

It is an emerging topic detecting criminal tendency through social media profiles and by using this we can get aware of the people with criminal mind with low cost and accurate effiency.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now