Authorship Analysis And Identification

Published Date: 02 Nov 2017

Saurav Bose

IIIT-Delhi

[email protected]

Saurabh Yadav

2010077

IIIT-Delhi

[email protected]

ABSTRACT

With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. This report provides an overview of Authorship analysis and the process of Authorship identification of online messages. It explains in detail the different types of writing-style features that are extracted to build feature-based classification models for identifying authorship of online messages. It reviews the efficiency and robustness of these models in multi-language context (English, Arabic). It also discusses various limitations that exist in performing authorship analysis of Arabic messages and how we can overcome these limitations.

Keywords

Authorship analysis, Authorship identification, online messages, writing style features.

INTRODUCTION

With the advent of Internet, it has become easier to share information between people across time and space. These are followed by both advantages and disadvantages, the latter being opening a new venue for criminal activities, collectively known as cyber crimes. Some examples include distribution of illegal content in cyber space like pornography, pirated softwares; terrorism, hatred, etc. Of late, the cyber criminals have been extensively involved in the distribution of such illegal contents and hatred speeches via the Web-based channels, such as websites, newsgroups, forums, etc. The Internetâ€™s feature of anonymity provides them an upper hand into performing such activities. Participating in cyber activities is an easy task as people usually do not have to provide their real identity information, such as name, address, gender, etc. As a result, it imposes complex challenges for the law enforcement agencies in criminal identity tracing. To add to their agony, we have a sheer amount of cyber users and activities, making the manual approach to criminal identity tracing impossible for meeting cybercrime investigation requirements. The need of the hour is to automate criminal identity tracing in cyberspace, allowing the investigators to prioritize their tasks and focus on the major criminals.

Authorship analysis can assist this activity by automatically extracting linguistic features from online messages and evaluating stylistic details for patterns of terrorist communication. However, the related work on authorship analysis techniques have mostly been on paper without much implementation in real life, particularly in online communication. Furthermore, the global nature of terrorist activity has made it necessary to analyze multilingual content. Amongst various foreign languages, Arabic has garnered specific attention in recent years for sociopolitical reasons that include possible ties between certain Middle Eastern groups and terrorism [1]. The morphological characteristics of this language pose several critical problems to current authorship analysis techniques.

LITERATURE REVIEW

2.1 Authorship Analysis

Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship [2]. The problem can be broken down into three sub-fields, namely, Author Identification (likelihood of a particular author having written a piece of work by examining other works produced by that author, Author Characterization (summarizing the characteristics of an author and generating his/ her profile based on the available work) and Similarity Detection (comparing multiple pieces of work and determining whether or not they are produced by a single author, without actually identifying the author).

2.2 Feature Selection

Primarily, there are four writing style features that facilitate authorship attribution: syntactic, lexical, structural, and content-specific.

Syntactical features refer to the patterns used to form sentences. They consist of tools used to structure sentences, such as punctuation and function words (while, upon). Usage patterns of function words can be effective features for authorship identification [1]. For example, the difference between using the word thus or hence might seem subtle, but it can constitute a significant stylistic difference.

Lexical features can be either word- or character-based. Characteristics such as total number of words, words per sentence, vocabulary richness, etc. can be included under Word-based lexical features. On the other hand, Character-based lexical features include total number of characters, characters per sentence, usage frequency of individual letters, etc.

Structural features deal mainly with the textâ€™s organization and layout and have proved to be important in analyzing online messages. Examples include greetings, signatures; number of paragraphs used, average paragraph length, etc. These features are only good for providing discriminators but not for capturing additional information present in online messages. For example, the use of various font sizes and colors requires a conscientious effort, making it a style marker.

Content-specific features are words which are important within a specific topic domain. For example, in a discussion on computers, words like RAM, laptop, etc would be the ones which would be heard the most.

2.3 Techniques for Authorship Analysis

Most of the early work used statistical methods for authorship analysis. They were based on the idea that different authors have different text compositions which are characterized by a probability distribution of word usage. More specifically, given a population of an authorâ€™s texts, the identification of a new text can be considered as a statistical hypothesis test or a classification problem [2]. However, these techniques have been identified with some loopholes of late which include scalability, reliability, prediction capability, etc.

Drastic increase in computational power over the years has caused the Machine Learning techniques to emerge. These techniques include Support Vector Machines (SVMs), Neural Networks and Decision Trees. They provide greater scalability than statistical techniques for handling more features, and they are less susceptible to noisy data. As a result, they have gained wider acceptance in authorship analysis studies in the recent years. These benefits are important for working with online messages, which involve classification of many authors and a large feature set.

ARABIC CHARACTERISTICS

When it comes to the stylistic and structural properties of a language, Arabic is a language which poses some challenges on this front. Inflection, Diacritics, Word Length and Elongation are some characteristics which need to be taken care of while applying authorship analysis over Arabic messages.

3.1 Inflection

Arabic consists of approximately 5000 roots which are used to form words and sentences, thus making it a highly inflected language. These roots are themselves composed of 3-5 consonants. The orthographical and morphological properties of Arabic result in significant lexical variation [1], because words can take on numerous forms. As a result, Inflection creates feature extraction problems thereby weakening vocabulary richness measures.

3.2 Diacritics

Diacritics are the markings above or below the letters used to indicate special phonetic values [1]. In English, for example, a diacritic is the little mark on top of the letter e in the word rÃ©sumÃ©. [1] Diacritics are used in Arabic to represent short vowels, consonant lengths, and relationships between words. However, diacritics are rarely used in online communication. The lack of diacritics can significantly impact the effectiveness of word-usage- based features such as function words. In Arabic, for example, itâ€™s impossible without diacritics to distinguish between the words who and from.

3.3 Word Length and Elongation

When compared with their English counterparts, Arabic words are often shorter in length, thus reducing the effectiveness of many lexical features in identifying authorship. Also, the usage of long complex words in a sentence shows how well versed a person is in his/ her language. But since Arabic words are almost of the same length, be it easy or complex words, this assumption does not hold true. Elongation presents a further complication as at times the Arabic words are elongated for purely stylistic reasons, using a special character that resembles a dash (â€”)[1]. Arabic characters are combined during writing, so elongation is possible by lengthening the joins between letters. Elongation has its own pros and cons, the pros being providing an important style marker and the cons being inflating the values of word length features significantly.

Capture2

EXPERIMENT DESIGN

The test bed for relevant messages is taken from Web Forums. In case of Arabic messages, the data set was taken from Al-Aqsa Martyrs group while for English messages; it was the White Knights group. There were 400 messages pertaining to each language. There were two classifier techniques put into use in the experiment: C4.5 and SVM. C4.5 is a powerful decision-tree-based classifier and shows a great analytical and explanatory potential in effectively assessing key differences between the English and Arabic feature sets. On the other hand, SVM is a computational learning method based on structural risk minimization [1]. It has gained popularity over the years due to its massive classification power and robustness. It is also effective in handling noisy data.

Both English and Arabic Feature sets were formed each consisting of 301 and 418 features respectively. Out of 301 English features, 87 were lexical, 158 syntactic, 45 structural, and 11 content-specific features. In case of Arabic messages, they were distributed as 79 lexical, 262 syntactic, 62 structural, and 15 con- tent-specific features.

In order to come up with Arabic feature set, the languageâ€™s morphological and orthographical properties were taken into consideration. The issue of Inflection was handled using usage frequencies for a selected set of word roots. This compensated for the losses in vocabulary richness measures. Tracking of the root frequencies was done by a clustering algorithm designed by De Roeck and Al-Fares. Their algorithm calculated and assigned similarity scores for each word against a collection of roots. The word having highest similarity score with respect to a root was assigned to that root. The SVM technique zeroed upon 50 optimal roots which were often used for classification. In order to capture word length precisely, a filter was embedded in the Arabic feature extractor which helped in removing elongation after it had been tracked. The absence of a feasible semantic tagger restricted tracking of Diacritics.

IDENTIFICATION PROCESS

Collection, Extraction and Experimentation: these are the 3 main steps involved in complete online authorship identification process.

5.1 Collection and Extraction

The web forums of interest are identified with the help of spidering programs, which crawl through the Internet, searching for potentially dangerous or abusive contents. Once identified, the collection programs take over the job of storing such messages in text and HTML format. The extraction programs further facilitate the process by deriving the writing style characteristics as mentioned earlier. The extraction programs have a complexity varying over different languages due to presence of special features in every language (Word Length Elongation and Inflection in case of Arabic).

5.2 Experiments

After extracting the feature values, the next step is to form feature sets. These sets are formed in a step-wise manner. For example, the first set consisted of lexical features, the second encompassed lexical and syntactic features and so on. Finally, the fourth set comprised of all features: lexical, syntactic, structural, and content-specific. Such a stepwise increment of features helps to identify the relevance of each writing style characteristic in authorship analysis. Lexical and syntactic features happen to be the most important categories and hence form the foundation for structural and content-specific features.

For the experiment concerning English and Arabic messages, 30 randomly selected samples of five authors were selected. Each sample of five authors was evaluated using all 20 messages per author [1]. Both classifiers were used one at a time. Often, a 30-fold cross-validation testing method is used in all experiments.

Accuracy, recall and precision measures are used to evaluate the prediction performance. The accuracy is a measure which indicates the overall prediction performance of a particular classifier [2].

Accuracy = Number of messages with correctly identified author

Total number of messages

For a particular author, precision and recall measures are used to measure the effectiveness of the approach for identifying messages that were written by that author. The precision and recall are defined as [2]:

Precision = Number of messages correctly assigned to the author

Total number of messages assigned to the author

Recall = Number of messages correctly assigned to the author

Total number of messages written by the author

6. ANALYSIS AND RESULTS

6.1 Feature Type Comparison

The authorship prediction performance varied significantly with different combinations of metrics. Pair-wise t-test results indicated that:

Using style markers and structural features outperformed using style markers only [2]: The results might be explained by the fact that an authorâ€™s consistent writing patterns show up in the messageâ€™s structural features.

Using style markers, structural features, and content-specific features did not outperform using style markers and structural features [2]: The results indicated that using content-specific features as additional features did not improve the authorship prediction performance significantly. The weaker performance of content-specific features could be attributable to their smaller representation in the feature set as they contained 11 and 15 content-specific features in English and Arabic messages respectively.

6.2 Classification Technique Comparison

The SVM technique significantly outperformed the decision tree classifier in terms of accuracy, precision and recall. This further supports the results of previous studies which say that SVM is better equipped to handle larger feature sets and noisier data. The difference in accuracy between classifiers across Arabic messages was far greater than it was in English messages: SVM outperformed C4.5 by more than 20 percent on all feature set combinations.

6.3 Decision Tree Analysis

Elongation features and nearly half the word roots were some of the important attributes that are required to be considered for future analyses. In the English messages, the Word Length had a major role to play (40 percent) as compared to the Arabic messages (20 percent) [1]. The importance of punctuation,

Capture1

function words, and word-based structural features was fairly consistent across both languages, suggesting that syntactical and structural characteristics are fairly robust feature categories across languages. The messages extracted from the Al-Aqsa group had an overdose of various font sizes, color, hyperlinks and embedded images; leading to a greater disparity in terms of feature importance in the technical structure category.

7. CONCLUSION AND FUTURE WORK

Capture4

The experiments have demonstrated that with a set of carefully selected features and an effective learning algorithm, the authors of Internet newsgroup and email messages can be identified with a reasonably high accuracy. There was a significant improvement in performance when structural features were added on top of style markers. Also, the SVM classifier outperformed the other classifiers on all occasions.

The experimental results have shown a promising future for applying the automatic authorship analysis approaches in cybercrime investigation to address the identity- tracing problem. Using such techniques investigators would be able to identify major cyber criminals who post illegal messages on the Internet, even though they may use different identities.

This study can be expanded in the future to include more authors and messages to further demonstrate the scalability and feasibility of SVM classifier as well as finding any other feasible classifier to solve the purpose. The current approach can also be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography images [2]. Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now