Support System For Predicting Sales Performance

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract-Mining online reviews is the popular way to identify the opinions and sentiments of people towards the products or services they received. Online reviews present a wealth of information on the products and services, and if properly utilized, can provide vendors highly valuable network intelligence and social intelligence to facilitate the improvement of their business. This paper concentrates on the movie domain and helps to move from simple negative or positive classification of reviews towards a deeper understanding of sentiments in blogs. Text classification enables the computers to intelligently process unstructured data and Industries make use of this unstructured data because of its rich source of information. There are a set of sentiment factors embodied in a reviews which are not directly observable in the reviews. For the sentiment factor Sentiment PLSA (S-PLSA) is used, in which a review is considered as a document generated by a number of hidden sentiment factors, in order to capture the complex nature of sentiments. Classification and clustering based on these hidden sentiment factors provide better results since these sentiment factors express what the reviewer really feels for the product. Sentiments, past sales performance and the quality of the reviews can be utilized to predict the future sales performance of the movies which are currently playing in the theatres.

Index Terms—Review mining, sentiment analysis, prediction, classification, clustering.

1 INTRODUCTION

Internet has become a widespread platform, where people can share their opinions and sentiments towards particular products or services received. In many cases our decisions are influenced by the opinions of others. Customers can easily express what they feel for the particular products or services they received in ecommerce websites and blogs available. People publish their reviews using the facilities provided by different websites other customers can also make use of the different opinions posted by different users. Customers purchasing decisions depend on each other. Two sets of people can make use of these review websites. First set is the customers and second set is the manufacturers. Before purchasing a product a customer can check what the other customers think about that product and then the decisions can be taken according to the reviews posted online where as the manufacturers can adjust their products, servicing and market strategies based on the review published by the customers as online.

The growing availability and popularity of opinion-rich resources such as online review websites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. Unfortunately, 85% of these opinion rich resources are available in unstructured format. It has encouraged the analysts to develop an intelligent system that can automatically categorize or classify these text documents.

Prior studies focused on simple positive or negative classification of reviews, they did not consider the hidden sentiment factors underlined inside the reviews. But considering the deeper sentiments embodied in the reviews can contribute to better sales performance prediction. Classification and clustering of online reviews based on these hidden sentiment factors yields better results than the traditional positive or negative classification of online reviews.

Prediction of product sales is a highly domain-driven task, for which analysis of several factors need to be carried out. In this paper, using the movie domain as a case study, the various issues encountered in modeling reviews, producing sales predictions, and deriving actionable knowledge is investigated. Three factors that play important roles in predicting the future sales performance in the movie domain, namely, public sentiments, past sales performance, and review quality, and a framework can be proposed for sales prediction with all those factors incorporated. Simply classifying reviews as positive or negative, as most current sentiment mining approaches are designed for, does not provide a comprehensive understanding of the sentiments reflected in reviews. In order to model the multifaceted nature of sentiments, the sentiments embedded in reviews can be viewed as an outcome of the joint contribution of a number of hidden factors, and a novel approach to sentiment mining based on Probabilistic Latent Semantic Analysis (PLSA) is proposed, which is called Sentiment PLSA (S-PLSA). Therefore, instead of considering "bag-of-words" approach and considering all the words present in the blogs, the words that are sentiment related is focused.

The second factor that need to be considered is the past sale performance of the same product, or in the movie domain, past box office performance of the same movie. This effect can be captured through the use of an Autoregressive (AR) model, which has been widely used in many time series analysis problems. Combining this AR model with sentiment information mined from the reviews, a new model for product sales prediction called the Autoregressive Sentiment Aware (ARSA) model is proposed.

Since online reviews are of different quality and, carry different predictive power each one should be treated separately in producing the prediction. Thus quality factor is of considerable importance in sales prediction. The quality factor is then incorporated into the ARSA model, resulting in an Autoregressive Sentiment and Quality Aware (ARSQA) model for sales prediction.

In summary, this paper makes the following contributions:

Using the movie domain as a case study, the problem of predicting sales performance using online reviews as a domain-driven task is addressed, and identified the important the factors involved in generating prediction.

Using the S-PLSA model, classification and clustering of online reviews based on the sentiments is performed.

The rest of the paper is organized as follows. Section 2 provides a brief review of related work. Section 3 discusses the characteristics of decision support system and data mining. In Section 4, description of S-PLSA, a probabilistic approach to sentiment mining and in section 5 ARSA, the sentiment-aware model for predicting future product sales. Section 6 considers the quality factor and presents the ARSQA model. Section 7 explains the classification and clustering of reviews based on the sentiment values, and Section 8 concludes this paper.

2 RELATED WORK

2.1 Review Mining

With the rapid growth of online reviews, review mining has attracted a great deal of attention. Early work in this area was primarily focused on determining the semantic orientation of reviews. Considering the problem of classifying documents not by topic, but by overall sentiment, Pang and Lee [5]determined whether a review is positive or negative. For training and testing domains, they decided to use movie reviews . The source for the movie reviews was IMDB. The experiments used 700 positive and 700 negative reviews. . For the classification of reviews based on their sentiments they used three machine learning techniques which are Naïve Bayes, Maximum Entropy and Support vector machines. In follow-up work, they propose to first extract the subjective portion of text with a graph min-cut algorithm[6], and then feed them into the sentiment classifier. They focused on label the sentences in the document as either subjective or objective, discarding the latter; and then applied a standard machine learning classifier to the resulting extract. Instead of applying the straightforward frequency-based bag-of-words feature selection methods, Whitelaw et al. [4] defined the concept of "adjectival appraisal groups" headed by an appraising adjective and optionally modified by words like "not" or "very." Each appraisal group was further assigned four types of features: attitude, orientation, graduation, and polarity.. We use the same words and phrases from the appraisal groups to compute the reviews’ feature vectors, such adjective appraisal words play a vital role in sentiment mining and need to be distinguished from other words.

There are also studies that work at a finer level and use words as the classification subject. They classify words into two groups, "good" and "bad," and then use certain functions to estimate the overall "goodness" or "badness" score for the documents. Kamps and Marx [7] propose to evaluate the semantic distance from a word to good/bad with WordNet. Turney [8] measures the strength of sentiment by the difference of the Mutual Information (PMI) between the given phrase and "excellent" and the PMI between the given phrase and "poor." Extending previous work on explicit two-class classification, Pang and Lee [9], and Zhang and Varadarajan [10] attempt to determine the author’s opinion with different rating scales (i.e., the number of stars). Liu build a framework to compare consumer opinions of competing products using multiple feature dimensions. After deducting supervised rules from product reviews, the strength and weakness of the product are visualized with an "Opinion Observer."

2.2 Economic impact of online reviews

Newly released products need to get marketed effectively for the successful long run. Some studies attempt to answer the question of whether the polarity and the volume of reviews available online have a measurable and significant effect on actual customer purchasing [1]. In most of the studies cited above, the sentiments are captured by explicit rating indication such as the number of stars; only few studies have tried to make use of the text mining strategies for sentiment classification. To fill in this gap, Ghose and Ipeirotis [2] argue that review texts contain richer information that cannot be easily captured using simple numerical ratings. In their study, they assign a "dollar value" to a collection of adjective-noun pairs, as well as adverb-verb pairs, and examine how they influence the bidding prices of different products at Amazon. Our work is similar to [2] in the sense that we also utilize the textual information to capture the underlying sentiments in the reviews. However, their approach mainly focuses on quantifying the extent of which the textual content, especially the subjectivity of each review, affects product sales on a market such as Amazon, while our method aims to build a more fundamental framework for predicting sales performance using multiple factors. Foutz and Jank [11] also exploit the wisdom of crowds to predict the box office performance of movies.

3.DECISION SUPPORT SYSTEM

Data mining is a process of pattern and relationship discovery within huge sets of data. The context consists several fields, including pattern recognition, statistics, computer science, and database management. Thus the definition of data mining very much depends on the point of view of the writer who gives the definitions. For example, from the perspective of pattern recognition, data mining is defined as the process of identifying legitimate, new, and easily understood patterns within the data set. In still broader terms, the main goal of data mining is to convert data into meaningful information. More specifically, one major primary goal of data mining is to discover new patterns for the users. The discovery of new patterns can serve two purposes: description and prediction. The former focuses on finding patterns and presenting them to users in an interpretable and understandable form. Prediction involves identifying variables or fields in the database and using them to predict future values or behavior of some entities. Data mining is well suited to provide decision support in all the areas. For example healthcare organizations face increasing pressures to improve the quality of care while reducing costs. Because of the large volume of data generated in healthcare settings, it is not surprising that healthcare organizations have been interested in data mining to enhance physician practices, disease management, and resource utilization.

Rapid advances in information and sensor technologies (IT and ST) along with the availability of large-scale scientific and business data repositories or database management technologies, combined with breakthroughs in computing technologies, computational methods and processing speeds, have opened the floodgates to data dictated models and pattern matching . The uses of sophisticated and computationally intensive analytical methods are expected to become even more commonplace with recent research breakthroughs in computational methods and their commercialization by leading vendors. Scientists and engineers have developed innovative methodologies for extracting correlations and associations, dimensionality reduction, clustering or classification, regression and predictive modeling, tools based on expert systems and case based reasoning, as well as decision support systems for batch or real-time analysis. They have utilized tools from areas like traditional statistics, signal processing and artificial intelligence as well as emerging fields like data mining, machine learning, operations research, systems analysis and nonlinear dynamics.

3.1 Data Mining

In knowledge management process, data mining technique can be used to extract and discover the valuable and meaningful knowledge from a large amount of data. Nowadays, data mining has given a great deal of concern and attention in the information industry and in society as a whole. Data mining is being used both to increase revenues and to reduce costs. The potential returns are enormous. Innovative organizations worldwide are already using data mining to locate and appeal to higher-value customers, to reconfigure their product offerings to increase sales, and to minimize losses due to error or fraud. Additionally, among the major tasks in data mining are classification and prediction; concept description; rule association; cluster analysis; outlier analysis; trend and evaluation analysis; statistical analysis and others. In supervised learning, classification refers to the mapping of data items into one of the predefined classes. Clustering divides a database into different groups. The goal of clustering is to find groups that are very different from each other, and whose members are very similar to each other

4.S-PLSA: A PROBABILISTIC APPROACH TO SENTIMENT MINING

In this section, we use a probabilistic approach to analyzing sentiments in reviews, which will serve as the basis for predicting sales performance and classification and clustering of reviews.

4.1 Sentiment PLSA

We first consider the problem of feature selection, i.e., how to represent a given review as an input to the mining algorithms. The traditional method to carry out this is to calculate the (relative) frequencies of various words in a given document (review) and use the resulting multidimensional feature vector as the representation of the document. Here we does not concentrate on all the words but only the words which are sentiment related. Those set of words called the appraisal words can be taken from the lexicon constructed by Whitelaw et al. [7], and the lexicon consists of 2030 words.

Traditional text mining algorithms cannot easily address all the challenges involved in mining opinions and sentiments. This is because of complex and subtle way of appearance of sentiments in natural language. Sentiments can vary from one another according to the polarity, orientation and graduation. For the sales performance prediction and classification and clustering of reviews accurate extraction of sentiments is needed. So we use the Probabilistic Model Called Sentiment Probabilistic Latent Semantic Analysis(S-PLSA) in which a review is considered as being generated under the combination of a set of hidden sentiment factors.

We now formally present S-PLSA. Suppose we are given a set of reviews B ={b1,…,bn}, and a set of words (appraisal words) from a vocabulary W ={w1,…,wn}.. The review data can be described as a N M matrix D =(c(bi,wj)i,j), where c(bi,wj) is the number of times wj appears in review bi. Each row in D is then a frequency vector that corresponds to a review.

S-PLSA is a latent variable model for co-occurrence data ((b,w) pairs) that associates with each (w,b) observation an unobserved hidden variable from the set of hidden sentiment factors, Z={z1… zk}. Just like in PLSA where hidden factors correspond to the "topics" of the documents,in S-PLSA those factors may correspond to the sentiments embodied in the reviews (e.g., joy, surprise, disgust, etc.). Such sentiments are not directly observable in the reviews; rather, they are expressed through the use of combinations of appraisal words.

We need to find out how much a hidden sentiment factor z€Z"contributes" to the review b. Therefore, the set of probabilities {pr(z|b)|z€Z} can be considered as a succinct summarization of b in terms of sentiments.A widely used method to perform maximum likelihood parameter estimation for models involving latent variables (such as our S-PLSA model) is the Expectation-Maximization(EM) algorithm [32], which involves an iterative process with two alternating steps.

1. An Expectation step (E-step), where posterior probabilities for the latent variables (in our case, the variable z) are computed, based on the current estimates of the parameters.

2. A Maximization step (M-step), where estimates for the parameters are updated to maximize the complete data likelihood.

In our model, As for the initial probabilities of Pr(b|z), Pr(w|z), Pr(z) , P(w|z) is initialized randomly, while Pr( z) and pr( are initialized to we can show that the algorithm requires alternating between the following two steps:

In E-step, we compute

In M-step, we update the model parameters with

and

It can be shown that each iteration above monotonically increases the complete data likelihood, and the algorithm converges when a local optimal solution is achieved. Once the parameter estimation for the model is completed, we can compute the posterior probability Pr(z|b) using the Bayes rule

Intuitively, Pr(z|b) represents how much a hidden sentiment factor z€Z"contributes" to the review b. Therefore, the set of probabilities {pr(z|b)|z€Z} can be considered as a succinct summarization of b in terms of sentiments.

5. ARSA: A Sentiment Aware Model

Box office revenue of the current day is strongly correlated to those of the preceding days.For the better sale performance prediction the sentiments can be incorporated with the sentiments.

5.1 The Autoregressive model

The temporal relationship between the box office revenues of the preceding days and the current day can be well modeled by an autoregressive process. The box office revenue of the movie of interest at day t can be denoted by xt.

AR process of order p is as follows:

where are the parameters of the model, and is an error term.

The box office revenue xt can be predicted by xt-1, xt-2 . . . xt-p. Therefore, in order to properly model the time series {xt}, some preprocessing steps are required. The first step is to remove the trend. This is achieved by first transforming the time series {xt} into the logarithmicdomain, and then differencing the resulting time series {xt}.The second step is to remove seasonality. After the preprocessing step, a new AR model can be formed on the resulting time series {}.

5.2 Incorporating sentiments

After incorporating the sentiments, the ARSA model can be formulated as follows:

Where p, q, and K are user-chosen parameters, while and are parameters whose values are to be estimated using the training data. Parameter q specifies the sentiment information from how many preceding days is taken into account, and K indicates the number of hidden sentiment factors used by S-PLSA to represent the sentiment information.

6. ARSQA model

Let be the number of reviews posted at day t. Also, recall that is the inferred probability of the kth sentiment factor in the jth review at time t, which we assume can be obtained based on S-PLSA. Denote by the quality of the jth review on day t. Then, the prediction model can be formulated as follows:

7. CLASSIFICATION AND CLUSTERING

7.1Classification of reviews by decision tree induction

The classification process has two phases; the first phase is learning process, the training data will be analyzed by the classification algorithm. The learned model or classifier shall be represented in the form of classification rules. Next, the second phase is classification process where the test data are used to estimate the accuracy of the classification model or classifier. If the accuracy is considered acceptable, the rules can be applied to the classification of new data. In classification a model or classifier is constructed to predict the categorical labels. These categories can be discrete values where the ordering of values has no meaning.

Decision tree induction adopts a greedy approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for decision tree induction also follow such a top-down approach which starts with a training set of tuples and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is being built. Here we predict the class label of a review based on the sentiment values for different sentiment factors. We consider four sentiment factors namely z1,z2,z3 and z4 for the classification of reviews based on the sentiments.

Algorithm: Generate decision tree

Input:

Data partition D, a set of training tuples and their associated class labels

Attribute_list, the set of candidate attributes

Attribute_selection method, a procedure to determine the splitting criterion

Output: A decision tree

Method:

Create a node N;

If tuples in D are all of the same class, C then

Return N as a leaf node labeled as class C;

If attribute_list is empty then

Return N as a leaf node labeled with the majority class in D;

Apply Attribute_selection_method to find the best splitting criterion;

Label node N with the splitting criterion;

If splitting attribute is discrete_valued multiway splits allowed then

Attribute_list attribute_list – splitting attribute;

For each outcome j of splitting_criterion

Let be the set of tuples in D satisfying outcome j;

If is empty then

attach a leaf labeled with the majority class in D to node N;

else attach the node returned by to node N endfor

Return N;

Training set

Tid

Pr(b|z1)

Pr(b|z2)

Pr(b|z3)

Pr(b|z4)

sentiment

1

0.4

0.7

0.2

0.2

z2

2

0.3

0.4

0.9

0.2

z3

3

0.8

0.2

0.6

0.4

z1

4

0.5

0.7

0.4

0.8

z4

5

0.6

0.2

0.3

0.5

z1

6

0.3

0.4

0.7

0.3

z3

7

0.9

0.3

0.7

0.6

z1

8

0.1

0.5

0.8

0.6

z3

9

0.6

0.4

0.7

0.4

z3

10

0.4

0.6

0.7

0.9

z4

Pr(b|z1)>pr(b|z2),pr(b|z3),pr(b|z4)

yes no

Pr(b|z2)>pr(b|z1),pr(b|z3),pr(b|z4)

Z1

yes no

Pr(b|z3)>pr(b|z1),pr(b|z2),pr(b|z4)

Z2

yes no

Z3

Z4

7.2 Clustering of reviews using k-means algorithm

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. We group different set of reviews based on their probability values of sentiment for each sentiment factor. We can identify as many number of clusters same as the number of sentiment factors we choose. We consider four sentiment factors namely z1,z2,z3 and z4 for the clustering of reviews based on the sentiments.

Algorithm: k-means

Input:

k:the number of clusters

D:a dataset containing n objects

Output: A set of k clusters

Mathod:

Arbitrarily choose k objects from D as initial cluster centers;

Repeat

(re)assign each object to the cluster to which the object I most similar, based on the mean value of the objects in the cluster;

Update the cluster means

Until no change;

8. CONCLUSIONS

Online reviews convey the sentiments and opinions of people towards particular products or services received. Using movie domain as a case study we studied the problem predicting future sales performance. We have identified important characteristics of movie reviews. The outcome of this work leads to actionable knowledge that can be readily employed by decision makers.

A center piece of this work is S-PLSA which summarizes the reviews in terms of sentiments and also helps us move from simple "negative or positive" classification toward a deeper comprehension of the sentiments in blogs. Using the sentiments, past sales performance and the quality factor ARSQA model is developed which predicts the future sales performance of the movies which are currently playing in the theatres. We explore the use of S-PLSA as a tool for the classification and clustering of reviews based on the sentiment that leads to better performance.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now