Basic Business Analytics Measures

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Chapter 2

This chapter of this introductory business analytics book has a rather heterogeneous structure. This may be puzzling and confusing for some who expect to read a basic introductory text, and being taken step by step through new concepts in a linear fashion.

While this approach is appropriate for classical disciplines, e.g. statistics, econometrics, or marketing, I considered it to be totally inappropriate for business analytics. The multidisciplinary nature of this field requires that both quantitative and marketing concepts are covered and spelled out for the novice reader. As these disciplines are essentially different one from the other, an intuitive approach could be to initially convey this knowledge in two distinct parts: the marketing part and the quantitative part.

However, one cannot exist without the other in the context of business analytics. It is essential that the quantitative techniques are being mastered, however, their mastery is mostly useless if we do not understand the business need that generated them, and we cannot translate the results in actionable recommendations. On the other hand, good marketing knowledge and acumen is essentially useless if the business analyst is not able to use the data and tools available to extract new knowledge and generate basic inputs for marketing and business decisions.

This chapter tries to point out and foster this strong co-dependency between business and marketing knowledge on one hand, and the quantitative and data processing aspects on the other hand, from the very beginning. Marketing and quantitative knowledge are presented in turns, something that will be done later in the book.

Another characteristic of this chapter is that it introduces the basic knowledge and techniques needed to use the R software effectively along with basic quantitative and data processing topics. Having a practical focus, the aim of this chapter is to immediately apply some basic quantitative issues and manipulations right after R and its user-friendly interface is installed and becomes functional. While difficult, this is a tremendous advantage of this text over other introductory books, which are either too focused on software manipulation aspects, or maintain a theoretical focus and leave the practical implementation of statistical methods confined to an ending chapter, or to an external reference.

The self-contained purpose and mission of this book led to conceiving a heterogeneous first chapter in which all the three essential knowledge pillars that hands-on business analytics rests upon are represented: marketing concepts, quantitative and data processing concepts and issues, and installation, setup and basic data manipulations using R.

2.1 Basic concepts used in business analytics

2.1.1 Marketing Concepts

Customer Base

Is defined as being total number of clients a business has at a given moment in time. It is useful to see it as being the sales-generating base, understand its structure (how many customers buy a certain product, or a certain combination of products), and have a middle or long-term view of it if feasible. It is a must to look at it in detail, knowing how many customers buy certain products and services (or bundles of products and services), how much do they spend, how often they do business, and how likely they are to keep doing business with a particular company.

Churn Rate

Often referred as attrition rate, this is a key business analytics metric that tells how many of the customers stop doing business with a company in a given period of time, or stop buying certain products and services.

While ‘churn’ is a simple concept, its relevance and applications varies a lot with the nature of a business. Some businesses have very few repeat customers, in which case the churn rate will tell what they can expect from their customers, and see what can be done for those who do business with them again, if this can turn into additional profits. Some others face strong competition, in which case it is essential know how many customers are still doing business with them, how many have left, and how many they have managed to attract from their competitors.

Also, churn rates are dependent on the length of time customers are likely to use a product or a service. Thus, if a product is expected to be used for three months, it makes sense to compute a churn rate after three months of the transaction date to see if a customer bought a product again. In other cases, when products and services are similar in many respects, it would be wrong to consider that a customer has ‘churned’ if she buys another similar product, or a bundle of products. This behavior is related to what is called ‘cannibalization’.

A refinement of the churn concept involves rotational churn. This refers to customers that have stopped doing business with a company at some time, and have come back in to do business with it again. Any long-term calculations of churn should consider these customers, which could be an important segment of a customer database. Given their likelihood to go to competitors and then come back as a customer, they may need to be addressed with special retention policies.

Customer Segmentation

This is a technique that helps classify customers using some of their key characteristics. Very often customers are very diverse, with different needs, expectations, and ability and willingness to spend on products and services. Using results of customer segmentation, a business can identify which customers are likely to buy particular products and services, how much are they willing to pay, and often they are likely to make purchases from your company.

A basic segmentation framework takes into account the following factors: recency of purchases, frequency of purchases, and the amount usually purchased, often referred to as the RFM (Recency, Frequency, Monetary) framework. Segmentation is often based on geography (location of customers), and also takes into account basic personal characteristics: age gender, family size, income, occupation, education, religion, race, nationality, living environment (urban, suburban, rural), generation (Y, X, baby boomers), psychographic (lifestyle, personality, values) (Kotler and Keller, 2006).

Very often segmentation is fraught with difficulties, as data on customers is not readily available, is difficult or hardly possible to obtain, or is unreliable. A good example is a landline telephone company. The ‘official’ customer that appears in the database is often a middle age to elderly individual who uses basic services such as landline voice. However, it is very likely that he/she has relatives living in the same household (sons, nieces, etc.) which may use internet connections, bundle packages with TV or IPTV services, cellular or mobile data connections. If this is the case, it is often difficult to distinguish between the account holder own personal needs and the needs of his/her household, with the latter subject to major changes (e.g. due to sons/daughters getting married, nephews/nieces going to college, etc.).

However, while basic information is missing or limited, customer behavior may reveal facts that are not obvious, and may help one gain a better understanding of his/her customers.

Market Basket/Association Analysis

This refers to products and services that are purchased together, or are likely to be purchased within relatively short periods of time.

Perhaps the most widely known example is major bookstores. They do not sell books, CDs, DVDs and other book-related items only anymore. They also sell tea items, interior decoration items, gift items, even toys. While selling toys may come as no surprise, because bookstores also sell children books, tea and interior decoration items are a result of this type of analysis. Research involving book lovers has shown that they have particular tastes and that they are very likely to purchase tea, enjoy adorning their rooms with decoration items, etc.

Market basket is often used by retailers when bundling products, or when designing discounts and rebates for purchasing items.

Note: these concepts are only a few of the ones you will need to know. For further reference, and to ensure the self-contained nature of this book, I have included a basic glossary of marketing terms at the end of the book as a quick and handy reference to help you with your future study and work.

2.1.2 Quantitative and data-related concepts

Price Elasticity

This concept refers to the impact of a price change on the sales of a given product or service. Price elasticity is useful when designing the pricing policy, or assessing the impact of an unavoidable price increase.

For example, a 10% increase in the price of widgets will decrease sales by 5%. This will give a price elasticity of -0.5, which will mean that sales are relatively inelastic to price changes.

Elasticity = -0.5

 

Price decrease

Price increase

Price

90%

92%

94%

96%

98%

100%

102%

104%

106%

108%

110%

Quantity

105%

104%

103%

102%

101%

100%

99%

98%

97%

96%

95%

Regression Analysis

Regression analysis helps understand how the value of the dependent variable changes with variations in the value of the independent variable(s). For example, how do beer sales increase when temperature changes, or decrease when there is a rainy season. A close measure to this is correlation, which has limited applicability due to the fact that it refers to the relationship between two variables only, and that it has few measures to assess if the relationship is valid.

A typical relation modeled by regression analysis is:

where y is the dependent variable, xi are vectors of independent or explanatory variables, α is the intercept, bi stands for the slope coefficient (the one that describes to what extend a change in x will effect a change in y), and is the error term who captures the unexplained part in the variation of y.

One of the advantages of regression analysis is that it allows to model the evolution of a variable (e.g. sales) with respect to one or several other variables (past sales, weather, urban/rural area, season, etc). Another advantage is that it comes with specific diagnostics and tools that allows to see whether your analysis is good or not. Thus, metrics as t-statistics and p-values help in determining if the hypothesized influence of explanatory variables is reliable, or whether it has no relevance at all. Another key metric for regression analysis is the R2, or the coefficient of determination, which tells you how much of the evolution of the target/dependent variable is captured by the explanatory variables.

Yet another useful output from regression analysis, useful in checking its validity, is the residual plot. This is a graph where the unexplained evolution of the target/dependent variable, given by the residual term , is plotted against the predicted value of the dependent variable, and helps you see if your model is correct, and whether there are problems with your analysis. We will revisit all these metrics and diagnostics later in chapter 3.

Data Issues and Data Quality

Very often in business analytics, and especially in unstructured environments, it is necessary to get and process data that may have several issues. This is a rather vague statement made to a beginner in business analytics, but it has far more implications than it appears. Sometimes data is extremely unreliable: for example, for prepay customers of a mobile phone network, information on their employer or marital status may be years old and be completely inaccurate. In other cases, data is not comparable over time (e.g. when the entire portfolio of products has changed) and sometimes it is just incomplete or flawed (e.g. sales for a particular customer is missing from the sales report, although it can be retrieved from its account statement).

No matter how these data problems occur, overcoming them is of utmost importance for the analysis. Using bad data will make the analysis useless from the very beginning. In analytics, several people agree that it is much better to have good data and use an OK model, than use an advanced model with data that has a lot of issues. As a guideline, which comes from a basic statistics textbook, it should always be remembered that data is a numeric expression of something real (e.g. sales, customer base, price changes), and that all the calculations and methods applied should characterize it and show how it evolves. For example, statistics on CPI (consumer price index) enable us to see whether and how much prices are rising, stagnating, or declining; likewise, sales numbers show whether sales have increased, stayed the same, or decreased, and by how much.

Coming back to data issues and data quality, there are a few things that should be considered when data is chosen, processed and used for business analytics purposes. These are the main aspects of data quality and include the following:

• Accuracy: shows if the data describes very well what is supposed to describe. For example, if we get a sales report for a given period, accuracy means that all transactions are there, are recorded exactly when they occur, all prices and quantities are included and are correct, and that all discounts are correctly applied.

• Completeness: data describes everything that it refers to in a given context. For example, sales data describes sales for all active customers, price data includes all prices for the products that are being sold, purchased or stocked.

• Update status: shows whether and when data has been updated or revised. Some macroeconomic indicators like GDP are first published as a flash estimate, which is followed by a final estimate. In this case it is very important to know when the data was updated so that we know its degree of accuracy, and whether it is likely to differ from previous data releases, if applicable.

Some companies handle large amounts of data stored in databases, which are being updated on a regular basis with a certain frequency. Here it is important to know when updates are available so as to ensure that the data reflects the latest information. For example, customer accounts may be updated only after the month end; in this case if we get past due information one or two days before that, we risk having major embarrassments as we may issue payment reminders for customers who already paid their invoices.

• Relevance: refers to how relevant a given data set is for the purposes that is being used. For example, if we want to use inflation as a benchmark for likely price increases for major purchases of building materials, it will make more sense to use a price index for building materials to account for this, rather than the standard CPI (consumer price index), which shows price changes for a basket of consumer goods.

• Consistency across data sources: this is an important feature that defines a good data system, and ensures that shortcomings and fallacies are known and accounted for. For example, data from a customer statement showing purchases should match the amounts coming from the sales ledger. If they do not, the person who handles reporting and analysis should know which report is the most accurate, and how, if possible, differences can be isolated and corrected.

. Consistency of data sources is also important when looking at public data coming from different surveys. For example, job creation data coming from individuals from the labor force survey may be different from job creation data coming from surveys of businesses.

• Reliability: it is essential to know the data shows what it is supposed to. For example, when we get data on survey respondents, we need to be sure that responses are properly recorded, and accurately reflect all the answers given. On sales data, we need to be certain that they reflect all sales made. Reliability also involves having a methodology that is followed in data generation and processing, availability of data quality control, and a reasonable assurance that data will be available for some time in the future and updated at regular intervals.

• Accessibility: refers to how easy it is to get the data. In some cases, obtaining certain transaction data requires querying specific databases and applying filtering criteria. As an example, let us consider transactions of an online store, where some customers have not checked out, or their credit card was declined when checking out. If the only available records contain the items which went to checkout, in order to get the sales for a given day one needs to download a report with all the items that went into the checkout process, the corresponding payment information for each transaction, and match this information against the merchant credit card statement which include all transactions that were approved.

Data Formats

Data comes in different formats. Most often it comes in numeric format for prices, sales, discounts. In other cases it comes in text format, such as product attributes, ratings, etc. In other cases it is important to have data describing transaction day and time.

Most data analysis involves working with numeric data; therefore, there is often the need to convert text data into numeric so as to be able to analyze it. For example, if we have data on customer satisfaction rankings for certain products or attributes, we cannot obtain an average opinion unless we convert ratings into numeric values. If we would like to do clustering on specific attributes, we will likely need to assign distinct values to them so as to run the cluster analysis.

Data Issues

Quite often the data available for analysis has flaws. For example, some customer accounts have negative sales data, or negative amounts due to some special transactions that may not reflect sales for a given period. Sometimes data has values that are clearly off range, for example customer satisfaction data with rankings on a scale from 1 to 5 may include values such as 99, 999 for not applicable, or for no answer. These values need to be removed, or otherwise treated, in order to obtain accurate results and statistics from the data. In other cases, these values are useful in order to isolate records who should be separated or treated differently from the records who take regular values.

An apparently minor issue may have major impacts on the results of calculations: the distinction between zero values and missing values. Sometimes missing values are interpreted as zeroes; if this is the case, an average or quantile calculation is flawed as the false zeroes may decrease averages and produce fallacious results for some key statistical parameters.

Properly identifying missing values is helpful in isolating the records for which data exists. Let’s take a database of sales for a large variety of personalized products. In this case, it is important to isolate the records for which data exist for a given product type, analyze it, and then take the data with missing values and figure out what is happening for transactions that did not include that product type.

In some other cases, missing data needs appropriate treatment in order to be taken into consideration. In many statistical calculations, records with missing data are deleted. This has important consequences because some records that have almost all data except for one or two fields are removed, and the valid data they contained becomes unusable. In such cases, it is better to assign an extreme value for the missing data that would distinguish it from the non-missing data and thus retain all the useful data for analysis.

Note of advice: all definitions and examples above are not meant to be exhaustive. The aim is to present the basic elements needed in understanding these key concepts and their relevance to the future analytical work ahead, and to provide a working background that will be expanded as these issues will be analyzed in detail.

Clarifications on data terminology

We often talk about data, but not very often in the same way. While analysts are used to certain terms, other people who handle data may use different words to describe it. A quick review of data terminology would be useful for describing a data set, what is needed from those who will provide it, and enable one to read software documentation that is sometimes arcane and technical.

The basic data structure can be seen as a table, often available in the following format:

No

Var1

Var2

Var3

Var4

1

500

Like

Agree

1

2

6000

Don’t like

0

3

789

Like

Disagree

0

The first column often gives the identifier for a given data record, for example for the first data record (1) we have a value of 500 for variable 1, like for variable 2, agree for variable 3, 1 for variable 4. Data records are often referred using the term observations, mostly by social scientists and marketing researchers, or cases by life scientists and biostatisticians. Variables, a term that describes a given feature of characteristic of the data, is often used by economists and social scientists. Database professionals often refer to them as fields, with observations or cases being referred as rows.

2.2 Practical applications with R

2.2.1 Getting started with R

Perhaps the most advanced and versatile software package for data analysis, which comes free and benefits from the latest implementations of statistical methods is R. I have to admit that I am an R’nthusiast because I can use it for free and do a lot of advanced work with it, however I am aware of its pitfalls and shortcomings, mostly when they relate to making comparisons and working with commercial software packages such as SAS, SPSS, Stata, EViews and so on.

One thing to be remembered when using R is that in order to use some statistical methods, you will need to download, install, and then load packages which will enable the use of specific methods and algorithms as needed.

Using R without any help requires having a bit of a programming background; fortunately there is a friendly user interface (referred as the GUI) that enables you to use it in a relatively straightforward way and rerun or modify some of the commands to suit your analysis purposes.

R is available for download from the following website:

http://cran.r-project.org/bin/windows/base/

Figure 2.1. Download location for the R software for Windows

R-2.15.1 for Windows (32/64 bit)

Download R 2.15.1 for Windows (47 megabytes, 32/64 bit)

Installation and other instructions

New features in this version

In order to double-check that the package downloaded matches the package distributed by R, you can compare the md5sum of the .exe to the true fingerprint. You will need a version of md5sum for windows: both graphical and command line versions are available. For a hassle-free installation, the default options need to be selected. If your computer has a 64-bit processor, both the 32-bit/64 bit files options need to be selected.

Once R is installed, following screen will appear after opening it, as shown in figure 2:

Figure 2.2 The R Console for command-line use

Then the command should be run in order to install the GUI : install.packages("Rcmdr"), as shown in Figure 2.2.

If R works correctly, it will require the selection of a CRAN mirror, aka. a download site, for the GUI Interface. If it works correctly, R will download it to a location in your computer as in Figure 3.

Figure 2.3 Messages generated during the download of the R graphical user interface R Commander

The next step is to install the GUI through the following steps:

In the menu Packages, select Install packages from the local zip file, and follow the path indicated in the R window (in my case C:\Users\Local\temp\Rtmp8oA1G4\downloaded_packages) and select Rcmdr_1.8-4.zip. Click to open it and look for the message in the regular R window:

> utils:::menuInstallLocal()

package ‘Rcmdr’ successfully unpacked and MD5 sums checked

Figure 2.4 Installation of the R Commander graphical user interface

Launching the GUI interface is done by typing the command at the command prompt:

>library(Rcmdr)

Note: unlike other software packages R is case-sensitive, something that needs to be remembered.

Other steps are needed until the GUI opens, as it will prompt for more packages to be downloaded. Fortunately, these packages will download and install automatically. When prompted for these packages, click OK, to allow them to download and install. After that, the GUI will be up and running. Figure 2.5 shows an instance of the RCommander GUI.

Note: if behind a firewall and installation does not work, there will be the need to manually download and install the packages needed in order to run the GUI.

For reference, here is the site with information about the GUI.

http://www.sciviews.org/_rgui/

Figure 2.5. The R Commander graphical user interface

At the top, the R Commander interface has several menus (File, Edit, Data, etc) with submenus that contain tools to perform a wide range of operations. You can use these to import the data, do statistical analyses, process variables in the data set, load packages, etc.

It will also display the active data set in the box named "Data set:", and enable the user to change the active dataset by clicking on that box and selecting another data set. In figure 2.5, the data set shows no active data set to display.

The script window will show all the commands that are send for processing in R. Here one can also input commands and run them by selecting them and clicking on the Submit button. The output window will show in most cases the results of your operations. The messages window will give some basic information about processing some commands (e.g. importing data) and display warnings in red when there is a problem with data processing, or when a command does not work.

Now the GUI is open and functional.

2.2.2 Basic applications with R

2.2.2.1 Importing your data

Please use the data in the Excel file Exercises WK_1. xls, provided with this book, which should be saved on a safe location.

To import data, go to the Data tab, then to Import data, then choose, from Excel, Access or dBase data set. Then enter the name for the dataset and then click OK to find the file to import, as shown in figure 2.6.

Figure 2.6 The menu used for importing data

After the file is found, click on it and select the worksheet with the data as shown in figure 2.7.

Figure 2.7 Selecting data sets for import

On clicking OK, a message will appear in the Messages part of the GUI window, showing that the import went OK. (e.g. [3] NOTE: The dataset UE has 18 rows and 2 columns). In case there are errors, they will be shown in the same location.

Figure 2.8 R Commander view with the imported data set

Now it is good to check if the data was correctly imported. Clicking on the View data set button will display the imported data, as in figure 2.9.

Figure 2.9 Display of the imported data set

2.2.2.2 Getting basic statistics

Using the summaries section, data can be analyzed using one of the following options shown in figure 2.10.

Figure 2.10 Menu options for running basic statistics

With one variable (unemployment rate, UE), it only makes sense to use Active data set or Table of statistics option to get means medians, standard deviation.

2.2.2.3 Merging two datasets

In order to get both the GDP and unemployment rate in the same dataset we will need to merge the two data sets into one dataset.

For this, one must access the Data tab, choose the data sets to be combined, and select Merge columns. The name of the merged dataset can be changes from the default MergedDataset to another name, e.g. Data.

Figure 2.11 Menu for data merges

The resulting dataset, Data, has now both GDP and unemployment data for the same countries.

Now it is time to observe that in the new data set, there are two variables created for the common variable entered in the same data set. To delete one of them, use the menus as shown in figure 2.12.

Figure 2.12 Deleting variables from the data set

The merge command has several options:

A) Merging rows: This will put one dataset on top of the other. This is useful for adding the same data (say GDP) for another set of countries. To avoid confusion, both datasets should have exactly the same variable names, otherwise the result will show missing data (shown as NA) for the rows for which data is not available in either of the data sets. The result is shown in the left picture of figure 2.13.

B) Merging columns: will take into account all existing rows. If one row is deleted, say GDP, we will get a merged dataset in which there is missing data for the GDP and unemployment rate data. The result is shown in the right picture of figure 2.13.

Figure 2.13 Merge results

2.2.2.4 Joins

While R has easy to use options for combining data sets, they are not very advanced and often do not yield appropriate results. In the example of a more complex dataset, with several common variables, say, country and year, with missing data for particular years or countries. A basic merge will then provide inaccurate data, that cannot be used.

Another case is where there are several duplicates in the data. In this case, non-duplicated data will appear twice to match the duplicated data, which may result in data inconsistencies.

In order to avoid this and other more complex errors, more complex joins will be carried out in order to control how different datasets are merged according to specific analysis needs. ‘joins’ is a database processing term which defines the types of relationships to be used when combining two data sets. These relationships are created using ‘keys ‘or ‘key variables’, which can be found in both datasets. These key variables will be used in bringing together the other variables in the same dataset, and to define the rules to be used when doing so.

There are three main types of joins:

The full/inner join is used for merging data only for existing values of key variables. If GDP data is missing for one country, say UK, the merged data set containing GDP and unemployment data will not have any UK data

The left join. It is used if there is a ‘left table’ and a ‘right table’ and one is interested in getting data from the right table only for the records which exist in the left table. For example, we have a left table with sales data by item for January, and want to get February data for the same items in order to do a comparison.

The right join works exactly the same as the right join, but brings only the data that matches the key variables in the right table.

For example, we want to merge sales data for January and February by both item and brand. The two datasets are in the excel file Exercises wk 1.xls in the worksheets Salesjan and Salesfeb, which are shown in figure 2.14.

Figure 2.14 Data used in performing the merges

Left table (x) Right table (y)

SalesJan worksheet SalesFeb worksheet

The command

a<- merge(Jan, Feb, by=c("Item","Brand"),all=TRUE)

will produce a table which will contain any record that exists in the two tables. The result of this command is shown in figure 2.15. The resulting dataset , a, is selected in the active dataset tab by clicking on the square and selecting it, before displaying it by pressing the View data set button, shown in figure 2.5.

Figure 2.15 Resulting data set from the merge command

This command is to be entered in the Script window, and executed by clicking on the Submit button.

Figure 2.16 Using the script window to submit commands

The resulting data is complete, but has obvious gaps in terms of missing data. To see the evolution of sales between January and February, one could use a dataset that contains only the data for which sales for both months are available.

The command

a<- merge(Jan, Feb, by=c("Item","Brand"), all=FALSE)

generates the sales for the items for which data for both months are available. This is a full join, and its results are shown below in figure 2.17.

Figure 2.17

Results of a

full join

To see what happened for the items for which there are sales in January, one can use the following command to join all January sales with the February sales for the corresponding items.

The following command will perform a left join which will give you the result shown in figure 2.18. Please note that x in the command below stands for the first table used in the join, Jan.

a<- merge(Jan, Feb, by=c("Item","Brand"),all.x=TRUE)

Figure 2.18

Results of a left join

And if we replace all.x with all.y in the command above we will get a right join, as shown below in figure 2.19.

Figure 2.19

Results of a right join

More information on joins and how to perform them can be found on the web page, http://stat.ethz.ch/R-manual/R-patched/library/base/html/merge.html. However it is not in a user-friendly format.

2.2.2.5 Getting ready for work with R. Installing libraries and packages

One distinctive feature of R is that, for performing specialized operations, it makes use of specific packages. Packages are built up as collections of functions, data, and code, structured in well-defined formats.

The standard R uses a basic set of packages that perform basic functions. For performing specific analyses and other operations, such as running specific models, drawing specialized graphs, importing and formatting data, etc., the user needs to first download and install these packages, and then load them every time he/she needs to use them.

Throughout the book, there will be the need to download, install and load several packages. As an example, the tseries package, used in performing time series operations, will be downloaded and installed.

In the R console window, in the packages menu, and the install packages submenu, a mirror download site must be selected, then the tseries package must be selected, as shown in figure 2.20.

Figure 2.20 Using the R console to download packages

The package will download to a location similar to the one shown in figure 2.21. The MD5 sums messages displayed must be checked to ensure they are successfully unpacked.

Figure 2.21 Downloading the tseries package

Next, one will need to go to the packages menu again, select Install packages from local zip file, and use the path indicated in the console to locate and install the tseries zip file, similar to the one shown in figure 2.21.

If the installation is successful, the package can be launched. To do so, in the RCommander window, one needs to go to the Tools menu, choose Load package(s), select tseries and click OK. A message at the bottom GUI window, similar to the one below, will show that the package has successfully loaded:

[4] NOTE: Packages loaded: tseries

2.2.2.6 Working with seasonal data. Performing seasonality adjustments

A useful operation when doing data analysis is to isolate the seasonality in the data. Very often sales of some products and services vary greatly by season, and thus it could be impossible to see if sales have really increased because of the regular increases in a particular season, or because the product sold is more and more in demand. In order to make an accurate comparison between monthly or quarterly data, an useful procedure is to remove the seasonal variation in order to see the ‘core’ evolution of a particular variable, and make comparisons that go beyond seasonal shifts.

The following practical example will isolate seasonality and compute seasonally adjusted data series.

The Excel file Exercises wk 1.xls, contains a dataset for quarterly unemployment in sheet ‘Qtly Unemployment’. After importing the data, the command below should be run by inputting it in the script window and clicking on the submit button. Dataset is the name of the dataset where the UE variable is, and x is the resulting table which stores the data formatted in a time series layout.

x<- ts(Dataset$UE,start=c(1996,1),end=c(2011,4),

frequency=4)

The list(x) command will list in the output window the transformed data as a table where the rows are the years and columns are the quarters:

Figure 2.22

Data put in quarterly format,

shown in the R Commander

output window

In the next step, data will be decomposed into three components: a trend component, a seasonal component, and an error term, with the following command:

m<- decompose(x)

Listing m gives the output below, with the seasonal component ($seasonal), the trend component after $trend, the random component after $random, the seasonality adjustment coefficients after $figure and the decomposition type (additive in this case).

$seasonal

Qtr1 Qtr2 Qtr3 Qtr4

1996 0.9116667 -0.2825000 -0.6116667 -0.0175000

1997 0.9116667 -0.2825000 -0.6116667 -0.0175000

1998 0.9116667 -0.2825000 -0.6116667 -0.0175000

1999 0.9116667 -0.2825000 -0.6116667 -0.0175000

2000 0.9116667 -0.2825000 -0.6116667 -0.0175000

2001 0.9116667 -0.2825000 -0.6116667 -0.0175000

2002 0.9116667 -0.2825000 -0.6116667 -0.0175000

2003 0.9116667 -0.2825000 -0.6116667 -0.0175000

2004 0.9116667 -0.2825000 -0.6116667 -0.0175000

2005 0.9116667 -0.2825000 -0.6116667 -0.0175000

2006 0.9116667 -0.2825000 -0.6116667 -0.0175000

2007 0.9116667 -0.2825000 -0.6116667 -0.0175000

2008 0.9116667 -0.2825000 -0.6116667 -0.0175000

2009 0.9116667 -0.2825000 -0.6116667 -0.0175000

2010 0.9116667 -0.2825000 -0.6116667 -0.0175000

2011 0.9116667 -0.2825000 -0.6116667 -0.0175000

$trend

Qtr1 Qtr2 Qtr3 Qtr4

1996 NA NA 6.3125 5.9250

1997 5.8125 5.7875 5.8875 5.9875

1998 6.0125 6.0875 6.2375 6.4000

1999 6.5125 6.6000 6.6750 6.7875

2000 6.9375 6.9500 6.8250 6.7000

2001 6.5625 6.4375 6.7250 7.2750

2002 7.7500 8.2125 8.1750 7.7750

2003 7.4625 7.1375 7.0625 7.2500

2004 7.5250 7.8750 8.0125 7.9000

2005 7.6375 7.2875 7.0375 6.9375

2006 7.0375 7.2000 7.1500 6.9875

2007 6.8000 6.5375 6.3125 6.1125

2008 5.9250 5.8125 5.8500 6.0125

2009 6.2750 6.6625 7.0250 7.2375

2010 7.3125 7.3000 7.2125 7.2000

2011 7.2875 7.3750 NA NA

$random

Qtr1 Qtr2 Qtr3 Qtr4

1996 NA NA -0.10083333 0.09250000

1997 -0.22416667 -0.20500000 0.02416667 0.13000000

1998 0.27583333 -0.40500000 -0.22583333 0.21750000

1999 0.47583333 -0.31750000 -0.36333333 0.23000000

2000 0.25083333 0.03250000 -0.01333333 -0.08250000

2001 0.02583333 0.14500000 -0.61333333 -0.95750000

2002 1.43833333 0.17000000 -0.06333333 0.24250000

2003 -0.27416667 0.04500000 -0.25083333 -0.53250000

2004 0.36333333 0.10750000 0.19916667 0.21750000

2005 -0.04916667 0.09500000 -0.32583333 -0.12000000

2006 -0.14916667 0.08250000 0.46166667 0.23000000

2007 -0.71166667 0.24500000 0.29916667 0.00500000

2008 -0.53666667 0.07000000 0.16166667 -0.19500000

2009 -0.28666667 -0.08000000 0.38666667 0.28000000

2010 -0.12416667 -0.21750000 0.29916667 0.11750000

2011 -0.59916667 0.10750000 NA NA

$figure

[1] 0.9116667 -0.2825000 -0.6116667 -0.0175000

$type

[1] "additive"

attr(,"class")

[1] "decomposed.ts"

The commands below will graph the data and allow one to see whether there is seasonality in the data

m$figure

plot(m)

Figure 2.22 Graph of time series components

Seasonality can be spotted by looking at the top graph in figure 2.22. The observed values show obvious spikes. Most of them are taken out when the trend is computed, shown as a relatively smoothed line. The seasonality components are 0.9116667, -0.2825000, -0.6116667, and -0.0175000. The random component stands for the variation in the observed data that is not explained by either the trend or seasonal variation. As a technicality, it is known that this data has had an issue around 2002, which is clearly shown in the random graph.

After decomposing the data, seeing that it is seasonal, and being confident that appropriate seasonality coefficients have been obtained, let us compute a seasonality adjusted data series, and plot it with the commands below:

ue<-x-m$seasonal

plot(ue)

Figure 2.23 Graph of the seasonally adjusted data

The resulting data, stored under the ue name, is still in table format, as shown in the output below.

[[1]]

Qtr1 Qtr2 Qtr3 Qtr4

1996 8.088333 6.182500 6.211667 6.017500

1997 5.588333 5.582500 5.911667 6.117500

1998 6.288333 5.682500 6.011667 6.617500

1999 6.988333 6.282500 6.311667 7.017500

2000 7.188333 6.982500 6.811667 6.617500

2001 6.588333 6.582500 6.111667 6.317500

2002 9.188333 8.382500 8.111667 8.017500

2003 7.188333 7.182500 6.811667 6.717500

2004 7.888333 7.982500 8.211667 8.117500

2005 7.588333 7.382500 6.711667 6.817500

2006 6.888333 7.282500 7.611667 7.217500

2007 6.088333 6.782500 6.611667 6.117500

2008 5.388333 5.882500 6.011667 5.817500

2009 5.988333 6.582500 7.411667 7.517500

2010 7.188333 7.082500 7.511667 7.317500

2011 6.688333 7.482500 7.811667 7.717500

There are now several options for manipulating this data: to export the series to a file using R, or to copy and paste the data into Excel.

One of the most useful processing solutions would be to get the data back to a column format, and merge it with the existing data set. This will not only get the adjusted data into the data series, but will also allow us to check if the seasonally adjusted values do not significantly diverge from unadjusted data.

For this, the tseries package must be loaded.

The command use<-as.data.frame(ue)will store the seasonally adjusted data into a single column labeled x, as shown in the output below:

[[1]]

x

1 8.088333

2 6.182500

3 6.211667

…

The following command will add this column to the dataset with the original unemployment data, and generate the result shown in figure 2.24.

a<-cbind(Dataset,use)

Figure 2.24 Graph of the

seasonally adjusted data

As a final step, the seasonally adjusted variable can be appropriately renamed with the use of the RCommander, by using the menu options shown in figure 2.25

Figure 2.25 Graph of the seasonally adjusted data

2.3 Exercises and questions for review

Exercise 1: Please indicate which analysis method/concept is to be used for the following cases:

1) A company wants to determine sales for a certain product based on past sales and on urban/rural locations of sales points.

2) "We need to gain a better understanding of our clients. So far we have little or no idea who they are and where they come from", said the CEO in a key strategy meeting.

3) We know that our main competitor is going the lower the price for widgets by 5%. Do we know what will happen if we can only match them with a 2% price decrease?

Exercise 2: Can you describe two of the most important data quality issues you faced in your routine work, and how you dealt with them? Please also indicate which main data quality aspects were not abided by (e.g., accuracy, etc.)

Exercise 3: Going back to the tables in the exercise data set, please pick two tables (one of which should be the sales data) and describe the data in terms of the data quality aspects described above. Please make specific comments to argue your case with respect to the data

Exercise 4: Downloading, installing and launching a package in the GUI. Please choose one of the following ‘urca’, ‘tseries’, if not existing among the existing packages. If both are present, please install and launch RcmdrPlugin.BCA. Please inspect the output that indicates successful completion of each step.

Exercise 5: Do a seasonal adjustment of the monthly data in the exercise file UE data.xlsx, provided with this book. Please remember to change the key parameter that gives you the monthly periodicity of the data (as indicated in the course). Please plot the decomposed series and state the conclusion: is the data seasonal, and what shape does the trend have.

Please also plot the seasonally adjusted data you obtained against the unadjusted data and the officially seasonally adjusted data (UE_SA), with the following commands

plot(a$UE_SA, type="l", col="red")

lines(a$x, type="o",pch=22, lwd=8,col="blue")

lines(a$UE_NSA, type="o",col="green")

Exercise 6:

Merge the following data in the file provided so as to obtain last month’s sales, and submit the results. Your manager wants to see all the data available. Please prepare a report using all data available.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now