Usefulness Of Data Manipulation Techniques

Published Date: 02 Nov 2017

Its purpose is to provide a well-rounded and effective wrap-up of this book. In this wrap-up we will first go back, review, and apply basic notions from the first chapter, as an useful detour from the analytical content from the previous two chapters and a rehearsal of the data manipulations which may be not so interesting and fun to learn, but are very useful in day-to-day business analytics work.

The importance and usefulness of data manipulation techniques is further stressed by working though some particular processing. Working with dates and making summaries is, again, not among the most popular topics. The lack of treatment of these topics in a book, and their weak or non-existing theoretical aspects may be frowned upon by readers with a predominantly theoretical background (e.g. students, professionals). However, in practice, one is very likely to do this type of calculations, which may be quite challenging to figure out if done by oneself, and are definitely not taught in heavily theoretical, high-profile courses.

It is my belief that this practical component will provide the readers an edge over other practitioners by giving them a practical headstart.

The quantitative focus of this book is not entirely put aside in this chapter. An example using quantile regression is done in order to compute price elasticity, as an example of how a rather fancy-named quantitative technique can find applicability in everyday work, and an encouragement for the readers to not be afraid to go beyond the content of this book and use less-known techniques.

Practical knowledge as a competitive edge is fostered though the study of some retail audit metrics. Here the book covers some widely-known metrics that are mostly known by FMCG companies and practitioners, and are seldom found in textbooks, unless their focus is retail audit. Hence, the reader is exposed to some niche content that is in many cases essential in understanding the analyses provided by the FMCG industry.

And the last part, often not taught in an analytics book, is a short guide on how to go about doing a business analytics project from start to finish. This is another attempt to give the reader an edge over readers of other books, and provide some essential, practical tips that help in keeping focus and staying on track to deliver actionable results that sell themselves well to both customers and bosses.

5. 1 Churn Rates

Perhaps this is the most commonly used business metric, and it essentially tells what percentage of your customers stop doing business with a company or another entity. Churn rates are usually computed as a percentage of the number of customers at the time period where your analysis starts. For example, if one wants to compute churn rates for 2011, the results obtained are reported relative to the number of customers existing in January 2011.

Timing and consumption patterns are essential in computing meaningful measures of churn rates (or in short, the churn). That depends very much on the normal frequency of purchases. For example, for a business making personalized promotional items, it is very likely that it will get orders once or twice a year or so from most customers. In this case, it would make sense to compute churn on an annual basis, to see how many of the previous yearsâ€™ customers have come back to place another order. In other cases, churn must consider the duration of a given service. Analysing an online social network service where members can contact the others only if they have purchased a service is more complicated, since one must see whether members renew the subscription after the initial membership period has passed.

Also, one may need to compute churn for particular products or services just to see how the customer base is doing in that particular area of business.

Apart from defining the right period of analysis, computations are relatively straightforward. Relevant client data needs to be merged for the last period for which the churn is calculated with the data for the initial period. Then, the number of customers obtained is divided by the number of customers in the first period to give the churn rate.

The computational procedure used is to do a left join, which retains all customers from the first period and brings all records from the last period for the customer numbers that exist in the first period.

This is more or less similar to the work done at the end of the second chapter, so I will not give an additional example here. One basic operation that needs to be done is to ensure having unique customer IDs, otherwise all calculations could be wrong. This is done with the command: a<-unique(petrol$SG) which stores the unique records into a new variable, as shown in figure 5.1.

Figure 5.1 Getting unique values from a variable

> a

[1] 50.8 40.8 40.0 38.4 40.3 32.2

41.3 38.1 31.8

Now that all the necessary tools and knowledge to do basic churn calculations are known, let us move further to see refinements of the churn calculation which are linked to rotational churn and purchase/usage behaviour. Rotational churn means that some customers that have stopped doing business with a company will, at a later point in time, resume doing business to that company again. They may not appear in the last periodâ€™s data, but they are still active customers who should not be included in the churn.

While rotational churn is characteristic to telecom businesses, it has wide applicability elsewhere. Suppose a churn report for customers who buy printer cartridges has to be computed. It is known that the average business needs a new cartridge every 6 months. If one is asked to produce a churn rate for the entire year (December versus January) and only the January and December data are used, it is possible to exclude from the churners all those that have purchased cartridges between July and December. This is not rotational churn yet, as it is not known whether a customer who did not purchase anything between July and December has not purchased anything between January and June, and may do a purchase again next year. That customer may have purchased a cartridge from a competitor between July and December but has afterwards switched back to the original supplier, and should not be included in the churn numbers.

Most of the difficulties relating to churn computation come from the fact that one needs to have a good understanding of the business before doing them. The second difficulty is devising a meaningful routine for computing the churn.

Let us take an example of customers from the file customer.csv. There is no data for the next year, and thus it will not be possible to exclude rotational churn.

Using the dataset customer.csv, which includes simulated transaction data, the calculation methodology will be defined first. This file contains only transaction data available, and thus no customer report or other document generated as of January 1st is available to give all existing customers at that time. Then how it is possible to get the initial customer base? By taking an approximation and assuming that whoever did transactions in January, February or March were customers that existed in our customer base.

This assumption is based on some prior knowledge from the sales people, which know that many repeat customers have placed at least one order during those months. New customers start coming in from April onwards, so the assumption laid down above seems valid. Note that it would be plain wrong to take only January customers given the low frequency of purchases.

In R, a way to do this is to select the transactions recorded in the first three months and obtain unique customer numbers. In the first step, using the dim command will display the number of rows/observations in the dataset, followed by the number of columns (in our case 5444 and 1).

b <- subset(Dataset, subset=MONTH<=3)

b$MONTH<- NULL

base<- as.data.frame(unique(b$cust))

dim(base)

Now the next step is to determine the second data part for the churn calculation, the active customer base as at year end. Based on the fact that the average number of transactions is 2, we can conclude that whoever did a transaction in the last 6 months of the year is an active customer. The commands to be used are similar than before, with only the subset condition to change to MONTH>6. This time a variable called 1 will be added, so as to indicate which records belong to this file, and enable us to compute the churn rate.

b <- subset(Dataset, subset=MONTH>6)

active<- as.data.frame(unique(b$cust))

active<-cbind(active,1)

Using the above commands we now have two datasets with only one common variable labeled unique.b.cust.

The next step will be to do a left join by linking the customer base with all customers who had transactions between July and December (labeled as active). However, before that, it may be useful to rename the customer variables in both datasets with their original names, as the names created with the code unique(b$cust) cannot be used in the merge command. Doing a join using the unique.b.cust variable will yield the following message in the messages box of RCommander interface:

ERROR: 'by' must specify uniquely valid column(s)

Renaming the customer number variable is done separately for each data set by changing it "custno" in both datasets through using the Data<Manage variables in active dataset<Rename variables menus/submenus. One thing to remember is to change the active dataset by clicking on the dataset box, and selecting the appropriate data set, as shown in figure 5.2.

Figure 5.2 Selecting the appropriate dataset

The following command will generate the dataset to compute churn, as illustrated in figure 5.3

a<-merge(base, active, by=c("custno"), all.x=TRUE)

Figure 5.3 Resulting dataset for churn calculations

This should have as many records as the base file (5444) and a flag variable indicating the records that are present in the active customersâ€™ file.

Using the menus/submenus Data, Active data set, Remove cases with missing data, will get the final dataset with the number of customers which are active. In our case, it will be 2467.

Thus the churn rate is computed as follows:

(5.1)

The result of applying formula 5.1 in our case is (5444-2467)*100/ 5444 = 54.68%

One final observation before we move to the next chapter. Obtaining the churn value is not the final step in analyzing the churn rate. The final step is its interpretation, and it greatly depends upon the business or industry it refers to. For some businesses, almost all clients are one-time buyers and repeat customers are rare (e.g. in the case of optical shops). In these cases, churn rates are not a viable metric, as they approach 100%, and the right metric would be the retention rate, computed as the number of active/repeat customers, divided by the size of the customer base.

In other cases, for example hairdressers, businesses count on a mass of steady and regular customers, and here using churn rates will have to take into account the rates recorded by competitors, or churn rates calculated for the entire industry. Only by comparison with similar results one is able to tell whether the churn rate (or retention rate, if this measure is more relevant) is high, average, or low, and whether remedial action should be taken if results are deemed to be reliable.

5.2 Price elasticity

This is perhaps the most widely known concept in economics, which does not get much attention in the real life. Despite being largely unknown, it is easy to model and very straightforward to calculate.

Price elasticity tells how much the sales will decrease (or in some marginal cases, increase) as the selling price changes. The basic formula is: percentage change in sales/percentage change in price. However, this will be complicated to compute as this assumes comparing quantities and prices for different periods.

Fortunately, there is a simple transformation from calculus which makes our file easier, and estimations simpler. Taking a derivative of the change in sales, shown in formula 5.2, which means, in plain language, getting the rate of change in sales yields the natural logarithm of sales, notes as log(Sales):

(5.2)

By doing this transformation on both percentage change we can get a formula which can be estimated with the linear regression model as follows:

log(salest) = a + b*log(pricet) + et (5.3)

A note on the interpretation of the results. The elasticity coefficient is the b coefficient for the log of price. Elasticity above 1 in absolute value (less than -1, or more than 1) means that sales are elastic, that is a 1 percent of an absolute price change leads to a change in sales of more than 1.

This may be due to customers readily switching to competitors which may offer similar products at better prices, or, in some cases, that customers swiftly cut back consumption so as to fit their limited incomes. Elasticity below 1 in absolute value (that is between -1 and 1), tells that sales are rather insensitive to price changes. This may be good news, as in these cases sales will not be significantly affected by price changes, which can happen if the product is unique, with little or no competition for it, or that the product may be an essential one and customers do not cut back consumption if their incomes fall.

A clarification is needed about the sign of the elasticity coefficient b. It is usually negative but, in many economics textbooks, and price elasticity reports, a minus sign is put in front of it so as it becomes positive and easier to interpret. However, there may be some cases in which a raise in price may be associated with an increase in the quantity sold. This refers to inferior goods and reflects a counterintutitive, but true relationship. As the price decreases for these goods, the quantities sold decrease since consumers have more money to buy goods of better quality and shun purchases of these inferior goods. In other cases, an increase in the price of these goods is a sign of future hardship, so people rush into buying these items as they assume prices for these goods are to increase even more.

Let us do a straightforward example and compute price elasticity for the sales of pork in the US.

Using proxy variables

In real life there is often the case that the data we want for the analysis is not

available. There are many reasons for this: data may not be available, may not be reliable, or it may just be too expensive to compute. Given this, an analyst may be in the position of not being able to do work as the variable(s) of interest are missing.

However, in many cases, there are other options available in the form of data that is fairly close to the variables of interest. For example, salary data may not be available, but there may be wage indexes computed for a specific area or, say, industry. Similarly, prices may not be available for specific goods, but prices for similar goods, or for a broader category of goods which comprise the goods we are interested in, may be available.

Choosing and using a proxy variable instead of a missing variable of interest requires an educated judgement, which takes into account getting satisfactory answers to the following questions:

- is the nature of the proxy variable close to the required variable?

- what is the difference between the proxy variable and the desired variable?

- what are the potential shortcomings of using the proxy variable?

- is it available for a significant length of time, or for a significant number of similar units, and will it be available in the future on a regular basis?

In our example, sales of pork are proxied by the consumption of pork in the US, assuming that most of the consumed pork comes from purchases, and that farmers' own consumption of pork is very small. The data set pork.xlsx comes from USDAâ€™s Red Meat Yearbook 2006.

The output is:

lm(formula = log(Consumption) ~ log(Bprice) + Year, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-0.042445 -0.020689 0.005329 0.014687 0.073704

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -18.283744 3.177574 -5.754 3.55e-06 ***

log(Bprice) -0.391879 0.048858 -8.021 9.82e-09 ***

Year 0.015224 0.001707 8.918 1.13e-09 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02697 on 28 degrees of freedom

(4 observations deleted due to missingness)

Multiple R-squared: 0.7441, Adjusted R-squared: 0.7258

F-statistic: 40.71 on 2 and 28 DF, p-value: 5.158e-09

It looks like demand for beef is fairly inelastic as a 1% increase in the price of beef will reduce demand by only 0.39%.

In a nicer graphical form, price elasticity is expressed as a table as in the example below (by rounding elasticity to (-0.4):

Table 5.1 Price elasticity table, standard format

Elasticity = -0.4

Â

Price decrease

Price increase

Price

80%

85%

90%

95%

100%

105%

110%

115%

120%

Quantity

108%

106%

104%

102%

100%

98%

96%

94%

92%

However, while calculations are valid, there may be a caveat in this coefficient formula. Essentially, this formula assumes that price elasticity is the same no matter how big or how small changes in prices and consumption are. This is a drawback of linear regression which, while being very good at explaining the impact of the average change in prices on sales, may not explain the changes at different levels of consumption, especially when they tend to have large variations.

In real life, price elasticity may vary a lot for different values taken by sales and prices. In order to account for this, it may be useful to use a techniques called quantile regression, to account for different elasticities that may occur for the sales data we want to analyze.

In order to run a quantile regression, the quantreg package must be installed. The commands for the quantile regression are:

a<-rq( log(Consumption) ~ log(Bprice) + Year, tau = c(0.2, 0.4, 0.5, 0.6, 0.8), data = Dataset)

summary(a)

which yields the following output:

Call: rq(formula = log(Consumption) ~ log(Bprice) + Year, tau = c(0.2, 0.4, 0.5, 0.6, 0.8), data = Dataset)

tau: [1] 0.2

Coefficients:

coefficients lower bd upper bd

(Intercept) -16.11455 -24.15162 -9.55874

log(Bprice) -0.38186 -0.49356 -0.27009

Year 0.01410 0.01051 0.01839

Call: rq(formula = log(Consumption) ~ log(Bprice) + Year, tau = c(0.2,

0.4, 0.5, 0.6, 0.8), data = Dataset)

tau: [1] 0.4

Coefficients:

coefficients lower bd upper bd

(Intercept) -14.43706 -22.59664 -10.50698

log(Bprice) -0.37373 -0.48921 -0.29911

Year 0.01324 0.01108 0.01746

Call: rq(formula = log(Consumption) ~ log(Bprice) + Year, tau = c(0.2, 0.4, 0.5, 0.6, 0.8), data = Dataset)

tau: [1] 0.5

Coefficients:

coefficients lower bd upper bd

(Intercept) -15.57513 -25.35833 -13.03027

log(Bprice) -0.37444 -0.46402 -0.31422

Year 0.01382 0.01254 0.01908

Call: rq(formula = log(Consumption) ~ log(Bprice) + Year, tau = c(0.2,

0.4, 0.5, 0.6, 0.8), data = Dataset)

tau: [1] 0.6

Coefficients:

coefficients lower bd upper bd

(Intercept) -17.68794 -22.88871 -14.77597

log(Bprice) -0.37484 -0.47793 -0.33895

Year 0.01489 0.01344 0.01765

Call: rq(formula = log(Consumption) ~ log(Bprice) + Year, tau = c(0.2,

0.4, 0.5, 0.6, 0.8), data = Dataset)

tau: [1] 0.8

Coefficients:

coefficients lower bd upper bd

(Intercept) -21.89313 -24.06022 -15.63642

log(Bprice) -0.46017 -0.48820 -0.34484

Year 0.01721 0.01379 0.01836

In this output we are given the different elasticity coefficients at different levels of the distribution of the consumption, with the 90% confidence bounds or, in other words, the values between which the true, but unobserved elasticity coefficient, lies, with a likelihood of 90%. Values are summarized in table 5.2.

Table 5.2 Price elasticity coefficients and confidence bounds

Quantile

coefficients

lower bd

upper bd

0.2

-0.38186

-0.49356

-0.27009

0.4

-0.37373

-0.48921

-0.29911

0.5

-0.37444

-0.46402

-0.31422

0.6

-0.37484

-0.47793

-0.33895

0.8

-0.46017

-0.4882

-0.34484

We are now able to see how the coefficients obtained hold at different points of the data distribution. They are fairly stable around -0.37 to -0.38, which is close to the one given by the linear regression, except for the 0.8 quintile. However, for this quintile the confidence interval is less symmetric, which raises the question mark whether the results obtained are reliable. Nevertheless, we can conclude that price elasticity is significantly higher for high consumption levels, which may lead us to revise our initial calculations and redraw a new elasticity table which captures this difference.

In order to visualize the results, the following command can be used to plot the quantile regression results: plot(summary(a)).

The plot obtained, shown in figure 5.4, contains all the regression coefficients information and confidence interval boundaries plotted in a fairly compact way.

Figure 5.4 Quantile regression plot

5.3 Price Points/Price Dispersion

This is a fairly frequently used metric in compiling the price range with which a merchandise is being sold across different stores and at varying prices.

Typically, it is shown either as a table with quantities sold at a given price, or the percent distribution of all sales at different prices, for one or two more products that are fairly similar in terms of size and quality. For example we have 6-pack beer sales for brand X, and we want to see how they compare against similar products.

Figure 5.5 contains a sample done on simulated data, which shows the sales distribution at different prices for different brands. This is usually accompanied with a numeric version, which just shows the quantities sold. It is a good idea to include both of them so as to make sure that the price dispersion table was done on all data available, and that no sales are missing. For a brand, all row values should add to 100%.

Figure 5.5 Example of a price dispersion table

Let us do a practical example using the cu.summary dataset, included in the rpart package, and available after loading it. The main steps to produce a price dispersion table are:

A: Filtering the data needed. It only makes sense to compare similar products, so we need to pick the products that are fairly similar. In Excel you can do this using the filter command, and then click the appropriate boxes.

In R the following standard command is used to filter data based on specific values for variables:

da <- subset(cu.summary, Type %in% c("Small","Large"))

The command above will isolate all records of cars whose type is either small or large. The problem with it, often encountered in real-world data, is that in some cases variables have typos or spaces in them. If the Type variable contains text with a space right after, or right before the text, for example " Small" instead of "Small", this command would not work properly.

The solution to this is to use the grep command, and the plus sign before or after the test string that unambiguously contains the sequence of letters that can identify the observations we are interested in . The "+" means any characters preceding, or following, the string we are interested in. In the example below "+Sma+" will help isolate all observations for which the Type variable contains the "Sma" string, preceded or followed by any characters.

da1 <- cu.summary[ grep("+Sma+", cu.summary$Type) , ]

Unfortunately, only one search expression using the grep command can be used at a time. Thus, in order to get a set of data containing only small and large car types, one will have to create a second dataset , e.g. da1 with "Lar+" or "+Lar+" as the search expression to be used with the grep command, and then merge both datasets with the command

a <- mergeRows(da, da1, common.only=FALSE)

This file merge operation can also be done using the menu command in R to do the merging, as shown in section 5.1 for computing churn rates.

B. The second step is to decide how to process the data in order to get a meaningful range of price points. A summary command may help, or some prior knowledge of what is wanted. There is no hard and fast rule, but if there are few data points/observations, your distribution may not give you much information.

C. Assigning data to the price points. After defining the price points, there comes the need to define the interval for them. This is due to the fact that prices can take any value within the range of the price points considered, and even above and below them.

For example, if the decision is to have price points of 4.9, 5, 5.1, there will be the need to define which data will be classified as belonging to each these points. For the price point 5, values between 4.95 and 5.05 need to be selected, for 4.9 values between 4.85 and 4.95 and so on. Should 5.2 and above (noted as >5.2 or 5.2+) be the highest price point considered, it should include values of 5.15 or higher.

After all intervals are defined, all the elements needed to compute a price points table are available. This can be done using MS Excel with a rather complicated formula, but can be coded in R in a relatively easy way.

Using the dataset cu.summary you will be asked to do a basic price range on car prices in one of the exercises at the end of this chapter based on the example below.

With the command below, a new variable, named Pricerange, is created by filling in a value (e.g. 7.5) if the value of the Price in the dataset cu.summary is between 7,500 and 10,000.

cu.summary$Pricerange[cu.summary$Price>5000&cu.summary$Price<=10000] <- 7.5

D. Doing a summary of the newly created variable as a basic input for the price range calculation. With a command similar to the one below, you can summarize numeric and character variables, by assigning frequency counts for the variable of interest.

a<- as.data.frame(table(cu.summary$Type))

Figure 5.6 Example of a frequency table

The result in figure 5.6, shows the summarized variable as var1 and the frequencies in the freq column. This is the basic information for doing a numeric price points table. For consistency and clarification, the Type variable will need to be replaced with the newly created variable which contains the price points (e.g. Pricerange) in order to do the frequency table

The percentage version, shown in figure 5.5, is obtained by summing up all frequencies and dividing the individual frequencies by this sum. In figure 5.6, the relative frequency for compact cars is obtained by dividing 22 by 117, the latter being the sum of all frequencies for all car types.

Two words of caution before moving on to the next chapter. One is that price points are done on prices only. The step-by step examples above are the necessary steps for doing a price points calculations in the exercise 2 at the end of this chapter with the use of the Price variable in the cu.summary data. The second one is, that all relative frequencies should sum up to 1.

5.4 Usual retail sales measures

This section contains the basic measures covered in standard sales tracking reports, provided by retail sales companies such as AC Nielsen or IRI.

The most commonly metrics used are the following:

A. Sales Quantity= Opening Stock + Retail Purchases â€“ Closing Stock

B. Retail sales in volume in litres, kilograms, pounds, ounces, pieces, etc.

C. Retail sales in value

D. Brand or product shares by volume/by value.

These are the essential measures used to report sales, and often the volume measures are preferred as they are not influenced by price fluctuations. Most of the retail reports and studies concerning market size and market forecasts refer to volumes (which is another name for quantities). This is also true for other types of goods such as durable goods; for example most sales of mobile phones are reported as number of units sold.

E. Retail Handling (percent of outlets handling a product or brand in the analysis period). Numeric handling is computed as the ratio of the outlets carrying the product to the total number of known outlets of that kind.

F. Size of Stock is the volume of stock held by the retailers at the end of the period.

G. Stock Coverage in days is volume of stock divided by average daily sales.

H. Retail Out of Stock, is the percentage of outlets with product/brand out of stock at the end of the period (expressed in the numeric or weighted versions).

I. Average Sales per Outlet Handling is given by the sales volume divided by the numeric handling; it shows the average sales of a certain brand within the outlets handling that brand during the analysis period.

For many readers, all of the above measures may look as redundant, if not useless, information. In many cases one is unable to compile these statistics all by oneself. There is often the need to use external data, such as the total number of outlets of a particular kind in a certain geographical area, to carry out these calculations.

Nevertheless, these measures are helpful in providing basic metrics for a (retail) business, and help an analyst or manager have a better picture of the sales of a particular product or products. These metrics are to be found in retail reports provided by companies such as AC Nielsen or IRI, and it is good to understand what they mean and how they can be used in making business decisions.

Table 5.3 contains a brief example of a sales report data for one month.

Table 5.3 Basic Input data for retail sales

Stores

Items

Opening stock

Purchases

Closing stock

Price

Store 1

Brand A

Store 1

Brand B

Store 2

Brand B

Store 2

Brand E

Store 2

Brand F

Assume that the company on which the analysis is done produces and sells brand A and B, and the rest are the brands for your competitors for to be used in the calculations. Also, assume that there are only two stores in the area for which this sales report was created.

Using this data, one can compute the following metrics:

Sales for brand A: 50+30-40=40 items

Sales value for brand A : 40 items*2= 80.

Numeric distribution: 50% for brand A (ne store only carries it) and 100% for brand B (both stores carry it)

Average sales per outlet handling: for brand B is 26, that is 52 divided by two (stores handling that brand).

Exercise 3 at the end of this chapter will require you to compute all retail measures presented in this section.

5.5 Peak transaction periods

In some cases it is worth seeing what are the busiest times when sales or other transactions take place. It is known that fast food places get busy around lunch time, and that for restaurants there might be another peak period in the evenings. But for other products, the periods with the highest sales may not be that straightforward. For example, online transactions may occur more frequently at different times and days of the week, and there could be other cases when common knowledge may not be of any help at all.

In theory, finding out peak transaction periods is fairly simple: it involves creating a new variable where the transaction times are assigned to certain time and/or date intervals. For example, if 30-minute intervals are used, a transaction occurring at 9:49 PM can be assigned the value 21_2 which indicates that it occurred in the second part of hour 21:00.

Then, the next step is to do frequency counts as described in section 5.2 for price points, which will give the peak or bottom transaction periods based on the frequency of the transactions classified in date and/or time intervals.

From the theoretical point of view, this type of calculations is not much different than the one used for price points, except for the fact that here the assignment intervals are based on time periods and not price points.

However, in practice, working with dates is complicated in most software package, and R makes no exception. In order to accommodate the specifics of date and time data processing, the chron package must be downloaded, installed, and loaded. All the commands below will require its use, and must be carried out after loading it.

The example below will show the calculations of peak transaction periods with the use of a simulated transaction file transactions.csv, which has only one variable, Date.

After importing it into R, the fields look like this:

Date

1 1/13/2009 20:19

R has imported this data using a date format, which is not very easy to use, especially for new users. In order to carry out the transformations of the data needed to assign all observations to transaction periods, one needs to convert dates to character strings first, with the following commands:

Dataset$Dates<-as.character(Dataset$Date)

is.character(Dataset$Dates)

The first command creates a new variable Dates, where the Date variable is stored in character format, and the second one checks whether this has actually happened by returning a TRUE value in the output window.

The next command will split the dates into date information and time information, and append the seconds information to the data. Appending the seconds is only needed so that the subsequent commands we will apply can work.

Dataset$Datess<- paste0(Dataset$Dates,":00")

dtparts = t(as.data.frame(strsplit(Dataset$Datess,' ')))

row.names(dtparts)= NULL

Working with dates and times in R

One of the most difficult parts in working with R is to process dates. There are different formats available and these formats are fairly standard. You could have a format similar to the one in the file (1/1/2000) which can be expressed as 01Jan00 or January 01, 2000 or 2000-01-01.

The difficulty of working with these formats is to find the right expression for it in order to convert and process it.

Information on dates can be found in several R sources. Among the best one is http://statistics.berkeley.edu/classes/s133/dates.html

and the documentation for the package chron.

However, it is best to begin with this documentation http://rss.acs.unt.edu/Rdoc/library/chron/html/chron.html which explains the basics of the chron command.

The first link contains three tables that show how different date formats are coded in the R commands, for example 2000 is coded as Y whereas y stands for 2-digit years.

Coding time is slightly simpler: h stands for hours, m for minutes and s for seconds. Unfortunately, there is no easy way to deal with these issues, and one needs to spend some time and test the commands to check if they work properly. Usually R gives a warning sign if dates are not processed correctly, but the best way to check the reuslts is to use a command such as head(yourdata) in order to quickly see the results.

After the data is split into date and time items, we need to get it in the right format for processing.

theT=chron(dates=dtparts[,1],times=dtparts[,2],format=c('m/d/Y','h:m:s'))

This command takes the two splits in the data as processed before (e.g. 02/27/09 for dates, identified as dparts[,1] and 12:21:00 for times, identified as dparts[,2], and after applying the format m/d/Y for dates and h:m:s for times, stores all data into the times dataset.

Now calculations can be performed to see the busiest transaction times based on the hours and minutes only. Because only the hour-minute-second part will be used, the time information will be attached to the initial dataset using the following commands:

thetimes=chron(times=dtparts[,2],format='h:m:s')

Dataset$Times<- thetimes

In the next step one needs to label the data so as to determine the most frequent half-hour transaction times. This is essentially done as for the price points, described in section 5.2. Essentially, a command similar to the one below is needed to assign a value to the observations that meet the relevant criteria, as shown below.

Dataset$Label[Dataset$Times>09:00:00&Dataset$Times<=09:30:00]

<- 00_1

However, this is not possible with the time format, and it will also be complicated to assign many time intervals, one at a time, in the label variable. An efficient and a more intuitive solution is to 1) extract the hours and minutes, 2) convert minutes into time intervals, and then 3) concatenate the two with the paste command used above.

For getting the hours part of the label, we have one command available to use. But this command alone will return blanks for 12AM or 00 hours. In order to correctly process the hour information a conditional if statement is needed, which assigns 0 for the 12AM or 00 hour.

Dataset$H<- hours(Dataset$Times)

for (i in 1:nrow(Dataset))

{if (Dataset$H[i]>=1) Dataset$H[i] else Dataset$H[i]=0}

In order to get the minute intervals, we first need to get the minutes part. Here no blank value is generated, so we can use only the half_hour condition:

Dataset$M<- minutes(Dataset$Times)

for( i in 1:nrow(Dataset))

{if (Dataset$M[i]<30) Dataset$M[i]=1 else Dataset$M[i]=2}

Now the label variable can be generated, by concatenating the hour and minute parts, with the following command:

Dataset$FLG<- paste0(Dataset$H,"_",Dataset$M)

The last step in obtaining the peak transaction periods is to compute the frequencies for the label values obtained, using the following command:

a<- as.data.frame(table(Dataset$FLG))

In order to get the busiest, or the most frequent, times, it is useful to create a new dataset in which frequencies for the labels are stored in descending order. The minus sign put before the variable to be ordered is the command option for descending order.

sort.a <- a[order(-a$Freq), ]

To get the least busy transaction times, data needs to be sorted in ascending order, which can be easily done by removing the minus sign in front of the Frequency variable.

Result for the busiest, and the slowest transaction times for our data are obtained and shown in table 5.4. Remember that the first digits in the flag variable gives the hour in military format, and the second the minutes interval (1 for 0-30 minutes and 2 for 31-59).

Table 5.4 Peak and off-peak transaction periods table

Most frequent

Least frequent

Flag

Freq

Flag

Freq

12_2

272

4_2

11_2

263

4_1

12_1

258

3_2

14_1

258

2_2

18_1

256

3_1

13_2

247

2_1

17_2

245

5_2

20_2

244

5_1

5.6 Patterns in customer behaviour

Very often it is useful to know the spending patterns of customers, especially of the most loyal ones. If we analyze several locations, it may be interesting to know which customers who shop at one location do shop at other locations, and how their total sales are allocated across stores.

How to do this depends on how the data is organized. But in most cases there is a list (or table) with customers, their sales and the stores where they made their purchases, as in the dataset customer.csv. In this file data refers to customers from 2 stores only.

Using this data, an attempt will be made to determine what proportion of the customers shop in more than one location (in this case in both stores) and what proportion of the total transactions are carried out in each of the two stores.

With the command:

a<- as.table(xtabs(~cust+Store, data=Dataset))

a cross-table with the following structure will be created:

Store

cust 1 2

1 2 0

2 0 1

3 1 0

4 1 0

5 0 1

The command

b<-as.data.frame(prop.table(a, 1))

computes the percentage of transactions that are carried out in each of the stores by each customer. The command head(b)gives:

cust Store Freq

1 1 1 1

2 2 1 0

3 3 1 1

4 4 1 1

5 5 1 0

6 6 1 1

To isolate customers who do transactions at both stores, a command which flags these customers with a 1 and the rest with 0 will be used. This is done by creating a flag variable through using a combined condition which separates frequencies of one or more from zero frequencies. The conditions used below are <1 for frequencies of 0, and >0 for frequencies of one or more.

Note that the command who does this will run for a while, especially for computers with relatively slow processors and or/little RAM memory, so one needs to be patient.

for( i in 1:nrow(b))

{if (b$Freq[i]<1 & b$Freq[i]>0 ) b$Flg[i]=1 else b$Flg[i]=0}

After the new flag variable has been created, the customers with the flag value equal to 1 can be isolated by subsetting the data, and data can be ordered by customers to obtain a listing of frequencies.

c <- subset(b, subset=Flg==1)

sort.c <- c[order(c$cust), ]

The result looks like this:

Cust Store Freq Flg

1039 1039 1 0.90384615 1

32463 1039 2 0.09615385 1

3777 3777 1 0.50000000 1

35201 3777 2 0.50000000 1

4182 4182 1 0.93442623 1

35606 4182 2 0.06557377 1

4867 4867 1 0.15384615 1

36291 4867 2 0.84615385 1

14430 14430 1 0.50000000 1

with the first row showing the initial observation id as imported from the dataset customers.csv. However, this is not useful for our purposes here, therefore it will be ignored in the rest of the section.

In order to get the list of customers who shop at both stores, the following command needs to be used.

d<-as.data.frame(unique(sort.c$cust))

In other cases analyses are more straightforward. I will not go into detailed examples on how to do the calculations needed to get these statistics, as are already covered in detail do in previous materials.

For example, it is useful to see customers that stopped doing business with a company, or who did only one or two purchases over a long period of time. A list of them, matched with their contact information from an extended account listing could be the base for a marketing campaign to make these customers active.

In some cases, it is worth seeing which customers only make purchases on set days (e.g. of Saturdays) and accommodate their behavior by making them more aware of special deals for that day a bit in advance. Inactive customers may be of interest, especially those who used to purchase a lot. Maybe there is a serious reason as to why they switched to a competitor, and by addressing this issue the loss of more customers can be prevented, or some of the lost customers can be enticed to come back to do business with the company of interest, targeted by our analysis.

5.7 How to present your results and stand out in the eyes of your audience! or How to keep in mind the end result when doing the analysis work!

Now that so many methods of analysis have been covered, and it was shown how to apply them, it is useful to know how to wrap their results up and sell them to our bosses, clients or other audiences.

This is more difficult and challenging than it looks. In practice people, even those having high-level positions in multinational corporations, are not very number-oriented, and tend to frown at more sophisticated content that they do not understand. In other cases it is even worse; if the story behind the results is not clear and the audience does not have any familiarity with statistics, the analysis may end up in the garbage bin.

But most analysts and statisticians are facing the same issues. This is a tough challenge, because an analyst will mostly use statistical models that model uncertainty. Statistical calculations will almost never yield a result in the form of an exact answer. And they are not meant to do so; exact answers are given by other sciences, not statistics, and its applications to the business world.

.

Rather, statistical calculations give results with a certain degree of confidence, and this needs to be used in creating confidence about the obtained results and their applicability.

Let us review below what are the basic elements that have to be included into a presentation of an analysis.

The first part must be an introduction of the issue. It may seem a bit odd, but many people highly appreciate this. Telling them why there is the need to do a regression to figure out salary levels, what is the purpose of customer segmentation, what are the uses and benefits of market basket analysis, will do a great deal in introducing your analysis and set the expectations of your audience.

This first part almost comes with a challenge. In this first part one needs to start building a relationship with the audience and create a first impression. Here there is the chance to build rapport with the audience, get to know how they are and learn how they react. Except for special situations, one should not rush into it. Using humour, sharing past successful stories, and giving some anecdotal evidence to add flavor are excellent complements of the introductory slides.

The second part should be describing your data and your model. This is the first part where one starts building confidence and make the analysis and findings be well understood and received by the audience.

Usually one will need to start with describing the data and showing its limitations. A degree of diplomacy and fine-tuning is often needed when describing its limitations. For some people which are not very familiar with data analysis and data issues, limitations are a really bad thing and may lead them to reject an analysis. For other people, they are essential in telling them that the right data was used, and that an analysis was correctly done if it took the inherent data issues into account. Among the analysts, it is well known that the quality of the data used is essential in running a correct analysis and obtaining the appropriate results.

Describing the model can also be challenging. It should be kept simple and non-technical. But remember that it should contain the basic layout of the analysis, as it will be shown later in the results. The best practice is to make sure that whatever goes into your model presentation should be later linked with the actual results obtained. Thus, model description must give a mental blueprint of the analysis results that will follow later in the presentation.

Doing an analysis and presenting it

It is often a misconception that in business analysis (and statistical analysis in general) there is an analysis part and a presentation part. That misconception is not without reason. The two parts are fairly different, they involve different tools and make use of a different set of skills. It is often a challenge, especially for some analytically minded people, to present their results. Very often I saw excellent analysts that did poorly in presenting their results, with their work being largely underrated, or even discarded by people in operational roles or higher management who associate a lack of communication skills means a lack of analytical skills.

While anyone has a specific set of skills and abilities and cannot excel at everything, using some basic techniques will ensure that most of the elements needed to be successful in an analysis role are in place.

It is a simple but effective practice to keep in mind all the key elements that will be included in the presentation throughout the analysis process, and have them clear throughout your analysis work. Clearly defining your data and being aware of their limitations will help keeping an analysis appropriately focused. Of course, it is recommended to explore the data available a little bit and figure out which dataset is the best for one's purposes. But one will need to be aware that results must be presented and the data description part must be clear for the audience.

Next, the right model needs to be chosen. Again, the model must be clear in the mind of an analyst; this is pretty straightforward when read, but not that obvious when one has tried numerous versions of it and sometimes have had a hard time choosing between a few versions that are the best performers. However, keeping in mind how to present the results will help one find a focus additional to the one given by the statistical measures, and help one choose the model that is the simplest to explain and get understood by the audience.

Then one needs to have available the elements that will prove that the analysis performed is reliable. First, some of the statistical measures that will tell the audience your analysis is sound and can be trusted are needed. The R2, p-values, percent of variance explained, the lift, are key elements not only for choosing he right model, but also for selling it. One needs, especially in the beginning, to have a good understanding of these measures and be able to explain them in laymanâ€™s words. Thus, one needs to explain that the R2 of 80% means that the dependent variables explain about 80% of the evolution of the dependent variable. In some cases you only the essential information needs to be conveyed. For example, when regression coefficients are presented, it may be useful to just flag them with a label showing that a certain coefficient can be relied on at a 99% confidence level.

Second, the graphs obtained during an analysis will be of great help for its presentation. Laying out a nice decision tree with elements that are clearly shown, or doing a regression plot, will convince the audience that the work is reliable. And third, having the model validated by comparing the estimation results with the actual results will definitely increase the confidence in one's work and results.

Thatâ€™s about all I wanted to cover with respect to the connection between the analysis work and its presentation. One thing Iâ€™d like to add from a professorâ€™s perspective is an old adage, which holds and will hold true for a long time, that the highest level of understanding and mastery of an issue comes when you are able to actually explain it step by step to someone else. This was also true when I wrote this course and chose the examples. More specifically, I had the challenge of choosing nicely behaved examples which will help one understand exactly what was read or calculated, or not so nicely behaved examples which had some resemblance to real-world results that are likely to be obtained in practice.

As an example, let us take the regression analysis done in course 2, which looked at the average expenditure, and analyze it in order to get the essential details relevant to its presentation to an audience presented in the first two steps.

The first thing that should be mentioned is the fact that this analysis will try to explain how the average expenditure is determined by age, income and home ownership. Then, the model needs to be laid out. The best way is to try and lay out a formula which has mathematical layout, but contain variables that are easy to grasp. This is a bit different than the way it appears in most parts of this book, but it is much better for impressing the audience.

For writing formula in a nice layout, the use of an equation editor such as the one in MS Word is needed.

A more difficult challenge is to present more complicated models such as segmentation. In these cases, formulas are more complicated. I did not cover those in the course for this reason. The best way to present them is to give a flavor of the results of the analysis by showing a step-by step example. For example, segmentation can be shown as a sequence of branches that start to unfold, and a presentation of the characteristics of the branches.

Another good practice when explaining the model is to give, if available, examples of similar models used by your competition, or one ones that appear in major presentations of companies.

The third step is to actually present the results. They should follow the model explained in the section before, contain the basic statistical information that reinforces their reliability and explains their relevance. A comparison between the predicted results and the actual data from regression analysis is recommended. For other models a powerful diagnostic graph can help. A clear picture showing the clusters, or a silhouette plot showing the fit of the model, are powerful elements in a presentation.

Picturing confidence and lift should be essential elements for an association analysis presentation. Explaining the lift in a slide which shows how much likely is a relationship to occur compared to a randomly selected purchase is the key to selling association analysis to any audience.

The last step is the conclusion. Here all results presented earlier should be summed up in a concise and crystal-clear way. Only the most relevant results and the most powerful from the statistical point of view should find their way here in a clear, unambiguous form. Except for the obvious reason that the conclusion is the last part of the presentation and provides the â€˜fruitsâ€™ of an analysis, it is often the case that this is the only part understood or used later by the audience. Hence the importance of getting this in the best possible shape. Another part of the conclusion is the recommendations. These showcase the usefulness of the results, and their foreseen implementation. As an example, in the case of segmentation, it should contain practical advice on treating customers differently based on the value of their car, or based on their size and frequency of purchases.

The recommendation part can be more challenging than it looks. There is an obvious reason for this: as an analyst, one may not be aware of the issues relating to the practical possibilities of implementing the results, or the users of the results may not see their applicability. It is a good practice to talk, or know the opinion of, the users of the results, or to the people who hold operational positions, must â€˜buyâ€™ the conclusion and, if so, take action or change their ways of working. And in many cases, they may not be aware of all the possible uses of such analyses. A little bit of research can help give ideas for the recommendations, and even provide examples for the recommendations.

Having reached the end of the course, I wish you good luck in your work, and I hope that you will have the chance to apply most of it. Also remember that this is just the basic material to get you going, and that, depending on your work, you may want to read more about all that you have learned. But, last but not least, you must complete the last set of exercises in order to have the confidence and make sure you have a good handle of the essentials taught in this last chapter.

5.8 Exercises and questions for review

1. Using the churn rate example, letâ€™s make some changes to the assumptions. The product the data refers to is being used within three months or so, and the end of the fiscal year is October 31st. What would be the churn rate at the end of the fiscal year?

2. Please carry out the price points calculations described in part 3 for the price of cars in the file cu.summary, and present your results in Excel.

3. Based on the table in part 4, compute all the retail audit measures described in section 5.4.

4. Compute the least frequent 15 minutes transaction intervals using the data in section 5.5, and show the results in a table.

5. How can one get a list of the customers who purchase from one of the two stores only? Please write the commands as you would input them in the RCommander window.

6. Please prepare a presentation based on one of the models described in the book, or requested in any of the relevant exercises.

7. Please provide a fair and honest account of this book. How did you find it, what you like most, what you did not like, and any comments and suggestions, you can make.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Usefulness Of Data Manipulation Techniques

5. 1 Churn Rates

5.2 Price elasticity

Using proxy variables

---

Elasticity = -0.4

Â

Price decrease

Price increase

Price

Quantity

5.3 Price Points/Price Dispersion

5.4 Usual retail sales measures

5.5 Peak transaction periods

Working with dates and times in R

5.6 Patterns in customer behaviour

5.7 How to present your results and stand out in the eyes of your audience! or How to keep in mind the end result when doing the analysis work!

.

Doing an analysis and presenting it

5.8 Exercises and questions for review

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time