Clickstream Data Collection On The Web

Published Date: 02 Nov 2017

Veronica Gomez

New York University

Abstract

TBD.

Introduction

The web today is a key communication channel for organizations (Norguet et al., 2006, p.430). One big advantage of electronic publishing is that, unlike print or broadcast channels, websites can be measured directly (Ogle 2010, p.2604). This is possible because the web allows for the logging of user events on a site. Through the analysis of this information, user behaviors can become visible and used for improvements in websites and for marketing purposes (Burby & Atchison, 2007). It is not surprising then that the interest in monitoring user activities on websites is said to be as old as the web itself (Spiliopoulou & Pohle, 2001, p.86).

But exactly how is this information collected? However, amazon.com is also an example of what makes user wary of how much a website "knows" about its visitors. This is because the technology that allows amazon.com to collect valuable data can be used to follow users to other sites and track their activity there as well. An article in The New York Times (Singer & Duhigg, 2012) reported that such tactics where in use in the 2012 presidential campaign. According to the article, both the Obama and the Romney websites used third-party surveillance engines that allowed them to retarget people that had visited their sites by serving ads to them on other websites. These mechanisms can tailor the ads to the user based on the userâ€™s web activity. For example, if the digital footprint of the user indicated an interest in environmental issues then the campaigns could serve the user ads where the candidates addressed that topic. This raises concerns for consumer advocates because they question how secure the information collected by websites is and if users are identifiable. They argue that if a userâ€™s identity is revealed then Internet data collected on that person could be used against them. Privacy advocates have long called for stricter regulations involving online data. In Europe, their calls are being heard.

Secondary data which is collected through e.g. log files or tagging often is referred to as clickstream data.

Purpose of the Study

This chapter will first give an overview of the different data collection techniques available for Web Analytics today. Following this, different metrics are outlined before the process of Web Analytics is described in more detail. A short section then discusses the legal aspects of Web Analytics. At the end the challenges are summarized and a summary is given.

How is data collected from our interactions with the Internet?

Is this data private?

Literature Review

The World Wide Web as we know it was born on April 30, 1993, when CERN released their project of connecting computers around the world as a free and open platform (pcmag article). Twenty years later, it is still growing rapidly. In 2009, about 1 million webpages were released each day. The number of websites and visitors to websites has increased exponentially (Pani et al., 2011, p.15). By the end of March 2011, 2,095,006,005 people had already used the Internet. This accounts for 30.2% of the world population (Miniwatts Marketing Group, 2011).

When surfing through the web many different kinds of websites can be found. As different sites have different purposes it is helpful to classify websites into a specific category. Five basic types are often distinguished: commercial, personal, informational, organizational and political (P. Crowder & D. A. Crowder, 2008, p.15f).

Personal websites - The goal of a personal website is mainly to introduce an individual, his or her interests and opinions, but the actual implementation of a personal website can vary widely. Personal websites range from simple text sites, to picture sharing sites, and to blogs or forums. Personal websites are usually managed by a single individual.

Commercial websites â€“ The purpose of commercial websites is to sell a product or serviceâ€”that is, to make a profit. Company websites that are not designed to sell fall into the category of organizational websites.

Organizational websites - Organizational websites present information about a particular organization. They are similar to informational websites with the difference that they are managed by an organizational body.

Informational websites - The purpose of informational websites is to provide information on a particular topic. These sites can be free or fee-based. Many of these sites are created by public-minded organizations who want to make people aware of a special topic.

Political websites - Websites that deal with political issues are called political websites. This includes sites dealing with political parties, candidates, legislations, etc.

No matter which kind of website someone looks at, they all have something in common: all websites consist of three basic elements: texts, pictures, and links (Wu et al., 2009, p.168). As users interact with web pages, and the elements that make up the pages, their interactions are recorded. This data is called clickstream data (Kosala & Blockeel 2000; Etminani et al. 2009, p.396). Clickstream data collection is grouped into three categories: 1) server-side, 2) client-side, and 3) an alternative method (Kaushik 2007). Server-side data collection refers to data capture from the perspective of the server were the website resides. The data capture occurs in the form of log files. Client-side collection methods record events from the userâ€™s perspectiveâ€”that is the client terminal from which the user is calling the website. Page tagging and web beacons are client-side data collection. The third category of data collection involves capturing data in the transmission between server and client (Kaushik 2007). Packet sniffers are examples of the alternative method of clickstream data collection.

Server-side data collection occurs in the form of log files. The web log is a file that is being created by the web server and each time a particular resource is requested it writes information to the log (Pani et al. 2011, p.19). When a user requests data from a server via an URL the server accepts the request and creates an entry (log) for this particular request before sending the web page to the visitor (Kaushik 2007). Thus it records the information about the activities on the server. Originally, log files were developed for debugging purposes (Suneetha & Krishnamoorthi 2009, p.328).

Web server logs are plain texts (ASCII) and independent from the server platform. Different web servers may save different information within logs. Table 1 gives a short overview of this basic information.

In addition to the examples outlined above demographic information such as country information can also be captured. It is important to notice that it is not the personal information of the user that is captured at this time, but the technological details, hence the location of the server used by the user (Weischedel & Huizingh 2006, p.465). Log files can be custom formatted, however, two common formats exist. They are the W3C Common Log Format (CLF) and the W3C Extended Log File Format (ECLF) (Pani et al. 2011). The ECLF is the most common format in use (Sen et al. 2006, p.85). The following gives an example of what an ECLF log file can look like:

www.lyot.obspm.fr - - [01/Jan/97:23:12:24 +0000] "GET /index.html HTTP/1.0" 200 1220 "http://www.w3perl.com/softs/" "Mozilla/4.01 (X11; I; SunOS 5.3 sun4m)" (W3perl 2011)

Log files are able to collect huge amounts of data with little effort but they do have certain limitations. First, as server logs capture each activity of the server when requesting a site, one site request of a user can lead to several logs. If for example a site is requested which includes a picture, loading the picture will be a separate log. The same applies to CSS data or flash videos. This is problematic because web robots, spiders, and crawlers (bots) produce numerous amounts of web logs (Pani et al. 2011, p.21). Bots are automated programs which visit a website (Heaton, 2002). They are used by web search engines and site monitoring software in order to see what is available at a site. As the number of bots today is enormous they can dramatically distort the results (Kohavi et al. 2004, p.95). Secondly, if a page was cached and a user gets the cached page no server log will be written as the server does not get the request. The same applies if users use the back button of the browser to switch between pages (Cooley et al. 1997). Third, server logs cannot record time spent on the last page. The request of the following website will generate a server log on the specific web server, but no hint is sent to the first server. Thus the time spent on the exit page cannot be recognized (Hofacker, 2005, p.233).

Another limitation of server log files is that the identification of individual users is not reliable for sites that do not require visitors to log in. In such cases, a userâ€™s IP-address is often used to identify them. But in a big company network, for example, multiple users share one IP-address and it cannot be separated between individual users (Weischedel & Huizingh 2006, p.465). Users with dynamically generated IP-addresses cannot be identified because when they visit a site for a second time their IP-address will have changed. Reliable distinction of users can only be guaranteed user self-authentication (logins) or by using cookies (Spiliopoulou, 2000, p.129). A cookie is as "a message given to a web browser by a web server. The browser stores the message in a text file. The message is then sent back to the server each time the browser requests a page from the server" (Web Analytics Association). The problem with cookies is that more and more users do not allow the storing of cookies or delete them regularly.

In order to address the limitations of web logs, new methods of tracking user interactions with websites were developed. Page tagging and web beacons are two client-side methods for collecting clickstream data. Tagging involves adding a snippet of code, usually using JavaScript, to every page on a website. When a visitor requests an URL from a web server, the web server sends back the page, including the JavaScript code. This code is executed on the client terminal while the page loads. It captures different data such as page views and cookies and sends it to a data collection server. The variety of data which can be captured is huge. It ranges from clicks and position of the curser, to mouse moves and keystrokes, to the window size of the browsers and installed plug-ins. Any information that can be captured by log files can also be captured with tagging (Kaushik 2007, p.54). Today, JavaScript tagging is preferred over all other data collection methods (Kaushik, 2007, p.30ff). Table XX is an example of the JavaScript code for Google Analytics:

One clear benefit of JavaScript tagging is the possibility to also get data from cached pages. As the code is executed each time a websites loads it is also executed if the site is loaded from a cached space. In contrast, if users have switched off JavaScript, which 2 to 6 percent of users have, the data of these users will not be captured at all. The implementation effort of JavaScript tagging normally is easy. Indeed the Java code needs to be included in every single page, but itâ€™s only a few lines and therefore it is possible to control what data is being collected. But it should always be implemented within the footer of a web page because it should not happen that a page is not loading just because the tagging causes problems. Furthermore, it is possible that an ISP (Internet Service Provider) vendor will collect the data. Here the question of who owns the data needs to be clarified. But beside all this benefits JavaScript also has limitations. What was described about the cookies for log files is the same for tagging. If users switch them off or delete them, this information is lost. Furthermore, even if it is possible, it is much harder to capture data from downloads with tagging than with log files. PDF files for example do not include executable Java code. If they are requested through an URL of the web page, the request can be captured. But if they are directly opened from a search engine then the request will not be recognised. In addition, if a website already uses a lot of JavaScript on its site the tagging can cause conflicts and sometimes is not even possible (Kaushik 2007, p.32f; Hassler 2010, p.60f).

The second method of Client-side data collection are web beacons. Web beacons are 1x1 pixel transparent images which are sent from the web server along with the website and are then executed to send data to a data collection server. They were developed and are mostly used to measure data about banners, ads and e-mails and to track user across multiple websites. Web beacons are easy implemented in web pages around an <img src> thus an image HTML tag. They can capture data such as the page viewed, time, cookie values or referrers. As robots do not execute image requests they wonâ€™t be visible in web beacon data collection. But if users turn off image request in their e-mail programs and web browsers they also wonâ€™t be measured. Furthermore, as web beacons are often coming from third-party servers they raise privacy issues and for example antispyware programs will delete them.

Web beacons are not as powerful as JavaScript tags. Thus they shouldnâ€™t be used as the main data collection methods. But they stand out when trying to get data across multiple websites (Kaushik 2007, p.28f).

The third category of date collection that does not involve the server-side or client side data collection are packet sniffers. Packet sniffers are implemented between the user and the website server. A user requests a page. Before this request is processed by the website server it runs through software- or hardware-based packet sniffer which collects data. It is then passed through the website server. On the way back to the user the packet sniffer again is between them to collect data. Using this method of data collection will bring the most comprehensive amount of data as every communication is watched. Technical as well as business related data can be captured. Only cached pages will not be measured unless they have additional JavaScript tags. One clear benefit of packet sniffing in contrast to JavaScript tagging is that there is no need to touch the website, nothing on the actual website needs to be changed. But on the other hand additional hard- and software is needed which need to be installed, controlled and maintained. Furthermore, packet sniffing raises privacy issues. Using this method, raw packets of the Internet web server traffic is collected. This means that information such as passwords and addresses, etc. will be saved. It needs careful investigations to handle this data correctly (Kaushik 2007, p.35f).

There are different ways to identify different users and to separate them into categories. If every user needs to log in at a website in order to see/use it, the registration information can easily be used to distinguish between users. Otherwise, the IP-address or cookies are often used. When using IP-addresses different methods or rather definitions of who/what a new user is can be defined. Assigning a different user to each different existing IP-address might be the simplest approach. But due to the use of proxy servers (Pierrakos et al. 2003, p.328) and the fact that different users can use the same IP-address and one user can use different IP-addresses (Etminani et al. 2009, p.397) this method is quite inaccuracy. Other approaches go further and do not only look at the IP-address, but also on the used operating system and browsing software. If the IP-address is the same, but the used systems are different the assumption is made that each different agent type for an IP-address is a different user (Suneetha & Krishnamoorthi 2009, p.330). Another method to identify different users is the use of cookies. Today they are an often used and reliable source for user identification. Even though they also have some issues they are still more accurate for user identification than IP-addresses (Hassler 2010, p.54).

Conclusion

The first method used for Web Analytics data collection within history was to capture log files generated on the web server (Kaushik 2007; Hasan et al. 2009). However, as other methods have been developed which try to overcome the issues of log files, all methods are outlined below.

From the beginnings in the 1990s, it took more than 10 years before a standard definition of Web Analytics was given by the Web Analytics Association in 2006 which shows the broad spectrum of Web Analytics today (Kaushik 2007, p.2). They defined Web Analytics as

"the objective tracking, collection, measurement, reporting, and analysis of quantitative Internet data to optimize websites and web marketing initiatives" (http://www.webanalyticsassociation.org/ 2006).

Looking at this definition it becomes clear, that Web Analytics can help to understand where problems exists, but it does not explain how to solve these problems or optimize the website (Wu et al. 2009, p.164). This further part needs to be done by an analyst who understands the metrics produced through WA and can translate them into actions.

The methods discussed above for data collection have limitations. And there are a few factors which all of them need to keep in mind. Probably the most important point is to make sure that the website is working for the user. The customer is first, not the analysis. If the analysis is not working that might be bad, but if the site is not working because of problems with the data collection this is even worse. Furthermore, the user privacy needs to be maintained. Privacy policies need to be in place and should be observed. Using web logs on the own server directly leads to the ownership of the log files. But with all other collection methods attention should be paid to the data ownership if the analysis is given to a third party. Furthermore, if the website is hosted on several servers the data needs to be aggregated in order to get the whole overview. The issues with cookies and the measurement of the time spent on the last page as already outlined for log files are issues true for each collection method (Kaushik 2007, p.36f).

Regardless of the method that is used for data collection no way is 100% accurate. There is no one way to go (Kaushik 2007, p.37). As already outlined within the explanation of the different methods above each method has its own benefits and challenges. Web logs and JavaScript tags are the most used data collection techniques at the moment (Kaushik 2007, p.100) but there are big discussions whether to use the one or the other technique. Hints about when to use what method were given above, but a question to ask is "what do we want to get out of the data?". No technique will collect â€žallâ€Ÿ data and it is also hard to make solid statements about the quality of clickstream data. The web is a nonstandard environment. Websites and their technologies are constantly changing. Visitors use different mediums to access pages and not everything is captured in clickstream data due to different reasons (Kaushik 2007, p.108). Kaushik very clearly expressed the data quality in general by saying: "Data quality on the Internet absolutely sucks" (2007, p.109). But besides the importance of data quality it is more important how confident someone is with the data. If someone knows how the data was generated and what drawbacks the method has it is possible to still use the data in a good way. The important point is that the data leads to a decision so that it is possible to move forward (Kaushik 2007, p.110).

The data itself is not very useful until it is analysed. But even when coming up with different metrics the interpretation can be problematic. For example, does a user who looked at a lot of pages represent a satisfied user because he found a lot of information or is it a lost user who clicks around because he couldnâ€Ÿt find what he was searching for (Sullivan 1997)? Or what does a long viewing time of a page say? The visitor may actually be reading the text, but it is also possible that he is making a phone call or is getting a coffee and the page is just open in the meantime (Weischedel & Huizingh 2006, p.465).

Even though it is not further outlined here, qualitative data collection techniques can further help to understand users. With the help of metrics the quantitative data can be analysed but they only show what happened at a specific site/a given timeframe. But they all miss to tell why something happened. To really understand user behaviour the quantitative analysis answering the â€žwhatâ€Ÿ (clicks, visitor count, etc.) and the qualitative analysis answering the â€žwhyâ€Ÿ (intent, motivation) need to be combined (Kaushik 2007, p.13f). Besides, having competitive data and to know how competitors are doing within the business help to estimate whether the own business is doing well or not (Kaushik 2007, p.44).

Aside from changes in the environment, like the introduction of a new product, understanding the user behaviour of a website is the main reason why websites are subject to changes today (Weischedel & Huizingh 2006, p.463). In the past consumers were mostly passive elements in the overall ecosystems. With the rise of the internet this has changed completely (Burby & Atchison 2007, p.6). Today they are very important participants and can provide much interesting data. Thus, it is desirable to understand how to support them best. With Web Analytics it is possible to go along the entire process and to see a website from the perspective of its users (Spiliopoulou 2000, p.133). Getting to know how users access a site and which path they are taking is critical for the development of effective marketing strategies as it allows organisations to predict user visits. It further helps to optimize the logical structure of a website (Cooley et al. 1997). "improving Web communication is essential to better satisfy the objectives of both the Web site and its target audience" (Norguet et al. 2006, p.430).

In the past WA has widely been used in the economic field (Wu et al. 2009, p.163) and most analytical investigations which are described within the literature were done for commercial websites. Nevertheless there is no reason why Web Analytics cannot also be helpful for other kinds of websites. There might be areas which are much easier to determine when looking at e-commerce websites like the successful visit of a user with a purchase, but there are lots of other areas that can also be used for the other kinds of websites.

It needs to be kept in mind that Web Analytics only is an analysis. Itâ€Ÿs not an exact science. No numbers will 100% show the reality. But "It is better to be approximately right than precisely wrong" (Hassler 2010, p.34f) because the approximately data can still be used in order to draw conclusions. It cannot give clear answers as no direct answers from users are given. With the help of analytic tools it might be easy to quantify a site, but the data needs to be interpreted by an analyst who takes action for improvements (Ogle 2010, p.2604).

"In web analytics, "going wrong" often means just going halfway." Very often vast amounts of money are invested into tools, but in the end only reports are produced and nothing more (Burby & Atchison 2007, p.43). The part of taking action after the analysis might be the most critical aspect in Web Analytics but is often neglected.

The volumes of data to measure can be another problem area. Not all tools are capable of handling very large amounts of data (Sen et al. 2006, p.87). Especially when not analysing in real time, but trying to analyse heaps of data from the past that can cause problems.

As discussed earlier there are hundreds of metrics available within Web Analytic tools. It is not always easy to understand the differences between metrics and find the appropriate one for the right purpose (Sen et al. 2006, p.87). It furthermore needs to be clear, how the metrics were developed. The time which is spent on the last page before leaving for example cannot be measured accurately. As there is no incoming information from the next server most tools determine a session after 29 minutes of inactivity (Kaushik 2007, p.36).

The market for Web Analytic tools has risen within the last years. But which tool is the appropriate one and where are the differences? An answer to this question is given in Chapter 6. But beside the analytical part which is achieved with the tool there is a need to understand the website as a whole, including its content and structure (Cooley 2003, p.94). It is required to understand the metrics and to generate useful findings from the statistics. Often many analyses stop with the findings of how many people visited the website and similar. But really good Web Analytic investigations should go behind that and try to conclude with action for e.g. redesigning the website or getting ideas for marketing campaigns.

A few years ago the majority of web metrics were generated with the use of server log files. They are able to create an overview of the usersâ€™ behavior, existing problems within the website and the used technology (Weischedel & Huizingh 2006, p.464). But within recent years other data collection techniques for Web Analytics have been developed which try to overcome the challenges and limitations of web server logs and even try to analyze more/different data. Log files only capture the server-side data within the moment of a request. But they do not have any information about what the user is doing in between clicking a new side and they also do not know which settings are used (Kaushik 2007, p.54). Because of little development in web logs and other positive innovations such as JavaScript tags, Kaushik (2007, p.27) recommends to now only use web logs to analyse search engine robot behaviour to measure success in search engine optimization. In all other cases other data capturing methods should be used.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now