Application Of The Page Rank Methodology

Published Date: 02 Nov 2017

In 1998 Sergey Brin and Lawrence Page presented their paper "The Anatomy of a Large-scale Hypertextual Web Search Engine" at the international Word Wide Web Conference in Brisbane, Australia (Langville and Meye 2009). Brin and Page presented Google in the article as a "prototype of a large-scale search engine" (Brin and Page 1998).

In the article, Brin and Page explained the algorithm behind Google, which was called PageRank. PageRank intended to rank web pages using citations or links and their relationship with keywords, which was not a common method at that time as highlighted in the article. The idea derived from the evaluation methodology for the source quality of academic literature based on using citations from prestigious authors. Web pages that incorporated more links referring to it were therefore seen as more important than others. (Brin and Page 1998).

PageRank eliminates the "middle-man" in terms of conducting a web based search, meaning the human user does not have to sift through pages upon pages of results of their search on the internet, as PageRank does this work for them by displaying the most relevant and helpful pages first as a result of the underlying algorithm explained below.

Page Rank algorithm and Ergodic Markov Chains

In order to find the PageRank () of a certain page, "A" denoted as: (), the algorithm used the existing ranks from pages containing links (citations) pointing to that page (). Where are the pages containing links to page A [1] . Using this algorithm, each link could be interpreted as a "vote" for page A, and the weight of that "vote" in the rank computation depends on the importance of the page "voting" (Anderson, Sweeney and Williams 2009), which is why the PageRank () from each quoting page is taken into account for the calculation.

In addition to the PageRank from the source pages (), the algorithm also included a damping factor d, defined as the probability that a "random surfer" will click a link to see a different random page (Brin and Page 1998). The third component included in the algorithm is the total amount of links contained in each source page (), which was included to weight each "vote" taking into account the overall "votes" available in a certain page. Therefore, the PageRank of a page is given by the formula (Brin and Page 1998):

The PageRank method was considered revolutionary at the time it was presented, due to the fact that it did not take into account the previous surfing history of the user (memory-less), but instead calculated the overall probability of the human user visiting a particular site based on the quality and quantity of the links pointing to it. The intuitive explanation of this concept is mentioned in the article when Brin and Page define "The probability that the random surfer visits a page is its PageRank" (Brin and Page 1998). This concept also involves a key element for their mathematical solution, and is the fact that the system was "memory-less", which assumed that the probability of one person visiting a certain page does not depend on the sites previously visited.

The amount of links in each source page () offers the second important mathematical property from PageRank. If it is assumed that the probability that a "random surfer" will click any link within a page is the same, then the probability of clicking the link directed to page A is . Using the mentioned properties it is possible to represent PageRank as a Markov Chain where each page is a Node and each Edge Transition Probability is given by , and the expression included in the formula corresponds to the possibility of returning to a "restart" state in which the surfer doesnâ€™t click any link but types a random address in the browser instead (Li 2005).

Given that the problem can be represented as an Ergodic Markov Chain, it is then possible to prove that the PageRank equation is actually the same fundamental equation of Ergodic Markov Chains (Li 2005) [2] :

Application of the Page Rank methodology

The following example intends to show how the algorithm works. In this example, four finance webpages and their corresponding links are shown in the following figure:

For the sake of simplicity in this example, we can assume that the damping factor (d) is 1, therefore the probability transition matrix, and steady state probabilities are as follows:

The previous example is simple because the probability transition matrix already fulfills all of the necessary requirements of an Ergodic Markov Chain, but in the real world this is not always the case. Therefore, there are some modifications required to make the algorithm functional. There is a necessary condition to check if a Markov Chain is Ergodic (Dartmouth College 2006), it must be possible to get from any Node to another (not necessarily in one step).

If Internet structure is analyzed, it can be easily seen that there are many pages that are not interconnected between each other (even through other links). Even more, there are many pages without outgoing links, which can be interpreted as dead roads or "Dangling Links". Given this, it would not be logical to assume that a random surfer would stay on a page indefinitely (absorbing state), but instead we can assign a very small probability equal to 1/N for this transition. A similar methodology can be applied to clusters of webpages which do not have connections between each other, resulting in loops or "Rank Sinks" (Page, et al. 1999).

In 1999 Page and Brian published a second article providing additional details of Googleâ€™s implementation called "The PageRank Citation Ranking: Bringing Order to the Web". In this article, they described PageRank again and its implications on computation resources and practical application for web searching (Page, et al. 1999). They also explained their solution for problems created by Dangling Links and Rank Sinks by multiplying an adjustment matrix "Eâ€™" to the complement of the dangling factor, and adding a "Dangling Matrix" to assign a probability to dead ends with the intent to link all existing pages in the web (Thorson 2004). In the article Page explains how the modified expression fulfills the fundamental equation (). Adjustments for the original probability transition matrix can be explained using the following expression (Langville and Meye 2009) [3] :

Modifying the previous example, by including an isolated page with no backlinks we can create a scenario with Dangling Links and Rank Sinks problems. In this case we use standard damping factor (d) of 0.85 (Brin and Page 1998):

If we try to use the simplified method explained in the previous example, using the matrix "P", it would not converge to a solution. Therefore, making the explained adjustments we get:

Pageâ€™s and Brinâ€™s paper mentions that an adjustment parameter "E" can be used for customization purposes to further prioritize a page. This also provides a solution for other practical issues of the algorithm, such as if E is uniform, all pages are valued equally in the event that a surfer wants to jump to a different network by typing in the web browser instead of clicking a link. An alternative solution proposed by Brin and Page is to assign a higher weight to home page types of sites, or even to set a specific search or portal page to 1 (Page, et al. 1999).

Page Rankâ€™s challenges and issues

Perhaps the most critical challenge for the real life implementation of the algorithm is the computation of the resulting matrix. With billions of webpages available on the Internet, and each one containing dozens of hyperlinks, solving the Markov Chain Matrix using regular algebraic methods is very complex. Therefore, iterative methods are used to find a solution in a short period of time. These methods use the fundamental properties from the Ergodic Markov Chain and its convergence characteristics. The iteration method originally selected to solve the matrix is the power method (Kamvar, et al. 2003).

Another functional issue we have identified regarding the practical implementation of the algorithm is the risk of being commercially manipulated. In their paper from 1999, Page and Brin underestimated this threat when they stated: "PageRanks are virtually immune to manipulation by commercial interests" (Page, et al. 1999). It is true that the philosophy behind the algorithm makes it less vulnerable to commercial manipulation, because the rank actually depends on the number of links from sites with high reputation and given the overwhelming success of Google as the primary search engine in the world. However, due that Google has become the global reference for Internet search, there is an enticing and important incentive to try to manipulate the rankings, even at a high cost.

Network externalities in platform markets and PageRank

A very promising business application of PageRank is the analysis and prediction of Network Externalities. Platform markets are becoming a very important trend in business and are defined as products or services that bring together two groups of users (Eisenmann, Parker and Van Alstyne 2006). Success in these types of markets is heavily dependent on the network effect, in which the platform becomes more valuable when more users are added to its networks. Furthermore, reference users ("marquee users") can increase and influence the network effect and are a key element of its success (Eisenmann, Parker and Van Alstyne 2006). This dynamic is very similar to the logic described by Page and Brin for PageRank, as the ranking, and therefore the probability of success of a certain platform depends on the number of users and their relative importance. Transition probability in this case could be modeled based on the actual number of subscribers (and their importance), but also on references to the product or services in blogs, public media or social networks.

Since the launch of PageRank, Google has not stopped its endeavors to continuously improve upon the algorithm. In February 2011, Google launched what is known as Google Panda, an update to the algorithm which is aimed at returning more high-quality and high-content sites in the search results. As described on Googleâ€™s Blog: "â€¦ in the last day or so we launched a pretty big algorithmic improvement to our rankingâ€”a change that noticeably impacts 11.8% of our queries.... This update is designed to reduce rankings for low-quality sitesâ€”sites which are low-value add for users, copy content from other websites or sites that are just not very useful. At the same time, it will provide better rankings for high-quality sitesâ€”sites with original content and information such as research, in-depth reports, thoughtful analysis and so on." (Google Inc 2011)

Google Correlate

Launched in May 2011, Google Correlate was to follow up Google Trends. Google Trends required the user to input a query and Google Trend will provide the series of queryâ€™s frequency over time. Instead of just providing certain trends in userâ€™s search inputs over time, with Google Correlate it is now possible to do it the reverse way. The user can upload its data or type in a certain search term and Google Correlate shows search terms, which correspond best with the time series within the uploaded set of data (Mohebbi 2011). To provide this, Google Correlate compares every used search term up to beginning of 2003 with the provided set of data.

Advantage of Google Correlate towards traditional market research

In modeling real-world activities such as illness, unemployment, and sales figures this approach has a major advantage compared to e.g. standard regression models provided by news or governmental agencies: time. Regression models published by these providers are lacking mostly in weeks or months of time for diseases or unemployment respectively. Comparing actual search as a reflection of userâ€™s reactions to actual trends is directly available and has a significant cost advantage due to less time and effort needed for the researcher (Vanderkam, et al. 2011). Finally, the most significant average over traditional research is represented by the underlying data shown by the larger and more diverse group of users of Google. The Nielsen Company presented that of an estimated 212 million Internet Users in the United States, 173 million have been used Google at least once (Burn-Murdoch 2012). A survey, which aims to measure user reaction in the U.S. to the latest Superbowl ad by Anheuser Busch, would normally not be financially feasible. Google Correlate provides a faster, cost-efficient and reliable way of finding corresponding terms to recent trends (Eaton 2011).

How Google Correlate works: Time Series and Correlation

Time series are a series of observations taken at regular intervals â€“ such as monthly unemployment figures, daily rainfall, weekly sales, quarterly profit and annual fuel consumption. When you have a time series it is always useful to draw a graph, and a simple scatter diagram shows any underlying patterns (Waters 2011).

To measure how two variables, or in this case, two time series are related to each other, the Pearson Correlation Coefficient ( r ) is calculated. TheÂ Pearson Correlation CoefficientÂ is a measure of the degree of linear relationship between two variables. While in regression the emphasis is on predicting one variable from the other, in correlation the emphasis is on the degree to which a linear model may describe the relationship between two variables. In correlation the interest is non-directional; the relationship is the critical aspect (Stockburger 2001). The correlation can assume figures from r = -1.0 (when one variable increases, the other decreases) to r = +1.0 (when one variable increases, the other also increases). r = 0 indicates no correlation.

The number of times a term is searched through Google varies over time and therefore constitutes a time series. What Google Correlate does is to compute those figures over time and space (only for US states) using the following steps:

Normalization: divide the total count of all queries in the week or state. This controls for the year-over-year growth in all internet search use (Vanderkam, et al. 2011);

Standardization: after normalization the series are standardized to have a mean value of zero and a variance of one, so than queries can be easily compared. ThatÂ´s why the y-axis has negative and positive values: the units are standard deviations from the mean (Vanderkam, et al. 2011);

Correlation: The Pearson Correlation Coefficient (r) is calculated between the userÂ´s time series and every query in GoogleÂ´s database. The strongest correlations (closest to r = +1.0) are shown and correlations under r = 0.6 are discarded for being considered weak/low correlations. Negative correlations are not considered, but can be explored by the user by uploading series of data multiplied by (-1). Also series with missing data can be correlated, once the timeframes with missing data are ignored for candidate queries (Google Inc. 2013).

To efficiently compare different time series and determine correlation within its colossal database, Google Correlates uses an Approximate Nearest Neighbor (ANN) system, optimized for finding closest points in n-metric spaces with a "good guess" of nearest neighbor. A novel asymmetric hashing technique is used to compute distance between target series and the expanded database series. This way, matching of the target series against all the database series can be done very quickly. Next, exact correlation computation is done for the top 1,000 results to achieve precision in a second pass (Vanderkam, et al. 2011).

It is important to note that correlation is not the same as causation. Correlation only describes how much two series of data are similar to each other, but cannot establish a relation of cause-and-effect between them. For example, typing "car" in Google Correlate shows a high correlation (r=0.9005) with the term "Hawaiian". However "car" does not cause nor has an effect on a personâ€™s nationality. Therefore correlation of two terms does not necessarily imply causation.

Application and Google Correlateâ€˜s value added

How does Google Correlate add valueÂ for making decisions within enterprises? An interesting example can be seen by plotting the average monthly temperature in Germany over the last nine years and comparing it with the search terms used by Germans in the same time frame. This result can be seen in exhibits 1 to 4.

German internet users tend to increasingly search for terms related to special cycling tours within Germany, e.g. Elbe River bike path ("Elberadwed" r=0.9398), Rhine River bike path ("Rheinradweg" r=0.9192) or Weser River bike path ("Weserradweg" r=0.9187) in the months April, May and June when temperature is going up to its average maximum. This also refers to search terms related to camping utilities ("campingstÃ¼hle" r=0.9270), camping or party tents ("Zelt" r=0.9307 or "Partyzelt" r=0.9172).

These trends can be useful for several purposes. For example, the national German holiday industry, which mainly offers domestic holiday packages, like cycling tours along the main rivers, to its customers. Knowing that Germans are eager to spend their money on such tours in times when temperature is rising, they could provide special offers for biking tours. It also implies that these promotions should start nearly immediately or in the near future because the customers tend to decide spontaneously to go cycling instead of searching a long term in advance for such type of entertainment in their free time. Nearly the same applies for the producers and retailers of camping equipment. Especially in the mentioned months they should be ready to provide dedicated offers to customers who want to renew their camping equipment for the upcoming sunny season.

Google Correlate also offers to shift the time series against each other by several months. In our example, shifting the temperature series by one quarter, it shows a high correlation towards search terms related to winter clothing and winter holidays with r=0.8449 and r=0.8472 respectively. For the apparel industry this result shows that German Google Users start to inform themselves about the current trends in the upcoming winter season three months before the winter season starts. Therefore these producers and retailers should start to decorate their displays and design their websites according to that need. This could also enable producers in more trend-changing industries to acquire a competitive advantage basing their decisions on these data.

Based on the above, Google Correlate is shown to follow the trend of data mining in market research and offers a very powerful complementary tool to the customer data companies already own. Especially the significant time and cost advantage as well as the magnitude of underlying data show noteworthy advantages over traditional market research. Using such a tool businesses are better equipped to implement marketing campaigns in a cost-efficient manner, and to tailor such campaigns to fit precisely with current customer need.

Exhibits

Exhibit â€“ Search term correlation to average German temperature

Exhibit - Search term correlation to average German temperature (graph)

Exhibit - Search term correlation to average German temperature 3 months shift

Exhibit - Search term correlation to average German temperature 3 months shift (graph)

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now