PageRank Basics: The Science bit
If anyone has spent any time reading information retrieval and search patents you'll know that it doesn't take long until there's a mention of eigenvectors, Markov chains, and various other mathematical principles. Having a strong appreciation, at the very least, of these concepts is critical in getting a true understanding of things like PageRank and other algorithms in the field. In this post I'll be introducing the topic, albeit briefly, for the uninitiated.
Let's take something simple like the football league as our analogy to help explain the maths. How do we determine the top team in a league? Let's denote a team's strength/quality by , and a specific football team by , so the strength of the football team is . Where would be proportional to the sum of the strengths of all the teams' it has beaten. Think of all the potential match-ups of teams playing each other, forming a giant matrix where subscript stands for rows and subscript for columns and the result of the game denoted by . Thus for instance, if team defeated team and of course if team lost to team .
To illustrate, the numbers below following the A= are in 6 rows forming one of these matrices, each one representing a football team and the numbers representing the results of the home games of each team. So this shows, for example, that team 1, at home (the first row) beat teams 3 and 4, team 2 (row 2) beat teams 1, 2,3 and 4 team 3 (row 3) beat teams 4 and 5, etc. One can extend this table with away matches as well, but for simplicity we evaluate only home performance.
Determining whether team 2 is better than team 3 would, theoretically, be easy as you'd just cross reference them to see who beat who when they played. However, this is primitive. A better way would be to determine where each of these teams should rank against each other by looking not only who each team beat, but also who everyone else beat or lost to. For example, if team A beats everyone except for team B, with B losing all its other games, you would still rank team A above team B, because team A beat everyone that beat team B.
Taking it one step further, some teams are tougher to beat than others so the ones that team A does beat is again more important than simply the number of total wins. Essentially, ranking the teams should look at everyone's performance against everyone's performance to determine who the best teams are. The mathematical process, drawing that line, gives us the eigenvector , specifically the eigenvalue eigenvector decomposition. The word eigen is a German word and means nothing else but proper, hence eigenvalue simply means proper value and eigenvector, proper vector.
Performing eigenvalue and eigenvector decomposition on our matrix A gives us a diagonal matrix of eigenvalues and a six on six matrix of eigenvectors. One should not worry about eigenvalues as they merely show us the importance of our eigenvectors. The eigenvector with the largest eigenvalue is the one under consideration and here is the one we received denoted by the letter p:
So, the largest entry in the eigenvector is the best team, the second largest entry is the second best team, etc. Substitute 'team' in the above analogy for 'site/page' and you have PageRank, assuming that 'team strength' is proportional to link equity. Note how this is different from just looking at how many sites link to another to determine ranking, here ranking is determined by focussing on which sites/pages beat all others based on a potential multitude of criteria. The eigenvector method is essentially a super smart way of determining importance within a large dataset. Beyond PageRank this approach is also relevant when it comes to on-page SEO relevancy comparisons between sites and pages.
Note that the matrix used in this example is something called a square matrix but in information retrieval (i.e. natural search in this instance) the matrices are called digraphs which are essentially matrices which show multiple nodes (pages) and the link relationships between them. For example, page A links to page B, Page C links to Page A, Page C links to Page B. These pages (nodes) have links that point in a particular direction (edges). The number of links pointing to a node is called indegree and the number of links pointing out are called outdegree. An array can then be built using all that data which then becomes another type of matrix, called a row-stochastic matrix.
Okay, now we get to random processes and the one we are interested in is what is called a Markov chain. A Markov chain is a random process that occurs over time according to some transition probabilities (don't worry if you don't understand this). I'll use another analogy to try and explain what this is - a Markov chain is a memoryless process, which means it is a series of events that are random and not influenced by the past. For example, if you are sleep walking, where you end up next is largely random and not influenced by where you've been, bar where you are right now before you start moving.
Similarly, PageRank is governed by a Markov chain process as it uses what is called a 'random surfer model', which is the probability of you landing on a particular site by randomly clicking on links from one site to another. For example, by clicking around the web, you are more likely to stumble on the BBC.co.uk than crappyunknownsite.com, and therefore the BBC will have a commensurately higher PageRank score as a result. You'll see the 'random surfer model' explanation in whitepapers all the time - technically at this stage it's a transition matrix or links matrix. An eigenvector is the mathematical bit that makes this all possible.
Now let us come back to our football matrix and imagine that numbers in this matrix are not results of the football games but the number of links from one page to another. Thus for instance the second row in the matrix shows that page number 2 has incoming links from pages 1, 2, 3, and 4 and page number 4 has only one incoming link from page 1. In this example, probabilities and the numbers of links are equal because for simplicity we assume that there is only one link on the page. The results of the eigenvalue eigenvector decomposition are the same, but now they have a completely different meaning as it shows us that page 2 has the highest RageRank and 4 the lowest PageRank, which is based on the number of incoming links.
Understanding these basic mathematical underpinnings of PageRank and other similar methods is key to developing the correct perspective on how the search engines view links and how this then informs rankings. These mathematical foundations not only make PageRank possible but remain some of the key building blocks for search algorithms going forward.