An exceedingly brief history of search

In the initial days of the Web, search wasn’t even necessary. There were only a handful of websites, so someone could just make a page that consisted of nothing but links to every other page (even with descriptions). You can still visit Tim Berners-Lee’s own web page catalog here. However, as the number of websites grew, this manual indexing became next to impossible. One of the first tools that could even be considered a search engine was Archie, predating HTML. Once a user typed in the exact FTP file they were looking for, they could download it via Telnet. Then came tools like Lycos, AltaVista, Ask Jeeves, etc., which all used simplistic and archaic search algorithms, are probably unfamiliar to you if you were born after 1990. Then, in 1998, Google was founded, a search engine that used a revolutionary algorithm: PageRank. Before we look into how the algorithm works, we need to first understand what problem it solves.

The problem

The Web is huge, and finding valuable information about a specific topic is incredibly difficult. Let’s imagine that we have a database containing every web page. How would we find the best page to find information about dogs? Keep in mind that we are in 1998, where we don't have access to fancy LLMs that can summarize large bodies of text and categorize them for us. An initial idea might be to find the page that contains the word “dog” more than any other. This seems like a pretty solid idea, a page about dogs will feature the word “dog” more than one about theoretical physics. However, it would be trivial for a bad actor to game our system, they would simply have to make a page that consisted of nothing but the word “dog” hundreds of times. So, how can we find a web page that 1) is about what we are searching for, and 2) is a real website that actually contains useful information. Well, we can split these two conditions into two different algorithms, and combine their results later. We already have a way to find pages related to our search, just match our search term to the content of a page. But how can we find websites that actually contain useful information?

Basic assumption

Essentially, the idea is that websites that contain useful information will be more popular, which in turn means that many other websites will link to them. But before we just count how many links a website has pointing to it, there are some nuances to consider. Just as our bad actor exploited us before by creating a website of nothing but the word “dog”, they could game our new system by creating a fake website that is filled with nothing but links to their real website, or worse, create hundreds of fake websites that all do this. So, we have to constrain our definition of a popular website slightly: a website is popular if it has many links from other popular websites pointing to it. This may seem somewhat circular, how do we know how popular those pages linking to our page are? Well, we will give each page an initial popularity value, called its rank, and iterate over every page multiple times such that each iteration will cause a page’s rank to get closer to its true value.

Directed graphs

Before we can get into the actual logistics of our new algorithm, we need to first understand the domain of our problem. Basically, we can think of the Web as a graph. For the uninitiated, a graph is a data structure that is made of two things: nodes and edges. Edges connect nodes to other nodes, and can have specified directions or weights. In our case, a node is a web page, and an edge represents a link to a different page (meaning it only has one direction).

So our goal is to find the number of edges that point to a given node, and sum them up depending on how popular those nodes are. In other terms, we are trying to find the number of backlinks a page has. Because the edges are directional, it is impossible to directly figure out how many backlinks any page has without having to visit every page. To help visualize how our process will work, we can use something called the “random surfer” model.

Random surfer model

We begin by initializing every page’s rank to some starting value, and placing a “web surfer” at a random page. Then, the surfer will click on a random link on the page, and that page’s rank will increase depending on the current one’s rank. From that page, we repeat the process, clicking on a random link and adding to its page’s rank each time. As we continue this, the rankings will stabilize and we will be left with our final values.

One problem that may arise from this is that there may not be a path from any given page to any other, there could be two “islands”, each that only link within themselves. If this happens, our surfer will only ever visit those pages that exist within the current graph, and the rankings could be totally wrong. To solve this, our surfer will jump to a random page every so often, leading to the surfer visiting every page over time.

Formal calculations

Let’s now formally define a page’s PageRank as the probability that our surfer will land on that page. With that, we can create our formula:

PR(A) is the PageRank of page A
PR(Ti) is the PageRank of page Ti, a page which links to A
C(Ti) is the number of outbound links on page Ti
d is the damping factor, always between 0 and 1. Intuitively, you can think of d as the probability that our surfer goes to a random page. Typically, 0.85 is used.

Because this formula involves other page’s PageRanks, we initialize all rankings to 1/n to avoid any infinite loops, where n is the number of pages on the web. Let’s run our formula on the following simple graph:

Since all rankings were initialized to the same value, a popular page’s endorsement and an irrelevant page’s endorsement are weighted equally. So, let's run our formula a couple more times using the new rankings each time.

After only five iterations our rankings have more or less stabilized, and we can see that Page 3 is higher than Page 1, even though they have the same number of links pointing to them. This is because Page 2, which has a high ranking, points to Page 3. As you can likely tell, the PageRank formula is probabilistic, so the sum of all rankings will be 1. While it seems quite tedious to do by hand, as I have, this algorithm is actually pretty fast, and in 1998 it was able to calculate the PageRank of 26 million pages in only a few hours.

Closing thoughts

Again, it's important to keep in mind that this algorithm does not perform “search”, it just ranks pages based on popularity. Only once used in combination with an actual search algorithm, even something as simple as a regular expression match, will its value really shine. This formula has use cases in tons of different spaces, such as social networks or theoretically anything with a directed graph at its core. Google apparently still uses PageRank, although, now paired with other diagnostics like quality of a page and search history of the user, it likely plays a smaller part. Nevertheless, it’s pretty interesting that when Larry and Sergey founded Google, this ridiculously simple algorithm was the only real improvement they had over other search engines, some of whom were very established. Twenty-five years later, Google is just now starting to be challenged in the search space thanks to Microsoft’s acquisition of OpenAI.

If you’d like to learn more, you can read the original paper here. Thanks for reading, and have a great day!