- Jun 1, 2020

Introduction to PageRank for SEO

When Google was launched back in 1998, they introduced a mechanism for ranking web pages that was radically different from how the established search engines at the time worked.

Up to then, most search engines relied exclusively on content and meta data to determine if a webpage was relevant for a given search. Such an approach was easily manipulated, and it resulted in pretty poor search results where the top ranked pages tended to have a lot of keywords stuffed in to the content.

Google radically shook things up by introducing PageRank as a key ranking factor.

Content still mattered to Google, of course, but rather than just look at which webpage had the keyword included most often, Google looked at how webpages linked to one another to determine which page should rank first.

Google’s theory was that a link from one webpage to another counted as a ‘vote’, a recommendation from that webpage for the page that was linked to. And the more ‘votes’ a webpage had – the more links that pointed to it – the more Google felt it could trust that page to be sufficiently good and authoritative. Therefore, pages with the most links deserved to rank the highest in Google’s results.

It’s interesting to note that the PageRank concept was heavily inspired by similar technology developed two years earlier by Robin Li, who later went on to co-found the Baidu search engine. (Thanks to Andreas Ramos for pointing that out to me!)

More than two decades later Google still relies heavily on PageRank to determine rankings. For a long time, Google allowed us to see an approximation of a webpage’s PageRank through their browser toolbar, which included a PageRank counter that showed the current webpage’s PageRank as a integer between 0 and 10.

This Toolbar PageRank (TBPR) was a very rough approximation of the actual PageRank that Google had calculated, and was a linear representation of the logarithmic scale that Google used internally to determine PageRank. So going from TBPR 1 to 2 was a growth of a factor of 10, whereas going from TBPR 2 to 3 needed a growth of a factor of 100 – and so on, all the way to the mythical and almost unachievable TBPR 10.

The problem with TBPR was that folks in the SEO industry obsessed over it, to the detriment of all other areas of SEO (like, you know, publishing content that actually deserved to be read rather than just serve as a platform to manipulate PageRank). Danny Sullivan (now employed by Google as their search liaison but at that time still working for Search Engine Land as their chief search blogger) wrote an obituary for Toolbar PageRank which explains rather well why it was such a terrible metric.

So Google stopped showing Toolbar PageRank. First the Toolbar scores were no longer updated from 2013 onwards, and thus ceased to accurately reflect a webpage’s PageRank. Then in 2016 Google retired the Toolbar PageRank icon entirely.

This retirement of TBPR led a huge contingent of SEOs to believe Google had stopped using its internal PageRank metric. You can forgive the SEO community for this conclusion, because Google also stopped talking about PageRank in favour of more conceptual terms like ‘link value’, ‘trust’, and ‘authority’. Moreover, the original patent for PageRank that Google filed in 1998 wasn’t renewed and expired in 2018.

But PageRank never went away. We just stopped being able to see it.

Also in 2018, a former Google engineer admitted on a Hacker News thread that the original PageRank algorithm had been replaced internally by Google in 2006 by a new approach to evaluating links. This new patent can easily be seen as the official successor to the original PageRank. In fact I’d highly recommend you read Bill Slawski’s analysis of it, as no one does a better job of analysing Google patents than him.

In this blog post I don’t want to go down the patent analysis route, as that’s firmly Bill’s domain. Instead I want to make an attempt at explaining the concept of PageRank in its current form in such a way that makes the theory applicable to the day to day work of SEOs, and hopefully clears up some of the mysticism around this crucial ranking factor.

Others have done this before, and I hope others will do it after me, because we need more perspectives and opinions and we shouldn’t fear to retread existing ground if we believe it helps the industry’s overall understanding of the topic. Hence this 4000-word blog post about a 22-year old SEO concept.

Note that I am not a computer scientist, I have never worked at Google, and I’m not an expert at analysing patents and research papers. I’m just someone who’s worked in the trenches of SEO for quite a while, and has formed a set of views and opinions over the years. What I’m about to share is very likely wrong on many different levels. But I hope it may nonetheless be useful.

The Basic Concept of PageRank

At its core, the concept of PageRank is fairly simple: page A has a certain amount of link value (PageRank) by virtue of links pointing to it. When page A then links to page B, page B gets a dose of the link value that page A has.

Of course, page B doesn’t get the same PageRank as page A already has. While page A has inbound links that give it a certain amount of PageRank, in my example page B only gets PageRank through one link from page A. So page B cannot be seen as equally valuable as page A. Therefore, the PageRank that page B gets from page A needs to be less than 100% of page A’s PageRank.

This is called the PageRank Damping Factor.

In the original paper that Google published to describe PageRank, they set this damping factor to 0.85. That means the PageRank of page A is multiplied by 0.85 to give the PageRank of page B. Thus, page B gets 85% of the PageRank of page A, and 15% of the PageRank is dissolved.

PageRank Damping Factor from webpage A to B

If page B were then to have a link to page C, the damping factor would apply again. The PageRank of page B (85% of page A’s PageRank) is multiplied by 0.85, and so page C gets 72.25% of page A’s original PageRank.

PageRank Damping Factor from webpage A to B to C

And so on, and so forth, as pages link to one another and PageRank distributes through the entire web. That’s the basic idea behind PageRank: pages link to one another, link value flows through these links and loses a bit of potency with every link, so webpages get different amounts of PageRank from every link that points to them.

Pages that have no links at all get a basic starting amount of PageRank of 0.15, as extrapolated from the original PageRank calculation, so that there’s a jump off point for the analysis and we don’t begin with zero (because that would lead to every webpage having zero PageRank).

Damping Factor Modifiers

The above all makes sense if a webpage has just one link pointing to another page. But most webpages will have multiple links to other pages. Does that mean those links all get 0.85 of the starting page’s PageRank?

In its original form, PageRank would distribute evenly across all those links. So if you had ten links to different pages from page A, that 85% of link value would be evenly shared across all of those links so that each link would get 8.5% of page A’s link value (1/10th of 85%). The more links you had on your page, the less PageRank each linked page would receive.

This led to a lot of SEOs adopting a practice called ‘PageRank Sculpting’ which involved hiding links from Google (with nofollow attributes or other mechanisms) to ensure more PageRank would flow through the links you did want Google to count. It was a widespread practice for many years and seems to have never really gone away.

Reasonable Surfer

But this even distribution of PageRank across all links on a page was a short-lived aspect of PageRank. The engineers at Google quickly realised that letting link value flow evenly across all links on a page didn’t make a lot of sense. A typical webpage has links that are quite discrete and not likely to be clicked on by actual visitors of the page (such as links to privacy policies or boilerplate content), so why should these links get the same PageRank as a link that is very prominent on a page as part of the main content?

So in 2004 Google introduced an improved mechanism for distributing PageRank across multiple links. This is called the ‘Reasonable Surfer’ model, and it shows that Google started assigning different weights to links depending on the likelihood of a link being clicked on by a ‘reasonable surfer of the web’ – i.e. an average person browsing webpages.

Basically, Google modified the PageRank Damping Factor depending on whether a link was actually likely to be used by an average person. If a link was very prominent and there was a good chance a reader of the webpage would click on it, the damping factor stayed low and a decent chunk of PageRank would flow through that link.

But if a link was hidden or discretely tucked away somewhere on the page, such as in the footer, it would get a much higher damping factor and so would not get a lot of value flowing through it. Pages linked from such hidden links would not receive much PageRank, as their inbound links were unlikely to be clicked on and so were subjected to very high damping factors.

This is a key reason why Google wants to render pages as it indexes them. Just looking at the HTML source code of a page doesn’t necessarily reveal the visual importance of a link. Links could easily be hidden with some CSS or JavaScript, or inserted as a page is rendered. By looking at a completely rendered version of a webpage, Google is better able to accurately evaluate the likelihood of a link being clicked, so can assign proper PageRank values to each link.

This aspect of Google’s PageRank system is still being updated and refined, so it’s safe to assume Google kept using it and improving it.

For me, the Reasonable Surfer approach also makes PageRank Sculpting obsolete. With Reasonable Surfer, the amount of links on a page is not a determining factor of how much PageRank each link gets. Instead, the visual prominence of a link is the key factor that decides how much PageRank flows through that link.

So I don’t believe you don’t need to ‘hide’ links from Google in any way. Links in the footer of a page are likely to be ignored anyway for PageRank purposes as users aren’t likely to click on them, so you don’t need to add ‘nofollow’ attributes or hide them another way.

Internal versus External Links

What follows is purely speculation on my end, so take it with a grain of salt. But I believe that, in addition to a link’s prominence, the applied PageRank Damping Factor also varies depending on whether the link points to an external site or an internal page.

I believe that internal links to pages within the same website have a lower damping factor – i.e. send more PageRank to the target page – than links that point to external websites.

This belief is supported by anecdotal evidence where improvements in internal linking have a profound and immediate impact on the rankings of affected webpages on a site. With no changes to external links, improving internal linking can help pages perform significantly better in Google’s search results.

I strongly suspect that PageRank flowing within a website diminishes more gradually, whereas PageRank that is sent to other websites dissipates more quickly due to a larger damping factor.

To give it some numbers (which I randomly pulled out of thin air for the purpose of this example so please don’t take this as any sort of gospel), when a webpage links to an external website Google may apply a damping factor of 0.75 which means 25% of PageRank is lost and only 75% arrives at the destination page. Whereas if that same webpage links to another page on the same site, the damping factor may be 0.9 so that the internal target page receives 90% of PageRank, and only 10% is lost.

Now if we accept this as plausible, it raises an interesting question: when is a link considered ‘internal’ versus ‘external? Or, more simply put, what does Google consider to be a ‘website’?

Subdomains vs Subfolders

This simple question may have a complicated answer. Take a website that uses multiple technology platforms, such as a WordPress site that also uses another CMS to power a forum. If both the WordPress site and the forum exist on the same overall domain, such as www.polemicdigital.com with a /forum/ subfolder, I’m pretty confident that Google will interpret is as just one website and links between the WordPress pages and the forum pages will be seen as internal links.

But what if the forum exists on a subdomain, like forum.polemicdigital.com? The forum behaves very differently from the main WordPress site on the www subdomain, with a different technology stack and different content. So in that scenario, I strongly suspect Google will treat the forum.polemicdigital.com subdomain as a separate website from the www.polemicdigital.com WordPress site, and any links between them will be seen as external links.

For me, this lies at the heart of the subdomain vs subfolder debate that has waged within the SEO industry for many years. Hosting a section of your site on a subdomain makes it more likely it’ll be interpreted as a separate site, so I believe links from your main site to the subdomain will be seen as external links and be subjected to higher damping factors. Thus your subdomain’s ranking potential is diminished because it receives less PageRank from your main site.

Whereas if you put extra features of your site, such as a forum or a blog, in a subfolder on the same domain as your main site, it’s likely Google will simply see this as extra pages on your site and links pointing to these features will be internal links and send more PageRank to those areas.

This is why I recommend my clients to never put crucial rankable resources like blog articles and user generated content on a subdomain, unless there’s really no other choice.

If you do have to use subdomains, you should try to use the same technology stack as your main site where possible with the same design, boilerplate content, and page resources (images, CSS, JavaScript). This will increase the chances that Google will interpret the subdomain as being a part of the main domain’s website.

Redirects and PageRank

A long standing question in SEO is “how much PageRank is lost through a redirect?” When you change a webpage’s URL and redirect the old URL to the new location, do you lose some of the original URL’s PageRank in that redirect?

Over the years the answers from Googlers have varied a bit. Historically, Google has confirmed that the amount of PageRank lost through a redirect is the same as through a link. This means that a redirect from page A to page B counts as a link from page A to page B, and so the PageRank Damping Factor applies and page B receives less PageRank (by about 15% or whatever damping factor Google chooses to apply in that specific context). This is done to prevent redirects from being used to artificially manipulate PageRank.

However, more recently some Googlers have said that redirects do not necessarily cause PageRank loss. In a 2018 Webmaster Hangout, John Mueller emphasises how PageRank is consolidated on a canonical URL, and that redirects serve as a canonicalisation signal. This would imply that there is no PageRank loss in a redirect, but that the redirect tells Google that there is a canonical URL that all the relevant ranking signals (including PageRank) should be focused on.

Nonetheless, whenever a website goes through an effort to minimise redirects from internal links and ensures all its links point directly to final destination URLs, we tend to see an uplift in rankings and traffic as a result. This may not necessarily be due to a decrease of PageRank loss and more due to optimised crawling and indexing, but it’s an interesting correlation nonetheless.

Because redirects result in extra crawl effort, and there is a chance that some redirects still cause PageRank loss, I would always recommend websites to minimise internal redirects as much as possible. It’s also good practice to avoid chaining multiple redirects due to the extra crawl effort, and Google tends not to crawl beyond a maximum of five chained redirects.

PageRank Over Time

Due to the volatile nature of the web, a webpage’s PageRank is never a static number. Webpages come and go, links appear and disappear, pages get pushed down deeper in to a website, and things are constantly in flux. So Google has to recalculate PageRank all the time.

Anecdotally many SEOs believe links lose value over time. A newly published link tends to have a stronger positive ranking effect than a link that was published years ago. If we accept the relevant anecdotal evidence as true, it leads to questions about applied PageRank damping factors over time.

One possibility is that Google applies higher damping factors to older links, which means those old links pass less PageRank as time goes on.

Another possibility is that the webpages that contain those old links tend to get buried deeper and deeper on a website as new content is published, so there are more layers of clicks that each siphon off a bit of PageRank. That means a link from page A to page B passes less PageRank not because of a higher damping factor, but because page A receives less PageRank itself as it sinks into the website’s archive.

Fact is, we don’t really know what Google does with PageRank from historic links. All we know is that links do tend to lose value over time, hence why we constantly need to get new links pointing to a website to maintain rankings and traffic.

URL Seed Set

There’s one more aspect of PageRank we need to talk about, which the updated PageRank patent mentions but the original didn’t.

The updated PageRank patent frequently mentions a ‘seed set of pages’. This refers to a starting point of URLs to calculate PageRank from. I suspect that this was introduced as a way to better calculate PageRank by starting from webpages that are known and understood to be trusted and reliable, such as Wikipedia articles or high authority news websites.

As per the patent, seed sets “… are specially selected high-quality pages which provide good web connectivity to other non-seed pages.”

What makes this seed set especially interesting is how it’s used to modify a webpage’s PageRank based on distance, i.e. how many clicks it takes to get from a seed URL to the webpage. As per the patent, “… shortest distances from the set of seed pages to each given page in the link-graph are computed,” and “[t]he computed shortest distances are then used to determine the ranking scores of the associated pages.”

This is entirely separate from how PageRank flows through links. Rather than counting a webpage’s cumulative PageRank as it flows through links, the patent explicitly states that it’s about the ‘shortest distance’ from the seed set to the webpage. So it’s not about an accumulation of PageRank from one or more links, but a singular number that shows how many clicks it would take to get to the webpage from any URL in the seed set.

So by all appearances, it looks like Google uses the number of clicks from a seed URL to a webpage as a PageRank modifier, where fewer clicks means higher PageRank.

PageRank and Crawling

So far we’ve talked about PageRank exclusively as a ranking factor. But this is just part of the picture. There is another important effect that PageRank has on a URL: It helps determine how often it is crawled by Googlebot.

While Google endeavours to crawl the entire web regularly, it’s next to impossible to actually do this. The web is huge and growing at an exponential rate. Googlebot couldn’t possible keep up with all the newly published content while also keeping track of any changes made to existing webpages. So Googlebot has to decide which known URLs to recrawl to find updated content and new links.

PageRank feeds in to this decision. Basically, the more PageRank a URL has, the more often Googlebot will crawl it. A page that has a lot of links pointing to it will be seen as more important for Google to crawl regularly.

And the opposite also applies – pages with very low PageRank are seen as less important, so Google will crawl them less often (or not at all). Note that PageRank is only part of that equation, but it’s good to keep in mind when you talk about optimising a site’s crawling.

All this and more is explained elegantly by Dawn Anderson in her Search Engine Land article about crawl budget, which is definitely worth a read.

What This Means For SEO

Understanding all of the above, what does this mean for SEOs? How can you apply this theory to your daily work?

We can distil the theory to a few clear and concise recommendations for improving a website’s use of PageRank:

1. Links are a vital ranking factor

So far nothing has managed to replace PageRank as a reliable measure of a site’s trust and authority. However, Google is very good at ignoring links it doesn’t feel are ‘natural’, so not all links will pass PageRank. In fact, according to Paul Madden from link analysis software Kerboo, as many as 84% of all links on the web pass little to no value.

In a nutshell, it’s not about how many links you have, but how much value a link could pass to your site. Which brings me to the second point:

2. Prominent links carry more weight

The most valuable type link you can get is one that a user is likely to click on. A prominent link in the opening paragraph of a relevant piece of good content is infinitely more valuable than a link hidden in the footer of a website. Optimise for visually prominent, clickable links.

3. Internal links are golden

It’s easy to obsess over getting more links from external sites, but often there’s just as much gain to be had – or more – from optimising how link value flows through your website. Look at your internal link structure and how PageRank might flow through your site.

Start with the webpages that have the most inbound links (often your homepage and some key pieces of popular content) and find opportunities for PageRank flow to URLs that you want to boost rankings for. This will also help with optimising how Googlebot crawls your site.

4. Links from seed URLs are platinum

We don’t know which URLs Google uses for its seed set, but we can make some educated guesses. Some Wikipedia URLs are likely part of the seed set, as are news publishers like the New York Times and the BBC.

And if they’re not directly part of the seed set, they’re likely to be only one or two clicks from the actual seed URLs. So getting a link from those websites is immensely valuable – and typically very hard.

5. Subfolders are usually superior

In almost any given context, content placed in a subfolder on a site will perform better than content hosted on a subdomain. Try to avoid using subdomains for rankable content unless you really don’t have a choice.

If you’re stuck with subdomains and can’t migrate the content to a subfolder, do your best to make the subdomain look and feel as an integral part of the main site.

6. Minimise redirects

While you can’t avoid redirects entirely, try to minimise your reliance on them. All your internal links should point directly to the destination page with no redirect hops of any kind.

Whenever you migrate URLs and have to implement redirects, make sure they’re one-hop redirects with no chaining. You should also look at pre-existing redirects, for example from older versions of the website, and update those where possible to point directly to the final destination URL in only one redirect hop.

Wrapping Up

There’s a lot more to be said about PageRank and various different aspects of SEO, but the above will hopefully serve as a decent enough introduction to the concept. You may have many more questions about PageRank and link value, for example about links in images or links with the now ambiguous ‘nofollow’ attribute.

Perhaps a Google search can help you on your way, but you’re also welcome to leave a comment or get in touch with me directly. I’ll do my best to give you a helpful answer if I can.

[Toolbar PageRank image credit: Search Engine Land]