top of page

151 items found for ""

  • Technical SEO is absolutely necessary

    Note: This post was originally published on in 2015. I've republished it here, with some minor cleaning up, as I believe its core lessons are still applicable today. Once in a while you hear these murmurs about how SEO is not about the technology anymore, how great content and authority signals will be sufficient to drive stellar growth in organic search. And the people who say this can often show formidable success from content and links, so it would be easy to conclude that they’re right. But they’re not. For them to claim that their efforts in content and social have made technical SEO unnecessary, is a bit like a Formula 1 driver claiming he won the race purely due to his own driving, ignoring the effort that has gone into the car he’s been whizzing around the track. This is insulting to the Formula 1 engineers that have enabled the racing driver to win, just as claiming that technical SEO is unnecessary is insulting – and dangerously short-sighted – to the folks who build and optimise the platforms that enable all your content and social efforts. The thing is, technical SEO is not ‘sexy’. In the early days of SEO, pretty much all practitioners were coders and IT geeks, and the industry was primarily about using clever technologies to game the algorithm. Nowadays SEO has evolved to such an extent that it aligns a lot with classic marketing, and this is reflected in the backgrounds where the SEO industry’s professionals come from – marketers, copywriters, journalists, and designers are now more common in SEO than computer science graduates. And for these people it’s very comforting to hear that technical SEO is unnecessary, because it’s an aspect they’re not comfortable with and can’t easily navigate. It’s a nice message for them, putting them at ease and allowing them to remain confident that their content strategies will continue to drive results. Until they start working on a website that is technically deficient. Then the trouble starts. For most small to medium-sized websites, technical SEO is indeed not a huge priority. Especially if a website is built on a popular CMS like Wix or WordPress, there’s usually not a lot of technical SEO work that needs to be done to get the website performing optimally. But for larger websites, it’s an entirely different story. The more complex a website is, the higher the chances that some aspect of its functionality will interfere with SEO, with potentially catastrophic results. There are countless ways that a website’s technical foundation can go wrong and prevent search engines from crawling and indexing the right content. It takes someone skilled in technical SEO to identify, prevent, or fix these problems. Simple things like slightly inaccurate blocking rules in robots.txt and faulty international targeting tags, to major issues like spider traps, automatic URL rewrites with the wrong status codes, or incorrect canonical implementations, can wreak havoc with a website’s performance in organic search results. And if you as an SEO practitioner are not familiar with the ins and outs of technical SEO, you couldn’t even begin to diagnose the problem, let alone fix it. So let’s be clear once and for all: technical SEO is absolutely necessary. Now if you’re an SEO with a non-technical background, there’s no need to panic. You can have a very successful career in SEO with limited technical know-how. But you do need to know and accept your limitations, and be able to call on expert help when you need it. However, I do recommend that every SEO practitioner develop at least a rudimentary understanding of technical SEO. I don’t think this is optional. You need to know the basics, if only to enable you to recognise when something technical might cause SEO problems and you need to call in further support. If you have zero technical SEO knowledge, you’re not going to be able to recognise technical issues when they arise, and that’s a dangerous position to be in. Learning Technical SEO So now that we’ve established that a baseline of technical SEO know-how is valuable, how do you go about achieving that? I wish I could tell you that all you need to do is read a Moz starter’s guide and some blog posts and you’re done. But if I did that, I’d be lying. The truth is, learning technical SEO is not easy, especially if you have no technical background at all. This stuff can be challenging to wrap your head around. But, if you use the right approach and try to learn it one step at a time, you will be amazed at how quickly you’ll become sufficiently proficient in the technical side of things. And, more than that, you’ll find yourself applying that technical know-how across many other digital marketing channels as well. This is because, at its core, understanding technical SEO is about understanding how the web works, and understanding how search engines work. And by knowing these things – especially the first one – you will become a more informed, more effective, and more successful digital marketer. How The Web Works The internet has become such an integral part of our daily lives that few of us ever stop to think about how it all works. It’s something we just take for granted. But when you make a living from the internet, it pays to know how it functions under the hood. Every year I give a guest lecture for a MSc program at a local university. The MSc is in digital communications, and the lecture is an introduction to coding on the web: explaining the differences between static markup like HTML and XML, dynamic code like JavaScript, and server-side code like PHP and ASP.NET. The first time I gave that lecture, it quickly became apparent that we should have started at an even more basic level. The students, most of whom had no technical background, couldn’t really grasp the differences between the different types of code, because they didn’t understand the basic client-server architecture of the internet. It was really eye-opening for me and the academics organising the course. We had to go back to the drawing board and revise the module to ensure that we started with the basic structure of the internet before we delved into explaining code. We also had to emphasise the difference between the web and the internet, something that gets muddled too easily in many people’s vocabulary. Learning Internet Networking Unfortunately, there’s no ‘Internet Architecture For Dummies’ book I can recommend to get you up to speed. The client-server structure of the internet is usually taught as part of A-level computing, and is assumed to be basic knowledge when someone starts a degree in computer science or related technical field. Most marketers, of course, don’t necessarily have this foundational knowledge. So if this is unknown territory for you, I would advise you to read up on client-server models and basic internet networking, as you will need at least a rudimentary understanding of these concepts later down the line. Learning Code The next step is then to learn the basics of the various types of code that enable the internet – and the web specifically – to function. Now I’m not one of those SEO guys that believes we all need to learn how to code. I don’t think being able to code is a crucial skill. In fact, I consider myself an expert technical SEO guy and I couldn’t string together a coherent {if, then, else} statement if my life depended on it. Rather, I see coding skills as a gateway to a much more important skillset: problem solving. A lot of technical SEO is about troubleshooting and problem solving. Learning to code is a means to acquire those skills, but by no means the only way. It’s about logic and reason, and knowing where to look. While you don’t need to be able to code, I do think it’s important that you understand markup and are able to troubleshoot it. HTML and XML are two markup languages (the ‘ML’ in their acronyms mean exactly that) which are very important for SEO. It’s incredibly useful to be able to look at the source code of a webpage or sitemap, and have a pretty good understanding of what each line of markup actually does. The good news here is that there are plenty of online courses and books to help you get started with HTML and XML. A simple Google search will yield literally thousands of results - plus, there's a section on Google's own site dedicated to teaching HTML. So just get started with it – build your own webpage from scratch and learn the various different HTML tags, what they do, and how they combine to format a webpage. Once you’ve grasped the basics of HTML, the intricacies of XML will come easy to you. It doesn’t hurt to have a basic understanding of CSS as well, though I don’t consider that as crucial to technical SEO. Google Developers Because Google’s platforms rely on third party developers to code for them, Google has an entire website devoted to educating developers on best practices for many of its platforms. The Google Developers site has a wealth of resources for a huge range of different development environments. For SEOs, the most useful section there is the Search area, which contains loads of useful tips on how to build websites in such a way that Google’s search engine can work with them. Additionally, the site devotes entire sections to optimising the load speed of your site, the basics of web security, and much more. If you’re comfortable with HTML, it’s definitely worth looking through the site and learn how Google wants us to build and optimise websites. Learning Webservers Lastly, to round off your understanding of the web, you need to understand how webservers work. More importantly, you’ll want to come to grips with the basic configuration options of some of the most common webserver platforms. Personally, I’m reasonably proficient with Apache webserver, as it’s such a widely-used platform; many websites that use open source software like WordPress and Magento will run on an Apache webserver. Get yourself a cheap hosting environment, upload your hand-coded HTML pages to it, and start experimenting with server settings and configuration files. How Search Engines Work Learning the intricacies of how the web works will give you a great basic skillset for technical SEO. However, it’s only part of the picture. Next, you’ll want to learn how search engines work – specifically, how search engines interact with websites to extract, interpret, and rank their content. Whenever I teach SEO, I always start with a brief high-level overview of how search engines work. While search engines are vastly complex pieces of software, at their core they’re made up of three distinct processes. Each process handles a different aspect of web search, and together they combine to provide relevant search results: Crawling: this process is the spider – Googlebot – that crawls the web, follows links, and downloads content from your website. Indexing: the indexer – Google’s is called Caffeine - then takes the content from the spider and analyses it. It also looks at the links retrieved by the spider and maps out the resulting link graph. Ranking: this is the front-end of the search engine, where search queries are processed and interpreted, and results shown according to hundreds of ranking factors. I give a more in-depth explanation of each of the three distinct processes in this post for State of Digital: The Three Pillars of SEO. When it comes to technical SEO, crawling and indexing are the two most important processes you want to understand (though the third one, ranking, also touches on some aspects of technical SEO). At its core, I believe that technical SEO is primarily about optimising your website so that it can be crawled and indexed efficiently by search engines. This means ensuring the right content can easily be found by Google’s spiders, removing obstacles that prevent efficient crawling, and limiting the amount of effort for search engines to index your content. It’s important to know that search engines are, in effect, an applied science – namely, the science of Information Retrieval, which is part of the Computer Science field. If you’re serious about developing your technical SEO skills, you’ll want to invest some time in studying the basics of information retrieval. Stanford University (where Google’s founders studied) has put their entire Introduction to Information Retrieval course online, and I highly recommend it. It won’t teach you the ins and outs of how Google works, but it gives you a clearer understanding of the field and its associated lingo. This will make interpreting the various search engine patents, as often analysed by the incomparable Bill Slawski, much more fun and illuminating. Go Forth and Learn! I hope all that hasn’t discouraged you from embarking on your own journey of learning technical SEO. As I said, it won’t be easy, but I guarantee that it’ll make you a more effective overall digital marketer – not to mention help you become a truly great SEO. You’ll also find that experienced technical SEOs are very willing to help out and provide answers, guidance, and mentorship. Over the years I’ve learned so much from experts like David Harry, Alan Bleiweiss, Aaron Bradley, Rishi Lakhani, Bill Slawski, and so many others I always forget to mention. I make a point of trying to pass this on to the next batch of technical SEO experts-to-be. So if you come across a technical SEO conundrum and you need a bit of help, don’t hesitate to get in touch. If you get demotivated and downbeat, just remember that learning technical SEO is no different than most things worth doing in life: perseverance and ambition will get you there in the end. Good luck!

  • Newsletter: SEO for Google News

    I’ve been neglecting this blog for a few months, which has been entirely unintentional. With the COVID-19 pandemic, our lives have all drastically changed and priorities have shifted. Blogging couldn’t have been further from my mind. But now things have settled into a new routine, and my creative itches have once again emerged. The problem was, I didn’t know what I should blog about. So recently I put out a feeler on Twitter to ask my followers what I should write about, and one reply in particular stood out: I thought that was an excellent suggestion and it really got me thinking. I’ve been crazy busy with work these past few months (I’m counting my blessings, I realise not everyone has been as lucky), and most of those projects have been for publishers. SEO for news is an area I’ve been fortunate enough to build a speciality in; first through a full-time job working for the Belfast Telegraph back in 2010, and then since I started my own consultancy business through a series of SEO consulting projects with major publishing organisations like News UK, FOX, Future Publishing,, and many more. While I work with all sorts of companies, the publishing industry is the one I enjoy working with the most. So, as a speciality in SEO, news publishing is something I want to double down on. SEO for news publishers is an area I’m deeply excited about. While it shares the same foundational SEO best practices with every other niche, there are some areas of SEO that are unique to publishers. The news-specific elements of Google’s ecosystem are quite different from their regular search results, and this requires a different approach to optimisation. The SEO blogosphere is cluttered, with hundreds of websites regularly writing about all areas of SEO. Except when it comes to Google News. This is a speciality in SEO that is massively under-serviced in terms of content. There are a handful of good pieces out there, but these tend to focus on the very basics only – for example, on how to get a website into Google News – and there’s very little (if any) content out there that discusses some of the more advances concepts and challenges that news publishers face when it comes to maximising their search traffic. Which is odd, because SEO is so incredibly valuable for publishers. For most news websites, Google is by far the largest driver of traffic – primarily through Top Stories carousels, but increasingly also via Google Discover. This lack of useful and up-to-date information about SEO for publishers is hindering the media landscape. Publishers are struggling to attract readers and are missing opportunities to claim visibility in Google. This something I want to help change. Over the years I’ve learned a lot about how publishers can use SEO to boost their traffic. I’ve worked with development and product teams to improve websites for better crawling & indexing of news content, I’ve trained journalists and editors to optimise their content for improved visibility, and I’ve consulted on a range of projects aimed at integrating SEO practices into newsroom workflows. And I have learned new things from each and every project, and continue to learn every day. It’s time I started sharing this knowledge. But rather than just blog an occasional article and hope it reaches the people who might be able to use it, I’ve decided to take a different approach. The firehose of the SEO blogosphere is not conducive to publishing meaningful content on such a relatively narrow niche. So I’m going to be doing this via a newsletter. Inspired by great email newsletters from people like Aleyda Solis and Louis Grenier, I want to try and build a community around an email newsletter specifically about Google News. I’ve chosen the Substack platform for this, as it has all the bells & whistles I could possibly need – and it takes very little effort for me to set up and manage, so I can focus on the actual content. The first few topics I have in mind are around things like technical optimisation for articles, best practices for syndicated content, and integrating SEO into newsroom workflows. As time goes on I’ll try to cover as many different areas as possible, all around the overarching topic of maximising traffic to news publishers from various search sources. If this sounds interesting for you and want to read this content, you can sign up using the form below or directly on The first issue was sent out on 18 November, and I hope many more will follow. If you sign up, I promise I will only use your email address to send you this newsletter – it will never be shared with anyone else in any capacity. Let me know what you think of this newsletter experiment in the comments below. Is it something you’ll sign up for? If not, why? Any and all feedback is welcome.

  • Perfecting XML Sitemaps

    [This article was originally published in August 2019 on Search News Central] An XML sitemap is a file, in XML format, that lists URLs on your site that you want search engines like Google to crawl and index. XML sitemaps have been a staple of good SEO practice for many years now. We know that sites should have valid XML sitemaps to help search engines crawl and index the right pages. Yet despite the ubiquity of XML sitemaps, their exact purpose isn’t always fully understood. And there’s still a lot of confusion about the ‘perfect’ setup for an XML sitemap for optimal crawling and indexing. In this post I’ll share my own best practices I’ve learned over the years for fully optimised XML sitemaps, focusing on standard sitemaps for webpages. The Basics I’m not going to explain the basics of XML sitemaps too much, as those have been covered many times over on many other blogs. I’ll just recap the essentials here: XML sitemaps should adhere to the official protocol, otherwise Google will not see it as a valid file and will ignore the sitemap. XML sitemaps should only contain the canonical URLs on your website that you want search engines to crawl and index. You can submit your XML sitemap to Google and Bing directly through Google Search Console and Bing Webmaster Tools, as well as reference it in your site’s robots.txt file. Search Console and Webmaster Tools will report on the URLs included in your XML sitemaps, whether they are indexed and if there are any errors or warnings associated with them. There are separate XML sitemap types for webpages, images, videos, and news articles. In this article we’ll focus only on XML sitemaps for standard webpages. Sitemap Attributes XML sitemaps support multiple attributes for a listed URL. The three main attributes for every listed URL are the last modified date (), the priority from 0.0 to 1.0 (), and how often the content on the URL is expected to change (). Many XML sitemaps will have all three of these attributes defined for every URL listed in the sitemap. However, most search engines – Google included – only pay attention to one of those attributes: the date. When a URL has a date that is more recent than the last time the URL was crawled by the search engine, it’s a strong indicator that a URL should be re-crawled to see what has changed. As such, I always recommend making sure the attribute is accurate and updated automatically when a page on the site is changed in a meaningful way. Most XML sitemap generators, like the Yoast SEO plugin for WordPress, will ensure the attribute is automatically updated in the XML sitemap when a page is changed in the site’s backend. The other two attributes, and , are seen as too ‘noisy’ to be used as proper signals. Often these are set incorrectly or manipulated to try and trick search engines in to crawling pages more frequently than necessary, so they tend to be ignored by most crawlers. I tend to recommend leaving out these attributes entirely. It makes the XML sitemap’s file size smaller, and results in less clutter which makes sitemaps easier to troubleshoot. Sitemap Size In Google’s support documentation on XML sitemaps, they say a sitemap file can’t contain more than 50,000 URLs and must be no larger than 50 MB uncompressed. If your site has more than 50,000 URLs, you can break them up in to separate sitemaps and submit a so-called sitemap index – an XML sitemap that lists only other XML sitemaps. And you can submit 500 sitemap index files, each listing a maximum of 50,000 individual sitemaps. Which means the total amount of URLs you can submit to Google via XML sitemaps is: 500 x 50,0000 x 50,000 = 1,250,000,000,000 (one trillion two hundred fifty billion) That’s more than enough for even the most excessive websites. However, in my experience it’s not ideal to fill XML sitemaps up to their maximum capacity. For larger websites with hundreds of thousands or millions of pages, ensuring Google crawls and indexes all URLs submitted in XML sitemaps is quite challenging. Cramming every XML sitemap full with 50,000 URLs often leads to incomplete crawling and indexing, with only a small fraction of the submitted URLs included in Google’s index. I have found that limiting sitemaps to only 10,000 URLs leads to more thorough levels of indexing. I’m not sure why – I suspect that smaller lists of URLs are easier for Google to process and crawl – but it’s been proven time and again that smaller sitemaps lead to higher degrees of indexing. As a result, I always urge large websites to use smaller XML sitemaps – but not too small! Some huge websites limit XML sitemaps to 1000 URLs, which means you end up with thousands of individual sitemap files. This too brings complications, as Google Search Console will only list 1000 sitemap files in its Sitemaps reports. If you have more than 1000 individual XML sitemap files, you will not be able to get a complete gauge of their performance in terms of indexing from Google Search Console. A happy medium is to limit XML sitemap files to 10,000 URLs each. I’ve found that this is a good compromise on size, in that it ensures a higher degree of crawling and indexing than a 50,000 URL sitemap, but at the same time doesn’t create reporting limitations in Google Search Console. A maximum of 10,000 URLs per XML sitemap seems to be a good middle road between indexing and reporting. This was first explored by Nick Eubanks, and I’ve seen similar good results from this 10k limit on XML sitemaps. Sitemaps by Content Type When analysing indexing problems on websites, XML sitemaps can be very useful. However, if all URLs on a website are simply heaped together in XML sitemaps regardless of the purpose of each URL, then troubleshooting SEO issues becomes more challenging. A great way to make XML sitemaps more useful and helpful is to separate them out by content type, so that there are different XML sitemap files for different types of pages. For example, on an ecommerce site you should have different XML sitemap files for your static content pages (about us, terms & conditions, etc), your category and subcategory pages (hub pages), and your product pages. Or, alternatively, you can also create separate sitemap files for each category of products, so that you can quickly see which product categories are well-indexed and which ones aren’t. Combining the two approaches also works, where you have separate XML sitemaps for each category’s hub pages and product pages. For news publishers, I recommend separate XML sitemap files for news sections, and to list your articles in different XML sitemaps. This is because we want to make sure Google has indexed every section page on the site (as these are important for new article discovery), whereas achieving 100% indexing for all individual articles on a news site is extremely difficult. Keeping articles in separate XML sitemaps from section pages means you can troubleshoot potential issues more effectively and get better data on the index performance of both types of pages. Additionally, news publishers should have a news-specific XML sitemap that only lists the articles published in the last 48 hours. This aids Google with discovering your newly published and recently updated articles. Discovery vs PageRank Flow One common misconception about XML sitemaps is that they can replace a regular crawl of the website. Some people think that by having a good XML sitemap, the website itself doesn’t need to be fully crawlable. After all, they reason, the URLs we want Google to crawl and index are listed in the XML sitemap, so the website doesn’t need to have crawlable links to these URLs. This is entirely wrong. The primary mechanism through which search engines discover content is still crawling. Your website needs to have a good internal link structure that enables crawlers (and your website’s visitors) to find all your important pages with as few clicks as possible. And, more importantly, links enable the flow of PageRank (link value) through your site. Without PageRank, your website’s pages aren’t going to rank in search results. XML sitemaps in no way replace internal links. XML sitemaps don’t distribute any link value, and they don’t guarantee indexing and ranking of your website’s pages. Sitemaps are a supplementary signal for Google and support a website’s internal linking and canonicalisation – they are not intended to replace a proper crawlable website. You should always make sure your website is fully crawlable, and that all URLs listed in your XML sitemap can also be discovered by simply clicking on links on your site. If a URL is listed in a sitemap but doesn’t have any links pointing to it, Google is very unlikely to crawl the URL and even less likely to rank it in its search results. In a Nutshell Well-crafted XML sitemaps can help your website’s crawling and indexing by search engines, but for me the main purpose of sitemaps is to help troubleshoot SEO issues on your site. The data reported in Google Search Console on XML sitemaps is the real reason you want to have good sitemap files. Keep your sitemaps relatively small and focused with no unused attributes and no more than 10,000 URLs. Separate them out for different content types, and always make sure that URLs listed in your sitemaps are also fully discoverable through a web crawl. Good luck and if you have any comments or questions about XML sitemaps, use the comments below and I’ll try to respond as best I can.

  • Online Technical SEO Training Course

    I’ve been delivering my technical SEO training course in person for several years now. It’s been a very rewarding experience, with full classrooms and great feedback from the participants. Delivering these training courses in person has always felt like a competitive advantage. The interactive element of my training is part of the appeal. I encourage my students to ask any questions they want, either during the training, in the breaks, or after the session. I always try to set the ground rule that there is no such thing as a stupid question, and I want every participant to feel empowered to ask whatever they want to make sure they get maximum value from the training. We’ve toyed with delivering this training in an online format for a long time. Now that most of us are stuck at home, it feels like the right time to take the plunge and see if we can do this. So my technical SEO training is going online! I want to preserve the interactive element of my training as much as possible, which means it’ll be delivered live. No pre-recorded videos or anything like that, it’ll be exactly like a classroom except it’ll be done via Zoom. And instead of one long day, we’ll spread the training out over two half-day sessions. The first online training will be delivered on 27 & 28 August 2020 in two morning sessions (UK/Ireland time). The training content will be the same as my classroom training, though I might tweak it a bit to facilitate the online format – I expect I’ll be able to cover a bit more ground online, so I am likely to put some more content in to the training. Because it’ll be delivered live, we want to keep the number of participants limited to encourage interaction and make sure everyone gets maximum value from the sessions. So if you’re interested in the training, make sure you book your spot soon!

  • Introduction to PageRank for SEO

    When Google was launched back in 1998, they introduced a mechanism for ranking web pages that was radically different from how the established search engines at the time worked. Up to then, most search engines relied exclusively on content and meta data to determine if a webpage was relevant for a given search. Such an approach was easily manipulated, and it resulted in pretty poor search results where the top ranked pages tended to have a lot of keywords stuffed in to the content. Google radically shook things up by introducing PageRank as a key ranking factor. Content still mattered to Google, of course, but rather than just look at which webpage had the keyword included most often, Google looked at how webpages linked to one another to determine which page should rank first. Google’s theory was that a link from one webpage to another counted as a ‘vote’, a recommendation from that webpage for the page that was linked to. And the more ‘votes’ a webpage had – the more links that pointed to it – the more Google felt it could trust that page to be sufficiently good and authoritative. Therefore, pages with the most links deserved to rank the highest in Google’s results. It’s interesting to note that the PageRank concept was heavily inspired by similar technology developed two years earlier by Robin Li, who later went on to co-found the Baidu search engine. (Thanks to Andreas Ramos for pointing that out to me!) More than two decades later Google still relies heavily on PageRank to determine rankings. For a long time, Google allowed us to see an approximation of a webpage’s PageRank through their browser toolbar, which included a PageRank counter that showed the current webpage’s PageRank as a integer between 0 and 10. This Toolbar PageRank (TBPR) was a very rough approximation of the actual PageRank that Google had calculated, and was a linear representation of the logarithmic scale that Google used internally to determine PageRank. So going from TBPR 1 to 2 was a growth of a factor of 10, whereas going from TBPR 2 to 3 needed a growth of a factor of 100 – and so on, all the way to the mythical and almost unachievable TBPR 10. The problem with TBPR was that folks in the SEO industry obsessed over it, to the detriment of all other areas of SEO (like, you know, publishing content that actually deserved to be read rather than just serve as a platform to manipulate PageRank). Danny Sullivan (now employed by Google as their search liaison but at that time still working for Search Engine Land as their chief search blogger) wrote an obituary for Toolbar PageRank which explains rather well why it was such a terrible metric. So Google stopped showing Toolbar PageRank. First the Toolbar scores were no longer updated from 2013 onwards, and thus ceased to accurately reflect a webpage’s PageRank. Then in 2016 Google retired the Toolbar PageRank icon entirely. This retirement of TBPR led a huge contingent of SEOs to believe Google had stopped using its internal PageRank metric. You can forgive the SEO community for this conclusion, because Google also stopped talking about PageRank in favour of more conceptual terms like ‘link value’, ‘trust’, and ‘authority’. Moreover, the original patent for PageRank that Google filed in 1998 wasn’t renewed and expired in 2018. But PageRank never went away. We just stopped being able to see it. Also in 2018, a former Google engineer admitted on a Hacker News thread that the original PageRank algorithm had been replaced internally by Google in 2006 by a new approach to evaluating links. This new patent can easily be seen as the official successor to the original PageRank. In fact I’d highly recommend you read Bill Slawski’s analysis of it, as no one does a better job of analysing Google patents than him. In this blog post I don’t want to go down the patent analysis route, as that’s firmly Bill’s domain. Instead I want to make an attempt at explaining the concept of PageRank in its current form in such a way that makes the theory applicable to the day to day work of SEOs, and hopefully clears up some of the mysticism around this crucial ranking factor. Others have done this before, and I hope others will do it after me, because we need more perspectives and opinions and we shouldn’t fear to retread existing ground if we believe it helps the industry’s overall understanding of the topic. Hence this 4000-word blog post about a 22-year old SEO concept. Note that I am not a computer scientist, I have never worked at Google, and I’m not an expert at analysing patents and research papers. I’m just someone who’s worked in the trenches of SEO for quite a while, and has formed a set of views and opinions over the years. What I’m about to share is very likely wrong on many different levels. But I hope it may nonetheless be useful. The Basic Concept of PageRank At its core, the concept of PageRank is fairly simple: page A has a certain amount of link value (PageRank) by virtue of links pointing to it. When page A then links to page B, page B gets a dose of the link value that page A has. Of course, page B doesn’t get the same PageRank as page A already has. While page A has inbound links that give it a certain amount of PageRank, in my example page B only gets PageRank through one link from page A. So page B cannot be seen as equally valuable as page A. Therefore, the PageRank that page B gets from page A needs to be less than 100% of page A’s PageRank. This is called the PageRank Damping Factor. In the original paper that Google published to describe PageRank, they set this damping factor to 0.85. That means the PageRank of page A is multiplied by 0.85 to give the PageRank of page B. Thus, page B gets 85% of the PageRank of page A, and 15% of the PageRank is dissolved. If page B were then to have a link to page C, the damping factor would apply again. The PageRank of page B (85% of page A’s PageRank) is multiplied by 0.85, and so page C gets 72.25% of page A’s original PageRank. And so on, and so forth, as pages link to one another and PageRank distributes through the entire web. That’s the basic idea behind PageRank: pages link to one another, link value flows through these links and loses a bit of potency with every link, so webpages get different amounts of PageRank from every link that points to them. Pages that have no links at all get a basic starting amount of PageRank of 0.15, as extrapolated from the original PageRank calculation, so that there’s a jump off point for the analysis and we don’t begin with zero (because that would lead to every webpage having zero PageRank). Damping Factor Modifiers The above all makes sense if a webpage has just one link pointing to another page. But most webpages will have multiple links to other pages. Does that mean those links all get 0.85 of the starting page’s PageRank? In its original form, PageRank would distribute evenly across all those links. So if you had ten links to different pages from page A, that 85% of link value would be evenly shared across all of those links so that each link would get 8.5% of page A’s link value (1/10th of 85%). The more links you had on your page, the less PageRank each linked page would receive. This led to a lot of SEOs adopting a practice called ‘PageRank Sculpting’ which involved hiding links from Google (with nofollow attributes or other mechanisms) to ensure more PageRank would flow through the links you did want Google to count. It was a widespread practice for many years and seems to have never really gone away. Reasonable Surfer But this even distribution of PageRank across all links on a page was a short-lived aspect of PageRank. The engineers at Google quickly realised that letting link value flow evenly across all links on a page didn’t make a lot of sense. A typical webpage has links that are quite discrete and not likely to be clicked on by actual visitors of the page (such as links to privacy policies or boilerplate content), so why should these links get the same PageRank as a link that is very prominent on a page as part of the main content? So in 2004 Google introduced an improved mechanism for distributing PageRank across multiple links. This is called the ‘Reasonable Surfer’ model, and it shows that Google started assigning different weights to links depending on the likelihood of a link being clicked on by a ‘reasonable surfer of the web’ – i.e. an average person browsing webpages. Basically, Google modified the PageRank Damping Factor depending on whether a link was actually likely to be used by an average person. If a link was very prominent and there was a good chance a reader of the webpage would click on it, the damping factor stayed low and a decent chunk of PageRank would flow through that link. But if a link was hidden or discretely tucked away somewhere on the page, such as in the footer, it would get a much higher damping factor and so would not get a lot of value flowing through it. Pages linked from such hidden links would not receive much PageRank, as their inbound links were unlikely to be clicked on and so were subjected to very high damping factors. This is a key reason why Google wants to render pages as it indexes them. Just looking at the HTML source code of a page doesn’t necessarily reveal the visual importance of a link. Links could easily be hidden with some CSS or JavaScript, or inserted as a page is rendered. By looking at a completely rendered version of a webpage, Google is better able to accurately evaluate the likelihood of a link being clicked, so can assign proper PageRank values to each link. This aspect of Google’s PageRank system is still being updated and refined, so it’s safe to assume Google kept using it and improving it. For me, the Reasonable Surfer approach also makes PageRank Sculpting obsolete. With Reasonable Surfer, the amount of links on a page is not a determining factor of how much PageRank each link gets. Instead, the visual prominence of a link is the key factor that decides how much PageRank flows through that link. So I don’t believe you don’t need to ‘hide’ links from Google in any way. Links in the footer of a page are likely to be ignored anyway for PageRank purposes as users aren’t likely to click on them, so you don’t need to add ‘nofollow’ attributes or hide them another way. Internal versus External Links What follows is purely speculation on my end, so take it with a grain of salt. But I believe that, in addition to a link’s prominence, the applied PageRank Damping Factor also varies depending on whether the link points to an external site or an internal page. I believe that internal links to pages within the same website have a lower damping factor – i.e. send more PageRank to the target page – than links that point to external websites. This belief is supported by anecdotal evidence where improvements in internal linking have a profound and immediate impact on the rankings of affected webpages on a site. With no changes to external links, improving internal linking can help pages perform significantly better in Google’s search results. I strongly suspect that PageRank flowing within a website diminishes more gradually, whereas PageRank that is sent to other websites dissipates more quickly due to a larger damping factor. To give it some numbers (which I randomly pulled out of thin air for the purpose of this example so please don’t take this as any sort of gospel), when a webpage links to an external website Google may apply a damping factor of 0.75 which means 25% of PageRank is lost and only 75% arrives at the destination page. Whereas if that same webpage links to another page on the same site, the damping factor may be 0.9 so that the internal target page receives 90% of PageRank, and only 10% is lost. Now if we accept this as plausible, it raises an interesting question: when is a link considered ‘internal’ versus ‘external? Or, more simply put, what does Google consider to be a ‘website’? Subdomains vs Subfolders This simple question may have a complicated answer. Take a website that uses multiple technology platforms, such as a WordPress site that also uses another CMS to power a forum. If both the WordPress site and the forum exist on the same overall domain, such as with a /forum/ subfolder, I’m pretty confident that Google will interpret is as just one website and links between the WordPress pages and the forum pages will be seen as internal links. But what if the forum exists on a subdomain, like The forum behaves very differently from the main WordPress site on the www subdomain, with a different technology stack and different content. So in that scenario, I strongly suspect Google will treat the subdomain as a separate website from the WordPress site, and any links between them will be seen as external links. For me, this lies at the heart of the subdomain vs subfolder debate that has waged within the SEO industry for many years. Hosting a section of your site on a subdomain makes it more likely it’ll be interpreted as a separate site, so I believe links from your main site to the subdomain will be seen as external links and be subjected to higher damping factors. Thus your subdomain’s ranking potential is diminished because it receives less PageRank from your main site. Whereas if you put extra features of your site, such as a forum or a blog, in a subfolder on the same domain as your main site, it’s likely Google will simply see this as extra pages on your site and links pointing to these features will be internal links and send more PageRank to those areas. This is why I recommend my clients to never put crucial rankable resources like blog articles and user generated content on a subdomain, unless there’s really no other choice. If you do have to use subdomains, you should try to use the same technology stack as your main site where possible with the same design, boilerplate content, and page resources (images, CSS, JavaScript). This will increase the chances that Google will interpret the subdomain as being a part of the main domain’s website. Redirects and PageRank A long standing question in SEO is “how much PageRank is lost through a redirect?” When you change a webpage’s URL and redirect the old URL to the new location, do you lose some of the original URL’s PageRank in that redirect? Over the years the answers from Googlers have varied a bit. Historically, Google has confirmed that the amount of PageRank lost through a redirect is the same as through a link. This means that a redirect from page A to page B counts as a link from page A to page B, and so the PageRank Damping Factor applies and page B receives less PageRank (by about 15% or whatever damping factor Google chooses to apply in that specific context). This is done to prevent redirects from being used to artificially manipulate PageRank. However, more recently some Googlers have said that redirects do not necessarily cause PageRank loss. In a 2018 Webmaster Hangout, John Mueller emphasises how PageRank is consolidated on a canonical URL, and that redirects serve as a canonicalisation signal. This would imply that there is no PageRank loss in a redirect, but that the redirect tells Google that there is a canonical URL that all the relevant ranking signals (including PageRank) should be focused on. Nonetheless, whenever a website goes through an effort to minimise redirects from internal links and ensures all its links point directly to final destination URLs, we tend to see an uplift in rankings and traffic as a result. This may not necessarily be due to a decrease of PageRank loss and more due to optimised crawling and indexing, but it’s an interesting correlation nonetheless. Because redirects result in extra crawl effort, and there is a chance that some redirects still cause PageRank loss, I would always recommend websites to minimise internal redirects as much as possible. It’s also good practice to avoid chaining multiple redirects due to the extra crawl effort, and Google tends not to crawl beyond a maximum of five chained redirects. PageRank Over Time Due to the volatile nature of the web, a webpage’s PageRank is never a static number. Webpages come and go, links appear and disappear, pages get pushed down deeper in to a website, and things are constantly in flux. So Google has to recalculate PageRank all the time. Anecdotally many SEOs believe links lose value over time. A newly published link tends to have a stronger positive ranking effect than a link that was published years ago. If we accept the relevant anecdotal evidence as true, it leads to questions about applied PageRank damping factors over time. One possibility is that Google applies higher damping factors to older links, which means those old links pass less PageRank as time goes on. Another possibility is that the webpages that contain those old links tend to get buried deeper and deeper on a website as new content is published, so there are more layers of clicks that each siphon off a bit of PageRank. That means a link from page A to page B passes less PageRank not because of a higher damping factor, but because page A receives less PageRank itself as it sinks into the website’s archive. Fact is, we don’t really know what Google does with PageRank from historic links. All we know is that links do tend to lose value over time, hence why we constantly need to get new links pointing to a website to maintain rankings and traffic. URL Seed Set There’s one more aspect of PageRank we need to talk about, which the updated PageRank patent mentions but the original didn’t. The updated PageRank patent frequently mentions a ‘seed set of pages’. This refers to a starting point of URLs to calculate PageRank from. I suspect that this was introduced as a way to better calculate PageRank by starting from webpages that are known and understood to be trusted and reliable, such as Wikipedia articles or high authority news websites. As per the patent, seed sets “… are specially selected high-quality pages which provide good web connectivity to other non-seed pages.” What makes this seed set especially interesting is how it’s used to modify a webpage’s PageRank based on distance, i.e. how many clicks it takes to get from a seed URL to the webpage. As per the patent, “… shortest distances from the set of seed pages to each given page in the link-graph are computed,” and “[t]he computed shortest distances are then used to determine the ranking scores of the associated pages.” This is entirely separate from how PageRank flows through links. Rather than counting a webpage’s cumulative PageRank as it flows through links, the patent explicitly states that it’s about the ‘shortest distance’ from the seed set to the webpage. So it’s not about an accumulation of PageRank from one or more links, but a singular number that shows how many clicks it would take to get to the webpage from any URL in the seed set. So by all appearances, it looks like Google uses the number of clicks from a seed URL to a webpage as a PageRank modifier, where fewer clicks means higher PageRank. PageRank and Crawling So far we’ve talked about PageRank exclusively as a ranking factor. But this is just part of the picture. There is another important effect that PageRank has on a URL: It helps determine how often it is crawled by Googlebot. While Google endeavours to crawl the entire web regularly, it’s next to impossible to actually do this. The web is huge and growing at an exponential rate. Googlebot couldn’t possible keep up with all the newly published content while also keeping track of any changes made to existing webpages. So Googlebot has to decide which known URLs to recrawl to find updated content and new links. PageRank feeds in to this decision. Basically, the more PageRank a URL has, the more often Googlebot will crawl it. A page that has a lot of links pointing to it will be seen as more important for Google to crawl regularly. And the opposite also applies – pages with very low PageRank are seen as less important, so Google will crawl them less often (or not at all). Note that PageRank is only part of that equation, but it’s good to keep in mind when you talk about optimising a site’s crawling. All this and more is explained elegantly by Dawn Anderson in her Search Engine Land article about crawl budget, which is definitely worth a read. What This Means For SEO Understanding all of the above, what does this mean for SEOs? How can you apply this theory to your daily work? We can distil the theory to a few clear and concise recommendations for improving a website’s use of PageRank: 1. Links are a vital ranking factor So far nothing has managed to replace PageRank as a reliable measure of a site’s trust and authority. However, Google is very good at ignoring links it doesn’t feel are ‘natural’, so not all links will pass PageRank. In fact, according to Paul Madden from link analysis software Kerboo, as many as 84% of all links on the web pass little to no value. In a nutshell, it’s not about how many links you have, but how much value a link could pass to your site. Which brings me to the second point: 2. Prominent links carry more weight The most valuable type link you can get is one that a user is likely to click on. A prominent link in the opening paragraph of a relevant piece of good content is infinitely more valuable than a link hidden in the footer of a website. Optimise for visually prominent, clickable links. 3. Internal links are golden It’s easy to obsess over getting more links from external sites, but often there’s just as much gain to be had – or more – from optimising how link value flows through your website. Look at your internal link structure and how PageRank might flow through your site. Start with the webpages that have the most inbound links (often your homepage and some key pieces of popular content) and find opportunities for PageRank flow to URLs that you want to boost rankings for. This will also help with optimising how Googlebot crawls your site. 4. Links from seed URLs are platinum We don’t know which URLs Google uses for its seed set, but we can make some educated guesses. Some Wikipedia URLs are likely part of the seed set, as are news publishers like the New York Times and the BBC. And if they’re not directly part of the seed set, they’re likely to be only one or two clicks from the actual seed URLs. So getting a link from those websites is immensely valuable – and typically very hard. 5. Subfolders are usually superior In almost any given context, content placed in a subfolder on a site will perform better than content hosted on a subdomain. Try to avoid using subdomains for rankable content unless you really don’t have a choice. If you’re stuck with subdomains and can’t migrate the content to a subfolder, do your best to make the subdomain look and feel as an integral part of the main site. 6. Minimise redirects While you can’t avoid redirects entirely, try to minimise your reliance on them. All your internal links should point directly to the destination page with no redirect hops of any kind. Whenever you migrate URLs and have to implement redirects, make sure they’re one-hop redirects with no chaining. You should also look at pre-existing redirects, for example from older versions of the website, and update those where possible to point directly to the final destination URL in only one redirect hop. Wrapping Up There’s a lot more to be said about PageRank and various different aspects of SEO, but the above will hopefully serve as a decent enough introduction to the concept. You may have many more questions about PageRank and link value, for example about links in images or links with the now ambiguous ‘nofollow’ attribute. Perhaps a Google search can help you on your way, but you’re also welcome to leave a comment or get in touch with me directly. I’ll do my best to give you a helpful answer if I can. [Toolbar PageRank image credit: Search Engine Land]

  • Google Guidance for News Coverage

    In December of last year Google made some drastic changes to Google News, specifically to how it selects websites that feature in the news vertical and related areas of Google’s search ecosystem such as the Top Stories carousel and the Discover feed. Previously, news publishers had to apply to be included in Google News and there was a manual verification process. In the current Google News, sites and articles are automatically selected and publishers do not need to apply to be included in Google News. As per the official support documentation: This seismic shift in Google’s approach to news publishers was hidden among the support documentation surrounding the new Publisher Center, which replaced the old partner dashboard that Google News-approved publishers had access to. In this new Publisher Center, publishers can control certain aspects of their sites’ visibility in Google’s news-related elements, such as their branding and topical focus areas. Additionally, getting through the new Publisher Center’s approval process means a site can be included in the Newsstand app on Android mobile devices, and opens up additional monetisation opportunities. I’ve gotten a lot of questions from publishers whether they need to go through the verification process in the new Publisher Center to be included in Google News. The answer is no, you don’t need to be verified in the Publisher Center to show up in Google News or other news-focused areas of Google. However, I do recommend going through the process to ensure your site is properly categorised and branded in Google’s news ecosystem. Since that initial launch of the new publisher center and the abandonment of the manual approval process for Google News, their support documentation for publishers has steadily expanded to provide more details on the new approach to news content in Google. In addition to clearer guidance on how sites can now be considered for Google News, the official webmaster blog has also published information about best practices for publishers regarding news coverage of current events. In this new guide, Google is of course emphasising their AMP standard (my thoughts on which can be read here, nonetheless I do recommend publishers implement AMP lest they cut off their nose to spite their face). Google highlights the importance of adding article structured data to your AMP articles, especially the article’s publication date: “We also recommend that you provide a publication date so that Google can expose this information in Search results, if this information is considered to be useful to the user.” When you’re live-streaming a news event, Google wants you to use the BroadcastEvent structured data markup and submit your relevant content through their Indexing API. “If you are live-streaming a video during an event, you can be eligible for a LIVE badge by marking your video with BroadcastEvent. We strongly recommend that you use the Indexing API to ensure that your live-streaming video content gets crawled and indexed in a timely way. The Indexing API allows any site owner to directly notify Google when certain types of pages are added or removed.” This public acknowledgement of the indexing API leads me to believe Google will be putting more focus on that technology. I wouldn’t be surprised if later in 2020 Google will allow news publishers (and perhaps all publishers) to start tapping in to the API to get their content quickly in to Google’s index. While potentially subject to abuse, a public indexing API makes perfect sense for a search engine that operates at the scale Google does; it basically moves the effort of discovering new content from Google’s crawlers to publishers’ technology stacks. So, essentially, it’ll save Google money. Lastly, in this webmaster blog post Google advises publishers to ensure their AMP articles are also updated in Google’s AMP cache whenever changes are made to the articles. This is obviously something Google struggles with, as once an article is cached in the AMP cache it’s not always updated when the publisher’s version changes. Hence Google needs publishers to tell it when an AMP article has changed. This latest webmaster blog from Google is quite technical and focused on a narrow niche (news publishers). It shows Google wants publishers to become more technically adept at maximising their content for News visibility. It’s an area of news SEO that I also focus heavily on and hope to share more of my insights and experience with in the coming months at relevant events and through this blog. I feel Google is close to perfecting its topical evaluations of news publishers when it comes to which sites to trust and for which topics. Yet the technical realities of news SEO are still somewhat lagging behind Google’s envisaged ideal scenario. Publishers will need to ensure their websites are constantly improved and stay abreast of the demands Google places on their technologies. No matter how good your news content is, it’ll only be surfaced in Google search if the search engine can properly process it. This is not something you just want to take for granted. Google’s technology keeps changing and progressing, which means your news site needs to do the same.

  • How SEO for News can help all websites

    The field of search engine optimisation has become so diverse and all-encompassing that we see more and more specialised SEO service offerings. I’m no exception – coming from an IT background, the technical aspects of SEO align well with my skills and interests. Additionally, I’ve always been fascinated by the publishing industry and spent a year working in-house as the SEO specialist at a local newspaper. As a result, my SEO consultancy has developed in to a specialised offering focused on technical SEO and SEO services for news publishers. The latter aspect especially is something of a passion of mine. News publishers occupy a distinct space on the web as go-to sources for what is happening in the world. Search engines like Google have dedicated an entire vertical specifically to news – Google News and its significantly less popular rival Bing News – reflecting its importance to the web. Nowadays most of us will get their daily news primarily from the internet, and search plays a huge role in how news is discovered and consumed. Optimising news websites for visibility in search is different from regular SEO. Not only do search engines have dedicated verticals for news with their own rules, we also see news stories injected as separate boxes (usually at the top) on regular search results pages: These Top Stories carousels are omnipresent: Research from Searchmetrics shows that 11% of all Google desktop results and 9% of mobile results have a news element. This equates to billions of searches every year that show news articles in a separate box on Google’s first page of results. The traffic potential is of course enormous, which is why most news publishers are optimising primarily for that Top Stories carousel. In fact, the traffic potential from Top Stories is so vast that it dwarfs the Google News vertical itself. As this data from shows, visits to news websites from the dedicated vertical are a fraction of the total visits from Google search: That Google search traffic is mostly clicks from the Top Stories carousel. And maximising your visibility in that carousel means you have to play by somewhat different rules than ‘classic’ SEO. Google News Inclusion First off, articles showing in the Top Stories carousel are almost exclusively from websites that are part of Google’s separate Google News index. A study by NewsDashboard shows that around 98% of Top Stories articles are from Google News approved publishers. It’s extremely rare to see a news article in Top Stories from a website that’s not included in Google News. Getting a website in to Google News used to be a manual process where you had to submit your site for review, and Google News engineers took a look to see if it adhered to their standards and requirements. In December 2019 this was suddenly changed and now Google says it will ‘automatically consider Publishers for Top stories or the News tab of Search’. Inclusion in Google News is no guarantee your articles will show up in Top Stories. Once your site is accepted in to Google News, the hard work really begins. First of all, Google News (and, thus, Top Stories) works off a short-term index of articles. Where regular Google search maintains an index of all content it finds on the web, no matter how old that content is, Google News has an index where articles drop out after 48 hours. This means any article older than two days will not be shown in Google News, and not be shown in Top Stories. (In fact, data from the NewzDash tool shows that the average lifespan of an article in Google News is less than 40 hours.) Maintaining such a short-term index for news makes sense, of course. After two days, an article isn’t really ‘news’ any more. The news cycle moves quickly and yesterday’s newspaper is today’s fish & chips wrapper. Real-Time SEO The implication for news SEO is rather profound. Where regular SEO is very much focused on the long term improvements of a website’s content and authority to steadily grow traffic, in news the effects of SEO are often felt within a few days at most. News SEO is pretty much real-time SEO. When you get something right in news SEO, you tend to know very quickly. The same applies when something goes wrong. This is reflected in traffic graphs; news websites tend to see much stronger peaks and troughs than regular websites: Search traffic graph for a regular site showing steady growth over time Search traffic graph for a news publisher showing heavy peaks and drops in short timeframes Where most SEO is all about building long term value, in the news vertical SEO is as close to real-time as you can get anywhere in the search industry. Not only is the timeframe of the news index limited to 48 hours, often the publisher that gets a story out first is the one who achieves the first spot in the Top Stories box for that topic. And being first in Top Stories is where you’ll want to be for maximum traffic. So news publishers have to focus on optimising for fast crawling and indexing. This is where things get interesting. Because despite being part of a separate curated index, websites included in Google News are still crawled and indexed by Google’s regular web search processes. Google’s Three Main Processes We can categorise Google’s processes as a web search engine in to roughly three parts: Crawling Indexing Ranking But we know Google’s indexing process has two distinct phases: the first stage where the page’s raw HTML source code is used, and a second stage where the page is fully rendered and client-side code is also executed: This second stage, the rendering phase of Google’s indexing process, is not very fast. Despite Google’s best efforts, there are still long delays (days to weeks) between when a page is first crawled and when Google has the capacity to fully render that page. For news articles, that second stage is way too slow. Chances are that the article has already dropped out of Google’s 48-hour news index long before it gets rendered. As a result, news websites have to optimise for that first stage of indexing: the pure HTML stage, where Google bases its indexing of a page on the HTML source code and does not execute any client-side JavaScript. Indexing in this first stage is so quick, it happens within seconds of a page being crawled. In fact, I believe that in Google’s ecosystem, crawling and first-stage indexing are pretty much the same process. When Googlebot crawls a page, it immediately parses the HTML and indexes the page’s content. Optimising HTML In theory this sounds like it’s easier for SEOs to optimise news articles. After all, many indexing problems originate from that second stage of indexing where the page is rendered. However, in practice the opposite is true. As it turns out, that first stage of indexing isn’t a particularly forgiving process. In a previous era, before Google moved everyone over to their new Search Console and removed a lot of reports in the process, news websites had an additional element to the Crawl Errors report in Webmaster Tools. This report showed news-specific crawl errors for websites that had been accepted in to Google News: This report listed issues that Google encountered while crawling and indexing news articles. The types of errors shown in this report were very different from ‘regular’ crawl errors, and specific to how Google processes articles for its news index. For example, a common error would be ‘Article Fragmented‘. Such an error would occur when the HTML source was too cluttered for Google to properly extract the article’s full content. We found that code snippets for things like image galleries, embedded videos, and related articles could hinder Google’s processing of the entire article, and result in ‘Article Fragmented‘ errors. Removing such blocks of code from the HTML snippet that contained the article content (by moving it to above or below the article HTML in the source code) tended to solve the problem and massively reduce the number of ‘Article Fragmented‘ errors. Google Has an HTML File Size Limit? Another news-specific crawl error that I frequently came across was ‘Extraction Failed‘. This error is basically an admission that Google was unable to find any article content in the HTML code. And it pointed towards a very intriguing limitation within Google’s indexing system: an HTML size limit. I noticed that ‘Extraction Failed‘ errors were common on pages that contained a lot of inline CSS and JavaScript. On these pages, the article’s actual content wouldn’t begin until very late in the HTML source. Looking at the source code, these pages had about 450 KB of HTML above the spot where the article content actually began. Most of that 450 KB was made up of inline CSS and JavaScript, so it was code that – as far as Google was concerned – added no relevancy to the page and was not part of that page’s core content. For this particular client, that inline CSS was part of their efforts to make the website load faster. In fact, they’d been recommended (ironically, by development advisors from Google) to put all their critical CSS directly in to the HTML source rather than in a separate CSS file to speed up browser rendering. It’s obvious that these Google advisors were unaware of a certain limitation in Google’s first-stage indexing system: namely that it stops parsing HTML after a certain amount of kilobytes. When I finally managed to convince the site’s front-end developers to limit the amount of inline CSS, and the code above the article HTML was reduced from 450 KB to around 100 KB, the vast majority of that news site’s ‘Extraction Failed‘ errors disappeared. To me, it showed that Google has a filesize limit for webpages. Where exactly that limit is, I’m not sure. It lies somewhere between 100 KB and 450 KB. Anecdotal evidence from other news publishers I worked with around the same time makes me believe the actual limit is around 400 KB, after which Google stops parsing a webpage’s HTML and just processes what it’s found so far. A complete index of the page’s content has to wait for the rendering phase where Google doesn’t seem to have such a strict filesize limitation. For news sites, exceeding this HTML size limit can have dramatic effects. It basically means Google cannot index articles in its first-stage indexing process, so articles cannot be included in Google News. And without that inclusion, articles don’t show up in Top Stories either. The traffic loss can be catastrophic. Now this particular example happened back in 2017, and Google’s indexing system has likely moved on since then. But to me it emphasised an often-overlooked aspect of good SEO: clean HTML code helps Google process webpages more easily. Cluttered HTML, on the other hand, can make it challenging for Google’s indexing system to make sense of a page’s content. Clean code matters. That was true in the early days of SEO, and in my opinion it’s still true today. Striving for clean, well-formatted HTML has benefits beyond just SEO, and it’s a recommendation I continue to make for many of my clients. Unfortunately Google decided to retire the news-specific Crawl Errors report back in 2018, so we’ve lost valuable information about how Google is able to process and index our content. Maybe someone at Google realised this information was perhaps a bit too useful for SEOs. ;) Entities and Rankings It’s been interesting to see how Google has slowly transitioned from a keyword-based approach to relevancy to an entity-based approach. While keywords still matter, optimising content is now more about the entities underlying those words rather than the words themselves. Nowhere is this more obvious than in Google News and Top Stories. In previous eras of SEO, a news publisher could expect to rank for almost any topic it chose to write about as long as their website was seen as sufficiently authoritative. For example, a website like the Daily Mail could write about literally anything and claim top rankings and a prime position in the Top Stories box. This was a simple effect of Google’s calculations of authority – links, links, and more links. With its millions of inbound links, few websites would be able to beat on link metrics alone. Nowadays, news publishers are much more restricted in their ranking potential, and will typically only achieve good rankings and Top Stories visibility for topics that they cover regularly. This is all due to how Google has incorporated their knowledge graph (also known as the entity graph) in to its ranking systems. In a nutshell, every topic (like a person, an event, a website, or a location) is a node in Google’s entity graph, connected to other nodes. Where two nodes have a very close relationship, the entity graph will show a strong connection between the two. For example, we can draw a very simplified entity graph for Arnold Schwarzenegger. We’ll put the node for Arnold in the middle, and draw some example nodes that have a relationship with Arnold in some way or another. He starred in the 1987 movie Predator (one of my favourite action flicks of all time), and was of course a huge bodybuilding icon, so those nodes will have strong connecting relationships with the main Arnold node. And for this example we’ll take the website and say it only publishes articles about Arnold very infrequently. So the relationship between Arnold and is fairly weak, indicated by a thin connecting line in this example entity graph: Now if expands its coverage of Arnold Schwarzenegger, and writes about him frequently over an extended period of time, the relationship between Arnold and becomes stronger and the connection between their two nodes is a lot more emphasised: How does this have an impact on the Google rankings for Well, if Google considers to be strongly related to ‘Arnold Schwarzenegger’, when publishes a story about Arnold it’s much more likely to achieve prime positioning in the Top Stories carousel: Now if were to write about a topic they rarely cover, such as Jeremy Clarkson, then they’d be unlikely to achieve good rankings – no matter how strong their link metrics are. Google simply doesn’t see as a reputable source of information about Jeremy Clarkson compared to news sites like the Daily Express or The Sun, because hasn’t built that connection in the entity graph over time. This entity-based approach to rankings is more and more prevalent in Google, and something all website owners should pay heed to. You cannot rely on authority signals from links alone. Websites need to build topical expertise so that they build strong connections between themselves and the topics they want to rank for in Google’s knowledge graph. Links still serve the purpose of getting a website noticed and trusted, but beyond a certain level the relevancy signals of the entity graph take over when it comes to achieving top rankings for any keyword. Lessons From News SEO To summarise, all SEOs can take valuable lessons from vertical-specific SEO tactics. While some areas of news SEO are only useful to news publishers, many aspects of news SEO also apply to general SEO. What I’ve learned about optimising HTML and building entity graph connections while working with news publishers is directly applicable to all websites, regardless of their niche. You can learn similar lessons by looking at other verticals, like Local and Image search. In the end, Google’s search ecosystem is vast and interconnected. A specific tactic that works in one area of SEO may contain valuable insights for other areas of SEO. Look beyond your own bubble, and always be ready to pick up new knowledge. SEO is such a varied discipline, no one person can claim to understand it all. It’s one of the things I like so much about this industry: There’s always more to learn.

  • Are Boris Johnson’s PR People Manipulating Google Search?

    Anyone remember the ‘Boris bus’? The pledge plastered across a red London bus to give £350 million to the NHS after the UK leaves the European Union? Here’s a reminder. For a long time, when you searched for ‘boris bus’ in Google you’d see many references to this Brexit campaign promise. So many references, in fact, that it became a bit of an embarassment for Boris Johnson, as so far it has seemed to be a rather empty promise. Hence why, in a June 2019 interview, Boris Johnson’s admission that he likes to ‘paint buses’ as a hobby raised some suspicion – primarily because it seemed to be a carefully crafted proclamation designed to game Google’s news algorithms. First highlighted by the folks at Parallax in Leeds, this tactic did seem to have the intended effect initially when the ‘boris bus’ search result changed to show the interview’s statement rather than the big red Brexit campaign bus. Then there was a bit of a backlash as some people caught on to the perceived deception, and news outlets like the Daily Mail wrote about it and these stories started to dominate Google’s results. Ironically, doing the same search today yields results about the bus’s manufacturer going in to administration. So that initial attempt to game Google’s search results seems to have misfired a bit. Yet, this doesn’t seem to have discouraged the people behind Johnson’s PR spin machine. This week, it seems, the PR folks responsible for scripting Johnson’s public statements are giving it another attempt. Take these two search results for ‘boris model’, screenshotted a few hours apart by TheAndyMaturin: Once again this seems carefully crafted to shift public attention away from an embarassing story for Boris Johnson, using language designed to make it in to article headlines that then replace existing headlines covering a different story altogether. Keywords in Headlines This is not particularly difficult to do in Google, especially in Google News which supplies content to the Top Stories boxes you see in regular Google search results. The news-specific part of Google’s algorithms is focused on speed, i.e. surfacing recent articles, and therefore loses some of its accuracy in terms of topical targeting in favour of simple keyword matching. By having a relevant keyword in an article headline on an official Google News-approved publisher’s website, Google is likely to show that article in its news boxes – especially when the only alternatives are articles older than 48 hours, which is the primary window of opportunity for articles to show up in Google News. Google Steers The Public Debate It seems Johnson’s PR people have a keen sense of Google’s importance in steering the public debate, as it is among the primary sources of news for the general populace. Moreover, these PR people know how to play the game to their advantage, and have journalists at the UK’s major outlets dancing like puppets by serving up the right words to put in to their headlines. Wherever you stand on the morality of this tactic, it is effective. While those of us working in digital industries tend to be able to spot these efforts rather easily, most of the public won’t notice these shenanigans and will simply consume the headlines they’re shown. Basically, it’s an effective means of burying embarassing stories in favour of more innocuous articles. Smart use of language gets certain key terms in to headlines for Google to then show in their search results. You could possibly write off the first ‘boris bus’ attempt as a coincidence, but this latest instance seems to show a pattern of deliberate manipulation. Especially considering searches for the actual person involved in the scandal are diminishing, leaving an opportunity to claim Google search real-estate for less focused searches. All is fair in love and war, and UK politics is certainly in a state of war right now. Update: Folks have pointed out to me that this may in fact be the third such instance, as this one is somewhat suspicious too.

  • My Digitalzone’18 talk about SEO for Google News

    Last year I was fortunate enough to deliver a talk at the Digitalzone conference in Istanbul. Among a great lineup of speakers on SEO, social media, and online advertising, the organisers asked me to speak about my specialist topic: SEO for Google News. In my talk I outlined what’s required for websites to be considered for inclusion in the curated Google News index, and how news websites can optimise their visibility in Google News and especially the associated Top Stories box in regular search results. You can view the recording of my entire talk online here: Since I delivered that talk in November 2018, there have been numerous changes to Google News – specifically to how Google handles original content and determines trust and authority. SEO for news publishers remains a fast-moving field where publishers need to pay constant attention to the rapidly evolving technical and editorial demands Google places on news sites. If you’re a publisher in need of help with your SEO, give me a shout.

  • How to do a Technical SEO Audit

    Since late 2017 Andrew Cock-Starkey, better known as Optimisey, has been organising regular meetups in his native Cambridge where he gets SEOs from all over the world to come and give a talk. While the meetups aren’t huge, usually having a few dozen attendees, Andrew records the talks and puts them online for anyone to watch for free. It’s a great way to share knowledge around the SEO industry, so when Andrew asked if I wanted to come over and do a talk I couldn’t say no. Sharing my experience and expertise with the industry is important to me, as that’s how I learned much about SEO myself. Hence, earlier this year I made the trip to Cambridge and did a talk about my approach to technical SEO site audits. The video of that talk is free to watch, and I hope people find it useful and worthwhile: There’s also a full transcript available on the Optimisey website if you prefer to read text rather than watch a video. Make sure you check out some of the other Optimisey meetup videos, which includes awesome talks from people like JP Sherman, Marie Haynes, Jennifer Hoffman, Chris Green, Stacey MacNaught, Kevin Indig, and many others.

  • SEO Strategies for Growth: One-Day SEO Training in Belfast on 18 September

    I’ve been delivering specialised technical SEO trainings for a few years now, as well as countless bespoke SEO trainings for agencies and in-house teams. Now I’ve teamed up with Growth Marketing Live to deliver a special one-day SEO training as part of their conference, where I’ll teach SEO best practices that deliver lasting growth. This SEO Strategies for Growth training day is intended for marketers who want to learn how to apply SEO to enhance their business growth through organic search traffic. It’ll be a full day of training that is accessible and actionable. The goal is to empower the participants to apply what they’ve learned to their own sites straight away to help grow their traffic from Google search. All the relevant areas of SEO will be covered, from basic on-page SEO to linkbuilding and technical optimisation. These are the topics we’ll cover in the training: SEO in the wider Digital Marketing mix: where does SEO fit in compared to other channels such as paid search and social media. On-Page Optimisation: how to optimise your webpages for maximum visibility. Linkbuilding and Content Marketing: becoming a trusted source of information that Google can confidently rank high in its search results. Technical SEO Basics: ensuring your website can be properly crawled and indexed by Google. Structured Data Markup: how to enhance your content with markup to get rich search snippets in Google. Load Speed & Mobile SEO: optimising your website experience for mobile users. Crawl Optimisation: ensuring large scale websites can be efficiently crawled. International SEO: how to make sure Google ranks your international content correctly across the globe. The early bird price for this SEO training is £349, which also includes a ticket for the conference on the following day. Places for this special one-off training day are limited, so book your spot now on the Growth Marketing Live website. P.S. there are still a few seats left for my upcoming Technical SEO Course in Dublin!

  • Preventing Saturation and Preserving Sanity

    Over the past few years I’ve spoken at a lot of conferences. I’m not quite as prolific as, for example, the amazing Aleyda Solis, but there have been significant periods where I spoke at an event least once every month. I enjoy speaking at conferences. A large part of my enjoyment comes from sharing my knowledge and meeting with people in the industry. I get to hang out with old friends and make new ones, and the privilege of going up on stage to have hundreds of people listen to me is one I never take for granted. Thanks to conferences I’ve been able to travel to amazing places and meet up with awesome people. The past few years I’ve travelled to cities like New York, Las Vegas, Paris, Istanbul, Milan, Bonn, Amsterdam, and numerous places in the UK and Ireland – all thanks to events I was invited to speak at. But I also dislike going to conferences. The travel is never fun (I’m a grumpy traveller at the best of times), I rarely get a good sleep in hotel beds, and my nutrition takes the usual hit. I also feel a lot of pressure to deliver a good talk, one that entertains and informs and is hopefully worthwhile and unique. And then there’s the socialising bit. At heart, I’m an introvert pretending to be an extrovert. I’m not great at socialising but I make an effort, because I do enjoy hanging out with people I like – and fortunately the SEO industry has plenty of fun people to hang out with. I’ve made several great friends in the industry over the years, thanks to conferences and the surrounding social activities. But there’s only so much I can handle. My reservoir of social interaction is limited, and conferences drain that reservoir very quickly. I’ve been very lucky that my wife and business partner Alison joins me at many events, and helps make socialising so much easier for me. Contrary to me, she actually likes people in general and enjoys chatting to new folks. She’s been an incredible support for me over the years as our business has grown and my conference speaking gigs became more numerous and more international. All in all, despite the fun bits and all the support I’ve received, it’s been taking a toll on me. The travel, the lack of sleep, the pressures of delivering, the socialising, and of course the time away from actual paid work – speaking at conferences comes at a price, and it’s one I’m increasingly reluctant to pay. I’ve already agreed to a number of events for the remainder of 2019, and I’m genuinely looking forward to each and every one of these: Optimisey Cambridge SEO Meetup SMX Munich BrightonSEO eComm Live The Tomorrow Lab Presents Digital Elite Day Digital DNA SearchLeeds Nottingham Digital Summit State of Digital Chiang Mai SEO Some are events I’ve never spoken at but have wanted to, and others are recurring events that I always enjoy being a small part of. So I’m committing to these events and will work damn hard to deliver great talks at every single one. After that, I’m pulling on the brakes. For a long time I felt that speaking at conferences was a way to prove myself, to show that I knew my stuff and wasn’t half-bad at this SEO malarkey. The bigger the stage, the more I felt affirmed in my knowledge and experience. That aspect of it has lost its luster for me. I don’t feel I’ve anything left to prove. I’ve become increasingly confident in my own abilities as an SEO, and feel I’ve gotten a good handle on my imposter syndrome. Also, I sometimes feel that by speaking at a conference I’m taking up a spot that could’ve gone to someone else, someone who is still building their reputation or who has more worthwhile content to share. And, let’s be honest, there’s enough white guys speaking at conferences. If I take a step back from the conference circuit, maybe that’ll allow someone else to take a step up. So from now on I’ll keep my speaking calendar a lot emptier. I’m not retiring from the conference circuit entirely – I enjoy it too much – but I’ll be speaking much less often. I’ll be on stage at a small handful of events every year at most, and mainly outside of the UK (with one or two exceptions). This will hopefully free me up to focus on my paid client work, as well as my SEO training offering. And I’ll keep showing my face at events like BrightonSEO, as for me those feel more like regular SEO family gatherings. It’s a selfish move of course, to prevent my name from saturating the conference circuit as much as preserve my sanity. I feel I’m at risk of losing appeal as a speaker, as there’ve been so many opportunities to see me speak. Maybe by enforcing some scarcity, I’ll stay attractive for conference organisers while also making sure I can deliver top notch talks at the few events I choose. But foremost I want to prevent burning out. I’ve felt quite stretched the last while, always running from one place to the next while trying to meet deadline after deadline. It’s time I slow down the Barry-train and focus primarily on my client work. Conferences are great fun but they also consume a lot of time and energy. Those are resources that I need to treat with more respect. I’ll hope to see many of you at the 2019 events still to come, and I’ll do my best to stay in contact with my industry friends. Conferences are a great way to keep in touch, but definitely not the only way. Some of our best industry friends have visited us in Northern Ireland, and I want to make time to do the same and visit our friends where they live. Those are the trips that don’t cost energy, but recharge the batteries. I need to do more of those. So, in short, I’m not going away, but I’ll become less ubiquitous. It’s win-win for everyone. :)

bottom of page