- Jul 6, 2020

Perfecting XML Sitemaps

[This article was originally published in August 2019 on Search News Central]

An XML sitemap is a file, in XML format, that lists URLs on your site that you want search engines like Google to crawl and index.

XML sitemaps have been a staple of good SEO practice for many years now. We know that sites should have valid XML sitemaps to help search engines crawl and index the right pages.

Yet despite the ubiquity of XML sitemaps, their exact purpose isn’t always fully understood. And there’s still a lot of confusion about the ‘perfect’ setup for an XML sitemap for optimal crawling and indexing.

In this post I’ll share my own best practices I’ve learned over the years for fully optimised XML sitemaps, focusing on standard sitemaps for webpages.

The Basics

I’m not going to explain the basics of XML sitemaps too much, as those have been covered many times over on many other blogs. I’ll just recap the essentials here:

XML sitemaps should adhere to the official protocol, otherwise Google will not see it as a valid file and will ignore the sitemap.
XML sitemaps should only contain the canonical URLs on your website that you want search engines to crawl and index.
You can submit your XML sitemap to Google and Bing directly through Google Search Console and Bing Webmaster Tools, as well as reference it in your site’s robots.txt file.
Search Console and Webmaster Tools will report on the URLs included in your XML sitemaps, whether they are indexed and if there are any errors or warnings associated with them.
There are separate XML sitemap types for webpages, images, videos, and news articles. In this article we’ll focus only on XML sitemaps for standard webpages.

Sitemap Attributes

XML sitemaps support multiple attributes for a listed URL. The three main attributes for every listed URL are the last modified date (<lastmod>), the priority from 0.0 to 1.0 (<priority>), and how often the content on the URL is expected to change (<changefreq>). Many XML sitemaps will have all three of these attributes defined for every URL listed in the sitemap.

However, most search engines – Google included – only pay attention to one of those attributes: the <lastmod> date. When a URL has a <lastmod> date that is more recent than the last time the URL was crawled by the search engine, it’s a strong indicator that a URL should be re-crawled to see what has changed.

As such, I always recommend making sure the <lastmod> attribute is accurate and updated automatically when a page on the site is changed in a meaningful way. Most XML sitemap generators, like the Yoast SEO plugin for WordPress, will ensure the <lastmod> attribute is automatically updated in the XML sitemap when a page is changed in the site’s backend.

The other two attributes, <priority> and <changefreq>, are seen as too ‘noisy’ to be used as proper signals. Often these are set incorrectly or manipulated to try and trick search engines in to crawling pages more frequently than necessary, so they tend to be ignored by most crawlers.

I tend to recommend leaving out these attributes entirely. It makes the XML sitemap’s file size smaller, and results in less clutter which makes sitemaps easier to troubleshoot.

Sitemap Size

In Google’s support documentation on XML sitemaps, they say a sitemap file can’t contain more than 50,000 URLs and must be no larger than 50 MB uncompressed. If your site has more than 50,000 URLs, you can break them up in to separate sitemaps and submit a so-called sitemap index – an XML sitemap that lists only other XML sitemaps.

And you can submit 500 sitemap index files, each listing a maximum of 50,000 individual sitemaps. Which means the total amount of URLs you can submit to Google via XML sitemaps is:

500 x 50,0000 x 50,000 = 1,250,000,000,000 (one trillion two hundred fifty billion)

That’s more than enough for even the most excessive websites. However, in my experience it’s not ideal to fill XML sitemaps up to their maximum capacity.

For larger websites with hundreds of thousands or millions of pages, ensuring Google crawls and indexes all URLs submitted in XML sitemaps is quite challenging. Cramming every XML sitemap full with 50,000 URLs often leads to incomplete crawling and indexing, with only a small fraction of the submitted URLs included in Google’s index.

Index Coverage for XML sitemaps with 50,000 URLs

I have found that limiting sitemaps to only 10,000 URLs leads to more thorough levels of indexing. I’m not sure why – I suspect that smaller lists of URLs are easier for Google to process and crawl – but it’s been proven time and again that smaller sitemaps lead to higher degrees of indexing.

Index Coverage for XML sitemaps with 10,000 URLs

As a result, I always urge large websites to use smaller XML sitemaps – but not too small! Some huge websites limit XML sitemaps to 1000 URLs, which means you end up with thousands of individual sitemap files.

This too brings complications, as Google Search Console will only list 1000 sitemap files in its Sitemaps reports. If you have more than 1000 individual XML sitemap files, you will not be able to get a complete gauge of their performance in terms of indexing from Google Search Console.

A happy medium is to limit XML sitemap files to 10,000 URLs each. I’ve found that this is a good compromise on size, in that it ensures a higher degree of crawling and indexing than a 50,000 URL sitemap, but at the same time doesn’t create reporting limitations in Google Search Console.

A maximum of 10,000 URLs per XML sitemap seems to be a good middle road between indexing and reporting. This was first explored by Nick Eubanks, and I’ve seen similar good results from this 10k limit on XML sitemaps.

Sitemaps by Content Type

When analysing indexing problems on websites, XML sitemaps can be very useful. However, if all URLs on a website are simply heaped together in XML sitemaps regardless of the purpose of each URL, then troubleshooting SEO issues becomes more challenging.

A great way to make XML sitemaps more useful and helpful is to separate them out by content type, so that there are different XML sitemap files for different types of pages.

For example, on an ecommerce site you should have different XML sitemap files for your static content pages (about us, terms & conditions, etc), your category and subcategory pages (hub pages), and your product pages.

Or, alternatively, you can also create separate sitemap files for each category of products, so that you can quickly see which product categories are well-indexed and which ones aren’t.

Combining the two approaches also works, where you have separate XML sitemaps for each category’s hub pages and product pages.

Separate XML sitemap files for each type of page

For news publishers, I recommend separate XML sitemap files for news sections, and to list your articles in different XML sitemaps. This is because we want to make sure Google has indexed every section page on the site (as these are important for new article discovery), whereas achieving 100% indexing for all individual articles on a news site is extremely difficult.

Keeping articles in separate XML sitemaps from section pages means you can troubleshoot potential issues more effectively and get better data on the index performance of both types of pages.

Additionally, news publishers should have a news-specific XML sitemap that only lists the articles published in the last 48 hours. This aids Google with discovering your newly published and recently updated articles.

Discovery vs PageRank Flow

One common misconception about XML sitemaps is that they can replace a regular crawl of the website. Some people think that by having a good XML sitemap, the website itself doesn’t need to be fully crawlable. After all, they reason, the URLs we want Google to crawl and index are listed in the XML sitemap, so the website doesn’t need to have crawlable links to these URLs.

This is entirely wrong.

The primary mechanism through which search engines discover content is still crawling. Your website needs to have a good internal link structure that enables crawlers (and your website’s visitors) to find all your important pages with as few clicks as possible.

And, more importantly, links enable the flow of PageRank (link value) through your site. Without PageRank, your website’s pages aren’t going to rank in search results.

XML sitemaps in no way replace internal links. XML sitemaps don’t distribute any link value, and they don’t guarantee indexing and ranking of your website’s pages. Sitemaps are a supplementary signal for Google and support a website’s internal linking and canonicalisation – they are not intended to replace a proper crawlable website.

You should always make sure your website is fully crawlable, and that all URLs listed in your XML sitemap can also be discovered by simply clicking on links on your site. If a URL is listed in a sitemap but doesn’t have any links pointing to it, Google is very unlikely to crawl the URL and even less likely to rank it in its search results.

In a Nutshell

Well-crafted XML sitemaps can help your website’s crawling and indexing by search engines, but for me the main purpose of sitemaps is to help troubleshoot SEO issues on your site.

The data reported in Google Search Console on XML sitemaps is the real reason you want to have good sitemap files.

Keep your sitemaps relatively small and focused with no unused attributes and no more than 10,000 URLs. Separate them out for different content types, and always make sure that URLs listed in your sitemaps are also fully discoverable through a web crawl.

Good luck and if you have any comments or questions about XML sitemaps, use the comments below and I’ll try to respond as best I can.