A lot of web design agencies use online staging environments, where a development version of a website resides for clients to view and comment on. Some agencies use their own domain to host staging environments, usually on a subdomain like staging.agency.com.
There is a risk involved with online staging environments: Google could crawl & index these subdomains and show them in its search results. This is a Bad Thing, as often these staging websites contain unfinished designs and incomplete content. Public access to these staging websites could even damage a business if it leads to premature exposure of a new campaign or business decision, and could get you in to legal trouble.
Today, whilst keeping tabs on some competitors of mine, I came across this exact scenario:
The name has been redacted to protect the guilty – I’ve sent them an email to notify them of this problem, because I want to make sure their clients are protected. A business shouldn’t suffer because of an error made by their web agency.
How To Protect Your Staging Sites
Protecting these staging environments is pretty simple, so there really isn’t an excuse to get it wrong.
Robots.txt blocking
For starters, all your staging environments should block search engine access in their robots.txt file:
User-agent: *
Disallow: /
This ensures that the staging website will not be crawled by search engines. However, it doesn’t mean the site won’t appear in Google’s search results; if someone links to the staging site, and that link is crawled, the site could still appear in search results. So you need to add extra layers of protection.
You could use the ‘noindex’ directive in your robots.txt as well:
User-agent: *
Disallow: /
Noindex: /
This directive basically means Google will not only be unable to access the site, it’ll also not be allowed to include any page in its index – even if someone else links to it.
Unfortunately the ‘noindex’ directive isn’t 100% fullproof; Tests have shown that Google doesn’t always comply with it. Still, it won’t hurt to include it in your robots.txt file.
Htaccess login
The next step I recommend is to put a password on it. This is easily done on Apache servers with an .htaccess login.
Edit the staging site’s .htaccess file (or, if there isn’t one, create it first) and put the following text in the file:
AuthType Basic
AuthName "Protected Area"
AuthUserFile /path/to/.htpasswd
Require valid-user
Then create a .htpasswd file in the path you’ve specified in the .htaccess file. The .htpasswd file contains the username(s) and password(s) that allow you to access the secured staging site, in the [username]:[password] format. For example:
john:4ccEss123
However you probably want to encrypt the password for extra security, so that it can’t be read in plain-text. A tool like the htpasswd generator will allow you to create encrypted passwords to include in your .htpasswd file:
john:$apr1$jRiw/29M$a4r3bNJbrMpPhtVQWeVu30
When someone wants to access the staging site, a username and password popup will appear:
This will make your staging environment much more secure and will prevent unauthorised access.
IP address restriction
Lastly, as an additional layer of protection, you can restrict access to your staging sites to specific IP addresses. By limiting access to the staging sites to users coming from specific networks, such as your own internal office network and the client’s network, you can really nail the security down and make it impervious to access for all but the most determined crackers.
First of all you’ll want to know your office’s internal IP address as well as that of your client’s. This is pretty simple – you can just Google ‘what is my ip address’ and it’ll be shown straight in search results:
Have your client do the same and get their office’s IP address from them. Check that you’re both using fixed IP addresses, though – if you’re on a dynamic IP address, yours could change and you’d lose access. Check with your internet service provider to make sure.
Once you’ve got the IP addresses that are allowed access, you need to edit the staging website’s .htaccess file again. Simply add the following text to the .htaccess file:
order allow,deny
allow from 123.456.789.012
deny from all
This directive means that your webserver will allow access to the site for the specified IP addresses (and you can have as many as you want there, one per line) and deny access to everyone else.
With those three security measures in place, your staging environments won’t be so easily found any more – and certainly not with a simple ‘site:’ command.
How To Remove Staging Sites from Google’s Index
Once you’ve secured your staging environments, you’ll also want to remove any staging websites from Google’s search results in case they already show up. There are several ways of doing this:
Use Google’s URL removal tool
In Google Search Console (formerly known as Webmaster Tools) you can manually enter specific URLs that you want removed from Google’s search index:
Simply create a new removal request, enter the URL you want deleted from Google’s index, and submit it. Usually these requests are processed after a few days, though I’ve seen them handled within a few hours of submitting them.
The downside of the URL removal tool is that you need to do it manually for every URL you want deleted. If entire staging sites are in Google’s index, this can be a very cumbersome process.
Noindex meta robots tag
Another way to get pages out of Google’s index is to include a so-called meta robots tag with the ‘noindex’ value in the HTML code of every page on your staging site. This meta tag is specifically intended for crawlers and can provide instructions on how search engines should handle the page.
With the following meta robots tag you instruct all search engines to remove the page from their indeces and not show it in search results, even if other sites link to it:
<meta name="robots" content="noindex">
When Google next crawls the staging site, it’ll see the ‘noindex’ tag and remove the page from its index. Note that this will only work if you have not blocked access in your robots.txt file – Google can’t see and act on the noindex tag if it can’t re-crawl the site.
X-Robots-Tag HTTP Header
Instead of adding the meta robots tag to your website – and running the risk of forgetting to remove it when you push the site live – you can also use the X-Robots-Tag HTTP header to send a signal to Google that you don’t want the site indexed.
The X-Robots-Tag header is a specific HTTP header that your website can send to bots like Googlebot, providing instructions on how the bot is allowed to interact with the site.
Again you can use the Apache .htaccess file to configure the X-Robots-Tag. With the following rule you can prevent Google from crawling and indexing your staging site:
Header set X-Robots-Tag "noindex, nofollow"
With this rule, your Apache webserver will serve the ‘noindex,nofollow’ HTTP header to all bots that visit the site. By having this .htaccess rule active on your staging site, but not on your live site, you can prevent your staging websites from being crawled and indexed.
Note that, like the meta noindex tag, the X-Robots-Tag header only works if bots are not blocked from accessing the site in the first place through robots.txt.
410 Gone status code
Finally, another approach is to serve a 410 HTTP status code. This code tells search engines like Google that the document is not there anymore, and that there is no alternative version so it should be removed from Google’s index. The way to do this is to create a directive in your .htaccess file that detects the Googlebot user-agent, and serves a 410 status code.
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} googlebot [NC]
RewriteRule .* - [R=410,L]
This detects the Googlebot user-agent, and will serve it a 410 HTTP status code. Note that this also will only work if there’s no robots.txt blocking in place, as Google won’t quickly remove pages from its index if it doesn’t find the new 410 status code when trying to crawl them. So you might want to move the staging site to a different subdomain and secure it, then serve a 410 on the old subdomain.
The 410 solution is a bit overkill, as Google will remove a page from its index after a few weeks if it keeps getting a 401 Access Denied error and/or if it’s blocked in robots.txt, but it’s probably worth doing just to get the site out of Google’s index as soon as possible.
Security Through Obscurity = Fail
In summary, don’t rely on people not knowing about your staging servers to keep them safe. Be pro-active in securing your clients’ information, and block access to your staging sites for everyone except those who need it.
These days, you simply can’t be safe enough. Determined crackers will always find a way in, but with these security measures in place you’ll definitely discourage the amateurs and script kiddies, and will prevent possible PR gaffes that might emerge from a simple ‘site:’ search in Google.