Protect your Staging Environments

A lot of web design agencies use online staging environments, where a development version of a website resides for clients to view and comment on. Some agencies use their own domain to host staging environments, usually on a subdomain like staging.agency.com.

There is a risk involved with online staging environments: Google could crawl & index these subdomains and show them in its search results. This is a Bad Thing, as often these staging websites contain unfinished designs and incomplete content. Public access to these staging websites could even damage a business if it leads to premature exposure of a new campaign or business decision, and could get you in to legal trouble.

Today, whilst keeping tabs on some competitors of mine, I came across this exact scenario:

Unprotected staging environments uncovered with a simple site: search

The name has been redacted to protect the guilty – I’ve sent them an email to notify them of this problem, because I want to make sure their clients are protected. A business shouldn’t suffer because of an error made by their web agency.

How To Protect Your Staging Sites

Protecting these staging environments is pretty simple, so there really isn’t an excuse to get it wrong.

Robots.txt blocking

For starters, all your staging environments should block search engine access in their robots.txt file:

User-agent: *
Disallow: /

 
This ensures that the staging website will not be crawled by search engines. However, it doesn’t mean the site won’t appear in Google’s search results; if someone links to the staging site, and that link is crawled, the site could still appear in search results. So you need to add extra layers of protection.

You could use the ‘noindex’ directive in your robots.txt as well:

User-agent: *
Disallow: /
Noindex: /

 
This directive basically means Google will not only be unable to access the site, it’ll also not be allowed to include any page in its index – even if someone else links to it.

Unfortunately the ‘noindex’ directive isn’t 100% fullproof; Tests have shown that Google doesn’t always comply with it. Still, it won’t hurt to include it in your robots.txt file.

Htaccess login

The next step I recommend is to put a password on it. This is easily done on Apache servers with an .htaccess login.

Edit the staging site’s .htaccess file (or, if there isn’t one, create it first) and put the following text in the file:

AuthType Basic
AuthName "Protected Area"
AuthUserFile /path/to/.htpasswd
Require valid-user

 
Then create a .htpasswd file in the path you’ve specified in the .htaccess file. The .htpasswd file contains the username(s) and password(s) that allow you to access the secured staging site, in the [username]:[password] format. For example:

john:4ccEss123

 
However you probably want to encrypt the password for extra security, so that it can’t be read in plain-text. A tool like the htpasswd generator will allow you to create encrypted passwords to include in your .htpasswd file:

john:$apr1$jRiw/29M$a4r3bNJbrMpPhtVQWeVu30

 
When someone wants to access the staging site, a username and password popup will appear:

htaccess authentication popup

This will make your staging environment much more secure and will prevent unauthorised access.

IP address restriction

Lastly, as an additional layer of protection, you can restrict access to your staging sites to specific IP addresses. By limiting access to the staging sites to users coming from specific networks, such as your own internal office network and the client’s network, you can really nail the security down and make it impervious to access for all but the most determined crackers.

First of all you’ll want to know your office’s internal IP address as well as that of your client’s. This is pretty simple – you can just Google ‘what is my ip address’ and it’ll be shown straight in search results:

What is my IP address - Google Search

Have your client do the same and get their office’s IP address from them. Check that you’re both using fixed IP addresses, though – if you’re on a dynamic IP address, yours could change and you’d lose access. Check with your internet service provider to make sure.

Once you’ve got the IP addresses that are allowed access, you need to edit the staging website’s .htaccess file again. Simply add the following text to the .htaccess file:

order allow,deny
allow from 123.456.789.012
deny from all

 
This directive means that your webserver will allow access to the site for the specified IP addresses (and you can have as many as you want there, one per line) and deny access to everyone else.

With those three security measures in place, your staging environments won’t be so easily found any more – and certainly not with a simple ‘site:’ command.

How To Remove Staging Sites from Google’s Index

Once you’ve secured your staging environments, you’ll also want to remove any staging websites from Google’s search results in case they already show up. There are several ways of doing this:

Use Google’s URL removal tool

In Google Search Console (formerly known as Webmaster Tools) you can manually enter specific URLs that you want removed from Google’s search index:

Google Search Console URL removal

Simply create a new removal request, enter the URL you want deleted from Google’s index, and submit it. Usually these requests are processed after a few days, though I’ve seen them handled within a few hours of submitting them.

The downside of the URL removal tool is that you need to do it manually for every URL you want deleted. If entire staging sites are in Google’s index, this can be a very cumbersome process.

Noindex meta robots tag

Another way to get pages out of Google’s index is to include a so-called meta robots tag with the ‘noindex’ value in the HTML code of every page on your staging site. This meta tag is specifically intended for crawlers and can provide instructions on how search engines should handle the page.

With the following meta robots tag you instruct all search engines to remove the page from their indeces and not show it in search results, even if other sites link to it:

<meta name="robots" content="noindex">

 
When Google next crawls the staging site, it’ll see the ‘noindex’ tag and remove the page from its index. Note that this will only work if you have not blocked access in your robots.txt file – Google can’t see and act on the noindex tag if it can’t re-crawl the site.

X-Robots-Tag HTTP Header

Instead of adding the meta robots tag to your website – and running the risk of forgetting to remove it when you push the site live – you can also use the X-Robots-Tag HTTP header to send a signal to Google that you don’t want the site indexed.

The X-Robots-Tag header is a specific HTTP header that your website can send to bots like Googlebot, providing instructions on how the bot is allowed to interact with the site.

Again you can use the Apache .htaccess file to configure the X-Robots-Tag. With the following rule you can prevent Google from crawling and indexing your staging site:

Header set X-Robots-Tag "noindex, nofollow"

 
With this rule, your Apache webserver will serve the ‘noindex,nofollow’ HTTP header to all bots that visit the site. By having this .htaccess rule active on your staging site, but not on your live site, you can prevent your staging websites from being crawled and indexed.

Note that, like the meta noindex tag, the X-Robots-Tag header only works if bots are not blocked from accessing the site in the first place through robots.txt.

410 Gone status code

Finally, another approach is to serve a 410 HTTP status code. This code tells search engines like Google that the document is not there anymore, and that there is no alternative version so it should be removed from Google’s index. The way to do this is to create a directive in your .htaccess file that detects the Googlebot user-agent, and serves a 410 status code.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} googlebot [NC]
RewriteRule .* - [R=410,L]

 
This detects the Googlebot user-agent, and will serve it a 410 HTTP status code. Note that this also will only work if there’s no robots.txt blocking in place, as Google won’t quickly remove pages from its index if it doesn’t find the new 410 status code when trying to crawl them. So you might want to move the staging site to a different subdomain and secure it, then serve a 410 on the old subdomain.

The 410 solution is a bit overkill, as Google will remove a page from its index after a few weeks if it keeps getting a 401 Access Denied error and/or if it’s blocked in robots.txt, but it’s probably worth doing just to get the site out of Google’s index as soon as possible.

Security Through Obscurity = Fail

In summary, don’t rely on people not knowing about your staging servers to keep them safe. Be pro-active in securing your clients’ information, and block access to your staging sites for everyone except those who need it.

These days, you simply can’t be safe enough. Determined crackers will always find a way in, but with these security measures in place you’ll definitely discourage the amateurs and script kiddies, and will prevent possible PR gaffes that might emerge from a simple ‘site:’ search in Google.

Code, Domain Names, Hosting, Legal, Security

Comments

  1. Great post.

    If people are coming here with nginx hosting their website instead of apache the same techniques are easily transferable.

    For basic auth restrictions you can add this to your site settings;

    auth_basic "Restricted";
    auth_basic_user_file /etc/nginx/.htpasswd;

    The .htpasswd file should be the same as you outlined in the post.

    Blocking all bar one IP is also simple with niginx;

    location / {
    allow 123.456.789.012;
    deny all;
    }

    On that point, one of the purposes of a staging server is to get feedback from stakeholders before something goes live. If all those stakeholders are in the same office with a fixed IP then this is a valid method, I would imagine though that this could quickly become unwieldy and provide a lot of extra work for the dev team keeping everything up to date. Something to bear in mind.

    Reply »

    1. Thanks Toby! I only have experience with securing Apache webservers so I’ve no idea how to do it on other platforms. Thanks for sharing!

      And yes it can be a pain in the arse to maintain the various IP addresses, but I do feel it’s a necessary element. If it’s too much hassle, the username & password solution is still quite robust on its own.

      Reply »

      1. Just thinking out loud here. Instead of blocking access what about doing a 302 (potentially even 301) redirect to IP addresses that didn’t make the cut. Just so they are getting something (in the case of some tube linking to the staging site publicly)

        Reply »

        1. Hmm, not a bad idea Toby, it’d definitely be more user-friendly. I’d recommend 301 redirects though, as with a 302 a search engine will keep the original URL in its index.

          Reply »

  2. You really need this nowadays. I used to be lazy enough to work the ‘Security Through Obscurity’ way, but since many testers visit our staging servers using Chrome, Google seems to hop by pretty quickly as well.

    Reply »

  3. Not enough people think about this and it can come back to haunt you if you don’t get it right. Giving a client a password to a dev site too early can also cause problems. Sausages and Websites, the 2 things end users should never see getting made!

    Reply »

    1. Aye putting a noindex meta tag on the site will also help, as it’ll remove the pages from Google’s index after a while, but it won’t keep the content safe from human visitors. I reckon if you prevent access in the first place – to both unauthorised users and search engines – you won’t necessarily need the noindex tag.

      Still a very useful tag so I’ll amend the post to include it. Thanks!

      Reply »

  4. Hi Barry,

    Good post. I recently worked with a client that had a staging site indexed for a website that should have been password protected as it contained private personal information(name, address, previous enquiries) about their customers!

    I’m finding that a lot of staging sites will still show in Google if blocked in robots.txt even if nobody links to them so I prefer using noindex if password protecting isn’t an option.

    Gordon

    Reply »

  5. Just as a word of warning when using htpasswd, some hosts turn off the plain text password on those files and you may also need the full absolute path in the AuthUserFile, if the relative path isn’t working. To find your full absolute path make a php file with this:

    echo __FILE__;

    Lastly, check your error log files to see if any problems arise that you can’t solve.

    Reply »

    1. Thanks for the tip James – lots that can go wrong when messing about with htaccess and htpasswd files, so I’d always advise to tread carefully. And when in doubt, always check with your hosting provider.

      Reply »

Leave a Reply

Your email address will not be published. Required fields are marked *


Award Winners

DANI Awards 2018 Winners

UK Search Awards 2016 Winners