Bit of a strange one today, we usually talk about how to get Google to notice more of our content, not about blocking them from getting at it.
However, there are times when we want Google to stay away from our pages and not crawl or index the content on them, in this post I want to explore a few ways to do this, and the problems they can present.
The problem with no indexing pages is that it doesn’t prevent Google from crawling them, this means Google are spending time crawling parts of your site that you never want in the index! Not only is this a waste of Google’s time, it also uses up your crawl equity.
For those who don’t know much about crawl equity here is a good read, basically crawl equity is a term that refers to the depth at which Google will crawl your website. Every site has a crawl limit, probably based on incoming links and the authority they pass. If Google is spending bandwith crawling pages that you don’t want indexing they could potentially be missing out on pages that are more user focussed or content rich.
On top of all this Google may ignore, or in their words, miss the noindex tag anyway.
This can be a big problem, so how do we deal with thousands of pages we don’t want indexing?
This is the preferred way to block pages from Google, stop them from crawling and therefore save your crawl equity for more useful pages.
However, after a recent post by Google it seems as though they may still index these pages anyway! Google has hinted at this before but Matt Cutts wrote the following:
Be careful about disallowing search engines from crawling your pages. Using the robots.txt protocol on your site can stop Google from crawling your pages, but it may not always prevent them from being indexed. For example, Google may index your page if we discover it by following a link from someone else’s site. To display it in search results, Google will need to display a title of some kind and because we won’t have access to any of your page content, we will rely on off-page content such as anchor text from other sites. (To truly block a URL from being indexed, you can use meta tags.)
OK, we use robots.txt but Google follow a link, find the URL, add the title tag based on the anchor text/URL and index the page anyway! The last thing you want to see in the SERP’s are pages from your site with nothing but an ugly title and URL.
So what Google are saying is that they don’t want you to block them from crawling anything, they want to see it all. No index it but let them see everything.
Now you may ask yourself, is crawl equity really an issue? Well, maybe not for small websites, but I have worked with multiple big sites that have struggled to get areas indexed because of all the dynamic URL’s Google is having to crawl. Once we began to robots.txt these pages, the useful parts of the site began to get indexed.
So What Can We Do?
What do we do then? Block the pages knowing Google may index them anyway?
There is no definitive answer from Google but here is my advice for what it’s worth
> If you want to block a handful of pages, simply no index them
> If you have a large area of the site that has no need to be indexed, then:
1. Block it using robots.txt
2. Use URL Parameters in the Site Configuration section of Webmaster Tools
Parameter blocking allows you to block Google from crawling certain areas of your site, and is a strong indication that you don’t want them indexed. If these pages are on multiple levels, try placing them all under one file and blocking it, however you will need to be careful not to block any useful pages.
Now Google still could find and index some of these pages but using this method will cover all your bases.