Google SEO & Search Engine Marketing Services

Stop Google From Getting At Your Content

Bit of a strange one today, we usually talk about how to get Google to notice more of our content, not about blocking them from getting at it.

However, there are times when we want Google to stay away from our pages and not crawl or index the content on them, in this post I want to explore a few ways to do this, and the problems they can present.

No Index

The problem with no indexing pages is that it doesn’t prevent Google from crawling them, this means Google are spending time crawling parts of your site that you never want in the index! Not only is this a waste of Google’s time, it also uses up your crawl equity.

For those who don’t know much about crawl equity here is a good read, basically crawl equity is a term that refers to the depth at which Google will crawl your website. Every site has a crawl limit, probably based on incoming links and the authority they pass. If Google is spending bandwith crawling pages that you don’t want indexing they could potentially be missing out on pages that are more user focussed or content rich.

On top of all this Google may ignore, or in their words, miss the noindex tag anyway.

This can be a big problem, so how do we deal with thousands of pages we don’t want indexing?

Robots.txt

This is the preferred way to block pages from Google, stop them from crawling and therefore save your crawl equity for more useful pages.

However, after a recent post by Google it seems as though they may still index these pages anyway! Google has hinted at this before but Matt Cutts wrote the following:

Be careful about disallowing search engines from crawling your pages. Using the robots.txt protocol on your site can stop Google from crawling your pages, but it may not always prevent them from being indexed. For example, Google may index your page if we discover it by following a link from someone else’s site. To display it in search results, Google will need to display a title of some kind and because we won’t have access to any of your page content, we will rely on off-page content such as anchor text from other sites. (To truly block a URL from being indexed, you can use meta tags.)

OK, we use robots.txt but Google follow a link, find the URL, add the title tag based on the anchor text/URL and index the page anyway! The last thing you want to see in the SERP’s are pages from your site with nothing but an ugly title and URL.

So what Google are saying is that they don’t want you to block them from crawling anything, they want to see it all. No index it but let them see everything.

Now you may ask yourself, is crawl equity really an issue? Well, maybe not for small websites, but I have worked with multiple big sites that have struggled to get areas indexed because of all the dynamic URL’s Google is having to crawl. Once we began to robots.txt these pages, the useful parts of the site began to get indexed.

So What Can We Do?

What do we do then? Block the pages knowing Google may index them anyway?

There is no definitive answer from Google but here is my advice for what it’s worth :)

> If you want to block a handful of pages, simply no index them

> If you have a large area of the site that has no need to be indexed, then:

1. Block it using robots.txt

2. Use URL Parameters in the Site Configuration section of Webmaster Tools

Parameter blocking allows you to block Google from crawling certain areas of your site, and is a strong indication that you don’t want them indexed. If these pages are on multiple levels, try placing them all under one file and blocking it, however you will need to be careful not to block any useful pages.

Now Google still could find and index some of these pages but using this method will cover all your bases.

Line Break

Author: Tim (296 Articles)

is the owner and editor of SEO wizz and has been involved in the search engine marketing industry for over 9 years. He has worked with multiple businesses across many verticals, creating and implementing search marketing strategies for companies in the UK, US and across Europe. Tim is also the Director of Search at Branded3, a Digital Marketing & SEO Agency based in the UK.

Share

{ 16 comments… read them below or add one }

Colin January 16, 2012 at 1:37 am

Hello Tim,

Just wondered what you thought of placing text within an image to ensure that it isn’t indexed? For instance if you had a section of text which was common across many commercial pages.
In a similar way could you place the repeating text within an I frame to avoid it being indexed?

All the best

Colin

Reply

Tim January 16, 2012 at 1:51 am

Hi Colin,

Never thought about the image option before, I guess the issue might be duplication of an image on multiple pages.

How is Google going to perceive multiple pages with the identical image on them? I’m not sure it would work. and iframe is definitely an option but some recent research has shown that Google maybe crawling and following links within them, having said that we have had a few Panda hit clients recover by moving duplicate content into iframes.

Your other option is to somehow merge the pages some how or expand the content on them so it is at least 70 – 80% unique.

Reply

seomadhat January 16, 2012 at 1:47 am

And wath about using a reverse cloaking? Can be dangerous?
I mean a kind of cloaking that, when the bot arrive show to him a 301, or something like this.

Reply

Tim January 16, 2012 at 1:53 am

I think this would be classed as manipulative and if found out I’m pretty sure your site would get hit.

Reply

seomadhat January 16, 2012 at 2:10 am

Ok! sure, if found out!
And using a canonical to the home page, on all the page i dont want to index?

Reply

Tim January 16, 2012 at 2:15 am

Sure a canonical could work, wouldn’t stop Google crawling it but should work to block it from the index.

You can use the URL submitter to block individual pages as well.

Reply

Johann January 16, 2012 at 3:39 am

Interesting. Blocking or noindexing is easy, but to do both is tricky…
What about a noindex, nofollow ? If there are several layers of noindex-nofollow pages, only the first layer would be crawled, am I right ?

Reply

Tim January 16, 2012 at 3:44 am

Hi Johann,

Technically there is no point in doing both, robots.txt will stop the crawl so the no index tag won’t get read anyway. The problem is Google want to index everything they find so a page that can’t be crawled will still be indexed.

No index, no follow could be used but from what I have seen the no follow simply blocks PageRank, or whatever we want to call it these days, that doesn’t mean google don’t crawl and follow, no follow links??

I think the safest option is to block it in WMT.

Reply

Johann January 16, 2012 at 4:04 am

You’re right about using both at the same time – generally a bad idea – that’s why I find it tricky to try to noindex and save crawl bandwidth…
I guess blocking it with WMT can work, but I’ve seen pages getting reindexed when there’s links to them, regardless of wmt suppression.
In some cases, it’s just not possible to have it all I believe.

Personnaly I try to never use robots and only the noindex tag, alas for the crawl problem…

Reply

Tim January 16, 2012 at 4:42 am

Yeah, I think ultimately google will index what they want regardless of tags, they don’t want a hidden web :)

Reply

seomadhat January 16, 2012 at 4:01 am

Ok, i assume that probabily th best way to avoid Google to index some page, is to block it on WMT. Also becouse like no index and nofollow sometimes google take in his index page wit canonical…

So no other way! Only block it on WMT! But still i want to find some other way! I will be looking for and let you know if i find it ;)

Thank you for the article, very good as usually!

Reply

Tim January 16, 2012 at 4:40 am

Cheers, and do let me know.

Reply

Jali January 23, 2012 at 8:43 am

Hi Tim,

Is it true that Google has slowed down its indexing in protest of SOPA?

Thank you,
Jali

Reply

Tim January 23, 2012 at 9:08 am

They did for one day last week.

Reply

AndrewGoldy July 27, 2012 at 12:13 am

So I Google has indexed thousands of non-existent files in my domain:

stock.php?page=342987234
stock.php?page=342987235
stock.php?page=342987236, etc, etc

(where there is only 20 pages of stock, and there is no physical link to the above pages – I’ve checked)

Can I use the robots.txt file to block this? I was thinking ‘no’, since I still wanted Google to index stock.php?page all the way up to the end. But at the moment that’s 20 pages. Not a million.

Reply

Tim August 23, 2012 at 2:44 pm

Yeah, would be difficult to block in robots, use the noindex tag on all these pages, Google will still crawl the pages though.

Reply

Leave a Comment

Previous post:

Next post: