Thursday, February 22, 2007

The Robots Exclusion Protocol



This is the second in a short series of posts about the Robots Exclusion Protocol, the standard for controlling how web pages on your site are indexed. This post provides more details and examples of mechanisms to control access and indexing of your website by Google.

In the first post in this series, I introduced robots.txt and robots META tags, giving an overview of when to use them. In this post, I'll look at some examples of the power of the protocol. These examples illustrate the detailed and fine-grain control online publishers have over how their websites are indexed.

Preventing Googlebot from following a link

Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot "crawls" the web. This is useful as it allows Google to include all the pages on your site, as long as they are linked together. Let's say you run the TheHighsteadPost.com website. Here's a map of part of the site:


When Googlebot crawls the index.html file, it finds the links to breakingnews.html and articles.html. From breakingnews.html, it can find valentinesday.html and promnight.html and so on.

What if you didn't want valentinesday.html and promnight.html appearing in Google's index? The articles in the Breaking News section may only appear for a few hours before being updated and moved to the Articles section. In this case you want the full articles indexed, not the breaking news version. You could put the NOINDEX tag on both those pages. But if the set of pages in the Breaking News section changed frequently, it would be a lot of work to continually update the pages with the NOINDEX tag and then remove it again when they moved into the articles section. Instead, you can add the NOFOLLOW tag to the breakingnews.html page. This tells the Googlebot not to follow any links it finds on that page, thus hiding valentinesday.html and promnight.html and any other pages linked from there. Simply add this line to the <HEAD> section of breakingnews.html:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

However, there is an important caveat to NOFOLLOW that you should know about. It only stops Google from following links from one page to another. If one of the linked pages is also linked from somewhere else, Google can still find and index that page via that other link. For example if promnight.html is also linked from HighsteadCourier.com, Google can still find and index promnight.html when it indexes HighsteadCourier.com and follows the link from there to promnight.html.

Using NOFOLLOW is generally not the best method to ensure content does not appear in our search results. Using the NOINDEX tag on individual pages or controlling access using robots.txt is the best way to achieve this.

Controlling Caching and Snippets

The Robots Exclusion Protocol allows you to specify, to some extent, how you would like your web pages should appear in Google's search results. Usually search results show a cached page link and a snippet, two features that our users tell us are very useful. Here, for example, is the first result I got when I searched for "Mallard duck":

The snippet is the extract of text from the web page, in this case it starts "The mallard duck is found mostly in North America...". We know from user studies that users are more likely to visit your site if the search results show the snippet. Why? Because snippets make it much easier for users to see why the result is relevant to their query. If a user isn't able to make this determination quickly, he or she usually moves on to the next search result.

Underneath the snippet is the URL of the page followed by the "cached" link. Clicking on this link takes you to a copy of the page stored on Google's servers. This is useful in a number of cases: for sites that are temporarily unavailable; for news sites that get overloaded in the aftermath of a major event, for example, 9/11; for sites that are accidentally deleted. Another advantage is that Google's cached copy highlights the words a person searched for, allowing them to quickly see how the page is relevant to their query.

Usually you want Google to display both the snippet and the cached link. However, there are some cases where you might want to disable one or both of these. For example, say you were a newspaper publisher, and you have a page whose content changes several times a day. It may take longer than a day for us to reindex a page, so users may have access to a cached copy of the page that is not the same as the one currently on your site. In this case, you probably don't want the cached link appearing in our results.

Again, the Robots Exclusion Protocol comes to your aid. Add the NOARCHIVE tag to a web page and Google won't cache copy of a web page in search results:

<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">

Similarly, you can tell Google not to display a snippet for a page. The NOSNIPPET tag achieves this:

<META NAME="GOOGLEBOT" CONTENT="NOSNIPPET">


Adding NOSNIPPET also has the effect of preventing a cache link from being shown, so if you specify NOSNIPPET you automatically get NOARCHIVE too.

Learn more

As usual the Google Webmaster Help pages have a lot of useful information:


Next time...

The final post in this series will take some common exclusion problems that webmasters have told us about and show how to solve them using the Robots Exclusion Protocol.

No comments:

Post a Comment