Friday, July 27, 2007

Robots Exclusion Protocol: now with even more flexibility



This is the third and last in my series of blog posts about the Robots Exclusion Protocol (REP). In the first post, I introduced robots.txt and the robots META tags, giving an overview of when to use them. In the second post, I shared some examples of what you can do with the REP. Today, I'll introduce two new features that we have recently added to the protocol.

As a product manager, I'm always talking to content providers to learn about your needs for REP. We are constantly looking for ways to improve the control you have over how your content is indexed. These new features will give you flexible and convenient ways to improve the detailed control you have with Google.

Tell us if a page is going to expire
Sometimes you know in advance that a page is going to expire in the future. Maybe you have a temporary page that will be removed at the end of the month. Perhaps some pages are available free for a week, but after that you put them into an archive that users pay to access. In these cases, you want the page to show in Google search results until it expires, then have it removed: you don't want users getting frustrated when they find a page in the results but can't access it on your site.

We have introduced a new META tag that allows you to tell us when a page should be removed from the main Google web search results: the aptly named unavailable_after tag. This one follows a similar syntax to other REP META tags. For example, to specify that an HTML page should be removed from the search results after 3pm Eastern Standard Time on 25th August 2007, simply add the following tag to the first section of the page:

<META NAME="GOOGLEBOT" CONTENT="unavailable_after: 25-Aug-2007 15:00:00 EST">

The date and time is specified in the RFC 850 format.

This information is treated as a removal request: it will take about a day after the removal date passes for the page to disappear from the search results. We currently only support unavailable_after for Google web search results.

After the removal, the page stops showing in Google search results but it is not removed from our system. If you need a page to be excised from our systems completely, including any internal copies we might have, you should use the existing URL removal tool which you can read about on our Webmaster Central blog.

Meta tags everywhere
The REP META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.

We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples:
  • Don't display a cache link or snippet for this item in the Google search results:
X-Robots-Tag: noarchive, nosnippet
  • Don't include this document in the Google search results:
X-Robots-Tag: noindex
  • Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT:
X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT

You can combine multiple directives in the same document. For example:
  • Do not show a cached link for this document, and remove it from the index after 23rd July 2007, 3pm PST:
X-Robots-Tag: noarchive
X-Robots-Tag: unavailable_after: 23 Jul 2007 15:00:00 PST


Our goal for these features is to provide more flexibility for indexing and inclusion in Google's search results. We hope you enjoy using them.

No comments:

Post a Comment