Wednesday, October 3, 2012

Google crawlers


See which robots Google uses to crawl the web
"Crawler" is a generic term for any program (such as a robot or spider) used to automatically discover and scan websites by following links from one webpage to another. Google's main crawler is called Googlebot. This table lists information about the common Google crawlers you may see in your referrer logs, and how they should be specified in robots.txt, the robots meta tags, and the X-Robots-Tag HTTP directives.

Here several user-agents are recognized in the robots.txt file, Google will follow the most specific. If you want all of Google to be able to crawl your pages, you don't need a robots.txt file at all. If you want to block or allow all of Google's crawlers from accessing some of your content, you can do this by specifying Googlebot as the user-agent. For example, if you want all your pages to appear in Google search, and if you want AdSense ads to appear on your pages, you don't need a robots.txt file. Similarly, if you want to block some pages from Google altogether, blocking the user-agent Googlebot will also block all Google's other user-agents.robots.txt

But if you want more fine-grained control, you can get more specific. For example, you might want all your pages to appear in Google Search, but you don't want images in your personal directory to be crawled. In this case, use robots.txt to disallow the user-agent Googlebot-image from crawling the files in your /personal directory (while allowing Googlebot to crawl all files), like this:
User-agent: Googlebot
Disallow:

User-agent: Googlebot-Image
Disallow: /personal
To take another example, say that you want ads on all your pages, but you don't want those pages to appear in Google Search. Here, you'd block Googlebot, but allow Mediapartners-Google, like this:
User-agent: Googlebot
Disallow: /

User-agent: Mediapartners-Google
Disallow:

robots meta tag

Some pages use multiple robots meta tags to specify directives for different crawlers, like this:
<meta name="robots" content="nofollow"><meta name="googlebot" content="noindex">
In this case, Google will use the sum of the negative directives, and Googlebot will follow both thenoindex and nofollow directives

Controlling how Google crawls and indexes your site.


Search engines generally go through two main stages to make content available for users in search results: crawling andindexing. Crawling is when search engine crawlers access publicly available webpages. In general, this involves looking at the webpages and following the links on those pages, just as a human user would. Indexing involves gathering together information about a page so that it can be made available ("served") through search results.
The distinction between crawling and indexing is critical. Confusion on this point is common and leads to webpages appearing or not appearing in search results. Note that a page may be crawled but not indexed; and, in rare cases, it may be indexed even if it hasn't been crawled. Additionally, in order to properly prevent indexing of a page, you must allow crawling or attempted crawling of the URL.
The methods described in this set of documents helps you control aspects of both crawling and indexing, so you can determine how you would prefer your content to be accessed by crawlers as well as how you would like your content to be presented to other users in search results.
In some situations, you may not want to allow crawlers to access areas of a server. This could be the case if accessing those pages uses the limited server resources, or if problems with the URL and linking structure would create an infinite number of URLs if all of them were to be followed.
In some cases it may be preferable to control how content is indexed and made available in search results. For instance, you may not want your pages to be indexed at all, or want them to appear without a snippet (summary of the page shown below the title in search results); or you may not want users to be able to view a cached version of the page.
Warning: Neither of these methods is suitable for controlling access to private content. If content should not be accessible by the general public, it's important that proper authentication mechanisms are in place. Our Help Center has more information on blocking Google from accessing or showing private content.
Note: Pages may be indexed despite never having been crawled: the two processes are independent of each other. If enough information is available about a page, and the page is deemed relevant to users, search engine algorithms may decide to include it in the search results despite never having had access to the content directly. That said, there are simple mechanisms such as robots meta tags to make sure that pages are not indexed.

Controlling crawling

The robots.txt file is a text file that allows you to specify how you would like your site to be crawled. Before crawling a website, crawlers will generally request the robots.txt file from the server. Within the robots.txt file, you can include sections for specific (or all) crawlers with instructions ("directives") that let them know which parts can or cannot be crawled.

Location of the robots.txt file

The robots.txt file must be located on the root of the website host that it should be valid for. For instance, in order to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located athttp://www.example.com/robots.txt. A robots.txt file can be placed on subdomains (likehttp://website.example.com/robots.txt) or on non-standard ports (http://example.com:8181/robots.txt), but it cannot be placed in a subdirectory (http://example.com/pages/robots.txt). There are more details regarding the location in the specifications.

Content of the robots.txt file

You can use almost any text editor to create a robots.txt file. The text editor should be able to create standard ASCII or UTF-8 text files; don't use a word processor (word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which may cause problems for crawlers). A general robots.txt file might look like this:
User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Disallow: /onlygooglebot/

Sitemap: http://www.example.com/sitemap.xml
Assuming this file is located at http://example.com/robots.txt, it specifies the following directives:
  1. No Googlebot crawlers should crawl the folder http://example.com/nogooglebot/ and all contained URLs. The line "User-agent: Googlebot" starts the section with directives for Googlebots.
  2. No other crawlers should crawl the folder http://example.com/onlygooglebot/ and all contained URLs. The line "User-agent: *" starts the section for all crawlers not otherwise specified.
  3. The site's Sitemap file is located at http://www.example.com/sitemap.xml
For more information, see the robots.txt specifications.

Some sample robots.txt files

These are some simple samples to help get started with the robots.txt handling.

Allow crawling of all content

User-agent: *
Disallow:
or
User-agent: *
Allow: /
The sample above is valid, but in fact if you want all your content to be crawled, you don't need a robots.txt file at all (and we recommend that you don't use one). If you don't have a robots.txt file, verify that your hoster returns a proper 404 "Not found" HTTP result code when the URL is requested.

Disallow crawling of the whole website

User-agent: *
Disallow: /
Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled.

Disallow crawling of certain parts of the website

User-agent: *
Disallow: /calendar/
Disallow: /junk/
Remember that you shouldn't use robots.txt to block access to private content: use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.

Allowing access to a single crawler

User-agent: Googlebot-news
Disallow:

User-agent: *
Disallow: /

Allowing access to all but a single crawler

User-agent: Unnecessarybot
Disallow: /

User-agent: *
Disallow:

Controlling indexing and serving

Indexing can be controlled on a page-by-page basis using simple information that is sent with each page as it is crawled. For indexing control, you can use either:
  1. a special meta tag that can be embedded in the top of HTML pages
  2. a special HTTP header element that can be sent with all content served by the website
Note: Keep in mind that in order for a crawler to find a meta tag or HTTP header element, the crawler must be able to crawl the page—it cannot be disallowed from crawling with the robots.txt file.

Using the robots meta tag

The robots meta tag can be added to the top of a HTML page, in the <head> section, for instance:
<!DOCTYPE html>
<html><head>
<meta name="robots" value="noindex" />
...
In this example, robots meta tag is specifying that no search engines should index this particular page (noindex). The namerobots applies to all search engines. If you want to block or allow a specific search engine, you can specify a user-agent name in the place of robots.
For more information, see the robots meta tag specifications.

Using the X-Robots-Tag HTTP header

In some situations, non-HTML content (such as document files) can also be crawled and indexed by search engines. In these cases, it's not possible to add a meta tag to the individual pages—instead, an HTTP header element can be sent with the response. This header element is not directly visible to users as it's not a part of the content directly.
The X-Robots-Tag is included with the other HTTP header tags. You can see these by checking the HTTP headers, for example using "curl":
$ curl -I "http://www.google.com/support/forum/p/Webmasters/search?hl=en&q=test"
HTTP/1.1 200 OK
X-Robots-Tag: noindex
Content-Type: text/html; charset=UTF-8
(...)
For more information, see the X-Robots-Tag specifications.

Getting started

Most websites will not need to set up restrictions for crawling, indexing or serving, so getting started is simple: you don't have to do anything.
There's no need to modify your pages if you would like to have them indexed.