The Server Pages


TheServerPages Articles




Search Engines

Robots.txt Format

Author: Wojjie     Posted: 2004-07-19     Viewed: 12,586


When search engines crawl through the internet, they check for the existance of the 'robots.txt' file. This file tells the search engine what urls it should not crawl to, and has a simple file format.


The 'user-agent' field tells which robot the following lines correspond to. 'User-agent' may correspond to all robots by using the '*' wildcard.

User-Agent: googlebot
The lines after this line will only be considered by the google robot, and no other.

User-Agent: *
The lines after this will be considered by all robots.


Specifies part of the url that should not be crawled to. Disallow does have a wildcard type nature, where everything after what is specified is included in the disallow rule.

Disallow: /the
Would match all of the following example urls:

Live 'robots.txt' example: uses the following robots.txt definition (as of 07-19-2004):

User-agent: *
Disallow: /files/
Which tells all robots not to crawl into the /files/ directory.

This file can be found here:


Copyright © 2004-2015: