When search engines crawl through the internet, they check for the existance of the 'robots.txt' file. This file tells the search engine what urls it should not crawl to, and has a simple file format.User-Agent:
The 'user-agent' field tells which robot the following lines correspond to. 'User-agent' may correspond to all robots by using the '*' wildcard.ex:
User-Agent: googlebotThe lines after this line will only be considered by the google robot, and no other.
User-Agent: *The lines after this will be considered by all robots.
Specifies part of the url that should not be crawled to. Disallow does have a wildcard type nature, where everything after what is specified is included in the disallow rule.ie.
Disallow: /theWould match all of the following example urls:
http://www.theserverpages.com/the/ http://www.theserverpages.com/the/index.html http://www.theserverpages.com/the/scripts/index.html http://www.theserverpages.com/theatre/ http://www.theserverpages.com/theatre/1.html http://www.theserverpages.com/the.html http://www.theserverpages.com/them.html http://www.theserverpages.com/theFinalExample.html
TheServerPages.com uses the following robots.txt definition (as of 07-19-2004):
User-agent: * Disallow: /files/Which tells all robots not to crawl into the /files/ directory.