The Server Pages

»

TheServerPages Articles

»

Webmasters

»

Search Engines

Robots.txt Format

Author: Wojjie     Posted: 2004-07-19     Viewed: 12,105

Introduction:

When search engines crawl through the internet, they check for the existance of the 'robots.txt' file. This file tells the search engine what urls it should not crawl to, and has a simple file format.

User-Agent:

The 'user-agent' field tells which robot the following lines correspond to. 'User-agent' may correspond to all robots by using the '*' wildcard.

ex:
User-Agent: googlebot
The lines after this line will only be considered by the google robot, and no other.

ex:
User-Agent: *
The lines after this will be considered by all robots.

Disallow:

Specifies part of the url that should not be crawled to. Disallow does have a wildcard type nature, where everything after what is specified is included in the disallow rule.

ie.
Disallow: /the
Would match all of the following example urls:
http://www.theserverpages.com/the/
http://www.theserverpages.com/the/index.html
http://www.theserverpages.com/the/scripts/index.html
http://www.theserverpages.com/theatre/
http://www.theserverpages.com/theatre/1.html
http://www.theserverpages.com/the.html
http://www.theserverpages.com/them.html
http://www.theserverpages.com/theFinalExample.html


Live 'robots.txt' example:

TheServerPages.com uses the following robots.txt definition (as of 07-19-2004):

User-agent: *
Disallow: /files/
Which tells all robots not to crawl into the /files/ directory.

This file can be found here: http://www.theserverpages.com/robots.txt


Comments

Copyright © 2004-2015: TheServerPages.com