Another method:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
The page will be still be indexed, but any hyperlinks in that page will not be followed by the spider.
The best method is to combine the two:
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
The page will not be indexed and no links will be followed.
Robots.txt File
The robots.txt file is a more powerful strategy. It is a text file that contains instructions on what to allow/disallow
agents and spiders to view and spider on your site. These rules are called The Robots Exclusion Standard.
ThinkHost doesn't place a default robots.txt file in your web when you open an account, so you'll need to create one
in notepad and upload it via FTP to your docs directory. If you are using Microsoft Frontpage, save the file to the root
directory of your disk based web and then upload via FrontPage's standard HTTP:// publishing function.
Never use a blank robots.txt file as some search engines may see this as an indication that you don't want your
site spidered at all! Have at least one entry in the file and remember to skip a line between entries. Also ensure
that the spider/agent that you are banning doesn't turn out to be a legitimate software browser.
To prevent specific agents and spiders from having any access to your site, put these lines into the robots.txt file:
User-agent: NameOfAgent
Disallow: /
You must record the name of the agent exactly as it appeared in your traffic reports; for example WebZip/4.0.
User-agent: WebZip/4.0
Disallow: /
Skip a line between entries. You could do the same to exclude search engine spiders such as Googlebot.
The "/" means disallow access to any directory.
You can also prevent access to specific folders:
User-agent: *
Disallow: /cgi-bin/
In this example the * indicates "all" but please note that the wildcard (*) cannot be used on the Disallow line,
use "/" instead.
Example robots.txt file
If you would like some sort of guide and further examples of a robots.txt file, you can take a look at the one we use
on the ThinkHost site. View it here:
http://www.thinkhost.com/robots.txt
Our file is by no means complete, but it does contain a number of "idiot" bots that repeatedly attempt to
strip our main site. Please be aware that robots.txt will not stop all web stripping activity as many strippers can
fake agent names, but it will help you save on bandwidth.
Good Spiders
If you would like to be able to identify the "good" spiders that may visit your site, you can view a listing of the most
popular search engines' robots in our tutorial,
Understanding your web site traffic
About The Author
Aricle by Michael Bloch of Team ThinkHost. Thinking Hosting? ThinkHost!