Tuesday, October 25, 2005

Robots, crawling and being nice

robots.txt is a file you put on your web server to tell spiders (like Google) what to index on your site. It's also to tell programs (like wget) what is not allowed to be downloaded by an automated program. You can see for your self what a website's policy is by directing your browser to http://WEBSITE/robots.txt The thing is, it is not security. It is a suggestion. In wget a command line argument can be given to tell wget to ignore robots.txt. Cryptome has little article on it with robots.txt from major websites. The best by far is Sun's


# /robots.txt for www.sun.com

# Mon Feb 2 11:59:27 PST 1998, Fred Elliott
# Bertrand Meyer's excellent "comp.risks" posting about the potential
# for misusing "robots.txt" files
# (http://www.eiffel.com/private/meyer/robots.html) includes a snapshot
# of the contents of this file here on www.sun.com.
# In the article, Bertrand speculates that the directories listed below
# contain proprietary information. Well, they don't. They do, though,
# contain information that we'd prefer people register for before they
# download it.
# The purpose of the "robots.txt" file is to keep these directories
# from being indexed so that the average user doesn't stumble across them
# while performing searches, and those that should be accessing these
# directories will do so through the URL that requires them to register.
# Of course, having the contents of this file advertised in "comp.risks"
# diminishes its purpose. Thanks Bertrand. ;-)
# If you do actually go to the trouble of figuring out how to download
# the files without registering, what you'll end up with is 1 or 2MB of
# stuff that is meaningless to you unless you have purchased an
# Ultra AX board from Sun. So, please do purchase an Ultra AX board,
# but then you might as well use the URL you'll be given along with it.

gleaned from Cryptome


At 8:55 AM, Blogger postUK2 said...

At 4:09 AM, Blogger Absintheandcinder said...

Bertrand's quite the little troublemaker, isn't he?


