People who own websites use the /robots.txt file in order to give directions and information regarding their website to web robots. This is known as The Robots Exclusion Protocol.
This is how it works. For example, if a robot visits a website URL, such as http://www.example.com/welcome.html, then before the robot visits the website, it will check for http://www.example.com/robots.txt first. This is what it will find:
The “User-agent: *” indicates that this section pertains to all robots. On the other hand, the “Disallow: /” relates to the robot that it shouldn’t visit the pages on that specific website.
There are two very essential points to consider when using the /robots.txt. It is outlined as below:
- Robots can sometimes pay no attention to your /robots.txt. This is specifically if malware robots run a thorough scan of the website for precautionary reasons or if the email address was used by spammers. If that is the case, then the robots will ignore your website altogether.
- The other instance is when the /robots.txt file is a file that is openly accessible. This means that anyone can gain access and see whatever sections of your server that you do not want robots to use.
All in all, you should not try to use /robots.txt in order to conceal any information.
Details about /robots.txt
The /robots.txt is an authentic standard that is not in possession of any standard organization. The historical descriptions are illustrated below:
- The 1994 A Standard for Robot Exclusion document.
- The 1997 Internet Draft condition A Method for Web Robots Control
Additionally, there are exterior resources as well:
It should be noted that the /robots.txt standard is not developed actively.
How to make a /robots.txt file
To put it shortly, the /robot.txt file is created in the top-level index of your website server.
To elaborate further on this, when a robot searches for the “/robots.txt” file for URL, it breaks down the path constituent from the URL (which includes everything from the very first single slash), and then puts the “/robots.txt” in its position.
For instance, in the “http://www.example.com/shop/index.html”, the robot will take away the “/shop/index.html”, and substitute it with “/robots.txt”. This will result in “http://www.example.com/robots.txt”.
Bearing all these in mind, if you own a web site, then you need to place it in the correct place on your web server for the resultant URL to actually work properly. Generally, that is the similar place in which you place your website’s main “index.html” landing page. To know where precisely that is and how to actually put the file in that place, it all comes down to your web server software.
Keep in mind to use all lower cases for the following filename: “robots.txt”, not “Robots.TXT.
What should you put in it?
As a text file, the “/robots.txt” comes with one or more records which typically consists of a particular record such as this:
Three directories are barred in this particular case.
It’s also important to remember that you would require a different “Disallow” line for each and every URL prefix that you would like to keep out because you will not be able to simply say “Disallow: /cgi-bin/ /tmp/” in just one line. Additionally, you might not have empty lines in a record because these are used in order to restrict several records.
At the same time, it’s important to note that globbing and customary terms are not maintained in the User-agent or Disallow lines. The ‘*’ in the User-agent field holds a very unique value of meaning to any robot. Distinctively, you can’t afford to have lines such as “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
Typically, whatever you should leave out depends heavily on your server. Everything that is not clearly prohibited is taken to mean a fair game. Following are some examples:
To eliminate all robots from the complete server
To permit total access all robots
(Or simply build an empty “/robots.txt” file or don’t use one)
To reject all robots from a fraction of the server
To leave out a single robot
To permit a single robot
To reject all files besides one
Because there is no “Allow” field, this becomes a little difficult. The simplest method is to put all files that should be disallowed into a different directory that is named as “stuff” for instance, and leave the other file in the level on top of this directory:
On the other hand, you can also plainly reject all disallowed pages: