A Firefox Add-On to Check robots.txt

(zur deutschen Version)


Add roboxt! to Firefox

The Firefox Add-On roboxt! helps determing if robots.txt is configured correctly. It reports which pages a web crawler may crawl and marks links that a web crawler may not visit.

Features

In the status bar roboxt! displays whether a certain crawler may visit the current URL. The user may choose in the preferences that blocked links should be marked. In this case the status bar also displays the number of blocked links:

roboxt! status bar

Blocked links are marked by a light red background and a dashed grey border: roboxt! blocked links

Context Menu

The context menu of the status bar offers a direct link to the current robots.txt file ("Show robots.txt"). This will open robots.txt in a new window/tab. Via the context menu the user may also open the preferences window.

Preferences

In the preferences the user may make two adjustments:

  1. He can choose the name of the crawler that he wants to test robots.txt with. This is "Googlebot" by default.
  2. He can choose whether blocked links should be marked

Interpretation of robots.txt by roboxt!

Most rules by which a parser should interpret a robots.txt file are relatively clear. In doubt roboxt! interprets robots.txt in this way:

    • roboxt! interprets the directives according to googles extended rules for the robots.txt. The add on, for example, understands wildcard and "allow" directives. The offical ruleset for robots.txt is more restricted.
    • robxt! interprets a crawler's name as case insensitive whereas it interprets the path name as case sensitive.
    • Only when robots.txt does not define any directive for the chosen crawler the directives for any other crawler ("*") is used.
    • Conflicting directives are solved according to these rules:
      • More complex directives beat less complex directives. The length of the respective path determines the complexity of a directive.
      • Directives that are noted later in the robots.txt beat earlier directives.