Documentation

Contents

Basic usage

To check a URL like http://www.myhomepage.org/ it is enough to execute linkchecker http://www.myhomepage.org/. This will check the complete domain of www.myhomepage.org recursively. All links pointing outside of the domain are also checked for validity.

For more options, read the man page linkchecker(1) or execute linkchecker -h.

Performed checks

All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.

Recursion

Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:

  1. A URL must be valid.
  2. A URL must be parseable. This currently includes HTML files, Opera bookmarks files, and directories. If a file type cannot be determined (for example it does not have a common HTML file extension, and the content does not look like HTML), it is assumed to be non-parseable.
  3. The URL content must be retrievable. This is usually the case except for example mailto: or unknown URL types.
  4. The maximum recursion level must not be exceeded. It is configured with the --recursion-level option and is unlimited per default.
  5. It must not match the ignored URL list. This is controlled with the --ignore-url option.
  6. The Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked by searching for a "nofollow" directive in the HTML header data.

Note that the directory recursion reads all files in that directory, not just a subset like index.htm*.

Frequently asked questions

Q: LinkChecker produced an error, but my web page is ok with Netscape/IE/Opera/... Is this a bug in LinkChecker?

A: Please check your web pages first. Are they really ok? Use a syntax highlighting editor. Use HTML Tidy. Check if you are using a proxy which produces the error.

Q: I still get an error, but the page is definitely ok.

A: Some servers deny access of automated tools (also called robots) like LinkChecker. This is not a bug in LinkChecker but rather a policy by the webmaster running the website you are checking. It might even be possible for a website to send robots different web pages than normal browsers.

Q: How can I tell LinkChecker which proxy to use?

A: LinkChecker works transparently with proxies. In a Unix or Windows environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy environment variables to a URL that identifies the proxy server before starting LinkChecker. For example

$ http_proxy="http://www.someproxy.com:3128"
$ export http_proxy

In a Macintosh environment, LinkChecker will retrieve proxy information from Internet Config.

Q: The link "mailto:john@company.com?subject=Hello John" is reported as an error.

A: You have to quote special characters (e.g. spaces) in the subject field. The correct link should be "mailto:...?subject=Hello%20John" Unfortunately browsers like IE and Netscape do not enforce this.

Q: Has LinkChecker JavaScript support?

A: No, it never will. If your page is not working without JS then your web design is broken. Use PHP or Zope or ASP for dynamic content, and use JavaScript just as an addon for your web pages.

Q: Is LinkCheckers cookie feature insecure?

A: Cookies can not store more information as is in the HTTP request itself, so you are not giving away any more system information. After storing however, the cookies are sent out to the server on request. Not to every server, but only to the one who the cookie originated from! This could be used to "track" subsequent requests to this server, and this is what some people annoys (including me). Cookies are only stored in memory. After LinkChecker finishes, they are lost. So the tracking is restricted to the checking time. The cookie feature is disabled as default.

Q: I want to have my own logging class. How can I use it in LinkChecker?

A: Currently, only a Python API lets you define new logging classes. Define your own logging class as a subclass of StandardLogger or any other logging class in the log module. Then call the addLogger function in Config.Configuration to register your new Logger. After this append a new Logging instance to the fileoutput.

import linkcheck, MyLogger
log_format = 'mylog'
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
cfg = linkcheck.configuration.Configuration()
cfg.logger_add(log_format, MyLogger.MyLogger)
cfg['fileoutput'].append(cfg.logger_new(log_format, log_args)) 

Q: LinkChecker does not ignore anchor references on caching.

Q: Some links with anchors are getting checked twice.

A: This is not a bug. It is common practice to believe that if a URL ABC#anchor1 works then ABC#anchor2 works too. That is not specified anywhere and I have seen server-side scripts that fail on some anchors and not on others. This is the reason for always checking URLs with different anchors. If you really want to disable this, use the --no-anchor-caching option.

Q: I see LinkChecker gets a /robots.txt file for every site it checks. What is that about?

A: LinkChecker follows the robots.txt exclusion standard. To avoid misuse of LinkChecker, you cannot turn this feature off. See the Web Robot pages and the Spidering report for more info.

Q: Ctrl-C does not stop LinkChecker immediately. Why is that so?

A: The Python interpreter has to wait for all threads to finish, and this means waiting for all open connections to close. The default timeout for connections is 30 seconds, hence the delay. You can change the default connection timeout with the --timeout option.

Q: How do I print unreachable/dead documents of my website with LinkChecker?

A: No can do. This would require file system access to your web repository and access to your web server configuration.

You can instead store the linkchecker results in a database and look for missing files.

Q: How do I check HTML/XML syntax with LinkChecker?

A: No can do. Use the HTML Tidy program.