Enterprise Information Portal APIs

com.ibm.gcs.netutil.http
Class RobotsProcessor

java.lang.Object
  |
  +--com.ibm.gcs.netutil.http.RobotsProcessor

public class RobotsProcessor
extends java.lang.Object

Robots is a robots.txt file processor on a URL This implementation is based on the internet draft of robots.txt available at Robots RFC


Constructor Summary
RobotsProcessor(java.io.File fileObj)
          creates a RobotsProcesor object given a file object containing a robots.txt description
RobotsProcessor(java.lang.String hostName)
          constructor creates a RobotsProcesor object given the host name
RobotsProcessor(java.lang.String hostName, int port)
          constructor creates a RobotsProcesor object given the host name and port number
RobotsProcessor(java.net.URL baseURL)
          constructor create a RobotsProcessor object given the URL related to the robots.txt file
 
Method Summary
 java.util.Enumeration getAllowedPaths(java.lang.String agent)
          returns an array of paths that a particular robot agent is allowed to access
 java.util.Enumeration getDisallowedPaths(java.lang.String agent)
          returns an array of paths that a particular robot agent is not allowed to access
 java.lang.String getHost()
          returns the name of the host for which the robots.txt is being processed.
 int getPort()
          returns the port number where the host is listening for which the robots.txt is being processed.
 boolean isAllowed(java.lang.String agent, java.lang.String path)
          given the name of a robot agent and a path check if the agent is allowed to access that path.
 boolean isExpired()
          A robots processor is expired if "VALID_FOR_MS" ms have passed since the robots processor was constructed.
 com.ibm.gcs.netutil.http.RDFDescription toRDFDescription()
          produce an RDFDescription of the robot information
 java.lang.String toString()
          A string representation of the RobotsProcessor data
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RobotsProcessor

public RobotsProcessor(java.lang.String hostName)
                throws java.net.MalformedURLException,
                       java.io.IOException
constructor creates a RobotsProcesor object given the host name
Parameters:
hostName - name of the http host
Throws:
java.net.MalformedURLException - when the URL (http://hostName/robots.txt) is malformed
java.io.IOException - when an IO exception occurs while reading the robots.txt data
See Also:
MalformedURLException

RobotsProcessor

public RobotsProcessor(java.net.URL baseURL)
                throws java.net.MalformedURLException,
                       IllegalProtocolException,
                       java.io.IOException
constructor create a RobotsProcessor object given the URL related to the robots.txt file
Parameters:
baseURL - URL of the host containing the robots.txt file
Throws:
java.net.MalformedURLException - when the URL (url+"/robots.txt") is malformed
IllegalProtocolException - if the protocol is not "http"
java.io.IOException - when an IO exception occurs while reading the robots.txt data
See Also:
IllegalProtocolException, IOException, MalformedURLException

RobotsProcessor

public RobotsProcessor(java.lang.String hostName,
                       int port)
                throws java.net.MalformedURLException,
                       java.io.IOException
constructor creates a RobotsProcesor object given the host name and port number
Parameters:
hostName - name of the http host
port - port in which the host is listening
Throws:
java.net.MalformedURLException - when the URL (http://hostName:port/robots.txt) is malformed
java.io.IOException - when an IO exception occurs while reading the robots.txt data
See Also:
MalformedURLException

RobotsProcessor

public RobotsProcessor(java.io.File fileObj)
                throws java.io.IOException
creates a RobotsProcesor object given a file object containing a robots.txt description
Parameters:
fileObj - a java.io.File object that contains the robots.txt description
Throws:
FileNotFoundException - if the file is not found
Method Detail

getDisallowedPaths

public java.util.Enumeration getDisallowedPaths(java.lang.String agent)
returns an array of paths that a particular robot agent is not allowed to access
Parameters:
agent - the agent for whom the test is made
Returns:
an enumeration of String objects containing the disallowed paths

getAllowedPaths

public java.util.Enumeration getAllowedPaths(java.lang.String agent)
returns an array of paths that a particular robot agent is allowed to access
Parameters:
agent - the agent for whom the test is made
Returns:
an enumeration of String objects containing the allowed paths

isAllowed

public boolean isAllowed(java.lang.String agent,
                         java.lang.String path)
given the name of a robot agent and a path check if the agent is allowed to access that path. It obtains the longest prefix of the allowed path strings with the reference path, and the longest prefix of the disallowed path strings with the reference path. If the longest allowed prefix is longer than the longest disallowed prefix, true is returned, false otherwise. Of course, if nothing is disallowed, the disallowed prefix is "", then everything is allowed Also "/robots.txt" is always allowed.
Parameters:
agent - the agent for whom the test is made
path - the path for which the access is to be checked
Returns:
true if the agent is allowed, false otherwise

getHost

public java.lang.String getHost()
returns the name of the host for which the robots.txt is being processed. Note that if the RobotsProcessor object is created using the constructor RobotsProcessor(File) then host name will be null
Returns:
a string representing the host name

getPort

public int getPort()
returns the port number where the host is listening for which the robots.txt is being processed. Note that if the RobotsProcessor object is created using the constructor RobotsProcessor(File) then port will be zero
Returns:
the port number

toRDFDescription

public com.ibm.gcs.netutil.http.RDFDescription toRDFDescription()
produce an RDFDescription of the robot information

toString

public java.lang.String toString()
A string representation of the RobotsProcessor data
Overrides:
toString in class java.lang.Object
Returns:
a string object

isExpired

public boolean isExpired()
A robots processor is expired if "VALID_FOR_MS" ms have passed since the robots processor was constructed.

EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.