Enterprise Information Portal APIs

com.ibm.gcs.netutil
Class GCSHttpConnection

java.lang.Object
  |
  +--java.net.URLConnection
        |
        +--com.ibm.gcs.netutil.GCSHttpConnection

public class GCSHttpConnection
extends java.net.URLConnection

GCSHttpConnection is a customization of the java.net.HttpURLConnection class. A GCSHttpConnection object is returned by the openConnection method of the GCSHttpStreamHandler class for the "http" protocol.

See Also:
java.net.HttpURLConnection, GCSHttpStreamHandler

Field Summary
static java.lang.String AUTHORIZATION
           
static boolean cookieDBOn
           
static java.lang.String CRLF
           
static int DEFAULT_ROBOTS_CACHE_SIZE
           
static java.lang.String HTTP_CONNECT
           
static java.lang.String HTTP_DELETE
           
static java.lang.String HTTP_GET
           
static java.lang.String HTTP_HEAD
           
static java.lang.String HTTP_OPTIONS
           
static java.lang.String HTTP_POST
           
static java.lang.String HTTP_PUT
           
static java.lang.String HTTP_TRACE
           
static java.lang.String HTTP_VERSION
           
static java.lang.String IF_MODIFIED_SINCE
           
static java.lang.String LAST_MODIFIED
           
static boolean pdfToHtmlConversionOn
           
static java.util.Map robotsLocksTable
           
static java.lang.String SET_COOKIE
           
static java.lang.String SPC
           
static java.lang.String[] supportedRequestMethods
           
 
Constructor Summary
GCSHttpConnection(java.net.URL u)
          (constructor)
 
Method Summary
 boolean checkIfAllowedToCrawl(java.net.URL u)
           
 void connect()
          Opens a connection to the URL if already not connected.
 void disconnect()
          Disconnect a previously established connection with the http server.
 java.lang.Object getContent()
          Retrieves the content of this URL connection.
 java.lang.String getContentEncoding()
          Returns the content encoding, or null if not found
 int getContentLength()
          Get the length of the content (length of the content header field).
 java.lang.String getContentType()
          Gets the content type of the resource.
 java.lang.String getDateModified()
          Returns the date last modified String from HTTP header, or null if not found
 java.lang.String getHeaderField(java.lang.String fieldName)
          Gets a field value based on the key in the headers that are sent back from the server in response to a connection request.
 java.util.Hashtable getHeaders()
          Gets the headers that are sent back from the server in response to a connection request.
 java.io.InputStream getInputStream()
          get an input stream that reads from this open connection overrides the super class' getInputStream method.
 java.lang.String getOutContent()
          Returns the current outcontent
 java.lang.String getRequestMethod()
          Returns the current transaction method
 java.lang.String getRequestProperty(java.lang.String key)
          Description copied from URLConnection Returns the value of the named general request property for this connection.
 int getResponseCode()
          Gets the response code or the status of a connection request.
 java.lang.String getResponseMessage()
          Gets the response message of a connection request Response messages are basically strings like "OK" or "Not Found" extracted from response messages like HTTP/1.0 200 OK - or - HTTP/1.0 404 Not Found
 RobotsProcessor getRobotsProcessor()
          Get the RobotsProcessor object (if already set up) for this connection
static java.lang.String guessContentTypeFromStream(java.io.InputStream is)
          guessContentTypeFromStream this is helpful in identifying "xml"s and "dtd"s which are not sent as the right streams overrides the base class method to figure out the contents in a better way
 boolean outContentIsEmpty()
          Tells if the outContent is empty
 boolean robotsAllowed()
          checks if robots are allowed.
 void setAuthorization(java.lang.String username, char[] password)
          If not connected, set the authorization header according the basic-authentication scheme as per rfc2617.
 void setIfModifiedSince(long ifmodifiedsince)
          Calls super, then sets the value in the request header.
 void setOutContent(java.lang.String outContent)
          This sets the content sent during a transaction.
static void setProxy(java.lang.String _proxyHost, int _proxyPort)
          set proxy info for all HTTP connections
 void setRequestMethod(java.lang.String method)
          Sets the method that will be used for the HTTP transaction.
 void setRequestProperty(java.lang.String key, java.lang.String value)
          Description copied from URLConnection.
 
Methods inherited from class java.net.URLConnection
getAllowUserInteraction, getContent, getDate, getDefaultAllowUserInteraction, getDefaultRequestProperty, getDefaultUseCaches, getDoInput, getDoOutput, getExpiration, getFileNameMap, getHeaderField, getHeaderFieldDate, getHeaderFieldInt, getHeaderFieldKey, getIfModifiedSince, getLastModified, getOutputStream, getPermission, getURL, getUseCaches, setAllowUserInteraction, setContentHandlerFactory, setDefaultAllowUserInteraction, setDefaultRequestProperty, setDefaultUseCaches, setDoInput, setDoOutput, setFileNameMap, setUseCaches, toString
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

cookieDBOn

public static boolean cookieDBOn

pdfToHtmlConversionOn

public static boolean pdfToHtmlConversionOn

HTTP_VERSION

public static final java.lang.String HTTP_VERSION

HTTP_GET

public static final java.lang.String HTTP_GET

HTTP_POST

public static final java.lang.String HTTP_POST

HTTP_HEAD

public static final java.lang.String HTTP_HEAD

HTTP_PUT

public static final java.lang.String HTTP_PUT

HTTP_DELETE

public static final java.lang.String HTTP_DELETE

HTTP_TRACE

public static final java.lang.String HTTP_TRACE

HTTP_OPTIONS

public static final java.lang.String HTTP_OPTIONS

HTTP_CONNECT

public static final java.lang.String HTTP_CONNECT

supportedRequestMethods

public static final java.lang.String[] supportedRequestMethods

CRLF

public static final java.lang.String CRLF

SPC

public static final java.lang.String SPC

SET_COOKIE

public static final java.lang.String SET_COOKIE

IF_MODIFIED_SINCE

public static final java.lang.String IF_MODIFIED_SINCE

AUTHORIZATION

public static final java.lang.String AUTHORIZATION

LAST_MODIFIED

public static final java.lang.String LAST_MODIFIED

DEFAULT_ROBOTS_CACHE_SIZE

public static final int DEFAULT_ROBOTS_CACHE_SIZE

robotsLocksTable

public static java.util.Map robotsLocksTable
Constructor Detail

GCSHttpConnection

public GCSHttpConnection(java.net.URL u)
                  throws java.io.IOException
(constructor)
Parameters:
u - URL object for which a connection object is created
Method Detail

getDateModified

public java.lang.String getDateModified()
Returns the date last modified String from HTTP header, or null if not found
Returns:
date last modified String

getContentEncoding

public java.lang.String getContentEncoding()
Returns the content encoding, or null if not found
Overrides:
getContentEncoding in class java.net.URLConnection
Returns:
date last modified String

setProxy

public static void setProxy(java.lang.String _proxyHost,
                            int _proxyPort)
set proxy info for all HTTP connections

getInputStream

public java.io.InputStream getInputStream()
                                   throws java.io.IOException
get an input stream that reads from this open connection overrides the super class' getInputStream method. It basically creates an Input-Output stream pipe. It seralizes the file object associated with this connection into the output stream and returns the corresponding input stream.
Overrides:
getInputStream in class java.net.URLConnection
Returns:
an input stream for reading
Throws:
java.io.IOException - when a File IO exception happens
See Also:
URLConnection.getInputStream()

getContentLength

public int getContentLength()
Get the length of the content (length of the content header field). Overrides the URLConnection's getContentLength method.
Overrides:
getContentLength in class java.net.URLConnection
Returns:
-1
See Also:
private URL getURLAfterConversion(URL urlBefore) throws IOException {

connect

public void connect()
             throws java.io.IOException
Opens a connection to the URL if already not connected. Implements the corresponding abstract method in the parent class. Basically just sets the connected field to true.
Overrides:
connect in class java.net.URLConnection
See Also:
URLConnection.connect(), java.net.URLConnection#connected

getResponseCode

public int getResponseCode()
Gets the response code or the status of a connection request. Response codes are basically numbers like 200 or 404 extracted from response messages like HTTP/1.0 200 OK - or - HTTP/1.0 404 Not Found
Returns:
the response code, -1 if cannot be deciphered.
See Also:
HTTP RFC for response code values

getResponseMessage

public java.lang.String getResponseMessage()
Gets the response message of a connection request Response messages are basically strings like "OK" or "Not Found" extracted from response messages like HTTP/1.0 200 OK - or - HTTP/1.0 404 Not Found
Returns:
the response string, null if cannot be deciphered
See Also:
HTTP RFC for response messages

getHeaders

public java.util.Hashtable getHeaders()
Gets the headers that are sent back from the server in response to a connection request. It is a hash table of pairs
Returns:
a hashtable of headers
See Also:
com.ibm.almaden.gcs.gcsurl.GCSHttpConnection#getHeaderField(String fieldName), HTTP RFC for headers

getHeaderField

public java.lang.String getHeaderField(java.lang.String fieldName)
Gets a field value based on the key in the headers that are sent back from the server in response to a connection request. It is a hash table of pairs
Overrides:
getHeaderField in class java.net.URLConnection
Parameters:
fieldName - the attribute name in the header whose value is to be obtained
Returns:
the value of the field fieldName
See Also:
com.ibm.almaden.gcs.gcsurl.GCSHttpConnection#getHeader(), HTTP RFC for headers

disconnect

public void disconnect()
Disconnect a previously established connection with the http server.

getContentType

public java.lang.String getContentType()
Gets the content type of the resource. It might obtain this information from the http header, from the file name extension, or try to guess from the socket input stream. If cannot find an extension it will return "unknown". All content types are of the kind "http/". Note that this is different from the standard mime-types.
Overrides:
getContentType in class java.net.URLConnection
Returns:
a mime-type style string representation of the content type. All are of the form "http/".

getContent

public java.lang.Object getContent()
                            throws java.io.IOException
Retrieves the content of this URL connection.

This method determines if robots are allowed to crawl the object.

Overrides:
getContent in class java.net.URLConnection
Returns:
Object DefaultResourceCollection object containing the object.
Throws:
java.io.IOException - if an I/O error occurs while getting the content
See Also:
URLConnection.getContent()

setIfModifiedSince

public void setIfModifiedSince(long ifmodifiedsince)
Calls super, then sets the value in the request header.
Overrides:
setIfModifiedSince in class java.net.URLConnection
Parameters:
ifmodifiedsince - the new value.
See Also:
URLConnection.getIfModifiedSince()

setRequestProperty

public void setRequestProperty(java.lang.String key,
                               java.lang.String value)
Description copied from URLConnection. Sets the general request property. If a property with the key already exists, overwrite its value with the new value.

HTTP requires all request properties which can legally have multiple instances with the same key to use a comma-seperated list syntax which enables multiple properties to be appended into a single property. Stores values in a hashmap.

Overrides:
setRequestProperty in class java.net.URLConnection
Parameters:
key - the keyword by which the request is known (e.g., "accept").
value - the value associated with it.
See Also:
getRequestProperty(java.lang.String)

getRequestProperty

public java.lang.String getRequestProperty(java.lang.String key)
Description copied from URLConnection Returns the value of the named general request property for this connection. Retrieves the value from a hashmap.
Overrides:
getRequestProperty in class java.net.URLConnection
Parameters:
key - the keyword by which the request is known (e.g., "accept").
Returns:
the value of the named general request property for this connection.
See Also:
setRequestProperty(java.lang.String, java.lang.String)

setAuthorization

public void setAuthorization(java.lang.String username,
                             char[] password)
If not connected, set the authorization header according the basic-authentication scheme as per rfc2617.

guessContentTypeFromStream

public static java.lang.String guessContentTypeFromStream(java.io.InputStream is)
                                                   throws java.io.IOException
guessContentTypeFromStream this is helpful in identifying "xml"s and "dtd"s which are not sent as the right streams overrides the base class method to figure out the contents in a better way

setRequestMethod

public void setRequestMethod(java.lang.String method)
                      throws java.net.ProtocolException
Sets the method that will be used for the HTTP transaction.

getRequestMethod

public java.lang.String getRequestMethod()
Returns the current transaction method

setOutContent

public void setOutContent(java.lang.String outContent)
This sets the content sent during a transaction. Not all protocols will use the content. Note the content should not be followed by CRLF- this is forbidden by HTTP/1.1

getOutContent

public java.lang.String getOutContent()
Returns the current outcontent

outContentIsEmpty

public boolean outContentIsEmpty()
Tells if the outContent is empty

robotsAllowed

public boolean robotsAllowed()
checks if robots are allowed.

getRobotsProcessor

public RobotsProcessor getRobotsProcessor()
Get the RobotsProcessor object (if already set up) for this connection

checkIfAllowedToCrawl

public boolean checkIfAllowedToCrawl(java.net.URL u)

EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.