Enterprise Information Portal APIs

com.ibm.gcs.db.component
Class DB2URLCollection

java.lang.Object
  |
  +--com.ibm.gcs.db.component.DB2URLCollection
All Implemented Interfaces:
com.ibm.gcs.urlpool.URLCollection

public class DB2URLCollection
extends java.lang.Object
implements com.ibm.gcs.urlpool.URLCollection

This URL collection class enqueues URLs based on depth By default, it does not save any annotation information. For maximum performance, make sure that there is an index on urlpoolstable(state_id, hide, time)


Fields inherited from interface com.ibm.gcs.urlpool.URLCollection
copyright
 
Constructor Summary
DB2URLCollection()
          Default Constructor.
DB2URLCollection(java.util.Hashtable args)
          Constructor.
DB2URLCollection(URLPoolConfig.Pair[] args)
          Constructor.
 
Method Summary
 void cleanup()
          provides cleanup-operation on the Collection like writing caches etc.
 com.ibm.gcs.urlpool.URLContainer get()
          Returns the next URL to be crawled.
 com.ibm.gcs.urlpool.URLContainer get(com.ibm.gcs.util.jdp.UnaryPredicate predicate)
          Gets the next URL from the collection that satisfies a particular predicate based on the hashing scheme used.
 boolean isEmpty()
          Returns true if there are no more URLs in the database pool of URLs to be crawled, false otherwise.
 int mySize()
          Returns the number of URLs currently in this collection's cache.
 void put(com.ibm.gcs.urlpool.URLContainer urlC)
          Adds a URLContainer object into the database pool of URLs to be crawled, if the url has not already been visited or is not waiting to be crawled.
 void put(com.ibm.gcs.urlpool.URLContainer[] urlCArray)
          Adds an array of URLContainer objects into the database pool of URLs.
 int size()
          Returns the total number of visible URLs from the database pool of URLs to be crawled.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DB2URLCollection

public DB2URLCollection()
Default Constructor. By default, this collection uses a cache size of 100 and does not keep annotation information.

DB2URLCollection

public DB2URLCollection(URLPoolConfig.Pair[] args)
Constructor.
Parameters:
args - The arguments to use for this collection. These arguments should be name/value pairs that specify values for cachesize, driver, dbname, user, and password. keepannotations, isdistributed

DB2URLCollection

public DB2URLCollection(java.util.Hashtable args)
Constructor.
Parameters:
args - The arguments to use for this collection. These arguments should be name/value pairs that specify values for cachesize, driver, dbname, user, and password. keepannotations, isdistributed This constructor uses a Hashtable to pass parameters. It can therefore be used from the outside. Thsi is for test purposes only and NOT included in the official version of the GCS!!!
Method Detail

size

public int size()
Returns the total number of visible URLs from the database pool of URLs to be crawled.
If the crawl is distributed, this method uses DB2Queue to execute the following SQL query:
  SELECT COUNT(*)
  FROM urlpoolstable
  WHERE urlpoolstable.STATE_ID=1 AND urlpoolstable.HIDE=0
  

Otherwise, returns the size kept in memory.
Specified by:
size in interface com.ibm.gcs.urlpool.URLCollection
Returns:
int The total number of URLs in the database pool of URLs to be crawled.
Throws:
DB2ComponentException - SQL error caused query to fail.
See Also:
DB2Queue

mySize

public int mySize()
Returns the number of URLs currently in this collection's cache.
Returns:
int - The current number of URLs in this collection's cache.

isEmpty

public boolean isEmpty()
Returns true if there are no more URLs in the database pool of URLs to be crawled, false otherwise.
Specified by:
isEmpty in interface com.ibm.gcs.urlpool.URLCollection
Returns:
boolean True if database collection is empty, false otherwise.
Throws:
java.lang.RuntimeException - SQL error caused query to fail.

get

public com.ibm.gcs.urlpool.URLContainer get()
Returns the next URL to be crawled. The URL is returned as a DB2URLContainer.

This method first checks the cache. If the cache is empty it loads the next set of URLs into the cache through DB2Queue. DB2Queue executes the following SQL query:

  SELECT *
  FROM urlpoolstable
  WHERE urlpoolstable.STATE_ID=1 AND urlpoolstable.HIDE=0
  ORDER BY time
 
It returns the next URL from the cache as a DB2URLContainer.
Specified by:
get in interface com.ibm.gcs.urlpool.URLCollection
Returns:
URLContainer The next URL to be crawled.
See Also:
URLContainer

get

public com.ibm.gcs.urlpool.URLContainer get(com.ibm.gcs.util.jdp.UnaryPredicate predicate)
Gets the next URL from the collection that satisfies a particular predicate based on the hashing scheme used.
Specified by:
get in interface com.ibm.gcs.urlpool.URLCollection
Parameters:
predicate - a unary predicate object
Returns:
URLContainer The next URL to be crawled.
See Also:
URLContainer, UnaryPredicate

put

public void put(com.ibm.gcs.urlpool.URLContainer[] urlCArray)
         throws DB2ComponentException
Adds an array of URLContainer objects into the database pool of URLs. If a URL has been visited or is waiting to be crawled, does not put it into the queue.

Calls put( URLContainer, Transaction ) to determine if the URL must be updated. If the method returns true, saves the changes for each URL to the database. It is more efficient to put URLs into the pool as a group than singly.

Specified by:
put in interface com.ibm.gcs.urlpool.URLCollection
Parameters:
urlCArray - An array of urls to add to the database pool

cleanup

public void cleanup()
Description copied from interface: com.ibm.gcs.urlpool.URLCollection
provides cleanup-operation on the Collection like writing caches etc. call this method at the end of use to ensure proper cleanup of the URLCollection. (Implementations like the DB2URLCollection needs this cleanup-method)
Specified by:
cleanup in interface com.ibm.gcs.urlpool.URLCollection
Following copied from interface: com.ibm.gcs.urlpool.URLCollection
See Also:
DB2URLCollection

put

public void put(com.ibm.gcs.urlpool.URLContainer urlC)
Adds a URLContainer object into the database pool of URLs to be crawled, if the url has not already been visited or is not waiting to be crawled.

Calls put( URLContainer, Transaction ) to determine if the URL must be updated in the table. If the method returns true, saves the changes to the database.

Specified by:
put in interface com.ibm.gcs.urlpool.URLCollection
Parameters:
URLContainer - urlC The URL to add to the database pool

EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.