Enterprise Information Portal APIs

com.ibm.gcs.db.component
Class PriorityDB2URLCollection

java.lang.Object
  |
  +--com.ibm.gcs.db.component.PriorityDB2URLCollection
All Implemented Interfaces:
com.ibm.gcs.urlpool.URLCollection

public abstract class PriorityDB2URLCollection
extends java.lang.Object
implements com.ibm.gcs.urlpool.URLCollection

This URL collection class enqueues URLs based on priority group. It loads URLs from the database according to these groups. A call to get() returns a URL of highest priority as follows: priority n from cache, priority n database, priority n-1 from cache, priority n-1 from database, etc. Extending classes must override getPrioritizer() which provides a Prioritzer implementation to rank the URLs.


Field Summary
static boolean debug
           
static long FREQ
           
static java.lang.String TIME
           
 
Fields inherited from interface com.ibm.gcs.urlpool.URLCollection
copyright
 
Constructor Summary
PriorityDB2URLCollection()
          Default Constructor.
 
Method Summary
 com.ibm.gcs.urlpool.URLContainer get()
          Returns the next URL to be crawled as a DB2Container object.
 com.ibm.gcs.urlpool.URLContainer get(com.ibm.gcs.util.jdp.UnaryPredicate predicate)
          Gets the next URL from the collection that satisfies a particular predicate based on the hashing scheme used.
abstract  Prioritizer getPrioritizer()
          Returns the prioritizer for this class.
 boolean isEmpty()
          Returns true if there are no more URLs in the database pool of URLs to be crawled, false otherwise.
 int mySize()
          Returns the number of URLs currently in this collection's cache (includes all priorities).
 void put(com.ibm.gcs.urlpool.URLContainer urlC)
          Adds a URLContainer object into the database pool of URLs to be crawled if the url has not already been visited.
 void put(com.ibm.gcs.urlpool.URLContainer[] urlCArray)
          Adds an array of URLContainer objects into the database pool of URLs according to subclass implementation.
 int size()
          Returns the total number of visible URLs from the database pool of URLs to be crawled.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface com.ibm.gcs.urlpool.URLCollection
cleanup
 

Field Detail

debug

public static boolean debug

TIME

public static final java.lang.String TIME

FREQ

public static final long FREQ
Constructor Detail

PriorityDB2URLCollection

public PriorityDB2URLCollection()
                         throws java.lang.Exception
Default Constructor.
Method Detail

size

public int size()
Returns the total number of visible URLs from the database pool of URLs to be crawled.
This method executes the following SQL query:
SELECT COUNT(*)
FROM urlpoolstable
WHERE urlpoolstable.STATE_ID=1 AND urlpoolstable.HIDE=0
Specified by:
size in interface com.ibm.gcs.urlpool.URLCollection
Returns:
int - The total number of URLs in the database pool of URLs to be crawled.
Throws:
DB2ComponentException - SQL error caused query to fail.

mySize

public int mySize()
Returns the number of URLs currently in this collection's cache (includes all priorities).
Returns:
int - The current number of URLs in this collection's cache.

isEmpty

public boolean isEmpty()
Returns true if there are no more URLs in the database pool of URLs to be crawled, false otherwise.
Specified by:
isEmpty in interface com.ibm.gcs.urlpool.URLCollection
Returns:
boolean - True if database collection is empty, false otherwise.
Throws:
java.lang.RuntimeException - SQL error caused query to fail.

get

public com.ibm.gcs.urlpool.URLContainer get()
Returns the next URL to be crawled as a DB2Container object. This method returns the urls in order of priority. If the cache is empty, it loads the urls from the database.
Specified by:
get in interface com.ibm.gcs.urlpool.URLCollection
Returns:
a DB2URLContainer object
Throws:
java.lang.RuntimeException -  
See Also:
URLContainer

get

public com.ibm.gcs.urlpool.URLContainer get(com.ibm.gcs.util.jdp.UnaryPredicate predicate)
Gets the next URL from the collection that satisfies a particular predicate based on the hashing scheme used.
Specified by:
get in interface com.ibm.gcs.urlpool.URLCollection
Parameters:
predicate - a unary predicate object
Returns:
a URLContainer object
See Also:
URLContainer, UnaryPredicate

put

public void put(com.ibm.gcs.urlpool.URLContainer[] urlCArray)
         throws DB2ComponentException
Adds an array of URLContainer objects into the database pool of URLs according to subclass implementation. For putting many URLs into the pool, this method is more efficient than #put(URLContainer)
Specified by:
put in interface com.ibm.gcs.urlpool.URLCollection
Parameters:
urlCArray - An array of urls to add to the database pool
See Also:
URLContainer, DB2URLContainer

put

public void put(com.ibm.gcs.urlpool.URLContainer urlC)
Adds a URLContainer object into the database pool of URLs to be crawled if the url has not already been visited.
Specified by:
put in interface com.ibm.gcs.urlpool.URLCollection
Parameters:
URLContainer - urlC The URL to add to the database pool
See Also:
URLContainer, DB2URLContainer

getPrioritizer

public abstract Prioritizer getPrioritizer()
                                    throws TransactionException
Returns the prioritizer for this class. This method is the first thing called in the constructor. ******************************************************************

EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.