com.ibm.gcs.db.component
Class PriorityDB2URLCollection
java.lang.Object
|
+--com.ibm.gcs.db.component.PriorityDB2URLCollection
- All Implemented Interfaces:
- com.ibm.gcs.urlpool.URLCollection
- public abstract class PriorityDB2URLCollection
- extends java.lang.Object
- implements com.ibm.gcs.urlpool.URLCollection
This URL collection class enqueues URLs based on priority group.
It loads URLs from the database according to these groups.
A call to get() returns a URL of highest priority as follows:
priority n from cache, priority n database,
priority n-1 from cache, priority n-1 from database, etc.
Extending classes must override getPrioritizer()
which provides a Prioritzer
implementation to
rank the URLs.
Field Summary |
static boolean |
debug
|
static long |
FREQ
|
static java.lang.String |
TIME
|
Fields inherited from interface com.ibm.gcs.urlpool.URLCollection |
copyright |
Method Summary |
com.ibm.gcs.urlpool.URLContainer |
get()
Returns the next URL to be crawled as a DB2Container
object. |
com.ibm.gcs.urlpool.URLContainer |
get(com.ibm.gcs.util.jdp.UnaryPredicate predicate)
Gets the next URL from the collection that satisfies a
particular predicate based on the hashing scheme used. |
abstract Prioritizer |
getPrioritizer()
Returns the prioritizer for this class. |
boolean |
isEmpty()
Returns true if there are no more URLs in the database
pool of URLs to be crawled, false otherwise. |
int |
mySize()
Returns the number of URLs currently in this collection's
cache (includes all priorities). |
void |
put(com.ibm.gcs.urlpool.URLContainer urlC)
Adds a URLContainer object into the database pool of
URLs to be crawled if the url has not already been visited. |
void |
put(com.ibm.gcs.urlpool.URLContainer[] urlCArray)
Adds an array of URLContainer objects into the database pool of
URLs according to subclass implementation. |
int |
size()
Returns the total number of visible URLs from the database
pool of URLs to be crawled. |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface com.ibm.gcs.urlpool.URLCollection |
cleanup |
debug
public static boolean debug
TIME
public static final java.lang.String TIME
FREQ
public static final long FREQ
PriorityDB2URLCollection
public PriorityDB2URLCollection()
throws java.lang.Exception
- Default Constructor.
size
public int size()
- Returns the total number of visible URLs from the database
pool of URLs to be crawled.
This method executes the following SQL query:
SELECT COUNT(*)
FROM urlpoolstable
WHERE urlpoolstable.STATE_ID=1 AND urlpoolstable.HIDE=0
- Specified by:
size
in interface com.ibm.gcs.urlpool.URLCollection
- Returns:
- int - The total number of URLs in the database
pool of URLs to be crawled.
- Throws:
DB2ComponentException
- SQL error caused query to fail.
mySize
public int mySize()
- Returns the number of URLs currently in this collection's
cache (includes all priorities).
- Returns:
- int - The current number of URLs in this collection's
cache.
isEmpty
public boolean isEmpty()
- Returns true if there are no more URLs in the database
pool of URLs to be crawled, false otherwise.
- Specified by:
isEmpty
in interface com.ibm.gcs.urlpool.URLCollection
- Returns:
- boolean - True if database collection is empty,
false otherwise.
- Throws:
java.lang.RuntimeException
- SQL error caused query to fail.
get
public com.ibm.gcs.urlpool.URLContainer get()
- Returns the next URL to be crawled as a DB2Container
object.
This method returns the urls in order of priority.
If the cache is empty, it loads the urls from the database.
- Specified by:
get
in interface com.ibm.gcs.urlpool.URLCollection
- Returns:
- a DB2URLContainer object
- Throws:
java.lang.RuntimeException
- - See Also:
URLContainer
get
public com.ibm.gcs.urlpool.URLContainer get(com.ibm.gcs.util.jdp.UnaryPredicate predicate)
- Gets the next URL from the collection that satisfies a
particular predicate based on the hashing scheme used.
- Specified by:
get
in interface com.ibm.gcs.urlpool.URLCollection
- Parameters:
predicate
- a unary predicate object- Returns:
- a URLContainer object
- See Also:
URLContainer
,
UnaryPredicate
put
public void put(com.ibm.gcs.urlpool.URLContainer[] urlCArray)
throws DB2ComponentException
- Adds an array of URLContainer objects into the database pool of
URLs according to subclass implementation. For putting many
URLs into the pool, this method is more efficient than
#put(URLContainer)
- Specified by:
put
in interface com.ibm.gcs.urlpool.URLCollection
- Parameters:
urlCArray
- An array of urls to add to the database pool- See Also:
URLContainer
,
DB2URLContainer
put
public void put(com.ibm.gcs.urlpool.URLContainer urlC)
- Adds a URLContainer object into the database pool of
URLs to be crawled if the url has not already been visited.
- Specified by:
put
in interface com.ibm.gcs.urlpool.URLCollection
- Parameters:
URLContainer
- urlC The URL to add to the database pool- See Also:
URLContainer
,
DB2URLContainer
getPrioritizer
public abstract Prioritizer getPrioritizer()
throws TransactionException
- Returns the prioritizer for this class. This method
is the first thing called in the constructor.
******************************************************************
(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.