Enterprise Information Portal APIs

Package com.ibm.gcs.db.component

Provides database implementations for GCS components that cause the gatherer to write to and read from a persistent URL pool.

See:
          Description

Interface Summary
DB2Annotation The DB2Annotation interface specifies persistence methods for Annotation objects which enable the object to be read from and written to the database.
DB2AnnotationFactory DB2AnnotationFactory defines methods to create instances of DB2Annotations.
Prioritizer This class specifies the priority for a URL to be crawled.
 

Class Summary
ConfigTableDef This class defines the constants for the database table CONFIGTABLE.
CrawlPatternPool  
DB2AnnotationHelper DB2AnnotationHelper provides methods to produce DB2Annotation related SQL strings for INSERT, UPDATE, and SELECT operations, and methods that call a DB2AnnotationFactory to reconstruct DB2Annotation objects from data retrieved from the database.
DB2AnnotationsList A DB2AnnotationsList represents a list of all DB2Annotations associated with a specified annotatee.
DB2ConfigTable  
DB2DescriptionAnnotation The DB2DescriptionAnnotation class is the default annotation type It extends DescrptionAnnotation by implementing methods that enable it to be written to and read from a database table.
DB2DictionaryAnnotation The DB2DictionaryAnnotation class extends DictrionaryAnnotation by implementing methods that enable it to be written to and read from a database table.
DB2HiddenPool DB2HiddenPool represents all the URLs in the database which have not been and should not be crawled.
DB2HiddenQueue DB2HiddenQueue represents the URLs in the database which must be crawled but have their hide flag set to true.
DB2Pool DB2Pool is a view or abstract represention of a set of URLs in the url database table.
DB2PriorityQueue DB2PriorityQueue represents the URLs in the database which must be crawled and belong to the specified priority group.
DB2Queue DB2Queue represents the URLs in the database which must be crawled.
DB2RevisitQueue DB2RevisitQueue represents the URLs in the database which must be recrawled.
DB2StatesDef  
DB2TableAdmin  
DB2URLCollection This URL collection class enqueues URLs based on depth By default, it does not save any annotation information.
DB2URLContainer A DB2URLContainer object provides access to the crawl information for a URL which is stored in DB2 relations.
DB2URLRow A DB2URLRow represents a record in the URLCRAWLTABLE.
DB2VisitedPool DB2VisitedPool represents all the URLs in the database which have already been visited.
DefaultDB2AnnotationFactory DefaultDB2AnnotationFactory provides methods to produce DB2Annotation objects from data retrieved from the database.
LinksTableDef This class provides the constants for column names of the database table LINKS_TABLE.
PriorityDB2URLCollection This URL collection class enqueues URLs based on priority group.
UrlCrawlTableDef This class provides the constants for the column names of the table containing the urls and crawl info, URLCRAWLTABLE It also provides a method to construct the sql string to create the URLCRAWLTABLE as follows:
 

Exception Summary
DB2ComponentException  
ImplementationException  
 

Package com.ibm.gcs.db.component Description

Provides database implementations for GCS components that cause the gatherer to write to and read from a persistent URL pool. The user may configure the gatherer to use this package's classes by setting the appropriate property values in the globals part of the GCS configuration file. The two relevant properties are url-collection-class and urlcontainer-class in url-pool-config.  The values of these properties will be the full Java class names for the chosen implementation (as indicated below).  Using the concrete implementation provided by this package, the gatherer saves and loads URLs to and from relational DB2 tables.  

With the DB2 implementations, the URLs to crawl have a character limit of 250 characters.  The reason is that the URL is used as a primary key in the database tables.  DB2 requires that this key be less that 250 characters.  If the gatherer encounters a URL that exceeds this character limit, it will reject the URL and log it as an exception.  Additionally, the JVM heap size should be set to about 250M to avoid OutOfMemoryExceptions using the java option -Xmx.  (See http://java.sun.com/products/jdk/1.2/docs/tooldocs/win32/java.html#options)  Also make sure that the DB2 transaction log size and number of log files is sufficient (as described in  How to configure GCS to crawl with DB2). 

This package provides two concrete DB2 implementations for URLCollection.  The default implementation DB2URLCollection returns URLs in the order that they were discovered (FIFO queue). To configure the gatherer to use this implementation, the user may set the configuration properties as follows:

        url-collection-class="com.ibm.gcs.db.component.DB2URLCollection"  
        urlcontainer-class="com.ibm.gcs.db.component.DB2URLContainer"

A second implementation com.ibm.gcs.db.component.BFSDB2URLCollection returns URLs in the order of their recursion depth  followed by the order that they were discovered (BFS queue).  To configure the gatherer to use this implementation, the user may set the configuration properties as follows:

        url-collection-class="com.ibm.gcs.db.component.BFSDB2URLCollection"  
        urlcontainer-class="com.ibm.gcs.db.component.DB2URLContainer"

Last, this package provides an abstract implementation called PriorityDB2URLCollection. This implementation returns URLs according to a user-defined priority, as computed by a custom Prioritizer implementation.   Classes that extend PriorityDB2URLCollection must return an instance of this user-defined Prioritizer through the method getPrioritizer().  To configure the gatherer to use a custom implementation, the user may set the configuration properties as follows:

        url-collection-class= class-name for implementation  
        urlcontainer-class="com.ibm.gcs.db.component.DB2URLContainer"

In addition to these component implementations, this package also contains support classes.  In general, these classes provide the mapping to the database tables.   Developers may use and extend these classes to provide their own database implementations for the gatherer.   

The remainder of this document gives an overview of the logical structure of the classes in this package, which fall into 4 categories:

  1. Public classes that implement the GCS component interfaces.
  2. Support classes to update and read from the actual DB tables.
  3. Static classes that contain the DB table definitions.
  4. Exceptions thrown by classes in this package.

 What the com.ibm.gcs.db.component Package Contains

1.  Implementation of GCS Components

The GCS gatherer provides well-defined interfaces for its com.ibm.gcs.crawler.URLCollection and com.ibm.gcs.component.urlc.URLContainer components.  The URLCollection API defines methods to put and to get URLs into and out of a collection of URLs to be crawled.  The URLContainer API defines methods to access the URL and URL-specific information required to crawl it (i.e. recursion-depth, include-pattern, etc.).  In addition, it provide methods to access link-relationship and annotation  information.  

This package implements these interfaces to define Java objects that map their internal data to DB2 relational tables.  In brief, these classes store each URL as a row in a DB2 relational table.  Each row of the table contains the URL value, its relevant crawl properties (depth, time-crawled, etc.) and its state-information (to be crawled, to be summarized, etc.).  The URLCollection and URLContainer classes access these tables to execute the public methods (used by the gatherer) and make the database accesses transparent to the gatherer.

URLContainer & Annotations

DB2URLContainer is the default implementation for the gatherer's com.ibm.gcs.component.urlc.URLContainer interface.   

It uses the DB2URLRow and the DB2AnnotationsList support classes to generate and execute SQL statements.  The purpose of the class is to access and map the data stored in DB2 table to the Java objects expected by the gatherer.

The gatherer has the capability to keep track of link-relationship and annotation information.  Link-relationship information gives a record of parent-child relationships.  Annotation information is what a parent-page says about a child-page.  (For more information on annotations, please refer to the research paper Using Annotations to Enhance an Information Gathering System.)  In the gatherer, this information is stored in an com.ibm.gcs.component.urlc.Annotation object, with each URLContainer keeping track of the list of Annotations provided by its parents.  Each annotation object contains the parent URLContainer and what the parent says about the child.

To store annotation information in the database, this package defines a DB2Annotation interface and provides two implementations of it:

These implemenations extend com.ibm.gcs.component.urlc.DictionaryAnnotation and com.ibm.gcs.component.urlc.DescriptionAnnotation by mapping the Java objects to DB2. 

URLCollection

Each of the implementations of com.ibm.gcs.crawler.URLCollection defines how the crawler gets and puts URLs from and into its URL pool.  This package provides three implementations:

Each of these three implementations accepts and returns DB2URLContainers.  They keep track of the state of each container using a state variable.  The collections depend on various support classes, which extend DB2Pool, to execute its database accesses. 

 

2.  SQL Generation and Database Access

The implementations of the GCS components use a set of support classes to access the database tables.  These classes generate and execute the SQL statements to access data in 3 relational tables:  urlpoolstable, parentstable, and treetable.  The tables are defined in the table definition classes.     

URL_CRAWL_TABLE

Each row of the url crawl table stores the basic properties of the URL container.   The main support class that manages selects, inserts, and updates of the rows in this table is DB2URLRow.  The URL_CRAWL_TABLE also provides state information that specifies whether the URL is waiting to be crawled, has been crawled, etc. The DB2Pool classes (DB2Queue,DB2VisitedPool,DB2HiddenPool) provide various views based on this state information.  URLCollection implementations use these support classes to retrieve URLContainers from the database and to update the URL rows with the appropriate state information.

LINKS_TABLE

The links table contains child-parent link relationships and includes annotation information (or the metadata found around hyperlinks).  Each child and parent in this table references a row in the url crawl table.  The following classes provide support for accessing, inserting, and updating data in this table:

CONFIG_TABLE

The CONFIG_TABLE stores the configuration for the crawl as XML.  Each of url in the url crawl table has  CrawlPattern associated with it, which is retrieved from the configuration information.  The class DB2ConfigTable maps the data in this table to the Java object.  

3.  Table Definitions

The following classes provide hard-coded definitions for the DB2 tables:

The DB2 support classes use the constants defined in the table definitions to generate SQL statements.

4.  Exceptions

The package classes throw the following exceptions:


EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.