|
Enterprise Information Portal APIs |
||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
DB2Annotation | The DB2Annotation interface specifies persistence methods for Annotation objects which enable the object to be read from and written to the database. |
DB2AnnotationFactory | DB2AnnotationFactory defines methods to create instances of DB2Annotations. |
Prioritizer | This class specifies the priority for a URL to be crawled. |
Class Summary | |
ConfigTableDef | This class defines the constants for the database table CONFIGTABLE. |
CrawlPatternPool | |
DB2AnnotationHelper | DB2AnnotationHelper provides methods to produce DB2Annotation related SQL strings for INSERT, UPDATE, and SELECT operations, and methods that call a DB2AnnotationFactory to reconstruct DB2Annotation objects from data retrieved from the database. |
DB2AnnotationsList | A DB2AnnotationsList represents a list of all DB2Annotations associated with a specified annotatee. |
DB2ConfigTable | |
DB2DescriptionAnnotation | The DB2DescriptionAnnotation class is the default annotation type It extends DescrptionAnnotation by implementing methods that enable it to be written to and read from a database table. |
DB2DictionaryAnnotation | The DB2DictionaryAnnotation class extends DictrionaryAnnotation by implementing methods that enable it to be written to and read from a database table. |
DB2HiddenPool | DB2HiddenPool represents all the URLs in the database which have not been and should not be crawled. |
DB2HiddenQueue | DB2HiddenQueue represents the URLs in the database which must be crawled but have their hide flag set to true. |
DB2Pool | DB2Pool is a view or abstract represention of a set of URLs in the url database table. |
DB2PriorityQueue | DB2PriorityQueue represents the URLs in the database which must be crawled and belong to the specified priority group. |
DB2Queue | DB2Queue represents the URLs in the database which must be crawled. |
DB2RevisitQueue | DB2RevisitQueue represents the URLs in the database which must be recrawled. |
DB2StatesDef | |
DB2TableAdmin | |
DB2URLCollection | This URL collection class enqueues URLs based on depth By default, it does not save any annotation information. |
DB2URLContainer | A DB2URLContainer object provides access to the crawl information for a URL which is stored in DB2 relations. |
DB2URLRow | A DB2URLRow represents a record in the URLCRAWLTABLE . |
DB2VisitedPool | DB2VisitedPool represents all the URLs in the database which have already been visited. |
DefaultDB2AnnotationFactory | DefaultDB2AnnotationFactory provides methods to produce DB2Annotation objects from data retrieved from the database. |
LinksTableDef | This class provides the constants for column names of the database table LINKS_TABLE. |
PriorityDB2URLCollection | This URL collection class enqueues URLs based on priority group. |
UrlCrawlTableDef | This class provides the constants for the column names of the table containing the urls and crawl info, URLCRAWLTABLE It also provides a method to construct the sql string to create the URLCRAWLTABLE as follows: |
Exception Summary | |
DB2ComponentException | |
ImplementationException |
Provides database implementations for GCS components that cause the gatherer to write to and read from a persistent URL pool. The user may configure the gatherer to use this package's classes by setting the appropriate property values in the globals part of the GCS configuration file. The two relevant properties are url-collection-class and urlcontainer-class in url-pool-config. The values of these properties will be the full Java class names for the chosen implementation (as indicated below). Using the concrete implementation provided by this package, the gatherer saves and loads URLs to and from relational DB2 tables.
With the DB2 implementations, the URLs to crawl have a character limit of 250 characters. The reason is that the URL is used as a primary key in the database tables. DB2 requires that this key be less that 250 characters. If the gatherer encounters a URL that exceeds this character limit, it will reject the URL and log it as an exception. Additionally, the JVM heap size should be set to about 250M to avoid OutOfMemoryExceptions using the java option -Xmx. (See http://java.sun.com/products/jdk/1.2/docs/tooldocs/win32/java.html#options) Also make sure that the DB2 transaction log size and number of log files is sufficient (as described in How to configure GCS to crawl with DB2).
This package provides two
concrete DB2 implementations for URLCollection. The default implementation
DB2URLCollection
returns URLs in the order that
they were discovered (FIFO queue). To configure the gatherer to use this
implementation, the user may set the configuration properties as follows:
url-collection-class="com.ibm.gcs.db.component.DB2URLCollection"
urlcontainer-class="com.ibm.gcs.db.component.DB2URLContainer"
A second implementation com.ibm.gcs.db.component.BFSDB2URLCollection
returns URLs
in the order of their
recursion depth followed by the order that they were discovered (BFS
queue). To configure the gatherer to
use this implementation, the user may set the configuration properties as follows:
url-collection-class="com.ibm.gcs.db.component.BFSDB2URLCollection"
urlcontainer-class="com.ibm.gcs.db.component.DB2URLContainer"
Last, this package provides an abstract implementation called PriorityDB2URLCollection
.
This implementation returns URLs according to a user-defined priority, as
computed by a custom Prioritizer
implementation. Classes that extend PriorityDB2URLCollection must
return an instance of this user-defined Prioritizer through the method getPrioritizer()
. To
configure the gatherer to use a custom implementation, the user may set the configuration properties as follows:
url-collection-class=
class-name for implementation
urlcontainer-class="com.ibm.gcs.db.component.DB2URLContainer"
In addition to these component implementations, this package also contains support classes. In general, these classes provide the mapping to the database tables. Developers may use and extend these classes to provide their own database implementations for the gatherer.
The remainder of this document gives an overview of the logical structure of the classes in this package, which fall into 4 categories:
com.ibm.gcs.db.component
Package ContainsThe GCS gatherer provides well-defined interfaces for its com.ibm.gcs.crawler.URLCollection
and com.ibm.gcs.component.urlc.URLContainer
components. The URLCollection API
defines methods to put and to get URLs into and out of a collection of URLs to
be crawled. The URLContainer API defines methods to access the URL and
URL-specific information required to crawl it (i.e. recursion-depth,
include-pattern, etc.). In addition, it provide methods to access
link-relationship and annotation information.
This package implements these interfaces to define Java objects that map their internal data to DB2 relational tables. In brief, these classes store each URL as a row in a DB2 relational table. Each row of the table contains the URL value, its relevant crawl properties (depth, time-crawled, etc.) and its state-information (to be crawled, to be summarized, etc.). The URLCollection and URLContainer classes access these tables to execute the public methods (used by the gatherer) and make the database accesses transparent to the gatherer.
DB2URLContainer
is the default
implementation for the gatherer's com.ibm.gcs.component.urlc.URLContainer
interface.
It uses the DB2URLRow
and the DB2AnnotationsList
support classes to generate and
execute SQL statements. The purpose of the class is to access and map the
data stored in DB2 table to the Java objects expected by the gatherer.
The gatherer has the capability to keep track of link-relationship and
annotation information. Link-relationship information gives a record of
parent-child relationships. Annotation information is what a parent-page
says about a child-page. (For more information on annotations, please
refer to the research paper Using Annotations to Enhance an
Information Gathering System.) In the gatherer, this information is
stored in an com.ibm.gcs.component.urlc.Annotation
object, with each
URLContainer keeping track of the list of Annotations provided by its
parents. Each annotation object contains the parent URLContainer and what
the parent says about the child.
To store annotation information in the database, this package defines a
DB2Annotation
interface and provides two
implementations of it:
These implemenations extend com.ibm.gcs.component.urlc.DictionaryAnnotation
and com.ibm.gcs.component.urlc.DescriptionAnnotation
by mapping the Java objects to
DB2.
Each of the implementations of com.ibm.gcs.crawler.URLCollection
defines how the crawler gets and puts URLs from and into its URL pool.
This package provides three implementations:
com.ibm.gcs.db.component.BFSDB2URLCollection
DB2URLCollection
PriorityDB2URLCollection
Prioritizer
) Each of these three implementations accepts and returns
DB2URLContainers. They keep track of the state of each container using a
state variable. The collections depend on various support classes, which extend DB2Pool
,
to execute its database accesses.
The implementations of the GCS components use a set of support classes to access the database tables. These classes generate and execute the SQL statements to access data in 3 relational tables: urlpoolstable, parentstable, and treetable. The tables are defined in the table definition classes.
Each row of the url crawl table stores the
basic properties of the URL container. The main support class that
manages selects, inserts, and updates of the rows in this table is DB2URLRow
. The URL_CRAWL_TABLE also provides
state information that specifies whether the URL is waiting to be crawled, has
been crawled, etc. The DB2Pool
classes
(DB2Queue
,DB2VisitedPool
,DB2HiddenPool
) provide various views based on this
state information. URLCollection implementations use these support classes
to retrieve URLContainers from the database and to update the URL rows with the
appropriate state information.
LINKS_TABLE
The links table contains child-parent link relationships and includes annotation information (or the metadata found around hyperlinks). Each child and parent in this table references a row in the url crawl table. The following classes provide support for accessing, inserting, and updating data in this table:
CONFIG_TABLE
The CONFIG_TABLE stores the configuration for the crawl as XML. Each of
url in the url crawl table has CrawlPattern associated with it, which is
retrieved from the configuration information. The class DB2ConfigTable
maps the data in this table to the Java
object.
The following classes provide hard-coded definitions for the DB2 tables:
The DB2 support classes use the constants defined in the table definitions to generate SQL statements.
The package classes throw the following exceptions:
|
EIP Web Crawler APIs | ||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |