|
Enterprise Information Portal APIs |
||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
Config | This represents the GCS configuration, with two sections:
Globals and an array of Group s. |
CrawlPattern | This part of a Group Config represents
a pattern of URLs that should be crawled. |
Globals | This part of the Config represents global parameters,
such as logger configuration , locale,
max urls, number of threads, temp/content/summary filepool,
URL pool configuration , system property,
and status monitor settings. |
Group | This part of the Config represents a group of resources
that will be crawled and summarized in a particular way. |
HostHandler | This part of a SummarizerConfig
maps a com.ibm.gcs.summarizer.Summarizable
class and a com.ibm.gcs.summarizer.SummaryMaker class
to a specified protocol String. |
HttpSpecific | This represents http-specific information in a URLSeed . |
LoggerConfig | This part of the Globals Config represents the
Logger settings. |
ResourceHandler | This part of a SummarizerConfig
maps a com.ibm.gcs.summarizer.Summarizable
class and a com.ibm.gcs.summarizer.SummaryMaker class
to a specified content-type pattern. |
SummarizerConfig | This part of a Group Config represents
a the configuration for a summarizer. |
URLExcIncPattern | This is a pattern used to determine whether a URL will be excluded
or included from a CrawlPattern . |
URLNamePattern | This type of URLExcIncPattern matches a URL
using a single name string and wildcards. |
URLObjPattern | This type of URLExcIncPattern matches a URL
using protocol, host, port, file, filename, and ref fields and wildcards. |
URLPoolConfig | This part of the Globals Config represents the
settings for the URLPool ,
URLContainer , and
URLCollection ;
it automatically creates a URLPool instance from these settings. |
URLPredicatePattern | This type of URLExcIncPattern matches a URL
using a separate UnaryPredicate class. |
URLRegExPattern | This type of URLExcIncPattern matches a URL
using a Perl 5 style regular expression (from the ibm regex4j package). |
URLSeed | This part of a CrawlPattern Config
represents a seed URL that is passed to the Crawler,
with optional HttpSpecific information. |
Exception Summary | |
ConfigException | This NLSException indicates
that an error has occurred while setting the GCS configuration. |
Configuration classes used to specify the run-time parameters for GCS.
The configuraton of a particular GCS crawl are represented
by the com.ibm.gcs.component.config.Configuration
interface.
To start a crawl, a com.ibm.gcs.component.config.ConfigStartEvent
is used.
This tells the Gatherer to load its user parameters from an XML config file,
which is parsed by the Config
class into the various other com.ibm.almaden.gcs.component.config classes.
The basic structure of the XML config tree is shown below, with links pointing to the corresponding Java classes. See the actual config DTD for details on the XML format.
<gcs-config
> <globals
> <group-list
> <group
> <url-pattern-list
> <url-pattern
> <seed-list
> <li
> <url
> <protocol-specific
> <http-specific
> <authentication
> <msg-header
> <content
> <content-type-pattern-list
> <url-regex-pattern
> <predicate
> <include-pattern-list
> <url-obj-pattern
> <url-name-pattern
> <url-regex-pattern
> <predicate
> <exclude-pattern-list
> <url-obj-pattern
> <url-name-pattern
> <url-regex-pattern
> <predicate
> <summarizer-list
> <summarizer
> <mime-summarizer-list
> <summarizer-list
> <summarizer
> <refine-list
> <refine
>
|
EIP Web Crawler APIs | ||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |