Enterprise Information Portal APIs

Package com.ibm.gcs.component.config

Configuration classes used to specify the run-time parameters for GCS.

See:
          Description

Class Summary
Config This represents the GCS configuration, with two sections: Globals and an array of Groups.
CrawlPattern This part of a Group Config represents a pattern of URLs that should be crawled.
Globals This part of the Config represents global parameters, such as logger configuration, locale, max urls, number of threads, temp/content/summary filepool, URL pool configuration, system property, and status monitor settings.
Group This part of the Config represents a group of resources that will be crawled and summarized in a particular way.
HostHandler This part of a SummarizerConfig maps a com.ibm.gcs.summarizer.Summarizable class and a com.ibm.gcs.summarizer.SummaryMaker class to a specified protocol String.
HttpSpecific This represents http-specific information in a URLSeed.
LoggerConfig This part of the Globals Config represents the Logger settings.
ResourceHandler This part of a SummarizerConfig maps a com.ibm.gcs.summarizer.Summarizable class and a com.ibm.gcs.summarizer.SummaryMaker class to a specified content-type pattern.
SummarizerConfig This part of a Group Config represents a the configuration for a summarizer.
URLExcIncPattern This is a pattern used to determine whether a URL will be excluded or included from a CrawlPattern.
URLNamePattern This type of URLExcIncPattern matches a URL using a single name string and wildcards.
URLObjPattern This type of URLExcIncPattern matches a URL using protocol, host, port, file, filename, and ref fields and wildcards.
URLPoolConfig This part of the Globals Config represents the settings for the URLPool, URLContainer, and URLCollection; it automatically creates a URLPool instance from these settings.
URLPredicatePattern This type of URLExcIncPattern matches a URL using a separate UnaryPredicate class.
URLRegExPattern This type of URLExcIncPattern matches a URL using a Perl 5 style regular expression (from the ibm regex4j package).
URLSeed This part of a CrawlPattern Config represents a seed URL that is passed to the Crawler, with optional HttpSpecific information.
 

Exception Summary
ConfigException This NLSException indicates that an error has occurred while setting the GCS configuration.
 

Package com.ibm.gcs.component.config Description

Configuration classes used to specify the run-time parameters for GCS. The configuraton of a particular GCS crawl are represented by the com.ibm.gcs.component.config.Configuration interface. To start a crawl, a com.ibm.gcs.component.config.ConfigStartEvent is used. This tells the Gatherer to load its user parameters from an XML config file, which is parsed by the Config class into the various other com.ibm.almaden.gcs.component.config classes.

The basic structure of the XML config tree is shown below, with links pointing to the corresponding Java classes. See the actual config DTD for details on the XML format.

<gcs-config>
  <globals>
  <group-list>
    <group>
      <url-pattern-list>
        <url-pattern>
          <seed-list>
            <li>
              <url>
              <protocol-specific>
                <http-specific>
                  <authentication>
                  <msg-header>
                  <content>
          <content-type-pattern-list>
            <url-regex-pattern>
            <predicate>
          <include-pattern-list>
            <url-obj-pattern>
            <url-name-pattern>
            <url-regex-pattern>
            <predicate>
          <exclude-pattern-list>
            <url-obj-pattern>
            <url-name-pattern>
            <url-regex-pattern>
            <predicate>
      <summarizer-list>
        <summarizer>
          <mime-summarizer-list>
            <summarizer-list>
              <summarizer>
          <refine-list>
            <refine>


EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.