Enterprise Information Portal APIs

com.ibm.gcs.crawler
Class Crawler

java.lang.Object
  |
  +--com.ibm.gcs.component.Component
        |
        +--com.ibm.gcs.crawler.Crawler
All Implemented Interfaces:
GCSThreaded, java.lang.Runnable, Schedulable

public class Crawler
extends Component
implements GCSThreaded

This Component creates Resources by downloaded content and header/meta information for each URLContainer in the URLPool. It is started, monitored, and stopped by the Gatherer.

The Crawler has four states:

Construct/Configure

First, it is constructed and configured by the Gatherer with a particular Config. At this point, the Crawler creates the URLPool and adds the seed URLs, and creates the crawler FilePool (or tells the crawler to use the temp file pool).

Start

The Crawler is started by the Gatherer. It creates a new GCSThreadGroup of worker threads, and starts them running. This puts it in the next state...

Run

Each Crawler worker thread runs through a loop with the following steps:

  1. get a URLContainer from the URLPool.
  2. create a Resource with the URL header/meta information and content data, using the appropriate Connection, StreamHandler, and ContentHandler classes in the com.ibm.gcs.netutil package. In some cases, it saves the content to a file in the content FilePool and gives the Resource the file name.
  3. add the Resource to the ResourcePool so that the Summarizer can summarize it.
  4. wait until there is another URLContainer available in the URLPool.

Stop

There are two cases where the Crawler is stopped by the Gatherer: (1) there is nothing left to crawl (or summarize), or (2) the Gatherer is told to stop by an external class. When the Crawler is stopped, it will interrupt all of its worker threads and have them join the Gatherer thread.

See Also:
com.ibm.gcs.component, Gatherer, com.ibm.gcs.urlpool, com.ibm.gcs.resourcepool, com.ibm.gcs.netutil

Fields inherited from interface com.ibm.gcs.component.GCSThreaded
copyright
 
Fields inherited from interface com.ibm.gcs.component.Schedulable
copyright
 
Constructor Summary
Crawler(java.lang.String name, Component gatherer, Config config)
          (constructor)
 
Method Summary
 java.lang.Thread createThread(GCSThreadGroup tg)
          creates a crawler worker thread in the crawler GCSThreadGroup (from @link com.ibm.gcs.component.GCSThreaded}).
static com.ibm.gcs.util.filepool.FilePool getContentFilePool()
           
 void run()
          runs the crawler threads
 void setResourcePool(com.ibm.gcs.resourcepool.ResourcePool resourcePool)
          tells the crawler where to find the ResourcePool (this should only be called by the Gatherer)
 void start()
          starts the crawler (from Schedulable)
 void stop()
          stops the crawlers (from Schedulable)
 
Methods inherited from class com.ibm.gcs.component.Component
getArgv, getConfig, getName, getTempFilePool, getVersion
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Crawler

public Crawler(java.lang.String name,
               Component gatherer,
               Config config)
        throws ConfigException
(constructor)
Parameters:
name - name of the crawler (usually "Crawler")
gatherer - gatherer that controls this crawler
config - configuration
Method Detail

setResourcePool

public void setResourcePool(com.ibm.gcs.resourcepool.ResourcePool resourcePool)
tells the crawler where to find the ResourcePool (this should only be called by the Gatherer)

start

public void start()
starts the crawler (from Schedulable)

run

public void run()
runs the crawler threads

stop

public void stop()
stops the crawlers (from Schedulable)

createThread

public java.lang.Thread createThread(GCSThreadGroup tg)
creates a crawler worker thread in the crawler GCSThreadGroup (from @link com.ibm.gcs.component.GCSThreaded}). This is used by the thread group to create all of its threads at once
Specified by:
createThread in interface GCSThreaded

getContentFilePool

public static com.ibm.gcs.util.filepool.FilePool getContentFilePool()

EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.