Enterprise Information Portal APIs

com.ibm.gcs.gatherer
Class Gatherer

java.lang.Object
  |
  +--com.ibm.gcs.component.Component
        |
        +--com.ibm.gcs.gatherer.Gatherer
All Implemented Interfaces:
java.lang.Runnable, Schedulable

public class Gatherer
extends Component

This is the main Component of GCS. It starts, monitors, and stops the Crawler and Summarizer. It is called from an external ComponentRunner, such as the executable GCS class.

The Gatherer has four states:

Construct/Configure

The Gatherer is constructed and configured by some external class that implements ComponentRunner (e.g., GCS). It is passed a Config file name (or an actual Config), which it uses to construct and configure Crawler and Summarizer components. The Gatherer is also responsible for configuring the loggers, network properties, text and graph status monitors, and the Temp FilePool.

Start

Next the Gatherer is started by some external class (e.g., GCS). All it does here is create and run a new "Gatherer" Thread, which automaticaly brings it to its next state...

Run

The Gatherer thread starts the Crawler and Summarizer, which actually do the work. All the Gatherer thread does now is check [and report] the crawl status, and stop when everything is done. The status may be output as text to System.out (using the "gcs.status.monitor" logger), or as a graph to the MonitorGraphComponent.

Stop

There are two cases where the Gatherer is stopped: (1) it was told to stop by some external class, or (2) there is nothing left to crawl or summarize (the URLPool is empty, the ResourcePool is empty, all of the Crawler worker threads are waiting for new URLContainer URLs, and all of the Summarizer threads are waiting for new Resources. When the Gatherer is stopped, it stops the Crawler and Summarizer, and cleans up the temp file pool.

If you are writing an program that calls the gatherer and want to be notified when it is done, you can wait on the gatherer.isDone() object.

See Also:
Crawler, Summarizer

Field Summary
static boolean crawlerStatus
           
 java.lang.Boolean isDone
           
static boolean summarizerStatus
           
static boolean threadStatus
           
 
Fields inherited from interface com.ibm.gcs.component.Schedulable
copyright
 
Constructor Summary
Gatherer(ComponentRunner componentRunner, java.lang.String[] args, Config config)
          (constructor)
Gatherer(ComponentRunner componentRunner, java.lang.String[] args, java.lang.String configFileName)
          (constructor)
 
Method Summary
 void crawlerUpdate(boolean threadWorking)
          update the gatherer that a crawler thread is working or waiting
 Crawler getCrawler()
          returns the crawler sub-component
 int getMaxNumURLsToCrawl()
          get the maximum number of URLs to crawl
 int getNumCrawlers()
          get the number of working crawler threads
 int getNumResourcesSummarized()
          get the number of URLs that have been summarized
 int getNumSummarizers()
          get the number of working summarizer threads
 int getNumURLsCrawled()
          get the number of URLs that have been crawled
 int getResourcePoolSize()
          get the number of URLs waiting to be summarized
 com.ibm.gcs.summarizer.Summarizer getSummarizer()
          returns the summarizer sub-component
 int getURLPoolSize()
          get the number of URLs waiting to be crawled
 boolean isDone()
          returns true if the Gatherer is all done
 void kill(java.lang.String reason)
          kills a crawl!!!
 void run()
          start the crawler and gatherer threads; loop monitoring
 void setCrawlerThreadGroup(GCSThreadGroup crawlerThreadGroup)
          sets the crawler thread group (called by crawler start method)
 void setResourcePool(com.ibm.gcs.resourcepool.ResourcePool resourcePool)
          sets the Resource Pool (called by the summarizer constructor)
 void setSummarizerThreadGroup(GCSThreadGroup summarizerThreadGroup)
          sets the summarizer thread group (called by summarizer start method)
 void setURLPool(com.ibm.gcs.urlpool.URLPool urlPool)
          sets the URL Pool (called by the crawler constructor)
 void start()
          start gatherer
 void stop()
          stop a crawl
 void summarizerUpdate(boolean threadWorking)
          update the gatherer that a summarizer thread is working or waiting
static void threadStatusUpdate(char newStatus)
           
static void threadStatusUpdate(char newStatus, char newStatusExt)
           
 
Methods inherited from class com.ibm.gcs.component.Component
getArgv, getConfig, getName, getTempFilePool, getVersion
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

isDone

public java.lang.Boolean isDone

crawlerStatus

public static boolean crawlerStatus

summarizerStatus

public static boolean summarizerStatus

threadStatus

public static boolean threadStatus
Constructor Detail

Gatherer

public Gatherer(ComponentRunner componentRunner,
                java.lang.String[] args,
                Config config)
         throws ConfigException
(constructor)
Parameters:
componentRunner - the ComponentRunner that is creating this gatherer
args[] - command line arguments (unused?)
config - the configuration to use

Gatherer

public Gatherer(ComponentRunner componentRunner,
                java.lang.String[] args,
                java.lang.String configFileName)
         throws ConfigException
(constructor)
Parameters:
componentRunner - the ComponentRunner that is creating this gatherer
args[] - command line arguments (unused?)
configFileName - the name of the configuration file to use
Method Detail

setURLPool

public void setURLPool(com.ibm.gcs.urlpool.URLPool urlPool)
sets the URL Pool (called by the crawler constructor)

setResourcePool

public void setResourcePool(com.ibm.gcs.resourcepool.ResourcePool resourcePool)
sets the Resource Pool (called by the summarizer constructor)

setCrawlerThreadGroup

public void setCrawlerThreadGroup(GCSThreadGroup crawlerThreadGroup)
sets the crawler thread group (called by crawler start method)

setSummarizerThreadGroup

public void setSummarizerThreadGroup(GCSThreadGroup summarizerThreadGroup)
sets the summarizer thread group (called by summarizer start method)

start

public void start()
start gatherer

run

public void run()
start the crawler and gatherer threads; loop monitoring

stop

public void stop()
stop a crawl

kill

public void kill(java.lang.String reason)
kills a crawl!!!

crawlerUpdate

public void crawlerUpdate(boolean threadWorking)
update the gatherer that a crawler thread is working or waiting
Parameters:
threadWorking - whether the current thread is crawling or waiting for a URL

threadStatusUpdate

public static void threadStatusUpdate(char newStatus)

threadStatusUpdate

public static void threadStatusUpdate(char newStatus,
                                      char newStatusExt)

summarizerUpdate

public void summarizerUpdate(boolean threadWorking)
update the gatherer that a summarizer thread is working or waiting
Parameters:
threadWorking - whether the current thread is summarizing or waiting for a resource

getNumCrawlers

public int getNumCrawlers()
get the number of working crawler threads

getNumSummarizers

public int getNumSummarizers()
get the number of working summarizer threads

getURLPoolSize

public int getURLPoolSize()
get the number of URLs waiting to be crawled

getNumURLsCrawled

public int getNumURLsCrawled()
get the number of URLs that have been crawled

getMaxNumURLsToCrawl

public int getMaxNumURLsToCrawl()
get the maximum number of URLs to crawl

getResourcePoolSize

public int getResourcePoolSize()
get the number of URLs waiting to be summarized

getNumResourcesSummarized

public int getNumResourcesSummarized()
get the number of URLs that have been summarized

getCrawler

public Crawler getCrawler()
returns the crawler sub-component

getSummarizer

public com.ibm.gcs.summarizer.Summarizer getSummarizer()
returns the summarizer sub-component

isDone

public boolean isDone()
returns true if the Gatherer is all done

EIP Web Crawler APIs

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.