|
Enterprise Information Portal APIs |
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--com.ibm.gcs.component.Component | +--com.ibm.gcs.crawler.Crawler
This Component
creates Resource
s
by downloaded content and header/meta
information for each URLContainer
in
the URLPool
.
It is started, monitored, and stopped by the
Gatherer
.
The Crawler has four states:
First, it is constructed and configured by the
Gatherer
with a particular
Config
.
At this point, the Crawler creates the
URLPool
and adds the seed URLs,
and creates the crawler FilePool
(or tells the crawler to use the temp file pool).
The Crawler is started by the Gatherer
.
It creates a new GCSThreadGroup
of worker threads
,
and starts them running. This puts it in the next state...
Each Crawler worker thread runs through a loop with the following steps:
URLContainer
from the
URLPool
.
Resource
with
the URL header/meta information and content data, using the appropriate
Connection
, StreamHandler
, and ContentHandler
classes in the com.ibm.gcs.netutil package.
In some cases, it saves the content to a file in the content
FilePool
and gives the Resource the file name.
ResourcePool
so that the
Summarizer
can summarize it.
URLContainer
available in the URLPool
.
There are two cases where the Crawler is stopped by the
Gatherer
:
(1) there is nothing left to crawl (or summarize), or
(2) the Gatherer is told to stop by an external class.
When the Crawler is stopped, it will interrupt
all of its worker threads and have them join the Gatherer thread.
Gatherer
,
com.ibm.gcs.urlpool
,
com.ibm.gcs.resourcepool
,
com.ibm.gcs.netutilFields inherited from interface com.ibm.gcs.component.GCSThreaded |
copyright |
Fields inherited from interface com.ibm.gcs.component.Schedulable |
copyright |
Constructor Summary | |
Crawler(java.lang.String name,
Component gatherer,
Config config)
(constructor) |
Method Summary | |
java.lang.Thread |
createThread(GCSThreadGroup tg)
creates a crawler worker thread in the crawler GCSThreadGroup
(from @link com.ibm.gcs.component.GCSThreaded}). |
static com.ibm.gcs.util.filepool.FilePool |
getContentFilePool()
|
void |
run()
runs the crawler threads |
void |
setResourcePool(com.ibm.gcs.resourcepool.ResourcePool resourcePool)
tells the crawler where to find the ResourcePool (this should only be called by the Gatherer ) |
void |
start()
starts the crawler (from Schedulable ) |
void |
stop()
stops the crawlers (from Schedulable ) |
Methods inherited from class com.ibm.gcs.component.Component |
getArgv, getConfig, getName, getTempFilePool, getVersion |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public Crawler(java.lang.String name, Component gatherer, Config config) throws ConfigException
name
- name of the crawler (usually "Crawler")gatherer
- gatherer that controls this crawlerconfig
- configurationMethod Detail |
public void setResourcePool(com.ibm.gcs.resourcepool.ResourcePool resourcePool)
Gatherer
)public void start()
Schedulable
)public void run()
public void stop()
Schedulable
)public java.lang.Thread createThread(GCSThreadGroup tg)
GCSThreadGroup
(from @link com.ibm.gcs.component.GCSThreaded}).
This is used by the thread group to create all of its threads at oncecreateThread
in interface GCSThreaded
public static com.ibm.gcs.util.filepool.FilePool getContentFilePool()
|
EIP Web Crawler APIs | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |