IBM Information Integrator for Content V8.2 APIs

com.ibm.mm.beans.infomining
Class CMBWebCrawlerService

java.lang.Object
  |
  +--com.ibm.mm.beans.infomining.CMBInfoMiningBean
        |
        +--com.ibm.mm.beans.infomining.CMBConnectedMiningBean
              |
              +--com.ibm.mm.beans.infomining.CMBWebCrawlerService
All Implemented Interfaces:
com.ibm.mm.beans.CMBConnectionReplyListener, java.util.EventListener, java.io.Serializable

public class CMBWebCrawlerService
extends CMBConnectedMiningBean

CMBWebCrawlerService - Create CMBItems from crawled documents.

This bean monitors the webspace, i.e., the crawled files created by the Web Crawler. These files are moved to an "archive" directory which is created in the webspace directory. CMBItems are created from those files which can then be categorized (with the CMBCategorizationService), summarized (CMBSummarizationService), and/or imported into the database (CMBCatalogService). The crawled files must begin with a metadata "Headline" containing its URL, last modified date, etc. Use the web crawler IMY.INI file SAVE_HEADLINES option (in the ini file section STORE) to let the crawler to write such a headline. See web crawler documentation for directory and file details.

See Also:
Serialized Form

Constructor Summary
CMBWebCrawlerService()
          Default constructor.
 
Method Summary
 void addCMBResultListener(com.ibm.mm.beans.CMBResultListener l)
          Adds the specified result listener to receive events from this bean.
 java.lang.String getFilterEncoding()
          Gets the filter encoding
 int getPageSize()
          Gets the number of CMBItems within a single CMBTextAnalysisRequestEvent.
 int getPollCycles()
          Gets overall number of times to poll.
 int getPollMinutes()
          Get minutes to wait before beginning next poll
 java.lang.String getRootDirectory()
          Gets the root directory where the crawler stores the crawled documents.
 java.lang.String getWebSpace()
          Gets the webspace which is monitored by the web crawler
 boolean isArchiveEnabled()
          Gets option if imported files should be kept in archive.
 void removeCMBResultListener(com.ibm.mm.beans.CMBResultListener l)
          Removes the specified result listener so that it no longer receives events from this bean.
 void setArchiveEnabled(boolean keepInArchive)
          Sets option if imported files should be kept in archive.
 void setFilterEncoding(java.lang.String filterEncoding)
          Sets the filter encoding.
 void setPageSize(int pageSize)
          Sets the number of CMBItems to be carried by a single CMBTextAnalysisRequestEvent.
 void setPollCycles(int newPollCycles)
          Sets overall number of times to poll.
 void setPollMinutes(int newPollMinutes)
          Sets minutes to wait before beginning next poll.
 void setRootDirectory(java.lang.String rootDir)
          Sets the root directory where the crawler stores the crawled documents.
 void setWebSpace(java.lang.String webSpace)
          Sets the webspace which is monitored by the Web crawler
 void start()
          Start the polling of the webspace.
 
Methods inherited from class com.ibm.mm.beans.infomining.CMBConnectedMiningBean
getConnection, isConnected, onCMBConnectionReply, setConnection, validateConnection
 
Methods inherited from class com.ibm.mm.beans.infomining.CMBInfoMiningBean
addCMBExceptionListener, addCMBTraceListener, isTraceEnabled, removeCMBExceptionListener, removeCMBTraceListener, setTraceEnabled
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CMBWebCrawlerService

public CMBWebCrawlerService()
Default constructor.
Method Detail

start

public void start()
           throws com.ibm.mm.beans.CMBNoConnectionException
Start the polling of the webspace. The webspace and the directory (root directory) where it can be found should be customized for your setup. The method returns after the number of polls (which can be set by setPollCycles, default is 1 million), each interrupted by a sleeping phase (can be set by setPollMinutes, default 30 minutes).

addCMBResultListener

public void addCMBResultListener(com.ibm.mm.beans.CMBResultListener l)
Adds the specified result listener to receive events from this bean.
Parameters:
l - the result listener

removeCMBResultListener

public void removeCMBResultListener(com.ibm.mm.beans.CMBResultListener l)
Removes the specified result listener so that it no longer receives events from this bean.
Parameters:
l - the result listener

setFilterEncoding

public void setFilterEncoding(java.lang.String filterEncoding)
Sets the filter encoding.
Parameters:
filter - encoding.
See Also:
getFilterEncoding()

getFilterEncoding

public java.lang.String getFilterEncoding()
Gets the filter encoding
Returns:
filter encoding
See Also:
setFilterEncoding(String)

setPollCycles

public void setPollCycles(int newPollCycles)
Sets overall number of times to poll.
Parameters:
overall - number of times to poll
See Also:
getPollCycles()

getPollCycles

public int getPollCycles()
Gets overall number of times to poll.
Returns:
overall number of times to poll
See Also:
setPollCycles(int)

setPollMinutes

public void setPollMinutes(int newPollMinutes)
Sets minutes to wait before beginning next poll.
Parameters:
minutes - to wait before beginning next poll
See Also:
getPollCycles()

getPollMinutes

public int getPollMinutes()
Get minutes to wait before beginning next poll
Returns:
minutes to wait before beginning next poll
See Also:
setPollMinutes(int)

setRootDirectory

public void setRootDirectory(java.lang.String rootDir)
Sets the root directory where the crawler stores the crawled documents.
Parameters:
root - directory where the crawler stores the crawled documents
See Also:
getRootDirectory()

getRootDirectory

public java.lang.String getRootDirectory()
Gets the root directory where the crawler stores the crawled documents.
Returns:
directory where the crawler stores the crawled documents
See Also:
setRootDirectory(String)

setArchiveEnabled

public void setArchiveEnabled(boolean keepInArchive)
Sets option if imported files should be kept in archive.
Returns:
true if imported files should be kept in archive
See Also:
isArchiveEnabled()

isArchiveEnabled

public boolean isArchiveEnabled()
Gets option if imported files should be kept in archive.
Returns:
true if imported files should be kept in archive
See Also:
setArchiveEnabled(boolean)

setPageSize

public void setPageSize(int pageSize)
Sets the number of CMBItems to be carried by a single CMBTextAnalysisRequestEvent.
Parameters:
number - of CMBItems to be carried by a single CMBTextAnalysisRequestEvent
See Also:
getPageSize()

getPageSize

public int getPageSize()
Gets the number of CMBItems within a single CMBTextAnalysisRequestEvent.
Returns:
number of CMBItems to be carried by a single CMBTextAnalysisRequestEvent
See Also:
setPageSize(int)

setWebSpace

public void setWebSpace(java.lang.String webSpace)
Sets the webspace which is monitored by the Web crawler
Parameters:
name - of webspace to be monitored by the Web crawler
See Also:
getWebSpace()

getWebSpace

public java.lang.String getWebSpace()
Gets the webspace which is monitored by the web crawler
Returns:
name of webspace to be monitored by the Web crawler
See Also:
setWebSpace(String)


IBM Information Integrator for Content V8.2 APIs

© Copyright International Business Machines Corporation 1996, 2003 IBM Corp. All rights reserved.