IBM Information Integrator for Content V8.2 APIs

com.ibm.mm.beans.infomining
Class CMBDefaultContentProvider

java.lang.Object
  |
  +--com.ibm.mm.beans.infomining.CMBDefaultContentProvider
All Implemented Interfaces:
CMBContentProvider

public class CMBDefaultContentProvider
extends java.lang.Object
implements CMBContentProvider

Default implementation of the CMBContentProvider interface. This ContentProvider implementation allows additional properties to be specified to customize the processing of CMBItems.

It is able to process a large number of document formats for their textual content. The default behaviour is to try and retrieve text from all parts of a CMBItem, concatenate these pieces and return as the result. It is possible to set a mask so that only certain parts are processed. This is helpful where the document model is known i. e. information on which parts are important for indexing is available.


Field Summary
static java.lang.String PLAIN_TEXT_Big5
          filter encoding for plain text, Big5, Traditional Chinese
static java.lang.String PLAIN_TEXT_EUC_CN
          filter encoding for plain text, GB2312, EUC encoding, Simplified Chinese
static java.lang.String PLAIN_TEXT_EUC_KR
          filter encoding for plain text, KS C 5601, EUC encoding, Korean
static java.lang.String PLAIN_TEXT_ISO8859_1
          plain text, ISO 8859-1, Latin alphabet No.
static java.lang.String PLAIN_TEXT_SJIS
          filter encoding for plain text, Shift-JIS, Japanese
static java.lang.String PLAIN_TEXT_UnicodeBig
          filter encoding for plain text, Sixteen-bit Unicode Transformation Format, big-endian byte order, with byte-order mark
static java.lang.String PLAIN_TEXT_UnicodeLittle
          filter encoding for plain text, Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark
static java.lang.String PLAIN_TEXT_UTF16
          filter encoding for plain text, Sixteen-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark
static java.lang.String PLAIN_TEXT_UTF8
          filter encoding for plain text, Eight-bit Unicode Transformation Format
 
Constructor Summary
CMBDefaultContentProvider()
          Creates a new default content provider.
CMBDefaultContentProvider(int[] partIndices)
          Creates a new default content provider.
 
Method Summary
 CMBTextDocument getContent(com.ibm.mm.beans.CMBConnection connection, com.ibm.mm.beans.CMBItem item)
          Returns the text of the specified item to be used for text analysis.
 java.lang.String getFilterEncoding()
          Returns the current filter encoding.
 int getMaxPartSize()
          Get maximal number of bytes in a part.
 int[] getPartsMask()
          Return indices of parts to be processed.
 void setFilterEncoding(java.lang.String encoding)
          Set the encoding for the underlying filter to be used.
 void setMaxPartSize(int v)
          Set maximal number of bytes in a part.
 void setPartsMask(int[] partIndices)
          Set the part indices of the parts to be processed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

PLAIN_TEXT_ISO8859_1

public static final java.lang.String PLAIN_TEXT_ISO8859_1
plain text, ISO 8859-1, Latin alphabet No. 1

PLAIN_TEXT_UTF8

public static final java.lang.String PLAIN_TEXT_UTF8
filter encoding for plain text, Eight-bit Unicode Transformation Format

PLAIN_TEXT_UTF16

public static final java.lang.String PLAIN_TEXT_UTF16
filter encoding for plain text, Sixteen-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark

PLAIN_TEXT_UnicodeLittle

public static final java.lang.String PLAIN_TEXT_UnicodeLittle
filter encoding for plain text, Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark

PLAIN_TEXT_UnicodeBig

public static final java.lang.String PLAIN_TEXT_UnicodeBig
filter encoding for plain text, Sixteen-bit Unicode Transformation Format, big-endian byte order, with byte-order mark

PLAIN_TEXT_SJIS

public static final java.lang.String PLAIN_TEXT_SJIS
filter encoding for plain text, Shift-JIS, Japanese

PLAIN_TEXT_EUC_CN

public static final java.lang.String PLAIN_TEXT_EUC_CN
filter encoding for plain text, GB2312, EUC encoding, Simplified Chinese

PLAIN_TEXT_EUC_KR

public static final java.lang.String PLAIN_TEXT_EUC_KR
filter encoding for plain text, KS C 5601, EUC encoding, Korean

PLAIN_TEXT_Big5

public static final java.lang.String PLAIN_TEXT_Big5
filter encoding for plain text, Big5, Traditional Chinese
Constructor Detail

CMBDefaultContentProvider

public CMBDefaultContentProvider()
Creates a new default content provider.

CMBDefaultContentProvider

public CMBDefaultContentProvider(int[] partIndices)
Creates a new default content provider.
Parameters:
partIndices - part indices of the parts to be processed
Method Detail

setPartsMask

public void setPartsMask(int[] partIndices)
Set the part indices of the parts to be processed.
Parameters:
partIndices - array of part indices

getPartsMask

public int[] getPartsMask()
Return indices of parts to be processed.
Returns:
array of integers

getMaxPartSize

public int getMaxPartSize()
Get maximal number of bytes in a part. If a part's size exceeds this threshold it is not processed.
Returns:
value of maxPartSize

setMaxPartSize

public void setMaxPartSize(int v)
Set maximal number of bytes in a part. If a part's size exceeds this threshold it is not processed.
Parameters:
v - maximal number of bytes in a part

getContent

public CMBTextDocument getContent(com.ibm.mm.beans.CMBConnection connection,
                                  com.ibm.mm.beans.CMBItem item)
                           throws CMBContentProviderException
Description copied from interface: CMBContentProvider
Returns the text of the specified item to be used for text analysis.
Specified by:
getContent in interface CMBContentProvider
Following copied from interface: com.ibm.mm.beans.infomining.CMBContentProvider
Parameters:
connection - an open connection to the server
item - the current item to be processed
Returns:
the text of the specified item
Throws:
CMBContentProviderException - if an error occured while processing the current item

setFilterEncoding

public void setFilterEncoding(java.lang.String encoding)
Set the encoding for the underlying filter to be used. Use public fields for plain text files or the public fields from DKIKFDocumentFilter.
Parameters:
encoding - filter encoding, use public fields or fields from DKIKFDocumentFilter.
See Also:
DKIKFDocumentFilter, getFilterEncoding()

getFilterEncoding

public java.lang.String getFilterEncoding()
Returns the current filter encoding.
Returns:
current filter encoding
See Also:
setFilterEncoding(String)


IBM Information Integrator for Content V8.2 APIs

© Copyright International Business Machines Corporation 1996, 2003 IBM Corp. All rights reserved.