Index

DXDatastoreTS

Purpose:

This class implements the Text Search (TS) datastore or Text Search Engine. Essentially, TS provides text indexing and search mechanisms; it doesn't really store documents or folders. TS indexes text parts of documents and process search requests. The results of a text query submitted to TS are item identifiers, which are keys to retrieve the actual documents from the Content Manager datastore.

The execute and evaluate methods of DXDatastoreTS take text query strings expressed in text query language; the syntax of this query string is described below. The DXTextQuery object accepts queries in this syntax, then the DXTextQuery object delegates the low level query processing tasks to DXDatastoreTS.

Methods:

The methods of DXDatastoreTS are similar to that of DXDatastoreDL, except for those listed below, which are not applicable:

The following methods are either different or are in addition to those in DXDatastoreDL:

connect
connect(LPCTSTR datastoreName [, VARIANT userName]
        [,VARIANT authentication] [, VARIANT connectString]);

This connects to the datastore. The userName and authentication are for the server and the datastoreName is the name of the search service.

The connect string is optional; it is used to provide the communication type and port number, as well as a list of library server, user ID and authentication groupings.

Below is a sample of a connect string an end-user may supply:

Connect string

    [COMMTYPE={T | P}; PORT=portnumber;
    LIBACCESS=(libraryserver, userid, auth;...)]

Additional connect string parameters:

COMMTYPE
communication type. This can be set to T (TCPIP) or P (PIPES).

PORT
port number. This parameter must be included if the COMMTYPE is specified.

LIBACCESS
library access group. If this parameter is passed, you should not specify the userName and authentication parameters in the connect method. Each library access group is related to a Content Manager server. If one library access group is specified, the parentheses are not needed; you can specify one or more library access groups. Each library access group consists of the library server name (for example, LIBSRVR2), user ID and password for a server, which is where the text parts are stored.

There are different ways to engage the connect method. Below is a listing of the different ways to connect with Text Search:

connectPort
connectPort(LPCTSTR server [, VARIANT port]
            [, VARIANT communicationType]); 

This connects to the datastore. The server_name is the text search server. You need to specify the communication type (LPSTR_DX_TS_CTYP_TCPIP for TCPIP or LPSTR_DX_TS_CTYP_PIPES for PIPES), and port number.

You can also connect to the datastore if you supply the search service name for server, an empty string for port, and zero as the value of communicationType.

disconnect
disconnect()

Disconnects from a datastore.

getOption
getOption(long option, VARIANT* value);

Gets a datastore option.

setOption
setOption(long option, VARIANT value);

Sets a datastore option.

datastoreName
BSTR datastoreName();

Gets the datastore name.

datastoreType
BSTR datastoreType();

Gets the datastore type.

userName
BSTR userName();

Gets the user name.

execute
LPDISPATCH execute(LPCTSTR command,
                   short commandLangType
                   [,VARIANT paramList]);

Executes a query using a command. The parameter list is in the form of an array of DXNVPairDL objects. The returned LPDISPATCH pointer contains a DXResultSetCursorDL object.

evaluate
VARIANT evaluate(LPCTSTR command,
                 short commandLangType
                 [,VARIANT paramList]);

Evaluates a query using a command. The parameter list is in the form of an array of DXNVPairDL objects. The value of the returned VARIANT is a DXResultsDL object.

evaluateQuery
VARIANT evaluateQuery(LPDISPATCH query);

Evaluates a query using a query object. The value of the returned VARIANT is a DXResultsDL object. The returned LPDISPATCH pointer contains a DXTextQueryTS object.

createQuery
LPDISPATCH createQuery(LPCTSTR command,
                       short commandLangType
                       [,VARIANT paramList]);

Creates a query object using a command. The parameter list is in the form of an array of DXNVPairDL objects. The returned LPDISPATCH pointer contains a DXTextQueryTS object.

isConnected
BOOL isConnected();

Returns TRUE if datastore is connected.

listDataSources
LPDISPATCH listDataSources();

Gets a list of servers. The returned LPDISPATCH pointer contains a DXSequentialCollectionDL object.

listDataSourceNames
VARIANT listDataSourceNames(long* arraySize);

Gets a list of server names. The output parameter arraySize is the size of the array.

listEntities
LPDISPATCH listEntities();

Gets a list of entities. The returned LPDISPATCH pointer contains a DXSequentialCollectionDL object.

listEntityNames
VARIANT listEntityNames(long* arraySize);

Gets a list of entity names. The output parameter arraySize is the size of the array.

datastoreDef
LPDISPATCH datastoreDef();

Gets the datastore definition. The returned LPDISPATCH pointer contains a DXDatastoreDefTS object.

getMatches
LPDISPATCH getMatches(LPDISPATCH cursor,
                      LPCTSTR documentId,
                      LPCTSTR textIndexName,
                      BOOL userDictionary)

Gets match information for an item returned from a text query. The match information contains the text of the document and the highlighting information for all matches of the corresponding query. The input parameter LPDISPATCH pointer contains a DXResultSetCursorDL object. The returned LPDISPATCH pointer contains a DXMatchesInfoTS object.

Important: This process is time consuming because the document is retrieved from the Content Manager datastore and analyzed linguistically, and potential matches are determined. These processes will have an impact on the performance of a text query.

Text Search text query string

The syntax of text query string is as follows:

Text Search text query syntax

       SEARCH=(COND=(text_search_expression)
             );
     [OPTION=([SEARCH_INDEX={search_index_name | (index_list) }]
               [MAX_RESULTS=maximum_results;]
               [TIME_LIMIT=time_limit;]
               [THES_NAME=thesaurus_index_name;]
               [THES_DEPTH=depth_for_query_expansion;]
               [MATCH_INFO=yes_no;]
                        [RANKING=yes_no;]
                        [SORT=yes_no;]
                 [MATCH_DICT=yes_no]
               )]
 

Words in uppercase are keywords. Lowercase words are parameters supplied by users; they are described below. Note that DBCS (double-byte character set) characters must be enclosed in SBCS single quotes, like a phrase.

text_search_expression

This is an expression composed of a free_text_expression or a boolean_query, followed by an optional free_text_expression. A boolean_query followed by a free_text_expression is known as a hybrid query.

       {boolean_query  [free_text_expression] | free_text_expression}

Notice that only one boolean query or one optional free_text_expression is allowed. If a boolean query is requested, this should be specified first. For more information about options, refer to the EhwSearch chapter of the Text Search Engine Application Programming Reference.

boolean_query

       [unary_operator] text_search_criteria
      [[binary_operator [unary_operator] text_search_criteria] ... ]

Binary operators are AND or &, OR or |. NOT is the only unary operator. Parentheses are treated as a subquery. A subquery changes the default order of processing for the binary operators. For example, a query that includes parentheses would have the following syntax: UNIX AND (ibm OR system). The information located inside the parentheses, "(ibm OR system)," is a subquery contained inside of a query.

Search Argument

text_search_criteria is one of the following keyword/options, where the dollar sign delimits the keyword/option:

      { search_argument                 |
        $DOC$  '{' proximity_search_argument '}'  |
        $PARA$ '{' proximity_search_argument '}'  |
        $SENT$ '{' proximity_search_argument '}'
      }

The following options specify proximity search conditions, which require search arguments. These consist of at least a pair of words or phrases:

$DOC$
reserved word indicating that the search proximity expression in search argument has a scope of the whole document

$PARA$
indicating that the search proximity expression in search argument has a scope of a paragraph

$SENT$
indicating that the search proximity expression in search argument has a scope of a sentence

search_argument can be more than one word or phrase:

      [$search_option$] {word | phrase} [$search_option$] [{word | phrase}...]

proximity_search_argument:

      [$search_option$] {word | phrase} [$search_option$] {word | phrase}
      [$search_option$] [{word | phrase}...]

Each word or phrase can be preceded by the "-$search_options$-" tag.

The dollar sign delimits search_option. Options inside a pair of dollar sign are separated by comma, and can have the following values:

SC=symbol
symbol to indicate a single required character, usually a question mark (?). This must come before the MC=symbol if both SC and MC are specified.

MC=symbol
symbol to indicate a sequence of optional characters or for a single optional word; that is, wildcard character, usually an asterisk (*).

SYN
The text search includes synonyms of the current search term.

THES

THES or THES=relation_name

The text search includes a request to also search for thesaurus expansions of the current search term. Text Search looks for thesaurus terms either in the file defined by the THES_NAME option or the default file. The default file is "imlthes" for Linguistic and Precise searches; the default file is "imlnthes" for GTR searches. If relation_name is specified, query expansion by thesaurus is done along branches of the named relation. If no value is specified, all branches are taken into account for query expansion.

If you have multiple terms in your search (words separated with spaces), you must enclose the entire string within apostrophes ('). For example, if you want to search for the words "digital" and "database" using a single query, your query would look like this: 'digital database'. Spaces between words are only recognized when contained within apostrophes.

NOSEQ
The words in the current search term are requested to be in any sequence; if not specified, the words must occur in exactly the same sequence within a single sentence.

SOUND
The words in the current search term sound like words targeted in the search.

MATCH=n
An option that specifies the degree of similarity (GTR). "n" is a number between one and five, inclusive.

BOUND
An option that requests the search to respect word phrase boundaries (GTR).

CSENS
The search is case-sensitve. This is only valid for GTR-type index with case enabled.

ESTEM
An option that requests tokens with a stem that matches the search term (GTR). With this option, Text Search Engine will also search on "computer" and "computing" from the search term "compute".

word is a word in the specified search language, phrase is surrounded by apostrophes, and free_text is words inside a pair of braces{}.

free_text_expression

free_text_expression is composed of the following string free_text_search_criteria, where free_text_search_criteria is:

    [$free_text_search_option$] '{' free_text '}'
 

The dollar sign delimits free_text_search_option. Options inside a pair of dollar signs are separated by a comma, and can currently have the following value:

SYN
the text search includes synonyms of the current search term.

THES

THES or THES=relation_name

The text search includes a request to also search for thesaurus expansions of the current search term. Text Search looks for thesaurus terms either in the file defined by the THES_NAME option or the default file. The default file is "imlthes" for Linguistic and Precise searches; the default file is "imlnthes" for GTR searches. If relation_name is specified, query expansion by thesaurus is done along branches of the named relation. If no value is specified, all branches are taken into account for query expansion.

If you have multiple terms in your search (words separated with spaces), you must enclose the entire string within apostrophes ('). For example, if you want to search for the words "digital" and "database" using a single query, your query would look like this: 'digital database'. Spaces between words are only recognized when contained within apostrophes.

search_index_name
the name of one search index to be searched.

index_list
the list of search index names to be searched, separated by commas.

maximum_results
the desired maximum number of results to be returned.

thesaurus_index_name
specifies the name of a thesaurus index to be used to expand query terms. The default name is imlthes for Linguistic and Precise searches; the default name is imlnthes for GTR searches.

depth_for_query_expansion
specifies the depth to be used in query expansion by looking for matches in the thesaurus. Actual expansion of the query is requested by using the THES search_option. The default depth setting is "1".

time_limit
specifies the maximum processing time of the text search server for a Boolean query or the Boolean part of a hybrid query.

An example of a boolean search expression to search for documents contains the phrase UNIX Operating and a word member in the same paragraph, is as follows:

            'UNIX Operating'   AND
             member                               

An example of a boolean and free-text search expression to search for documents containing the words WWW, internet, and a free text web site is as follows:

            WWW AND internet  {web site} 
           

Another example of an expression to search for documents containing the words internet and DB2 in the same paragraph, a word that starts with Net, and the free_text internet commerce is booming is as follows:

    $PARA$ {internet DB2} AND $MC=*$ Net* 
    {internet commerce is booming}

yes_no for MATCH_INFO

The MATCH_INFO indicator. The valid values are:

YES

Returns match information for each item returned from the text query. The match information contains the text of the document and the highlighting information for all matches of the corresponding query.

Important: This process is time consuming because the document is retrieved from the Content Manager datastore and analyzed linguistically, and potential matches are determined. These processes will have an impact on the performance of the text query.

NO
Do not return match information for each item returned from the text query. The match information is returned in a new attribute, DXMATCHESINFO, in the DXDDO returned from a text query. The value of the attribute DXMATCHESINFO will be a DXMatchesInfoTS object.

yes_no for MATCH_DICT

The MATCH_DICT indicator. The valid values are:

YES
Highlighting information will be obtained using a dictionary.

NO
Highlighting information will not be obtained using a dictionary.

(c) Copyright International Business Machines Corporation 1996, 2002. IBM Corp. All rights reserved.