IBM OmniFind Analytics Edition Overview | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Edition Notice
First Edition (February 2007)
1 Introduction
This edition applies to version 8, release 4 of IBM® OmniFind™ Analytics Edition and to all subsequent releases and modifications until otherwise indicated in new editions. This document contains proprietary information of IBM. This proprietary information is provided in accordance with the license conditions and is protected by copyright. Information contained in this document provides no warranties whatsoever for any products. Also, no descriptions provided in this document should be interpreted as product warranties. Depending on the system environment, the yen symbol may be displayed as the backslash symbol, or the backslash symbol may be displayed as the yen symbol. © Copyright International Business Machines Corporation 2007. All rights reserved. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. This document describes the functions of the text mining system, IBM OmniFind Analytics Edition, and some basic concepts that are necessary for understanding these functions. Understanding of the information provided in this document is crucial for understanding other documents or instruction manuals for IBM OmniFind Analytics Edition. This document is written for application users, system administrators, and operational designers of IBM OmniFind Analytics Edition.
See the List of Instruction Manuals
2 Overview of IBM OmniFind Analytics Edition
![]() This section describes the functions of IBM OmniFind Analytics Edition and the system configuration. This section also describes how target data is processed and analyzed. IBM OmniFind Analytics Edition analyzes text in documents and what customers say during interactions with a call center. Text is analyzed by the IBM OmniFind Analytics Edition language processing program and extracts relevant information about that text. The following example shows information about an inquiry to an call center. IBM OmniFind Analytics Edition runs the language processing for each document. In this example, one inquiry produces one document.
In the record above, "Date of inquiry" and "Name" are the items that are attached to the data to be analyzed. Meanwhile, an inquiry is created as a free-form text. This is what the IBM OmniFind Analytics Edition language processing program analyzes. The following example is a partial result of the analysis of the inquiry by the IBM OmniFind Analytics Edition language processing program.
One of the most significant characteristics of IBM OmniFind Analytics Edition is that by customizing language resources such as dictionaries, it can extract not only words such as software and purchase but also dependency expressions such as noun -> verb and expressions of intention such as want and question. Extracting expressions of intention requires optional dictionaries that are suitable for the target data field.
Now, how can new knowledge from the extracted information be obtained? If, for example, there are 100,000 documents, reading all of them will require a long time.
The following figure shows types of dependencies that are found in the documents that mentioned three particular products (ABC-001, ABC-002, and XYZ-999). These product names were retrieved from the call logs from the previous example. (Among all the retrieved dependency patterns, the five most commonly used patterns are selected.)
This result provides the following facts.
Based on these results, you can ensure that the memory expansion method is explained effectively in the instruction manuals for these products, for example.
IBM OmniFind Analytics Edition offers multiple applications depending on the purpose or use.
This section provides an overview of these applications. For the details of individual applications, see the respective instruction manuals (List of Instruction Manuals
The following figure shows how the applications mentioned in the previous section relate to each other. Arrows show the flow of data.
Data to be analyzed, including text, must be prepared in CSV (comma separated values) data format. Many standard spreadsheet applications and relational databases support exporting files to a CSV format. CSV data is first converted to the internal data format for IBM OmniFind Analytics Edition, and then analyzed by the natural language processing ("NLP" in the figure). Results of the language processing are stored in the index structure for analysis.
Text Miner
DOCAT GUI
Dictionary Editor
This section provides an example of the standard system operational design for the IBM OmniFind Analytics Edition system environment. See the "Operation Guide To design an operation, follow these steps:
When you add new data, follow these steps:
This section describes the basic concepts of IBM OmniFind Analytics Edition.
3.1 Database
IBM OmniFind Analytics Edition manages the results of language processing for each type of data to be analyzed. The unit of the management is called a database. When analyzing particular data by using IBM OmniFind Analytics Edition, you must create a database for that data.
A database contains the resources necessary for analyzing the results of the language processing, the results of the language processing, and the index structure required for real-time analysis. The database is viewed by application users simply as "data to be analyzed," and its physical structure is not recognized. Category is a label name given to a keyword. A category is a perspective of data analysis in IBM OmniFind Analytics Edition. IBM OmniFind Analytics Edition uses multiple categories. Categories can be divided into the standard item category, system category, and user defined category. Their characteristics can be summarized as follows:
Categories can have parent-child relationships with other categories.
A category tree is a collection of all categories defined in IBM OmniFind Analytics Edition. It is called a tree because it has a tree structure as in the following example. When a particular category is considered to semantically contain a different category, these two categories can be registered in the category tree as categories having a parent-child relationship.
A keyword is a combination of a category and a character string. When category C and character string S in the example of Section 2.1 are used, the resulting keywords [C, "S"] are: [Name, "Taro Yamada"], [Product name, "ABC-001"], [Noun -> Verb, "software … uninstall"] and so on.
As arbitrary character strings are used as keywords, note that expressions that are not words are also regarded as keywords, such as dependency, as in the expression "software … uninstall" in the last example. The IBM OmniFind Analytics Edition language processing extracts information in the form of keywords from text. See the example in Section 2.1 for extraction examples. See the instructions for individual applications about how to handle application-dependent keywords. A dictionary is a collection of data that is used when the language processing extracts keywords from text. A dictionary contains two types of information:
By default, IBM OmniFind Analytics Edition has a dictionary (system dictionary) for extracting nouns and dependency. By using the system dictionary, the language processing extracts nouns, verbs, and dependency. Some of the extracted information is as follows:
Now, analyze the following text, which is quite similar to Target text 1 above.
If the system dictionary is the only dictionary that IBM OmniFind Analytics Edition has, it probably returns the following results:
When you look at the output description, you can see that Target sentence 1 and Target sentence 2 have identical questions. However, because the extracted keywords are different from those extracted from Target sentence 1, the application cannot determine whether the extracted information is the same as the previously extracted one. This issue can be solved by providing a new dictionary for language processing. Before you run the language processing, you can add a new dictionary (usually called a user dictionary because you edit it) that contains the following knowledge.
Two keywords [Noun, "software"] and [Noun, "manual"] are registered in this dictionary. Keywords that are equivalent to these keywords are also registered. The language processing searches keywords that are found in the text in the synonym list of the dictionary. If synonyms are not found, the keywords are output as they are, but if synonyms are found, both the "keywords that correspond to the retrieved synonyms" and the synonyms are output. In the previous example, new keywords such as [Noun, "software"] and [Noun, "manual"] will be extracted from Target text 2 in addition to the keywords [Noun, "software"] and [Noun, "instructions"]. Also, these settings at the word level are used in dependency, new dependency [Noun -> Verb, "software … uninstall"] will also be newly extracted. As seen in this example, how keywords and their synonyms are registered in the dictionary is the important element in analyzing the results. Notices
This information was developed for products and services offered in the U.S.A.
Copyright License
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A.For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: IBM World Trade Asia Corporation Licensing 2-31 Roppongi 3-chome, Minato-ku Tokyo 106-0032, JapanThe following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation Silicon Valley Lab Building 090/H-410 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A.Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks.
See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks. The following terms are trademarks or registered trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product or service names might be trademarks or service marks of others. |