Configure the discover relationships function

You use the discover function to find relationships between sources and targets in the mapping editor. The discover function is configured for basic searches of matching elements without any further modifications to the configuration. But you can refine how the function finds the relationships by updating the mapping editor preferences.

The properties of your data and the properties of the structures in the mapping editor can be used to integrate data. You can use the properties of the data and the properties of the structures, or the metadata, to understand the relationships between source and target data sources. For example, by using the metadata relationships, you can build a script that correctly associates data from a legacy database with data in a new acquisition. The metadata properties can include relationships that might be difficult to identify, especially when schemas are large, without help from the discover function.

The discover function examines the metadata to find possible matches without manual interaction with the metadata. The configuration enables you to modify how the discover function should search and on what data and metadata to base the search.

You can define a global configuration for the discover function by setting the preferences in the Workbench Window > Preferences wizard. These configurations persist when you open and close new mapping editors and become the default values for new mapping models. You can override the global configurations for a specific mapping editor instance by using the Advanced configuration. These settings are lost when you close the mapping editor.

Basic discover function

The discover function provides two methods of controlling and refining the number of matches that you see: Find Best Fit and Find Similar.
Find Best Fit
You should always select this method first in your attempts to find relationships between objects. This method of running the discover function finds the best overall score of all potential object pairings in all of the elements in the scope of the model. There is a potential for any object to match any other object at any time. But when the discover function analyzes all of the mapping model participants, the Find Best Fit method produces the most satisfactory matches in terms of the entire model. The method returns at most one match for one target and one source that you select. It is possible to have no matches found.
Find Similar
If you are not completely satisfied with the results of the Find Best Fit method, then you can find other matches by running the Find Similar method. When you use the default configuration, the method finds the top 5 matches for each target element that you select. You can change that number. Generally, you only want to select Find Similar when you specify a target object on which to focus your search.

Advanced configuration: controlling the methods of searching

For more advanced discover techniques, click Discover Relationships > Advanced Configuration to specify which algorithms to use when finding relationships between sources and targets. For each algorithm that you select, you can define parameters to refine the search. The parameters that are available depend on the algorithm that you select.
Lexical similarity
Use this algorithm to find relationships by the longest common subsequence or a similarity in the values of the elements. This algorithm is a string matching algorithm that finds a maximum length or maximum weight subsequence of two or more strings that are common in each other. For example, if you have a short string (the pattern) and a long string (the text), and the letters of the pattern appear in order (but possibly separated) in the text, the pattern is a subsequence of the text. The following example shows this concept:
Pattern=Wood
Text=The World of words.

The pattern is a subsequence of the text.
Lexical similarity is the default algorithm. For example, there is a value similarity between elements if they represent the same entity property, such as elements Sample.Employee.Eno and OtherSample.EMP.ID. Elements with foreign keys and indexes have similar properties. A distance metric is used to find the similarity and differences between elements. For example, if 10 source elements and 20 target elements need to be matched, a distance metric can potentially return 200 measurements or 10 source elements multiplied by 20 target elements. Each measurement is generally a combination of source element, target element, and a distance value. The rejection threshold, or the maximum value by which the match is rejected, is a distance value. The suggested value for the rejection threshold is 1.
Semantic name
Use this algorithm to find relationships by thesaurus and ontology. You can use supported thesaurus software applications and glossary models to enhance the semantic name algorithm. The suggested value for the rejection threshold is 0.4. If you want to specify a thesaurus, select it from the list. The list shows supported applications, such as WordNet or SureWord, when they are installed in your system. In addition, any glossary model with synonym information in the current project can be selected as a thesaurus. If you use an external thesaurus, you do not need any further configuration on the mapping editor preferences page.
Signature
Use this algorithm to find relationships with a search method that is based on a name signature. This algorithm uses data sampling to find the relationships. A weighting value is assigned to certain classes of words in the data. The suggested value for a sampling size is 100 rows. The valid values for sampling size are 50, 100, 150, 200, 250, 300, 350, and 400. The suggested value for a sampling rate is 20 percent. The valid value is any integer between 1 and 100. The suggested value for the rejection threshold is 1. The schemas that are used in this discover function must be DB2 Universal Database™ schemas. To use this algorithm, you must specify some connection and authorization information to access the data. When data sampling is used, the data to run the discover function is cached. You can select a caching database from a list of available databases that are already configured, or you can specify a new caching database.
Regular expressions
Use this algorithm to find relationships with a search method that is based on textual or string searches that use regular expressions or pattern matching. A simple regular expression is an exact character match.
Distributions
Use this algorithm to find relationships with a search method that is based on a similarity in data. The discover function performs some data sampling to find the relationships. The schemas that are used in this discover function must be DB2 Universal Database schemas. To use this algorithm, you must specify some connection and authorization information to access the data. The suggested value for a sampling size is 100 rows. The valid values for sampling size are 50, 100, 150, 200, 250, 300, 350, and 400. The suggested value for a sampling rate is 20 percent. The valid value is any integer between 1 and 100. The suggested value for the rejection threshold is 1.

When you select multiple algorithms, you can choose to combine the algorithms by sequence or by weight. If you want to combine the algorithms by weight, then you can specify the percentage of importance that each algorithm has. You can refine the results of the discover function by sorting the results of the weighted algorithms and keep only the top percentage. If you want to combine the algorithms by sequence, you can specify the order of precedence for each algorithm. Selecting multiple algorithms combines the strength of the selected algorithms to more accurately find relationships.

You can determine a threshold for confidence values so that you control the kinds of matches to consider. You can run the discover function between specific parts of sources and targets, down to the smallest element on each side.


Feedback