You use the discover function to find relationships between sources
and targets in the mapping editor. The discover function is configured for
basic searches of matching elements without any further modifications to the
configuration. But you can refine how the function finds the relationships
by updating the mapping editor preferences.
The properties
of your data and the properties of the structures in the mapping editor can
be used to integrate data. You can use the properties of the data and the
properties of the structures, or the metadata, to understand the relationships
between source and target data sources. For example, by using the metadata
relationships, you can build a script that correctly associates data from
a legacy database with data in a new acquisition. The metadata properties
can include relationships that might be difficult to identify, especially
when schemas are large, without help from the discover function.
The discover function examines the metadata to find possible matches without
manual interaction with the metadata. The configuration enables you to modify
how the discover function should search and on what data and metadata to base
the search.
You can define a global configuration for the discover function by setting
the preferences in the Workbench wizard. These configurations
persist when you open and close new mapping editors and become the default
values for new mapping models. You can override the global configurations
for a specific mapping editor instance by using the Advanced configuration.
These settings are lost when you close the mapping editor.
Basic discover function
The discover function provides
two methods of controlling and refining the number of matches that you see:
Find
Best Fit and
Find Similar.
- Find Best Fit
- You should always select this method first in your attempts to find relationships
between objects. This method of running the discover function finds the best
overall score of all potential object pairings in all of the elements in the
scope of the model. There is a potential for any object to match any other
object at any time. But when the discover function analyzes all of the mapping
model participants, the Find Best Fit method produces
the most satisfactory matches in terms of the entire model. The method returns
at most one match for one target and one source that you select. It is possible
to have no matches found.
- Find Similar
- If you are not completely satisfied with the results of the Find
Best Fit method, then you can find other matches by running the Find
Similar method. When you use the default configuration, the method
finds the top 5 matches for each target element that you select. You can change
that number. Generally, you only want to select Find Similar when
you specify a target object on which to focus your search.
Advanced configuration: controlling the methods of searching
For
more advanced discover techniques, click to
specify which algorithms to use when finding relationships between sources
and targets. For each algorithm that you select, you can define parameters
to refine the search. The parameters that are available depend on the algorithm
that you select.
- Lexical similarity
Use this algorithm to find relationships by the longest common subsequence
or a similarity in the values of the elements. This algorithm is a string
matching algorithm that finds a maximum length or maximum weight subsequence
of two or more strings that are common in each other. For example, if you
have a short string (the pattern) and a long string (the text), and the letters
of the pattern appear in order (but possibly separated) in the text, the pattern
is a subsequence of the text. The following example shows this concept:
Pattern=Wood
Text=The World of words.
The pattern is a subsequence of the text.
Lexical similarity
is the default algorithm. For example, there is a value similarity between
elements if they represent the same entity property, such as elements Sample.Employee.Eno
and OtherSample.EMP.ID. Elements with foreign keys and indexes have similar
properties. A distance metric is used to find the similarity and differences
between elements. For example, if 10 source elements and 20 target elements
need to be matched, a distance metric can potentially return 200 measurements
or 10 source elements multiplied by 20 target elements. Each measurement is
generally a combination of source element, target element, and a distance
value. The rejection threshold, or the maximum value by which the match is
rejected, is a distance value. The suggested value for the rejection threshold
is 1.
- Semantic name
- Use this algorithm to find relationships by thesaurus and ontology. You
can use supported thesaurus software applications and
glossary models to enhance the semantic name algorithm. The suggested
value for the rejection threshold is 0.4. If you want to specify a thesaurus,
select it from the list. The list shows supported applications, such as WordNet
or SureWord, when they are installed in your system. In
addition, any glossary model with synonym information in the current project
can be selected as a thesaurus. If you use an external thesaurus, you
do not need any further configuration on the mapping editor preferences page.
- Signature
- Use this algorithm to find relationships with a search method that is
based on a name signature. This algorithm uses data sampling to find the relationships.
A weighting value is assigned to certain classes of words in the data. The
suggested value for a sampling size is 100 rows. The valid values for sampling
size are 50, 100, 150, 200, 250, 300, 350, and 400. The suggested value for
a sampling rate is 20 percent. The valid value is any integer between 1 and
100. The suggested value for the rejection threshold is 1. The schemas that
are used in this discover function must be DB2® Universal Database™ schemas.
To use this algorithm, you must specify some connection and authorization
information to access the data. When data sampling is used, the data to run
the discover function is cached. You can select a caching database from a
list of available databases that are already configured, or you can specify
a new caching database.
- Regular expressions
- Use this algorithm to find relationships with a search method that is
based on textual or string searches that use regular expressions or pattern
matching. A simple regular expression is an exact character match.
- Distributions
- Use this algorithm to find relationships with a search method that is
based on a similarity in data. The discover function performs some data sampling
to find the relationships. The schemas that are used in this discover function
must be DB2
Universal Database schemas. To use this algorithm, you must specify some
connection and authorization information to access the data. The suggested
value for a sampling size is 100 rows. The valid values for sampling size
are 50, 100, 150, 200, 250, 300, 350, and 400. The suggested value for a sampling
rate is 20 percent. The valid value is any integer between 1 and 100. The
suggested value for the rejection threshold is 1.
When you select multiple algorithms, you can choose
to combine the algorithms by sequence or by weight. If you want to combine
the algorithms by weight, then you can specify the percentage of importance
that each algorithm has. You can refine the results of the discover function
by sorting the results of the weighted algorithms and keep only the top percentage.
If you want to combine the algorithms by sequence, you can specify the order
of precedence for each algorithm. Selecting multiple algorithms combines the
strength of the selected algorithms to more accurately find relationships.
You
can determine a threshold for confidence values so that you control the kinds
of matches to consider. You can run the discover function between specific
parts of sources and targets, down to the smallest element on each side.