class Amatch::PairDistance
The pair distance between two strings is based on the number of adjacent character pairs, that are contained in both strings. The similiarity metric of two strings s1 and s2 is
2*|union(pairs(s1), pairs(s2))| / |pairs(s1)| + |pairs(s2)|
If it is 1.0 the two strings are an exact match, if less than 1.0 they are more dissimilar. The advantage of considering adjacent characters, is to take account not only of the characters, but also of the character ordering in the original strings.
This metric is very capable to find similarities in natural languages. It is explained in more detail in Simon White's article “How to Strike a Match”, located at this url: www.catalysoft.com/articles/StrikeAMatch.html It is also very similar (a special case) to the method described under citeseer.lcs.mit.edu/gravano01using.html in “Using q-grams in a DBMS for Approximate String Processing.”
Public Class Methods
Creates a new Amatch::PairDistance instance
from pattern
.
static VALUE rb_PairDistance_initialize(VALUE self, VALUE pattern) { GET_STRUCT(PairDistance) PairDistance_pattern_set(amatch, pattern); return self; }
Public Instance Methods
Uses this Amatch::PairDistance instance to
match #pattern against
strings
. It returns the pair distance measure, that is a
returned value of 1.0 is an exact match, partial matches are lower values,
while 0.0 means no match at all.
strings
has to be either a String
or an Array of Strings. The argument regexp
is used to split
the pattern and strings into tokens first. It defaults to /s+/. If the
splitting should be omitted, call the method with nil as
regexp
explicitly.
The returned results
is either a Float or an Array of Floats
respectively.
static VALUE rb_PairDistance_match(int argc, VALUE *argv, VALUE self) { VALUE result, strings, regexp = Qnil; int use_regexp; GET_STRUCT(PairDistance) rb_scan_args(argc, argv, "11", &strings, ®exp); use_regexp = NIL_P(regexp) && argc != 2; if (TYPE(strings) == T_STRING) { result = PairDistance_match(amatch, strings, regexp, use_regexp); } else { int i; Check_Type(strings, T_ARRAY); result = rb_ary_new2(RARRAY_LEN(strings)); for (i = 0; i < RARRAY_LEN(strings); i++) { VALUE string = rb_ary_entry(strings, i); if (TYPE(string) != T_STRING) { rb_raise(rb_eTypeError, "array has to contain only strings (%s given)", NIL_P(string) ? "NilClass" : rb_class2name(CLASS_OF(string))); } rb_ary_push(result, PairDistance_match(amatch, string, regexp, use_regexp)); } } pair_array_destroy(amatch->pattern_pair_array); amatch->pattern_pair_array = NULL; return result; }
Returns the current pattern string of this instance.
Sets the current pattern string of this instance to pattern
.