Simetrics
Author: François Vanzeveren Date: 19-Feb-2006
Contents:
1. Introduction
2. String similarity metrics
2.1 Global refinements
2.2 Edit distance metrics
2.3 The Jaro metric and variants
3. Token-based metrics and document queries
3.1 Tokenizer
3.2 Accumulate statistics
3.3 Jaccard and hybrid distances
3.4 Term Frequency Inverse Document Frequency and variants
1. Introduction
Simetrics stands for "Similarity Metrics".
This library implements a few effective and widely used metrics for measuring string similarities.
The metrics currently implemented are:
- Levenshtein
- Jaro
- Jaro-Winkler
- Jaccard
- Term Frequency Inverse Document Frequency
- Term Frequency Inverse Document Frequency combined with Jaro
- Term Frequency Inverse Document Frequency combined with Jaro-Winkler
Ressources
This document is based on the paper "A Comparison of String
Distance Metrics for Name-Matching Tasks" available at
http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf.
2. String similarity metrics
This class of metrics matches a source string with a target string
and returns the level of similarity as a number between 0
(no similarity) and 1 (perfect match).
These metrics are sensitive to word order.
>> get-similarity/jaro "Francois Vanzeveren" "François Vanzeveren"
== 1
>> get-similarity/jaro "Vanzeveren Francois" "François Vanzeveren"
== 0.582934609250399
2.1 Global refinements
By default, string matching metrics are case and
non-english characters insensitive.
>> get-similarity/levenshtein "François" "francois"
== 1
You can change this behaviour with the following refinements.
| /case | Characters are case-sensitive.
|
| /strict | Matching is non-english characters sensitive.
|
>> get-similarity/levenshtein/case "françois" "François"
== 0.875
>> get-similarity/levenshtein/strict "François" "Francois"
== 0.875
2.2 Edit distance metrics
An important class of static string similarity metrics are
"edit distances". The distance between the source and target
strings is the cost of the best sequence of "edit operations"
that converts the source string to the target string.
The edit operations are deletions, insertions and substitutions.
2.2.1 Levenshtein
The Levenshtein metric is a simple unit-cost implementation of
edit distance metric.
2.2.1.1 Samples
>> get-similarity/levenshtein "François" "François"
== 1
>> get-similarity/levenshtein "Jhonny" "Johnny"
== 0.666666666666667
>> get-similarity/levenshtein "Willlaim" "William"
== 0.75
2.2.1.2 Refinements
The levenshtein metric comes with three refinements:
| /del-cost | The cost of deleting the next letter in the source string.
|
| /ins-cost | The cost of inserting a new letter that does not appear in the source string.
|
| /sub-cost | The cost of substituting a different letter for the next letter in the source string.
|
By defaults, all costs are set to 1.
>> get-similarity/levenshtein "Willlaim" "William"
== 0.75
>> get-similarity/levenshtein/del-cost "Willlaim" "William" 2
== 0.25
>> get-similarity/levenshtein/ins-cost "Willlaim" "William" 2
== 0.25
>> get-similarity/levenshtein/sub-cost "Willlaim" "William" 2
== 0.625
>> get-similarity/levenshtein/del-cost/ins-cost/sub-cost "Willlaim" "William" 2 2 2
== 0.5
2.2.1.3 Advantages and Drawbacks
The recursive implementation available in Simetrics, while elegant,
is rather slow and is not suitable for large matching requests.
2.3 The Jaro metric and variants
These metrics are intended primarily to measures the similarity
of short strings.
2.3.1 Jaro
Measures the similarity of two strings based on the
number and the order of common characters between them.
2.3.1.1 Samples
>> get-similarity/jaro "François" "François"
== 1
>> get-similarity/jaro "Jhonny" "Johnny"
== 0.944444444444445
>> get-similarity/jaro "Willlaim" "William"
== 0.910714285714286
2.3.1.2 Advantages and Drawbacks
The Jaro similarity metric is very effective and its implementation
in Simetrics is performant and suitable for large matching requests.
2.3.2 Jaro-Winkler
This is a variant of the Jaro metric that emphasizes matches in the
first few characters.
2.3.2.1 Samples
>> get-similarity/jaro-winkler "François" "François"
== 1
>> get-similarity/jaro-winkler "Jhonny" "Johnny"
== 0.95
>> get-similarity/jaro-winkler "Willlaim" "William"
== 0.946428571428571
2.3.2.2 Advantages and Drawbacks
The Jaro similarity metric is very effective and its implementation
in Simetrics is performant and suitable for large matching requests.
3. Token-based metrics and document queries
Like the metrics presented in the previous section, the first one in this section
(Jaccard) is also designed to compare two strings and return their level of
similarity, but in a word order independent manner.
In the remainder of the section, we will discuss metrics to make queries
against a corpus of documents and to return those documents that are the
most relevant with regard to the query.
Corpus of document
To test the metrics described in this section, we recommend you to
load the script library from Rebol.org at
http://www.rebol.org/cgi-bin/cgiwrap/rebol/download-librarian.r
3.1 Tokenizer
All the metrics described in this section requires you to provide a
"tokenizer", i.e. a function that converts a string to a token multiset
(where each token is a word). The metrics will then consider similarity
on this multiset.
3.1.1 Definition
USAGE:
MY-TOKENIZER str
DESCRIPTION:
Converts a string to a token multiset (where each token is a word).
Returns a block! of tokens.
MY-TOKENIZER is a function value.
ARGUMENTS:
str -- (Type: string)
3.1.2 Available tokenizer
Simetrics provides several tokenizers:
| tokenize-rebol-script | Optimized to convert a rebol script into a token multiset.
|
| tokenize-text | Converts text to a token multiset insensitive to case and
non-english characters.
|
3.2 Accumulate statistics
Before running queries against the corpus of documents, you must
first compute and accumulate statistics about the corpus. This
is the purpose of the 'accumulate-statistics function.
USAGE:
ACCUMULATE-STATISTICS corpus tokenize
DESCRIPTION:
Accumulates and returns the statistics on the corpus of documents.
ACCUMULATE-STATISTICS is a function value.
ARGUMENTS:
corpus -- Path to the corpus of documents, or the corpus itself. (Type: file block hash)
tokenize -- Splits a string into tokens. (Type: function)
In the following code, we accumulate statistics on the corpus of rebol
scripts from rebol.org using the tokenizer 'tokenize-rebol-script.
>> rebol-org-stats: accumulate-statistics %local/library/scripts/ :tokenize-rebol-script
Do not forget to store the value returned by 'accumulate-statistics.
You will need it when calling the 'get-similarity function for
token-based metrics.
3.3 Jaccard and hybrid distances
Between the word sets S and T, Jaccard similarity is simply the
proportion of the intersection of S and T to the union of S and T.
The Jaccard similarity is designed to compare two strings, without
the need of a document corpus. You do not have to accumulate
statistics to use this metric.
3.3.1 Jaccard similarity
3.3.1.1 Samples
>> get-similarity/jaccard "François Vanzeveren" "vanzeveren, francois" :tokenize-text
== 1
>> get-similarity/jaccard "François Vanzeveren" "vanseveren, francois" :tokenize-text
== 0.333333333333333
>> get-similarity/jaccard {Vanzeveren, François
{ Avenue de Floréal, 85 b3
{ 1180, Bruxelles} {François Vanzeveren
{ 85, Avenue de Floréal, b3
{ Bruxelles 1180} :tokenize-text
== 1
3.3.1.2 Advantages and Drawbacks
The jaccard is word order independent and is particularly suitable to compare
strings like addresses. On shorter strings, this metric is highly sensitive to
mispelled terms.
>> get-similarity/jaccard "François Vanzeveren" "vanseveren, francois" :tokenize-text
== 0.333333333333333
3.3.2 Jaccard hybrid metrics
We developped two hybrid versions of the Jaccard metric to reduce its mispelling
sensitivity.
>> get-similarity/jaccard "François Vanzeveren" "vanseveren, françoiss" :tokenize-text
== 0
>> get-similarity/jaccard/jaro-hybrid "François Vanzeveren" "vanseveren, françoiss" :tokenize-text
== 0.948148148148148
>> get-similarity/jaccard/jaro-winkler-hybrid "François Vanzeveren" "vanseveren, françoiss" :tokenize-text
== 0.969259259259259
>>
3.3.3 Performace tip
The Jaccard metric is very usefull for comparing strings like addresses.
Suppose you have a database with 10000 persons and you want to find those
living at a specific address.
You could code it as below:
; s-address is the source address as a string!
result: make block! []
forall addresses [
score: get-similarity/jaccard/jaro-hybrid s-address first addresses :tokenize-text
repend result [score first addresses]
]
Every call to 'get-similarity involves converting both the source and target
addresses. This means that overall, 20000 strings will be tokenized.
Now, if you code as below:
; s-address is the source address as a string!
s-bag: tokenize-text s
result: make block! []
forall addresses [
score: get-similarity/jaccard/jaro-hybrid s-bag first addresses :tokenize-text
repend result [score first addresses]
]
Here, we "pre-tokenize" the source address and pass the token multiset to 'get-similarity.
Therefore, 'get-similarity will not tokenize the source address. This leads
to only 10001 addresses to be tokenized, which makes a huge difference.
3.4 Term Frequency Inverse Document Frequency and variants
Term Frequency Inverse Document Frequency (TF-IDF) is a simple and effective
weighing scheme that is the starting point for other, more recent algorithms.
The purpose of the TF-IDF implemented in Simetrics is not to compute the
similarity of two strings, but to retreive from a large set of documents
(the corpus) the most relevant ones with regard to a specific query.
3.4.1 TF-IDF
This is the simplest implementation of TF-IDF.
3.4.1.1 Samples
>> do %lib/simetrics/simetrics.r
Script: "Similarity Metrics" (18-Feb-2006)
>> rebol-org-stats: accumulate-statistics %library/scripts/ :tokenize-rebol-script
>> get-similarity/tfidf "bind, append, word!" rebol-org-stats :tokenize-rebol-script
== make hash! [70.6624872199201 %sql-protocol.r 48.6515826223692 %menu-system.r 28.1528683572244 %
ez-plot.r 27.8153672667736 %glayo...
>> get-similarity/tfidf "compose" rebol-org-stats :tokenize-rebol-script
== make hash! [29.1715770316469 %wiki.r 27.7129981800646 %sql-protocol.r 21.149393347944 %easy-ser
vice.r 19.6908144963617 %easy-soc...
>> get-similarity/tfidf "compose deep" rebol-org-stats :tokenize-rebol-script
== make hash! [55.5549255542205 %wiki.r 38.266337589094 %sql-protocol.r 33.6547545814701 %prolog.r
27.6058190531338 %easy-soccer.r ...
>> get-similarity/tfidf "compose/deep" rebol-org-stats :tokenize-rebol-script
== make hash! [55.5549255542205 %wiki.r 38.266337589094 %sql-protocol.r 33.6547545814701 %prolog.r
27.6058190531338 %easy-soccer.r ...
>>
The metrics returns a sorted list of scripts available in the corpus, from
the most relevant to the least relevant with regard to the query.
3.4.1.2 Advantages
Queries are very fast and accurate. If a user where to input a query for a
particular topic, TF-IDF can find documents that contain relevant information
on the query.
Observe the results of the queries below and you will understand why TF-IDF
is so powerfull:
>> get-similarity/tfidf "add" rebol-org-stats :tokenize-rebol-script
== make hash! [10.3221918205618 %vid-usage.r 8.73416230970611 %glayout.r 7.94014755427828 %menubar
.r 7.94014755427828 %prolog.r 7.1...
>> get-similarity/tfidf "compose add" rebol-org-stats :tokenize-rebol-script
== make hash! [30.095042446348 %sql-protocol.r 29.9655917870747 %wiki.r 21.149393347944 %easy-serv
ice.r 19.6908144963617 %easy-socc...
>>
The example below is even clearer:
>> get-similarity/tfidf "func function" rebol-org-stats :tokenize-rebol-script
== make hash! [32.6276130333423 %sql-protocol.r 27.8743152594645 %prolog.r 24.8717681646205 %glayo
ut.r 21.7517420222282 %rugby4.r 1...
>> get-similarity/tfidf "face func function" rebol-org-stats :tokenize-rebol-script
== make hash! [399.762300135641 %glayout.r 115.769315165492 %area-scroll-style.r 91.8702986734476
%spellck.r 85.2856470360527 %layo...
>>
The first query of each example concerns very common words (add, func, function) in rebol scripts.
These words appear very often in every script.
The second query add a more specific word to the initial query and the result is completely
different in terms of most relevant scripts and scores of relevance. That means that TF-IDF
gives more weight to relevant words in the corpus of document.
The analogy to plain english documents is straightforward: very common words like "a", "the", etc..
are completely irrelevant in a query and TF-IDF handle this very efficiently!
3.4.1.3 Drawbacks
Building the statistics can be rather long.
In its simplest form, TF-IDF can not handle relationships between
words (e.g. synomyms or plural forms).
We can easily see the impact of this limitation with a query on the corpus of
rebol scripts. Suppose we are looking for scripts using the 'append word. As good
rebolers, we know that 'repend is a variant of 'append and we are indifferent to get
scripts using 'append or 'repend.
>> get-similarity/tfidf "append" rebol-org-stats :tokenize-rebol-script
== make hash! [28.4802272916107 %sql-protocol.r 28.1528683572244 %ez-plot.r 16.0405877849302 %dig.
r 14.7311520473849 %glayout.r 13....
>>
In the above query, only 'append is taken into account to compute the score of a
script with regard to the query. Note that both %sql-protocol.r and %dig use
'append, but %sql-protocol.r uses it more frequently.
>> get-similarity/tfidf "repend" rebol-org-stats :tokenize-rebol-script
== make hash! [16.2745276063549 %etext.r 12.2058957047662 %content-management.r 12.2058957047662 %
site-check.r 12.2058957047662 %sq...
>>
In the above query, only 'repend is taken into account to compute the score of a
script with regard to the query.
>> get-similarity/tfidf "repend append" rebol-org-stats :tokenize-rebol-script
== make hash! [40.6861229963769 %sql-protocol.r 31.204342283416 %ez-plot.r 19.8754758846046 %etext
.r 16.0405877849302 %dig.r 14.731...
>>
In the above query, both 'append and 'repend are taken into account to compute the score of a
script with regard to the query. From the result, it turns out that %sql-protocol.r is the most
relevant example of a script using both 'append and 'repend, within the corpus of scripts.
This limitation can be sometimes regarded as an advantage, e.g. if you are looking for scripts using
'repend specifically.
Tip to circumvent this limitation
You can easily build a specific tokenizer that will add to the query related words,
so that the query "append" will turn into "append repend".
3.4.2 Variants to TF-IDF
To be completed.
|