Documention for: simetrics.r
Created by: fvzv
on: 17-Feb-2006
Last updated by: fvzv on: 19-Feb-2006
Format: html
Downloaded on: 18-Apr-2024

REBOL

Simetrics

Author: François Vanzeveren
Date: 19-Feb-2006

Contents:

1. Introduction
2. String similarity metrics
2.1 Global refinements
2.2 Edit distance metrics
2.3 The Jaro metric and variants
3. Token-based metrics and document queries
3.1 Tokenizer
3.2 Accumulate statistics
3.3 Jaccard and hybrid distances
3.4 Term Frequency Inverse Document Frequency and variants

1. Introduction

Simetrics stands for "Similarity Metrics".

This library implements a few effective and widely used metrics for measuring string similarities.

The metrics currently implemented are:

  • Levenshtein
  • Jaro
  • Jaro-Winkler
  • Jaccard
  • Term Frequency Inverse Document Frequency
  • Term Frequency Inverse Document Frequency combined with Jaro
  • Term Frequency Inverse Document Frequency combined with Jaro-Winkler

Ressources

This document is based on the paper "A Comparison of String Distance Metrics for Name-Matching Tasks" available at http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf.

2. String similarity metrics

This class of metrics matches a source string with a target string and returns the level of similarity as a number between 0 (no similarity) and 1 (perfect match).

These metrics are sensitive to word order.

>> get-similarity/jaro "Francois Vanzeveren" "François Vanzeveren"
== 1
>> get-similarity/jaro "Vanzeveren Francois" "François Vanzeveren"
== 0.582934609250399

2.1 Global refinements

By default, string matching metrics are case and non-english characters insensitive.

>> get-similarity/levenshtein "François" "francois"
== 1

You can change this behaviour with the following refinements.

 /caseCharacters are case-sensitive.
 /strictMatching is non-english characters sensitive.
>> get-similarity/levenshtein/case "françois" "François"
== 0.875
>> get-similarity/levenshtein/strict "François" "Francois"
== 0.875

2.2 Edit distance metrics

An important class of static string similarity metrics are "edit distances". The distance between the source and target strings is the cost of the best sequence of "edit operations" that converts the source string to the target string.

The edit operations are deletions, insertions and substitutions.

2.2.1 Levenshtein

The Levenshtein metric is a simple unit-cost implementation of edit distance metric.

2.2.1.1 Samples
>> get-similarity/levenshtein "François" "François"
== 1
>> get-similarity/levenshtein "Jhonny" "Johnny"
== 0.666666666666667
>> get-similarity/levenshtein "Willlaim" "William"
== 0.75
2.2.1.2 Refinements

The levenshtein metric comes with three refinements:

 /del-costThe cost of deleting the next letter in the source string.
 /ins-costThe cost of inserting a new letter that does not appear in the source string.
 /sub-costThe cost of substituting a different letter for the next letter in the source string.

By defaults, all costs are set to 1.

>> get-similarity/levenshtein "Willlaim" "William"
== 0.75
>> get-similarity/levenshtein/del-cost "Willlaim" "William" 2
== 0.25
>> get-similarity/levenshtein/ins-cost "Willlaim" "William" 2
== 0.25
>> get-similarity/levenshtein/sub-cost "Willlaim" "William" 2
== 0.625
>> get-similarity/levenshtein/del-cost/ins-cost/sub-cost "Willlaim" "William" 2 2 2
== 0.5
2.2.1.3 Advantages and Drawbacks

The recursive implementation available in Simetrics, while elegant, is rather slow and is not suitable for large matching requests.

2.3 The Jaro metric and variants

These metrics are intended primarily to measures the similarity of short strings.

2.3.1 Jaro

Measures the similarity of two strings based on the number and the order of common characters between them.

2.3.1.1 Samples
>> get-similarity/jaro "François" "François"
== 1
>> get-similarity/jaro "Jhonny" "Johnny"
== 0.944444444444445
>> get-similarity/jaro "Willlaim" "William"
== 0.910714285714286
2.3.1.2 Advantages and Drawbacks

The Jaro similarity metric is very effective and its implementation in Simetrics is performant and suitable for large matching requests.

2.3.2 Jaro-Winkler

This is a variant of the Jaro metric that emphasizes matches in the first few characters.

2.3.2.1 Samples
>> get-similarity/jaro-winkler "François" "François"
== 1
>> get-similarity/jaro-winkler "Jhonny" "Johnny"
== 0.95
>> get-similarity/jaro-winkler "Willlaim" "William"
== 0.946428571428571
2.3.2.2 Advantages and Drawbacks

The Jaro similarity metric is very effective and its implementation in Simetrics is performant and suitable for large matching requests.

3. Token-based metrics and document queries

Like the metrics presented in the previous section, the first one in this section (Jaccard) is also designed to compare two strings and return their level of similarity, but in a word order independent manner.

In the remainder of the section, we will discuss metrics to make queries against a corpus of documents and to return those documents that are the most relevant with regard to the query.

Corpus of document

To test the metrics described in this section, we recommend you to load the script library from Rebol.org at http://www.rebol.org/cgi-bin/cgiwrap/rebol/download-librarian.r

3.1 Tokenizer

All the metrics described in this section requires you to provide a "tokenizer", i.e. a function that converts a string to a token multiset (where each token is a word). The metrics will then consider similarity on this multiset.

3.1.1 Definition

USAGE:
    MY-TOKENIZER str
DESCRIPTION:
     Converts a string to a token multiset (where each token is a word).
     Returns a block! of tokens.
     MY-TOKENIZER is a function value.
ARGUMENTS:
     str -- (Type: string)

3.1.2 Available tokenizer

Simetrics provides several tokenizers:

 tokenize-rebol-scriptOptimized to convert a rebol script into a token multiset.
 tokenize-textConverts text to a token multiset insensitive to case and non-english characters.

3.2 Accumulate statistics

Before running queries against the corpus of documents, you must first compute and accumulate statistics about the corpus. This is the purpose of the 'accumulate-statistics function.

USAGE:
    ACCUMULATE-STATISTICS corpus tokenize
DESCRIPTION:
     Accumulates and returns the statistics on the corpus of documents.
     ACCUMULATE-STATISTICS is a function value.
ARGUMENTS:
     corpus -- Path to the corpus of documents, or the corpus itself. (Type: file block hash)
     tokenize -- Splits a string into tokens. (Type: function)

In the following code, we accumulate statistics on the corpus of rebol scripts from rebol.org using the tokenizer 'tokenize-rebol-script.

>> rebol-org-stats: accumulate-statistics %local/library/scripts/ :tokenize-rebol-script

Do not forget to store the value returned by 'accumulate-statistics. You will need it when calling the 'get-similarity function for token-based metrics.

3.3 Jaccard and hybrid distances

Between the word sets S and T, Jaccard similarity is simply the proportion of the intersection of S and T to the union of S and T.

The Jaccard similarity is designed to compare two strings, without the need of a document corpus. You do not have to accumulate statistics to use this metric.

3.3.1 Jaccard similarity

3.3.1.1 Samples
>> get-similarity/jaccard "François Vanzeveren" "vanzeveren, francois" :tokenize-text
== 1
>> get-similarity/jaccard "François Vanzeveren" "vanseveren, francois" :tokenize-text
== 0.333333333333333
>> get-similarity/jaccard {Vanzeveren, François
{    Avenue de Floréal, 85 b3
{    1180, Bruxelles} {François Vanzeveren
{    85, Avenue de Floréal, b3
{    Bruxelles 1180} :tokenize-text
== 1
3.3.1.2 Advantages and Drawbacks

The jaccard is word order independent and is particularly suitable to compare strings like addresses. On shorter strings, this metric is highly sensitive to mispelled terms.

>> get-similarity/jaccard "François Vanzeveren" "vanseveren, francois" :tokenize-text
== 0.333333333333333

3.3.2 Jaccard hybrid metrics

We developped two hybrid versions of the Jaccard metric to reduce its mispelling sensitivity.

>> get-similarity/jaccard "François Vanzeveren" "vanseveren, françoiss" :tokenize-text
== 0
>> get-similarity/jaccard/jaro-hybrid "François Vanzeveren" "vanseveren, françoiss" :tokenize-text
== 0.948148148148148
>> get-similarity/jaccard/jaro-winkler-hybrid "François Vanzeveren" "vanseveren, françoiss" :tokenize-text
== 0.969259259259259
>>

3.3.3 Performace tip

The Jaccard metric is very usefull for comparing strings like addresses. Suppose you have a database with 10000 persons and you want to find those living at a specific address.

You could code it as below:

; s-address is the source address as a string!
result: make block! []
forall addresses [
    score: get-similarity/jaccard/jaro-hybrid s-address first addresses :tokenize-text
    repend result [score first addresses]
]

Every call to 'get-similarity involves converting both the source and target addresses. This means that overall, 20000 strings will be tokenized.

Now, if you code as below:

; s-address is the source address as a string!
s-bag: tokenize-text s
result: make block! []
forall addresses [
    score: get-similarity/jaccard/jaro-hybrid s-bag first addresses :tokenize-text
    repend result [score first addresses]
]

Here, we "pre-tokenize" the source address and pass the token multiset to 'get-similarity. Therefore, 'get-similarity will not tokenize the source address. This leads to only 10001 addresses to be tokenized, which makes a huge difference.

3.4 Term Frequency Inverse Document Frequency and variants

Term Frequency Inverse Document Frequency (TF-IDF) is a simple and effective weighing scheme that is the starting point for other, more recent algorithms.

The purpose of the TF-IDF implemented in Simetrics is not to compute the similarity of two strings, but to retreive from a large set of documents (the corpus) the most relevant ones with regard to a specific query.

3.4.1 TF-IDF

This is the simplest implementation of TF-IDF.

3.4.1.1 Samples
>> do %lib/simetrics/simetrics.r
Script: "Similarity Metrics" (18-Feb-2006)
>> rebol-org-stats: accumulate-statistics %library/scripts/ :tokenize-rebol-script
>> get-similarity/tfidf "bind, append, word!" rebol-org-stats :tokenize-rebol-script
== make hash! [70.6624872199201 %sql-protocol.r 48.6515826223692 %menu-system.r 28.1528683572244 %
ez-plot.r 27.8153672667736 %glayo...
>> get-similarity/tfidf "compose" rebol-org-stats :tokenize-rebol-script
== make hash! [29.1715770316469 %wiki.r 27.7129981800646 %sql-protocol.r 21.149393347944 %easy-ser
vice.r 19.6908144963617 %easy-soc...
>> get-similarity/tfidf "compose deep" rebol-org-stats :tokenize-rebol-script
== make hash! [55.5549255542205 %wiki.r 38.266337589094 %sql-protocol.r 33.6547545814701 %prolog.r
 27.6058190531338 %easy-soccer.r ...
>> get-similarity/tfidf "compose/deep" rebol-org-stats :tokenize-rebol-script
== make hash! [55.5549255542205 %wiki.r 38.266337589094 %sql-protocol.r 33.6547545814701 %prolog.r
 27.6058190531338 %easy-soccer.r ...
>>

The metrics returns a sorted list of scripts available in the corpus, from the most relevant to the least relevant with regard to the query.

3.4.1.2 Advantages

Queries are very fast and accurate. If a user where to input a query for a particular topic, TF-IDF can find documents that contain relevant information on the query.

Observe the results of the queries below and you will understand why TF-IDF is so powerfull:

>> get-similarity/tfidf "add" rebol-org-stats :tokenize-rebol-script
== make hash! [10.3221918205618 %vid-usage.r 8.73416230970611 %glayout.r 7.94014755427828 %menubar
.r 7.94014755427828 %prolog.r 7.1...
>> get-similarity/tfidf "compose add" rebol-org-stats :tokenize-rebol-script
== make hash! [30.095042446348 %sql-protocol.r 29.9655917870747 %wiki.r 21.149393347944 %easy-serv
ice.r 19.6908144963617 %easy-socc...
>>

The example below is even clearer:

>> get-similarity/tfidf "func function" rebol-org-stats :tokenize-rebol-script
== make hash! [32.6276130333423 %sql-protocol.r 27.8743152594645 %prolog.r 24.8717681646205 %glayo
ut.r 21.7517420222282 %rugby4.r 1...
>> get-similarity/tfidf "face func function" rebol-org-stats :tokenize-rebol-script
== make hash! [399.762300135641 %glayout.r 115.769315165492 %area-scroll-style.r 91.8702986734476
%spellck.r 85.2856470360527 %layo...
>>

The first query of each example concerns very common words (add, func, function) in rebol scripts. These words appear very often in every script.

The second query add a more specific word to the initial query and the result is completely different in terms of most relevant scripts and scores of relevance. That means that TF-IDF gives more weight to relevant words in the corpus of document.

The analogy to plain english documents is straightforward: very common words like "a", "the", etc.. are completely irrelevant in a query and TF-IDF handle this very efficiently!

3.4.1.3 Drawbacks

Building the statistics can be rather long.

In its simplest form, TF-IDF can not handle relationships between words (e.g. synomyms or plural forms).

We can easily see the impact of this limitation with a query on the corpus of rebol scripts. Suppose we are looking for scripts using the 'append word. As good rebolers, we know that 'repend is a variant of 'append and we are indifferent to get scripts using 'append or 'repend.

>> get-similarity/tfidf "append" rebol-org-stats :tokenize-rebol-script
== make hash! [28.4802272916107 %sql-protocol.r 28.1528683572244 %ez-plot.r 16.0405877849302 %dig.
r 14.7311520473849 %glayout.r 13....
>>

In the above query, only 'append is taken into account to compute the score of a script with regard to the query. Note that both %sql-protocol.r and %dig use 'append, but %sql-protocol.r uses it more frequently.

>> get-similarity/tfidf "repend" rebol-org-stats :tokenize-rebol-script
== make hash! [16.2745276063549 %etext.r 12.2058957047662 %content-management.r 12.2058957047662 %
site-check.r 12.2058957047662 %sq...
>>

In the above query, only 'repend is taken into account to compute the score of a script with regard to the query.

>> get-similarity/tfidf "repend append" rebol-org-stats :tokenize-rebol-script
== make hash! [40.6861229963769 %sql-protocol.r 31.204342283416 %ez-plot.r 19.8754758846046 %etext
.r 16.0405877849302 %dig.r 14.731...
>>

In the above query, both 'append and 'repend are taken into account to compute the score of a script with regard to the query. From the result, it turns out that %sql-protocol.r is the most relevant example of a script using both 'append and 'repend, within the corpus of scripts.

This limitation can be sometimes regarded as an advantage, e.g. if you are looking for scripts using 'repend specifically.

Tip to circumvent this limitation

You can easily build a specific tokenizer that will add to the query related words, so that the query "append" will turn into "append repend".

3.4.2 Variants to TF-IDF

To be completed.

MakeDoc2 by REBOL - 19-Feb-2006