Legend:
Library
Module
Module type
Parameter
Class
Class type
Levenshtein distance
We take inspiration from http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata for the main algorithm and ideas. However some parts are adapted
Abstraction over Strings
Due to the existence of several encodings and string representations we abstract over the type of strings. A string is a finite array of characters (8-bits char, unicode runes, etc.) which provides a length operation and a function to access the n-th character.
This data structure is used to represent a list of result that is evaluated only as far as the user wants. If the user only wants a few elements, she doesn't pay for the remaining ones.
In particular, when matching a string against a (big) set of indexed strings, we return a continuation list so that, even if there are many results, only those actually asked for are evaluated.
type'a klist = unit ->[ `Nil | `Cons of 'a * 'aklist ]
The signature for a given string representation provides 3 main things:
a edit_distance function to compute the edit distance between strings
an automaton type that is built from a string s and a maximum distance n, and only accepts the strings s' such that edit_distance s s' <= n.
an Index module that can be used to map many strings to values, like a regular string map, but for which retrieval is fuzzy (for a given maximal distance).
A possible use of the index could be:
open Batteries;;
let words = File.with_file_in "/usr/share/dict/english"
(fun i -> IO.read_all i |> String.nsplit ~by:"\\n");;
let words = List.map (fun s->s,s) words;;
let idx = Levenshtein.Index.of_list words;;
Levenshtein.Index.retrieve ~limit:1 idx "hell" |> Levenshtein.klist_to_list;;
Edition distance between two strings. This satisfies the classical distance axioms: it is always positive, symmetric, and satisfies the formula distance a b + distance b c >= distance a c
Automaton
An automaton, built from a string s and a limit n, that accepts every string that is at distance at most n from s.
Build an automaton from a string, with a maximal distance limit. The automaton will accept strings whose edit_distance to the parameter is at most limit.
match_with a s matches the string s against a, and returns true if the distance from s to the word represented by a is smaller than the limit used to build a