Mega Code Archive

 
Categories / Delphi / Strings
 

Using the Damerau Levenshtein Metric for string comparsion

Title: Using the Damerau-Levenshtein-Metric for string comparsion. Question: Are two strings similar? Answer: What is Levenshtein Distance/Metric? Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. If x,y are strings then the distances ist defined by LD(x,y) = min(i)(#S(i)+#D(i)+#I(i) ) for all possible edit sessions i. #S(I), #D(i), # I(i) are the number of substitutions, deletions and insertions required in the edit session i. For example, h If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical. h If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t. The greater the Levenshtein distance, the more different the strings are. Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. h Spell checking h Speech recognition h DNA analysis h Plagiarism detection The Damerau-Levenstein metric is a generalisation of the Levenstein metric. It assigns weights to the edit operations. It is defined as DLD(x,y) = min(i)(#S(i)*Ws+#D(i)*Wd+#I(i)*Wi ) for all possible edit sessions i. Again #S(I), #D(i), # I(i) are the number of substitutions, deletions and insertions required in the edit session i. Ws, Wd, Wi are positive numbered weighting factors for the edit operations. For more information search the web for [levenshtein Vphp] using www.google.com. levensthtein is a function in the PHP library. The string Vphp is used to exclude all the PHP manual pages from the search, that obscure the more interesting links. With this metric and a quadruppel of numbers (the weights and a threshold value of likeness) you can implement a StringsAreLike function of your own taste. The below example I used to search for similar names (Meyer, Meier, Mayer, Mayr) in a database. This is usefull if you can not remeber the correct spelling of a persons name. Implementation. uses math, sysutils; const ws=3; // weight for substitution wi=1; // weight for insertion wd=6; // weight for deleting th=4; function StringsAreLike(const s1,s2:string):boolean; begin result:= DamerauLevenshteinLike(s1,s2,ws,wi,wd)end; function DamerauLevenshteinLike(const s1,s2:string;ws,wi,wd:integer):Integer; VAR i,j:Integer; function Pp(x,y:Integer):Integer; begin if AnsiUpperCase(s1[x])=AnsiUpperCase(s2[y]) then Pp:=0 else Pp:=ws; end; var Wmax:integer; d:array of array of integer; begin Wmax:=Max(length(s1),length(s2))+1; SetLength(d,Wmax,Wmax); dec(Wmax); d[0,0]:=0; for j:=1 TO Wmax DO d[0,j]:=d[0,Pred(j)]+wi; for i:=1 TO Wmax DO d[i,0]:=d[Pred(i),0]+wd; for i:=1 TO Length(s1) DO for j:=1 TO Length(s2) DO d[i,j]:=MinIntValue([ d[Pred(i),Pred(j)]+Pp(i,j), //substitution d[ i ,Pred(j)]+wi, //insertion d[Pred(i), j ]+wd //deletion ]); result:=d[Length(s1),Length(s2)]; SetLength(d,0); end{DamerauLevenshteinLike};