Mega Code Archive

 
Categories / Delphi / Multimedia
 

Soundex Revisited Searching Strings by sound

Title: Soundex Revisited - Searching Strings by sound. Question: How to match strings based on the way they sound & not on their spellings. Answer: This article is in continuation of my previous article (http://www.delphi3000.com/article.asp?id=1560) and represents an attempt at making the SoundEx() more versatile so as to theoratically accomodate languages other than English - the only restriction being that the language should use the ASCII character set. Another advantage is that the function can be "tuned" to peculiarities of a language e.g. "Knife" is pronounced as "Nife" in English. There is theoratically no limit to this "tunability" - of course with corresponding decrease in performance. But you can get amazing results which are better than what SoundEx() gives. I have chosen to post a new article rather than update the original one since the original function has been modified quite significantly (in concept) thus making it different from the industry standard SoundEx() function - which was implemented in the original article. Since the function now supports language "tuning", it can give different results than the industry standard SoundEx(). I have thus renamed the function to "Sound()". This also gives me the freedom to implement it differently. Sound() returns the same value (M240) for each of Micael/Maical/Michael/Maichael. Additionally, since it has been (partially) tuned for English, it will give the same result (F500) for "Phone"/"Fone". I guess the "Ultimate" Sound Matching logic will be based on phonemes - of which I currently know very little. If you help me by providing me details of phonemes that you may have, then I will make yet another attempt at improving "Sound()" even further... I thank Toninho Nunes and Joe Meyer for providing me ideas & inputs respectively. Please save the code below in a file called "Sounds.pas". You will need to include the file in your source (Uses Sounds) and then you will have access to the Sound() function. {********************************************************************} {* Description: Modified Soundex function in which it is attempted to include *} {* language pecularities which theoratically makes it adaptable to languages *} {* other than English - the only restriction being that the language in *} {* question should use ASCII character set *} {********************************************************************} {* Date Created : 15-Nov-2000 *} {* Last Modified : 16-Nov-2000 *} {* Version : 0.10 *} {* Author : Paramjeet Reen *} {* eMail : Paramjeet.Reen@EudoraMail.com *} {******************************************************************************} {* This program is based on an algorithm that I had found in a magazine, *} {* merged with an algorithm of a program posted by Joe Meyer. I do not *} {* gurantee the fitness of this program in any way. Use it at your own risk. *} {********************************************************************} {* Category: Freeware. *} {********************************************************************} unit Sounds; interface //Returns a code for InpStr depending upon how it sounds. function Sound(const InpStr :ShortString) :ShortString; implementation type TReplacePos = (pStart, pMid, pEnd); TReplacePosSet = set of TReplacePos; const {********************************************************************} {* The following are selected letters of the alphabet which are divided *} {* into their corresponding code (1-6). You might need to modify these for *} {* different languages depending upon whether the language requires *} {* alphabets other than the ones specified below *} {********************************************************************} Chars1 = ['B','P','F','V']; Chars2 = ['C','S','K','G','J','Q','X','Z']; Chars3 = ['D','T']; Chars4 = ['L']; Chars5 = ['M','N']; Chars6 = ['R']; procedure ReplaceStr(var InpStr :ShortString; const SubStr,WithStr :ShortString; const ReplacePositions :TReplacePosSet); var i :Integer; begin if(pStart in ReplacePositions)then begin i := Pos(SubStr,InpStr); if(i = 1)then begin Delete(InpStr,i,Length(SubStr)); Insert(WithStr,InpStr,i); end; end; if(pMid in ReplacePositions)then begin i := Pos(SubStr,InpStr); while(i 1)and(i begin Delete(InpStr,i,Length(SubStr)); Insert(WithStr,InpStr,i); i := Pos(SubStr,InpStr); end; end; if(pEnd in ReplacePositions)then begin i := Pos(SubStr,InpStr); if(i 1)and(i (Length(InpStr) - Length(SubStr)))then begin Delete(InpStr,i,Length(SubStr)); Insert(WithStr,InpStr,i); end; end; end; function Sound(const InpStr :ShortString) :ShortString; var vStr :ShortString; PrevCh :Char; CurrCh :Char; i :Word; begin {********************************************************************} {* Uppercase & remove invalid characters from given string *} {********************************************************************} {* Please have a long & hard look at this code if you have modified any of *} {* the constants Chars1,Chars2 ... Chars6 by increasing the overall range *} {* of alphabets *} {********************************************************************} vStr := ''; for i := 1 to Length(InpStr)do case InpStr[i] of 'a'..'z': vStr := vStr + UpCase(InpStr[i]); 'A'..'Z': vStr := vStr + InpStr[i]; end; {case} if(vStr '')then begin {**************************************************************************} {* Language Tweaking Section *} {********************************************************************} {* Tweak for language peculiarities e.g. "CAt"="KAt", "KNIfe"="NIfe" *} {* "PHone"="Fone", "PSYchology"="SIchology", "EXcel"="Xcel" etc... *} {* You will need to modify these for different languages. Optionally, you *} {* may choose not to have this section at all, in which case, the output *} {* of Sound() will correspond to that of SoundEx(). Please note however *} {* the importance of what you replace & the order in which you replace. *} {********************************************************************} {* Also, please note that the following replacements are targeted for the *} {* English language & that too is subject to improvements *} {********************************************************************} ReplaceStr(vStr,'CA' ,'KA',[pStart,pMid,pEnd]); //arCAde = arKAde ReplaceStr(vStr,'CL' ,'KL',[pStart,pMid,pEnd]); //CLass = Klass ReplaceStr(vStr,'CK' ,'K' ,[pStart,pMid,pEnd]); //baCK = baK ReplaceStr(vStr,'EX' ,'X' ,[pStart,pMid,pEnd]); //EXcel = Xcel ReplaceStr(vStr,'X' ,'Z' ,[pStart]); //Xylene = Zylene ReplaceStr(vStr,'PH' ,'F' ,[pStart,pMid,pEnd]); //PHone = Fone ReplaceStr(vStr,'KN' ,'N' ,[pStart]); //KNife = Nife ReplaceStr(vStr,'PSY','SI',[pStart]); //PSYche = SIche ReplaceStr(vStr,'SCE','CE',[pStart,pMid,pEnd]); //SCEne = CEne {********************************************************************} {* String Assembly Section *} {********************************************************************} PrevCh := #0; Result := vStr[1]; for i := 2 to Length(vStr) do begin if Length(Result) = 4 then break; CurrCh := vStr[i]; if (CurrCh PrevCh) then begin if CurrCh in Chars1 then Result := Result + '1' else if CurrCh in Chars2 then Result := Result + '2' else if CurrCh in Chars3 then Result := Result + '3' else if CurrCh in Chars4 then Result := Result + '4' else if CurrCh in Chars5 then Result := Result + '5' else if CurrCh in Chars6 then Result := Result + '6'; PrevCh := CurrCh; end; end; end else Result := ''; while(Length(Result) end; end.