visitor (0 QPoints)
  • FR
  • EN
  • NL
  • DE
  • ES
315 experts, 1193 registered users, 1659 questions already answered
European Experts Exchange, the very best site for high-quality IT solutions

New Improved Search!

 


05/10/2011 1h30 : Steve Jobs is dead, the father of Apple ][ is gone, we are all orphaned.

Algorithms :: Maths :: Word approximation algorithm


By: collegeBoy U.S.A.  Date: 21/01/2003 00:00:00  English  Points: 100 Status: Answered
Quality : Excellent
Hi
I want a suitable word approximation search algorithm. My problem is that I have a company database. When new company names arrive I have to check whether it's already there in the Db or not. Now sometimes because of operator mistake or so same name is distorted a bit (sometimes it's abbreviated too). Like Johnson & Nichelson is duplicated as Johnson and Nickelsons or Nickelsons & Johnson. I need to find an algo which will calculate the distance matrix with various words of two names and suggest the names present in the Db if probably matrix < some threshold. Some shareware will also be ok.

Regards
By: sumotimor Date: 21/01/2003 10:24:00 English  Type : Answer
Soundex often works well for names.
Or you could compare by edit distance.
agrep is a fast program for that
<A HREF="ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z">ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z</a>
By: VGR Date: 24/01/2003 06:37:00 English  Type : Assist
yes to the above

there are also "distance" measures for words, like this one (from PHP)

levenshtein
(PHP 3>= 3.0.17, PHP 4 )

levenshtein -- Calculate Levenshtein distance between two strings
Description

int levenshtein (string str1, string str2)

int levenshtein (string str1, string str2, int cost_ins, int cost_rep, int cost_del)

int levenshtein (string str1, string str2, function cost)


This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters (255 should be more than enough for name or dictionary comparison, and nobody serious would be doing genetic analysis with PHP).

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).

In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace and delete operations needed to transform str1 into str2.

A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.

The third variant (which is not implemented yet) will be the most general and adaptive, but also the slowest alternative. It will call a user-supplied function that will determine the cost for every possible operation.

The user-supplied function will be called with the following arguments:


operation to apply: 'I', 'R' or 'D'

actual character in string 1

actual character in string 2

position in string 1

position in string 2

remaining characters in string 1

remaining characters in string 2

The user-supplied function has to return a positive integer describing the cost for this particular operation, but it may decide to use only some of the supplied arguments.

The user-supplied function approach offers the possibility to take into account the relevance of and/or difference between certain symbols (characters) or even the context those symbols appear in to determine the cost of insert, replace and delete operations, but at the cost of losing all optimizations done regarding cpu register utilization and cache misses that have been worked into the other two variants.

See also soundex(), similar_text(), and metaphone().

By: collegeBoy Date: 26/01/2003 18:18:00 English  Type : Comment
Dear sumotimor is Levenshtein free or has to be purchased ?
By: VGR Date: 26/01/2003 18:32:00 English  Type : Comment
I'm not sumotimor ;-)

The function is free as it comes with PHP

The algorithm I don't know, but for sure it's free of royalties or it would not be included in PHP for free

regards
By: collegeBoy Date: 31/01/2003 04:17:00 English  Type : Comment
Sorry for posting so late. I don't know whom to give credit since both sumotimor and VGR deserves it for supplying excellent guidance. To give credit to one means to be unfair to other. I am just out of my wits. May be coordinator will be of help. My sincere thanks and apologies to both.
Regards
By: digitaltree Date: 26/02/2003 20:35:00 English  Type : Comment

hi quest ....

there is a way to share points ... you can even
ask how to do it here in this question.

regarding the question ... I think you should have a
unique ID for each Company name. When a data entry person
enters a name that is not on your list you should raise
a flag...ask data entry person some questions.

It does make sense for google.com to look at alternative
words, but as a database mgr you should not waste time
on that. You should have a well defined list of company
names in your data base...make sure your current data is
correct.

When data entry enters a Company name then check it
against your data base...if it is not in the data base
then start a sequence ... name not in database...
Is this a new company?? then use logic to continue
the questions. If within reason show a list based on
first letter of the entry...data person should see why
it is not new and correct the entry.

The data entry form is very important here ... it must
be designed for easy, logical entry and it must raise
a flag when entry is not wholly consistant with your
database.

sincerely, digiT



By: collegeBoy Date: 02/03/2003 03:49:00 English  Type : Comment
Hi digiT, thanks a lot for the comment. Actually both the answers are equally correct though the first one would suit my requirement but that should not carry away the credit from VGR.

Do register to be able to answer

EContact
browser fav
page generated in 307.621960 milliseconds

Why Google AdSense ads ?

compteur
 Ranking-Hits PageRank for this page