Algorithms :: Maths :: Word approximation algorithm |
|||
| By: collegeBoy |
Date: 21/01/2003 00:00:00 |
Points: 100 | Status: Answered Quality : Excellent |
|
Hi I want a suitable word approximation search algorithm. My problem is that I have a company database. When new company names arrive I have to check whether it's already there in the Db or not. Now sometimes because of operator mistake or so same name is distorted a bit (sometimes it's abbreviated too). Like Johnson & Nichelson is duplicated as Johnson and Nickelsons or Nickelsons & Johnson. I need to find an algo which will calculate the distance matrix with various words of two names and suggest the names present in the Db if probably matrix < some threshold. Some shareware will also be ok. Regards |
|||
| By: sumotimor | Date: 21/01/2003 10:24:00 | Type : Answer |
|
| Soundex often works well for names. Or you could compare by edit distance. agrep is a fast program for that <A HREF="ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z">ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z</a> |
|||
| By: VGR | Date: 24/01/2003 06:37:00 | Type : Assist |
|
| yes to the above there are also "distance" measures for words, like this one (from PHP) levenshtein (PHP 3>= 3.0.17, PHP 4 ) levenshtein -- Calculate Levenshtein distance between two strings Description int levenshtein (string str1, string str2) int levenshtein (string str1, string str2, int cost_ins, int cost_rep, int cost_del) int levenshtein (string str1, string str2, function cost) This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters (255 should be more than enough for name or dictionary comparison, and nobody serious would be doing genetic analysis with PHP). The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive). In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace and delete operations needed to transform str1 into str2. A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient. The third variant (which is not implemented yet) will be the most general and adaptive, but also the slowest alternative. It will call a user-supplied function that will determine the cost for every possible operation. The user-supplied function will be called with the following arguments: operation to apply: 'I', 'R' or 'D' actual character in string 1 actual character in string 2 position in string 1 position in string 2 remaining characters in string 1 remaining characters in string 2 The user-supplied function has to return a positive integer describing the cost for this particular operation, but it may decide to use only some of the supplied arguments. The user-supplied function approach offers the possibility to take into account the relevance of and/or difference between certain symbols (characters) or even the context those symbols appear in to determine the cost of insert, replace and delete operations, but at the cost of losing all optimizations done regarding cpu register utilization and cache misses that have been worked into the other two variants. See also soundex(), similar_text(), and metaphone(). |
|||
| By: collegeBoy | Date: 26/01/2003 18:18:00 | Type : Comment |
|
| Dear sumotimor is Levenshtein free or has to be purchased ? |
|||
| By: VGR | Date: 26/01/2003 18:32:00 | Type : Comment |
|
| I'm not sumotimor ;-) The function is free as it comes with PHP The algorithm I don't know, but for sure it's free of royalties or it would not be included in PHP for free regards |
|||
| By: collegeBoy | Date: 31/01/2003 04:17:00 | Type : Comment |
|
| Sorry for posting so late. I don't know whom to give credit since both sumotimor and VGR deserves it for supplying excellent guidance. To give credit to one means to be unfair to other. I am just out of my wits. May be coordinator will be of help. My sincere thanks and apologies to both. Regards |
|||
| By: digitaltree | Date: 26/02/2003 20:35:00 | Type : Comment |
|
hi quest .... there is a way to share points ... you can even ask how to do it here in this question. regarding the question ... I think you should have a unique ID for each Company name. When a data entry person enters a name that is not on your list you should raise a flag...ask data entry person some questions. It does make sense for google.com to look at alternative words, but as a database mgr you should not waste time on that. You should have a well defined list of company names in your data base...make sure your current data is correct. When data entry enters a Company name then check it against your data base...if it is not in the data base then start a sequence ... name not in database... Is this a new company?? then use logic to continue the questions. If within reason show a list based on first letter of the entry...data person should see why it is not new and correct the entry. The data entry form is very important here ... it must be designed for easy, logical entry and it must raise a flag when entry is not wholly consistant with your database. sincerely, digiT |
|||
| By: collegeBoy | Date: 02/03/2003 03:49:00 | Type : Comment |
|
| Hi digiT, thanks a lot for the comment. Actually both the answers are equally correct though the first one would suit my requirement but that should not carry away the credit from VGR. |
|||
|
Do register to be able to answer |
|||
©2010 These pages are served without commercial sponsorship. (No popup ads, etc...). Bandwidth abuse increases hosting cost forcing sponsorship or shutdown. This server aggressively defends against automated copying for any reason including offline viewing, duplication, etc... Please respect this requirement and DO NOT RIP THIS SITE.
Please DO link to this page!








