Important alert: (current site time 5/24/2013 4:31:10 PM EDT)
 

winzip icon

A Comprehensive Spell Checker Revisited 4 (Update 12 Aug 2012)

Email
Submitted on: 8/14/2012 9:26:27 AM
By: Rde 
Level: Intermediate
User Rating: By 19 Users
Compatibility: VB 4.0 (32-bit), VB 5.0, VB 6.0
Views: 18655
author picture
 
     This is an improved version of the spell checker from Shelz's COTM "A Comprehensive Spell Checker" at txtCodeId=65992. It includes a modified Russell Soundex phonetic algorithm and the Levenshtein Distance algorithm. It is now a *very* fast and effective spell checker solution...

It impressed me with its most effective and concise spell checking algorithms, a perfect demo project, and the smart way it provided a complete database of words in a 1.3 MB download...

But like many other spell checkers it had a common limitation...

The basic aim of the Soundex algorithm is for names with the same pronunciation to be encoded the same so that matching can occur despite minor differences in spelling. The Soundex for a word consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Therefore, only words beginning with the same first letter are compared for similar pronunciation using the standard algorithm...

This version of the Russell Soundex algorithm has been modified to allow the matching of words that start with differing first letters so as not to assume that the first letter is always known. In this version the encoding always begins with the first letter of the word...

The Levenshtein Distance algo marries perfectly with the results to identify the correct spelling for the given (mis-spelt) word every time! A search on the word "apolstry" with the minimum successful Levenshtein Distance returns just four words where one of these is "upholstery"...

[Version 2] - Removed the dependence on the DAO library and is contained in a single text file. The Soundex encoding has been extended to include a 'reverse soundex' of all words (encoding from the end of the words backwards as well). On my 866MHz PC the access database took minutes to create the words database and was quite slow processing the lookup query, particularly after adding the reverse soundex to the query. Creating the database now takes 25 seconds and the lookups are now fast enough to be in real time (updated with every text entry change event)...

[Update 18 Feb 09] - Improved speed of data loading at form load from 1.2 to 0.9 seconds on 866MHz PC. On my Athlon 4000+ the database builds in under 10 seconds and the data loads at form load in well under 0.5 seconds. On your average PC this would be fast enough to unload the form every use and re-load it when needed without the user experiencing any delay...

[Update 21 Feb 09] - Added code to normalise words file to expected format (removes empty lines and converts comma delimited files)...

[Version 3] - Considerably improved speed of both Soundex and Levenshtein Distance algos by eliminating Mid$(s,i,1) creating a temp string for every character and comparing chars, to using copymemory to grab the unicode value and comparing int's instead. I think the MidI code used was authored by Bruce McKinney, but if it is yours let me know and I will give you credit...

[Version 4] - Injected some asm machine code into the database creation sub and now builds the database in a blink! Big thanks to Robert Rayment for your generous help with as$embler

Happy coding, Rd :)


 
winzip iconDownload code

Note: Due to the size or complexity of this submission, the author has submitted it as a .zip file to shorten your download time. Afterdownloading it, you will need a program like Winzip to decompress it.Virus note:All files are scanned once-a-day by Planet Source Code for viruses, but new viruses come out every day, so no prevention program can catch 100% of them. For your own safety, please:
  1. Re-scan downloaded files using your personal virus checker before using it.
  2. NEVER, EVER run compiled files (.exe's, .ocx's, .dll's etc.)--only run source code.
  3. Scan the source code with Minnow's Project Scanner

If you don't have a virus scanner, you can get one at many places on the net including:McAfee.com

 
Terms of Agreement:   
By using this code, you agree to the following terms...   
  1. You may use this code in your own programs (and may compile it into a program and distribute it in compiled format for languages that allow it) freely and with no charge.
  2. You MAY NOT redistribute this code (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.   
  3. You may link to this code from another website, but ONLY if it is not wrapped in a frame. 
  4. You will abide by any additional copyright restrictions which the author may have placed in the code or code's description.


Other 52 submission(s) by this author

 


Report Bad Submission
Use this form to tell us if this entry should be deleted (i.e contains no code, is a virus, etc.).
This submission should be removed because:

Your Vote

What do you think of this code (in the Intermediate category)?
(The code with your highest vote will win this month's coding contest!)
Excellent  Good  Average  Below Average  Poor (See voting log ...)
 

Other User Comments

8/18/2006 11:47:11 AMLight Templer

This improvement sounds very good! ;-))
My ***** for and regards!
LiTe
(If this comment was disrespectful, please report it.)

 
8/18/2006 8:58:30 PMRde

Thanks heaps LiTe :)

Just a simple change to the original algo by Shelz - but definitely a good improvement on an excellent solution.

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
8/20/2006 11:45:01 AMShelz

This is amazing dude!

That was a very obvious limitation. N only u caught it. great. 5* from me.
(If this comment was disrespectful, please report it.)

 
8/21/2006 6:40:18 AMRde

Hi Shelz

That was exactly what I said to myself when I checked out your submission!

The combination of these two algos produces the perfect spell checker!

I have tested ridiculously mis-spelt words and it still returns the correct word at least by the second smallest LD - AMAZING!

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
2/26/2007 1:48:16 PMdr raj bonde

thanks i will try to incorporate in my vb
to search pt database bonderaj @ hotmail.com
(If this comment was disrespectful, please report it.)

 
2/22/2008 4:41:57 AMRan

If I wanted to use other language than English, could i use soundex?? and build another wordlist? Or soundex just available in English?? Please reply me...
(If this comment was disrespectful, please report it.)

 
2/15/2009 8:15:22 PMRde

Hi Ran

Sorry for the late response, must have missed your comment

The short answer is 'no' :(

The long answer is 'maybe'. Some other languages do have common letters with similar pronunciation to English letters and so may work but I think you would need to reproduce the soundex rules to apply them to the pronunciation of the letters in your intended language.

I do not have the phonetic knowledge to produce even this English algo, so I could not do so for another language.

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
3/6/2009 11:23:10 PMLuciano

Hello coders friends,

When I saw that Rde was the author of the code I gave 5 stars even without check the code ;-)

Regards
Luciano
(If this comment was disrespectful, please report it.)

 
12/28/2009 2:34:50 AMRde

Thanks Luciano, much appreciated mate.

Please note: the included words list is rather dodgy. It contains many invalid words. I would try to validate all the words if it was not so time consuming.

You can replace the file simply by overwriting the Words.dat file with your own list named Words.dat and the program will update the db with your list.

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
6/2/2010 4:32:13 AMandrew

Hi,
Cpiuld you adapt the program to work in realtime and to automatically hyphenate the words - as and when appropriate Also what about adding a thesaurus too it

(If this comment was disrespectful, please report it.)

 
6/6/2010 4:20:47 AMRde

Hi Andrew

Yes it already works in real time! Just pop up the spell checker with a word hi-lighted in the main window and it will spell check that word - or in the spell checker type in the word and it will update with each keystroke.

Hyphenating the words simply requires that you add the hyphenated words to the words database file. You can also find alternative words lists that already include all hyphenated variants of words. The words database is just a text file with each word occupying a single line in the text, so replacing the database is as simple as placing your alternate words list in the program path and renaming that file to WORDS.DAT.

cont'd
(If this comment was disrespectful, please report it.)

 
6/6/2010 4:21:26 AMRde

...

Note that I have optimized the loading process by sorting the words list so that it does not require this processing on every load of the database - an unsorted list will work correctly but found matches will not be in any particular order which may not appear very professional.

As for a thes - well that is a great idea but is really a whole new and very different project with some type of links for all words to related (and opposite) words. This could be done using the current (or alternate) words list by adding the relationship data - perhaps contained in another file?

Anyway, feel free to correspond on this or any other subject.

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
6/24/2010 11:18:20 AMAustin Reed Collins

Fantastic! Valuable code here. ; )
(If this comment was disrespectful, please report it.)

 
10/4/2010 6:50:59 AMRde

Thanks heaps Austin

=====

Correction on my comment above:

The code does (refresh)sort the data at form load so ignor my previous comment stating otherwise ;)

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
8/15/2012 8:50:20 AMRde

Just a quick note about the included words list - I have replaced the original 450,000 word rubbish to a smaller but ideal common use word list with maybe no rubbish words.

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
8/15/2012 1:34:18 PMDave Carter

WOW :D
(If this comment was disrespectful, please report it.)

 
8/15/2012 3:09:28 PMJe

You've probably come across this already but in case you or anyone hasn't

http://wordlist.sourceforge.net/
(If this comment was disrespectful, please report it.)

 
8/17/2012 8:06:06 AMRde

Thanks for the info Je

Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 
8/17/2012 8:24:54 AMRde

Hi Dave

Yes I still think 'wow' about how well these two algos work together as a spell checker
Happy coding,
Rd :)
(If this comment was disrespectful, please report it.)

 

Add Your Feedback
Your feedback will be posted below and an email sent to the author. Please remember that the author was kind enough to share this with you, so any criticisms must be stated politely, or they will be deleted. (For feedback not related to this particular code, please click here instead.)
 

To post feedback, first please login.