Tuesday, October 4, 2011

Spell Corrector for Apropos

One of the new features I wrote in apropos is a basic yet reasonably effective spell corrector.  While working on apropos, one big nuisance that I noticed was wrongly spelled keywords in the query. When supporting full text searches, I guess it is the usual expectation to have support for spell correction as well.

The search of apropos is based on the Boolean search model, which means that it will return only those documents which contain all the keywords mentioned in the query. This means that you misspell even one keyword and you will either get all non-relevant search results or no results at all. This behaviour is contradictory to the way conventional apropos did it's search, it would return all the results which matched even one of the keywords.

The user might think that "this new apropos is useless, can't get me any right results." Then he would most likely start experimenting by changing keywords and he might or might not succeed. The point is, apropos should be clever enough to inform the user that probably he misspelled one or more keywords in the query, so that the user doesn't waste time scratching his head.

Implementation Of The Spell Corrector: Writing an industry strength spell corrector (like that of Google, Bing, etc.) is a complex task and I have no idea about their intricacies. I was looking for a fairly basic implementation. I came across two articles which discussed implementation of a relatively simple spell checker. One article was by Jon Bentley in his famous book Programming Pearls and the second was from Prof. Peter Norvig in his famous post "How to write a spell corrector". I decided to go with Peter Norvig's implementation because of it's simplicity and ease of implementation. Before continuing, I would like to thank Prof. Norvig for writing such an insightful article and sharing it :-)

I highly recommend reading Prof. Norvig's article to understand the maths and logic involved properly, I am going to give some insight on what his Python code is doing and then produce the C translation of the code, with some demo.

The idea is to find the word at the least edit distance from the word being checked for. Edit distance here means, the number of characters from the given word you need to add, remove or change position to get the correctly spelled word. Peter Norvig mentioned in his post that for 80-95% cases edit distance 1 is sufficient.

The strategy for finding words at edit distance 1 is very simple. Four different kind of mistakes are possible that can lead to a misspelled word at edit distance 1. These are:
  1. Deletion: You missed a character while typing the word. For example: "speling".
  2. Transposition: You exchanged the positions of two adjacent characters in the word. For example: "teh" instead of "the"
  3. Replace: You replaced an alphabet in the word with some other alphabet (possibly you pressed the wrong key on the keyboard). For example: "dapple" instead of "apple" or "produkt" instead of "product"
  4. Insertions: You probably entered one additional alphabet in the spelling of the word. For example: "filles" when you mean "files".
I will take a simple example and show all it's possible permutations at edit distance 1. Let's say we misspelled "the" as "teh", then following are the different possible permutations:

deletes =  ['eh', 'th', 'te']

transpose =  ['eth', 'the']

#the replaces and inserts list is compacted but you get the idea
replaces =  ['aeh', 'beh', 'ceh', 'deh', 'eeh', 'feh', ..., 'zeh', 
                 'tah', 'tbh', 'tch', 'tdh', 'teh', 'tfh', ..., 'tzh'
                  'tea', 'teb', 'tec', 'ted', 'tee', 'tef', ..., 'tez']

inserts =  ['ateh', 'bteh', 'cteh', 'dteh', 'eteh', 'fteh', ..., 'zteh', 
                'taeh', 'tbeh', 'tceh', 'tdeh', 'teeh', 'tfeh', ..., 'tzeh', 
                'teah', 'tebh', 'tech', 'tedh', 'teeh', 'tefh', ..., 'tezh', 
                'teha', 'tehb', 'tehc', 'tehd', 'tehe', 'tehf', ..., 'tehz']


Once we have generated all these possible permutations of the word at edit distance 1, we check in our dictionary which of these are real and valid words. It is always possible that more than one of these permutations is a valid word in the dictionary, in which case we pick up the word which occurs most frequently in our sample corpus used for building the dictionary (this is the training model used for this spell corrector).

I suppose that explains what we need to do. Now time for some code:
NOTE: The following is a C implementation of Peter Norvig's spell corrector. It is written by me from scratch and is part of the apropos_replacement project, licensed under the two clause BSD license.

/*-
 * Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in
 * the documentation and/or other materials provided with the
 * distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

static char **
edits1 (char *word)
{
 int i;
 int len_a;
 int len_b;
 int counter = 0;
 char alphabet;
 int n = strlen(word);
 set splits[n + 1];

 /* calculate number of possible permutations and allocate memory */
 size_t size = n + n -1 + 26 * n + 26 * (n + 1);
 char **candidates = emalloc (size * sizeof(char *));

 /* Start by generating a split up of the characters in the word */
 for (i = 0; i < n + 1; i++) {
  splits[i].a = (char *) emalloc(i + 1);
  splits[i].b = (char *) emalloc(n - i + 1);
  memcpy(splits[i].a, word, i);
  memcpy(splits[i].b, word + i, n - i + 1);
  splits[i].a[i] = 0;
 }

 /* Now generate all the permutations at maximum edit distance of 1.
 * counter keeps track of the current index position in the array candidates
 * where the next permutation needs to be stored.
 */
 for (i = 0; i < n + 1; i++) {
  len_a = strlen(splits[i].a);
  len_b = strlen(splits[i].b);
  assert(len_a + len_b == n);

  /* Deletes */
  if (i < n) {
   candidates[counter] = emalloc(n);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b -1 > 0)
    memcpy(candidates[counter] + len_a , 
                                              splits[i].b + 1, len_b - 1);
   candidates[counter][n - 1] = 0;
   counter++;
  }

  /* Transposes */
  if (i < n - 1) {
   candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a, splits[i].b + 1, 1);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, 1);
   if (len_b >= 2)
    memcpy(candidates[counter] + len_a + 2, 
                                               splits[i].b + 2, len_b - 2);
   candidates[counter][n] = 0;
   counter++;
  }

  /* For replaces and inserts, run a loop from 'a' to 'z' */
  for (alphabet = 'a'; alphabet <= 'z'; alphabet++) {
   /* Replaces */
   if (i < n) {
    candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b - 1 >= 1)
    memcpy(candidates[counter] + len_a + 1, 
                                               splits[i].b + 1, len_b - 1);
   candidates[counter][n] = 0;
   counter++;
          }

          /* Inserts */
   candidates[counter] = emalloc(n + 2);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b >=1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, len_b);
   candidates[counter][n + 1] = 0;
   counter++;
  }
 }
 return candidates;
}

/*
* known_word--
* Pass an array of strings to this function and it will return the word with
* maximum frequency in the dictionary. If no word in the array list is found
* in the dictionary, it returns NULL
* NOTE: The dictionary in our case is a table in the db with two fields:
*       term, occurrences
*/
static char *
known_word(sqlite3 *db, char **list, int n)
{
 int i, rc;
 char *sqlstr;
 char *termlist = NULL;
 char *correct = NULL;
 sqlite3_stmt *stmt;

 /* Build termlist: a comma separated list of all the words in the list for
 * use in the SQL query later.
 */
 int total_len = BUFLEN * 20; /* total bytes allocated to termlist */
 termlist = emalloc (total_len);
 int offset = 0; /* Next byte to write at in termlist */
 termlist[0] = '(';
 offset++;

 for (i = 0; i < n; i++) {
  int d = strlen(list[i]);
  if (total_len - offset < d + 3) {
   termlist = erealloc(termlist, offset + total_len);
   total_len *= 2;
  }
  memcpy(termlist + offset, "\'", 1);
  offset++;
  memcpy(termlist + offset, list[i], d);
  offset += d;

  if (i == n -1) {
   memcpy(termlist + offset, "\'", 1);
   offset++;
  }
  else {
   memcpy(termlist + offset, "\',", 2);
   offset += 2;
  }

 }
 if (total_len - offset > 3)
  memcpy(termlist + offset, ")", 2);
 else
  concat(&termlist, ")", 1);

 easprintf(&sqlstr, "SELECT term FROM metadb.dict WHERE "
 "occurrences = (SELECT MAX(occurrences) from metadb.dict "
 "WHERE term IN %s) AND term IN %s", termlist, termlist);
 rc = sqlite3_prepare_v2(db, sqlstr, -1, &stmt, NULL);
 if (rc != SQLITE_OK) {
  warnx("%s", sqlite3_errmsg(db));
  return NULL;
 }

 if (sqlite3_step(stmt) == SQLITE_ROW)
  correct = strdup((char *) sqlite3_column_text(stmt, 0));

 sqlite3_finalize(stmt);
 free(sqlstr);
 free(termlist);
 return (correct);
}

static void
free_list(char **list, int n)
{
 int i = 0;
 if (list == NULL)
  return;

 while (i < n) {
  free(list[i]);
  i++;
 }
}

/*
* spell--
* The API exposed to the user. Returns the most closely matched word from the
* dictionary. It will first search for all possible words at distance 1, if no
* matches are found, it goes further and tries to look for words at edit
* distance 2 as well. If no matches are found at all, it returns NULL.
*/
char *
spell(sqlite3 *db, char *word)
{
 int i;
 char *correct;
 char **candidates;
 int count2;
 char **cand2 = NULL;
 char *errmsg;
 const char *sqlstr;
 int n;
 int count;
 
 lower(word);
 
 /* If this word already exists in the dictionary then no need to go further */
 correct = known_word(db, &word, 1);

 if (!correct) {
  n = strlen(word);
  count = n + n -1 + 26 * n + 26 * (n + 1);
  candidates = edits1(word);
  correct = known_word(db, candidates, count);
  /* No matches found ? Let's go further and find matches at edit distance 2.
  * To make the search fast we use a heuristic. Take one word at a time from
  * candidates, generate it's permutations and look if a match is found.
  * If a match is found, exit the loop. Works reasonable fast but accuracy
  * is not quite there in some cases.
  */
  if (correct == NULL) {
   for (i = 0; i < count; i++) {
    n = strlen(candidates[i]);
    count2 = n + n - 1 + 26 * n + 26 * (n + 1);
    cand2 = edits1(candidates[i]);
    if ((correct = known_word(db, cand2, count2)))
     break;
    else {
     free_list(cand2, count2);
     cand2 = NULL;
    }
   }
  }
  free_list(candidates, count);
  free_list(cand2, count2);
 }

 return correct;
}

Demo:
Following are some sample runs of apropos:

$ ./apropos "funckiton for coping stings"
Did you mean "function for copying strings" ?
$ ./apropos "generat termcap databse"
Did you mean "generate termcap database" ?
$ ./apropos idcmp
Did you mean "icmp" ?
$ ./apropos "confguire kernal"
Did you mean "configure kernel" ?
$ ./apropos "packate fillter"
Did you mean "package filter" ?
$ ./apropos reeltek
Did you mean "realtek" ?

Following are some screenshots of apropos_cgi (a CGI version of apropos for browsers):
















Further Scope: There are a few technical glitches in integrating this spell corrector with apropos so those need to be sorted. The suggestions are not always as expected, so probably the model for the spell corrector needs to be fine tuned (like Peter Norvig discussed at the end of his article). And while writing this post, it occurred to me that this implementation could make a fine small scale backend for auto completion feature in a web application (for example the apropos cgi above). ;-)

All this code is in the demo-spell , exp-spell branch of the project on github.

I am not sure if anyone would read this far, but thanks anyway for reading and taking interest. :-)

2 comments:

  1. Awesome article .... keep up the good work :)

    Happy new year ...

    - Pariwesh Gupta

    ReplyDelete