Tuesday, October 4, 2011

Spell Corrector for Apropos

One of the new features I wrote in apropos is a basic yet reasonably effective spell corrector.  While working on apropos, one big nuisance that I noticed was wrongly spelled keywords in the query. When supporting full text searches, I guess it is the usual expectation to have support for spell correction as well.

The search of apropos is based on the Boolean search model, which means that it will return only those documents which contain all the keywords mentioned in the query. This means that you misspell even one keyword and you will either get all non-relevant search results or no results at all. This behaviour is contradictory to the way conventional apropos did it's search, it would return all the results which matched even one of the keywords.

The user might think that "this new apropos is useless, can't get me any right results." Then he would most likely start experimenting by changing keywords and he might or might not succeed. The point is, apropos should be clever enough to inform the user that probably he misspelled one or more keywords in the query, so that the user doesn't waste time scratching his head.

Implementation Of The Spell Corrector: Writing an industry strength spell corrector (like that of Google, Bing, etc.) is a complex task and I have no idea about their intricacies. I was looking for a fairly basic implementation. I came across two articles which discussed implementation of a relatively simple spell checker. One article was by Jon Bentley in his famous book Programming Pearls and the second was from Prof. Peter Norvig in his famous post "How to write a spell corrector". I decided to go with Peter Norvig's implementation because of it's simplicity and ease of implementation. Before continuing, I would like to thank Prof. Norvig for writing such an insightful article and sharing it :-)

I highly recommend reading Prof. Norvig's article to understand the maths and logic involved properly, I am going to give some insight on what his Python code is doing and then produce the C translation of the code, with some demo.

The idea is to find the word at the least edit distance from the word being checked for. Edit distance here means, the number of characters from the given word you need to add, remove or change position to get the correctly spelled word. Peter Norvig mentioned in his post that for 80-95% cases edit distance 1 is sufficient.

The strategy for finding words at edit distance 1 is very simple. Four different kind of mistakes are possible that can lead to a misspelled word at edit distance 1. These are:
  1. Deletion: You missed a character while typing the word. For example: "speling".
  2. Transposition: You exchanged the positions of two adjacent characters in the word. For example: "teh" instead of "the"
  3. Replace: You replaced an alphabet in the word with some other alphabet (possibly you pressed the wrong key on the keyboard). For example: "dapple" instead of "apple" or "produkt" instead of "product"
  4. Insertions: You probably entered one additional alphabet in the spelling of the word. For example: "filles" when you mean "files".
I will take a simple example and show all it's possible permutations at edit distance 1. Let's say we misspelled "the" as "teh", then following are the different possible permutations:

deletes =  ['eh', 'th', 'te']

transpose =  ['eth', 'the']

#the replaces and inserts list is compacted but you get the idea
replaces =  ['aeh', 'beh', 'ceh', 'deh', 'eeh', 'feh', ..., 'zeh', 
                 'tah', 'tbh', 'tch', 'tdh', 'teh', 'tfh', ..., 'tzh'
                  'tea', 'teb', 'tec', 'ted', 'tee', 'tef', ..., 'tez']

inserts =  ['ateh', 'bteh', 'cteh', 'dteh', 'eteh', 'fteh', ..., 'zteh', 
                'taeh', 'tbeh', 'tceh', 'tdeh', 'teeh', 'tfeh', ..., 'tzeh', 
                'teah', 'tebh', 'tech', 'tedh', 'teeh', 'tefh', ..., 'tezh', 
                'teha', 'tehb', 'tehc', 'tehd', 'tehe', 'tehf', ..., 'tehz']


Once we have generated all these possible permutations of the word at edit distance 1, we check in our dictionary which of these are real and valid words. It is always possible that more than one of these permutations is a valid word in the dictionary, in which case we pick up the word which occurs most frequently in our sample corpus used for building the dictionary (this is the training model used for this spell corrector).

I suppose that explains what we need to do. Now time for some code:
NOTE: The following is a C implementation of Peter Norvig's spell corrector. It is written by me from scratch and is part of the apropos_replacement project, licensed under the two clause BSD license.

/*-
 * Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in
 * the documentation and/or other materials provided with the
 * distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

static char **
edits1 (char *word)
{
 int i;
 int len_a;
 int len_b;
 int counter = 0;
 char alphabet;
 int n = strlen(word);
 set splits[n + 1];

 /* calculate number of possible permutations and allocate memory */
 size_t size = n + n -1 + 26 * n + 26 * (n + 1);
 char **candidates = emalloc (size * sizeof(char *));

 /* Start by generating a split up of the characters in the word */
 for (i = 0; i < n + 1; i++) {
  splits[i].a = (char *) emalloc(i + 1);
  splits[i].b = (char *) emalloc(n - i + 1);
  memcpy(splits[i].a, word, i);
  memcpy(splits[i].b, word + i, n - i + 1);
  splits[i].a[i] = 0;
 }

 /* Now generate all the permutations at maximum edit distance of 1.
 * counter keeps track of the current index position in the array candidates
 * where the next permutation needs to be stored.
 */
 for (i = 0; i < n + 1; i++) {
  len_a = strlen(splits[i].a);
  len_b = strlen(splits[i].b);
  assert(len_a + len_b == n);

  /* Deletes */
  if (i < n) {
   candidates[counter] = emalloc(n);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b -1 > 0)
    memcpy(candidates[counter] + len_a , 
                                              splits[i].b + 1, len_b - 1);
   candidates[counter][n - 1] = 0;
   counter++;
  }

  /* Transposes */
  if (i < n - 1) {
   candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a, splits[i].b + 1, 1);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, 1);
   if (len_b >= 2)
    memcpy(candidates[counter] + len_a + 2, 
                                               splits[i].b + 2, len_b - 2);
   candidates[counter][n] = 0;
   counter++;
  }

  /* For replaces and inserts, run a loop from 'a' to 'z' */
  for (alphabet = 'a'; alphabet <= 'z'; alphabet++) {
   /* Replaces */
   if (i < n) {
    candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b - 1 >= 1)
    memcpy(candidates[counter] + len_a + 1, 
                                               splits[i].b + 1, len_b - 1);
   candidates[counter][n] = 0;
   counter++;
          }

          /* Inserts */
   candidates[counter] = emalloc(n + 2);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b >=1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, len_b);
   candidates[counter][n + 1] = 0;
   counter++;
  }
 }
 return candidates;
}

/*
* known_word--
* Pass an array of strings to this function and it will return the word with
* maximum frequency in the dictionary. If no word in the array list is found
* in the dictionary, it returns NULL
* NOTE: The dictionary in our case is a table in the db with two fields:
*       term, occurrences
*/
static char *
known_word(sqlite3 *db, char **list, int n)
{
 int i, rc;
 char *sqlstr;
 char *termlist = NULL;
 char *correct = NULL;
 sqlite3_stmt *stmt;

 /* Build termlist: a comma separated list of all the words in the list for
 * use in the SQL query later.
 */
 int total_len = BUFLEN * 20; /* total bytes allocated to termlist */
 termlist = emalloc (total_len);
 int offset = 0; /* Next byte to write at in termlist */
 termlist[0] = '(';
 offset++;

 for (i = 0; i < n; i++) {
  int d = strlen(list[i]);
  if (total_len - offset < d + 3) {
   termlist = erealloc(termlist, offset + total_len);
   total_len *= 2;
  }
  memcpy(termlist + offset, "\'", 1);
  offset++;
  memcpy(termlist + offset, list[i], d);
  offset += d;

  if (i == n -1) {
   memcpy(termlist + offset, "\'", 1);
   offset++;
  }
  else {
   memcpy(termlist + offset, "\',", 2);
   offset += 2;
  }

 }
 if (total_len - offset > 3)
  memcpy(termlist + offset, ")", 2);
 else
  concat(&termlist, ")", 1);

 easprintf(&sqlstr, "SELECT term FROM metadb.dict WHERE "
 "occurrences = (SELECT MAX(occurrences) from metadb.dict "
 "WHERE term IN %s) AND term IN %s", termlist, termlist);
 rc = sqlite3_prepare_v2(db, sqlstr, -1, &stmt, NULL);
 if (rc != SQLITE_OK) {
  warnx("%s", sqlite3_errmsg(db));
  return NULL;
 }

 if (sqlite3_step(stmt) == SQLITE_ROW)
  correct = strdup((char *) sqlite3_column_text(stmt, 0));

 sqlite3_finalize(stmt);
 free(sqlstr);
 free(termlist);
 return (correct);
}

static void
free_list(char **list, int n)
{
 int i = 0;
 if (list == NULL)
  return;

 while (i < n) {
  free(list[i]);
  i++;
 }
}

/*
* spell--
* The API exposed to the user. Returns the most closely matched word from the
* dictionary. It will first search for all possible words at distance 1, if no
* matches are found, it goes further and tries to look for words at edit
* distance 2 as well. If no matches are found at all, it returns NULL.
*/
char *
spell(sqlite3 *db, char *word)
{
 int i;
 char *correct;
 char **candidates;
 int count2;
 char **cand2 = NULL;
 char *errmsg;
 const char *sqlstr;
 int n;
 int count;
 
 lower(word);
 
 /* If this word already exists in the dictionary then no need to go further */
 correct = known_word(db, &word, 1);

 if (!correct) {
  n = strlen(word);
  count = n + n -1 + 26 * n + 26 * (n + 1);
  candidates = edits1(word);
  correct = known_word(db, candidates, count);
  /* No matches found ? Let's go further and find matches at edit distance 2.
  * To make the search fast we use a heuristic. Take one word at a time from
  * candidates, generate it's permutations and look if a match is found.
  * If a match is found, exit the loop. Works reasonable fast but accuracy
  * is not quite there in some cases.
  */
  if (correct == NULL) {
   for (i = 0; i < count; i++) {
    n = strlen(candidates[i]);
    count2 = n + n - 1 + 26 * n + 26 * (n + 1);
    cand2 = edits1(candidates[i]);
    if ((correct = known_word(db, cand2, count2)))
     break;
    else {
     free_list(cand2, count2);
     cand2 = NULL;
    }
   }
  }
  free_list(candidates, count);
  free_list(cand2, count2);
 }

 return correct;
}

Demo:
Following are some sample runs of apropos:

$ ./apropos "funckiton for coping stings"
Did you mean "function for copying strings" ?
$ ./apropos "generat termcap databse"
Did you mean "generate termcap database" ?
$ ./apropos idcmp
Did you mean "icmp" ?
$ ./apropos "confguire kernal"
Did you mean "configure kernel" ?
$ ./apropos "packate fillter"
Did you mean "package filter" ?
$ ./apropos reeltek
Did you mean "realtek" ?

Following are some screenshots of apropos_cgi (a CGI version of apropos for browsers):
















Further Scope: There are a few technical glitches in integrating this spell corrector with apropos so those need to be sorted. The suggestions are not always as expected, so probably the model for the spell corrector needs to be fine tuned (like Peter Norvig discussed at the end of his article). And while writing this post, it occurred to me that this implementation could make a fine small scale backend for auto completion feature in a web application (for example the apropos cgi above). ;-)

All this code is in the demo-spell , exp-spell branch of the project on github.

I am not sure if anyone would read this far, but thanks anyway for reading and taking interest. :-)

Monday, October 3, 2011

Improvements to makemandb

Over a month passed since GSoC finished and I made some improvements and introduced new features (which are experimental) in apropos. I wanted to write about a few of the things I did in last one month.

Indexing Additional metadata For Faster Update Operations: Previously makemandb was maintaining md5 hashes of all the pages indexed. On each run, makemandb would read all the man pages, generate their md5s and compare those with the md5 hashes it already had in it's index. Then it would parse and store the pages whose md5 hash it did not find in the database, meaning these are the new or modified pages and need (re)indexing.
    Joerg pointed out that this wasn't a very efficient approach. It required unnecessarily reading up all the man pages. He suggested to index more metadata about the man page files, like the mtime, device id and the inode number. So rather than reading up the pages and generating their md5, makemandb would do a stat(2) on them, read their {device id, inode, mtime} and see if a matching triplet exists in the database or not and decide whether this page needs to be indexed or not.  This is a more efficient approach when you are updating the index after installing some new man pages or updating few of the existing ones. Though when you are building the index from scratch, doing a stat(2) for all the pages just proves to be a roadblock.



Faster makemandb With Clever Memory Management: Due to the above mentioned changes in makemandb it's runtime had increased by more than double. Earlier makemandb could build an  index for 8000+ pages under 30-40 seconds but now it was taking 130-150 seconds to do the same job. The changes which made makemandb slow were necessary and could not be undone so I had to identify the other areas where it could do better.
    As it turns out, makemandb was managing the memory very poorly. It needs to perform one operation very frequently and that is of concatenating two strings, one of which contains previously parsed data from the man page and the other one contains newly parsed data. Doing such kind of string manipulation is always a tedious task in C. Most straightforward way is to call realloc(3) to allocate sufficient space to hold the contents of the new string and then copy the new string at the end of the old one. I had a function concat() which was doing just the same. In an average length man page there could be well over 100+ calls to concat() and for 8000+ pages this was a very large number of calls to malloc/realloc, and as the length of the string containing already parsed data increases, the realloc calls get even more expensive. So clearly this was the bottleneck which needed to be fixed.

Solution: The solution was very simple. Instead of doing memory allocations every time a new string needs to be concatenated, pre-allocate a large chunk of memory and keep writing to it until you fall short of space, in which case, just reallocate another large chunk and proceed as usual. This would reduce the calls to malloc from 100+ to around 10+ for a single page.



/*
 * A data structure for holding section specific data.
 */
    typedef struct secbuff {
        char *data;
        int buflen; //Total length of buffer allocated initially
        int offset; // Position of next byte to write at
    } secbuff;

static void
append(secbuff *sbuff, const char *src, int srclen)
{
	short flag = 0;
	assert(src != NULL);
	if (srclen == -1)
		srclen = strlen(src);

	if (sbuff->data == NULL) {
		sbuff->data = (char *) emalloc (sbuff->buflen);
		sbuff->offset = 0;
	}

	if ((srclen + 2) >= (sbuff->buflen - sbuff->offset)) {
		sbuff->data = (char *) erealloc(sbuff->data, sbuff->buflen + sbuff->offset);
		sbuff->buflen += sbuff->buflen;
		flag++;
	}

	/* Append a space at the end of the buffer */
	if (sbuff->offset || flag) {
		memcpy(sbuff->data + sbuff->offset, " ", 1);
		sbuff->offset++;
	}

	/* Now, copy src at the end of the buffer */
	memcpy(sbuff->data + sbuff->offset, src, srclen);
	sbuff->offset += srclen;
	return;
}
The secbuff data structure keeps track of the next byte offset in the data buffer where the next character needs to be written. In this way, I could allocate a sufficiently large chunk of memory to a buffer and simply use memcpy to write the new data at it's end. This approach brings large speed improvements to makemandb. The runtime has reduced from 130+ seconds to somewhere around ~45 seconds.