My tech musings: Project

Showing posts with label Project. Show all posts

Tuesday, October 4, 2011

Spell Corrector for Apropos

One of the new features I wrote in apropos is a basic yet reasonably effective spell corrector. While working on apropos, one big nuisance that I noticed was wrongly spelled keywords in the query. When supporting full text searches, I guess it is the usual expectation to have support for spell correction as well.

The search of apropos is based on the Boolean search model, which means that it will return only those documents which contain all the keywords mentioned in the query. This means that you misspell even one keyword and you will either get all non-relevant search results or no results at all. This behaviour is contradictory to the way conventional apropos did it's search, it would return all the results which matched even one of the keywords.

The user might think that "this new apropos is useless, can't get me any right results." Then he would most likely start experimenting by changing keywords and he might or might not succeed. The point is, apropos should be clever enough to inform the user that probably he misspelled one or more keywords in the query, so that the user doesn't waste time scratching his head.

Implementation Of The Spell Corrector: Writing an industry strength spell corrector (like that of Google, Bing, etc.) is a complex task and I have no idea about their intricacies. I was looking for a fairly basic implementation. I came across two articles which discussed implementation of a relatively simple spell checker. One article was by Jon Bentley in his famous book Programming Pearls and the second was from Prof. Peter Norvig in his famous post "How to write a spell corrector". I decided to go with Peter Norvig's implementation because of it's simplicity and ease of implementation. Before continuing, I would like to thank Prof. Norvig for writing such an insightful article and sharing it :-)

I highly recommend reading Prof. Norvig's article to understand the maths and logic involved properly, I am going to give some insight on what his Python code is doing and then produce the C translation of the code, with some demo.

The idea is to find the word at the least edit distance from the word being checked for. Edit distance here means, the number of characters from the given word you need to add, remove or change position to get the correctly spelled word. Peter Norvig mentioned in his post that for 80-95% cases edit distance 1 is sufficient.

The strategy for finding words at edit distance 1 is very simple. Four different kind of mistakes are possible that can lead to a misspelled word at edit distance 1. These are:

Deletion: You missed a character while typing the word. For example: "speling".
Transposition: You exchanged the positions of two adjacent characters in the word. For example: "teh" instead of "the"
Replace: You replaced an alphabet in the word with some other alphabet (possibly you pressed the wrong key on the keyboard). For example: "dapple" instead of "apple" or "produkt" instead of "product"
Insertions: You probably entered one additional alphabet in the spelling of the word. For example: "filles" when you mean "files".

I will take a simple example and show all it's possible permutations at edit distance 1. Let's say we misspelled "the" as "teh", then following are the different possible permutations:

deletes =  ['eh', 'th', 'te']

transpose =  ['eth', 'the']

#the replaces and inserts list is compacted but you get the idea
replaces =  ['aeh', 'beh', 'ceh', 'deh', 'eeh', 'feh', ..., 'zeh', 
                 'tah', 'tbh', 'tch', 'tdh', 'teh', 'tfh', ..., 'tzh'
                  'tea', 'teb', 'tec', 'ted', 'tee', 'tef', ..., 'tez']

inserts =  ['ateh', 'bteh', 'cteh', 'dteh', 'eteh', 'fteh', ..., 'zteh', 
                'taeh', 'tbeh', 'tceh', 'tdeh', 'teeh', 'tfeh', ..., 'tzeh', 
                'teah', 'tebh', 'tech', 'tedh', 'teeh', 'tefh', ..., 'tezh', 
                'teha', 'tehb', 'tehc', 'tehd', 'tehe', 'tehf', ..., 'tehz']

Once we have generated all these possible permutations of the word at edit distance 1, we check in our dictionary which of these are real and valid words. It is always possible that more than one of these permutations is a valid word in the dictionary, in which case we pick up the word which occurs most frequently in our sample corpus used for building the dictionary (this is the training model used for this spell corrector).

I suppose that explains what we need to do. Now time for some code:

NOTE: The following is a C implementation of Peter Norvig's spell corrector. It is written by me from scratch and is part of the apropos_replacement project, licensed under the two clause BSD license.


/*-
 * Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in
 * the documentation and/or other materials provided with the
 * distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

static char **
edits1 (char *word)
{
 int i;
 int len_a;
 int len_b;
 int counter = 0;
 char alphabet;
 int n = strlen(word);
 set splits[n + 1];

 /* calculate number of possible permutations and allocate memory */
 size_t size = n + n -1 + 26 * n + 26 * (n + 1);
 char **candidates = emalloc (size * sizeof(char *));

 /* Start by generating a split up of the characters in the word */
 for (i = 0; i < n + 1; i++) {
  splits[i].a = (char *) emalloc(i + 1);
  splits[i].b = (char *) emalloc(n - i + 1);
  memcpy(splits[i].a, word, i);
  memcpy(splits[i].b, word + i, n - i + 1);
  splits[i].a[i] = 0;
 }

 /* Now generate all the permutations at maximum edit distance of 1.
 * counter keeps track of the current index position in the array candidates
 * where the next permutation needs to be stored.
 */
 for (i = 0; i < n + 1; i++) {
  len_a = strlen(splits[i].a);
  len_b = strlen(splits[i].b);
  assert(len_a + len_b == n);

  /* Deletes */
  if (i < n) {
   candidates[counter] = emalloc(n);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b -1 > 0)
    memcpy(candidates[counter] + len_a , 
                                              splits[i].b + 1, len_b - 1);
   candidates[counter][n - 1] = 0;
   counter++;
  }

  /* Transposes */
  if (i < n - 1) {
   candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a, splits[i].b + 1, 1);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, 1);
   if (len_b >= 2)
    memcpy(candidates[counter] + len_a + 2, 
                                               splits[i].b + 2, len_b - 2);
   candidates[counter][n] = 0;
   counter++;
  }

  /* For replaces and inserts, run a loop from 'a' to 'z' */
  for (alphabet = 'a'; alphabet <= 'z'; alphabet++) {
   /* Replaces */
   if (i < n) {
    candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b - 1 >= 1)
    memcpy(candidates[counter] + len_a + 1, 
                                               splits[i].b + 1, len_b - 1);
   candidates[counter][n] = 0;
   counter++;
          }

          /* Inserts */
   candidates[counter] = emalloc(n + 2);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b >=1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, len_b);
   candidates[counter][n + 1] = 0;
   counter++;
  }
 }
 return candidates;
}

/*
* known_word--
* Pass an array of strings to this function and it will return the word with
* maximum frequency in the dictionary. If no word in the array list is found
* in the dictionary, it returns NULL
* NOTE: The dictionary in our case is a table in the db with two fields:
*       term, occurrences
*/
static char *
known_word(sqlite3 *db, char **list, int n)
{
 int i, rc;
 char *sqlstr;
 char *termlist = NULL;
 char *correct = NULL;
 sqlite3_stmt *stmt;

 /* Build termlist: a comma separated list of all the words in the list for
 * use in the SQL query later.
 */
 int total_len = BUFLEN * 20; /* total bytes allocated to termlist */
 termlist = emalloc (total_len);
 int offset = 0; /* Next byte to write at in termlist */
 termlist[0] = '(';
 offset++;

 for (i = 0; i < n; i++) {
  int d = strlen(list[i]);
  if (total_len - offset < d + 3) {
   termlist = erealloc(termlist, offset + total_len);
   total_len *= 2;
  }
  memcpy(termlist + offset, "\'", 1);
  offset++;
  memcpy(termlist + offset, list[i], d);
  offset += d;

  if (i == n -1) {
   memcpy(termlist + offset, "\'", 1);
   offset++;
  }
  else {
   memcpy(termlist + offset, "\',", 2);
   offset += 2;
  }

 }
 if (total_len - offset > 3)
  memcpy(termlist + offset, ")", 2);
 else
  concat(&termlist, ")", 1);

 easprintf(&sqlstr, "SELECT term FROM metadb.dict WHERE "
 "occurrences = (SELECT MAX(occurrences) from metadb.dict "
 "WHERE term IN %s) AND term IN %s", termlist, termlist);
 rc = sqlite3_prepare_v2(db, sqlstr, -1, &stmt, NULL);
 if (rc != SQLITE_OK) {
  warnx("%s", sqlite3_errmsg(db));
  return NULL;
 }

 if (sqlite3_step(stmt) == SQLITE_ROW)
  correct = strdup((char *) sqlite3_column_text(stmt, 0));

 sqlite3_finalize(stmt);
 free(sqlstr);
 free(termlist);
 return (correct);
}

static void
free_list(char **list, int n)
{
 int i = 0;
 if (list == NULL)
  return;

 while (i < n) {
  free(list[i]);
  i++;
 }
}

/*
* spell--
* The API exposed to the user. Returns the most closely matched word from the
* dictionary. It will first search for all possible words at distance 1, if no
* matches are found, it goes further and tries to look for words at edit
* distance 2 as well. If no matches are found at all, it returns NULL.
*/
char *
spell(sqlite3 *db, char *word)
{
 int i;
 char *correct;
 char **candidates;
 int count2;
 char **cand2 = NULL;
 char *errmsg;
 const char *sqlstr;
 int n;
 int count;
 
 lower(word);
 
 /* If this word already exists in the dictionary then no need to go further */
 correct = known_word(db, &word, 1);

 if (!correct) {
  n = strlen(word);
  count = n + n -1 + 26 * n + 26 * (n + 1);
  candidates = edits1(word);
  correct = known_word(db, candidates, count);
  /* No matches found ? Let's go further and find matches at edit distance 2.
  * To make the search fast we use a heuristic. Take one word at a time from
  * candidates, generate it's permutations and look if a match is found.
  * If a match is found, exit the loop. Works reasonable fast but accuracy
  * is not quite there in some cases.
  */
  if (correct == NULL) {
   for (i = 0; i < count; i++) {
    n = strlen(candidates[i]);
    count2 = n + n - 1 + 26 * n + 26 * (n + 1);
    cand2 = edits1(candidates[i]);
    if ((correct = known_word(db, cand2, count2)))
     break;
    else {
     free_list(cand2, count2);
     cand2 = NULL;
    }
   }
  }
  free_list(candidates, count);
  free_list(cand2, count2);
 }

 return correct;
}

Demo:

Following are some sample runs of apropos:


$ ./apropos "funckiton for coping stings"
Did you mean "function for copying strings" ?
$ ./apropos "generat termcap databse"
Did you mean "generate termcap database" ?
$ ./apropos idcmp
Did you mean "icmp" ?
$ ./apropos "confguire kernal"
Did you mean "configure kernel" ?
$ ./apropos "packate fillter"
Did you mean "package filter" ?
$ ./apropos reeltek
Did you mean "realtek" ?

Following are some screenshots of apropos_cgi (a CGI version of apropos for browsers):

Further Scope: There are a few technical glitches in integrating this spell corrector with apropos so those need to be sorted. The suggestions are not always as expected, so probably the model for the spell corrector needs to be fine tuned (like Peter Norvig discussed at the end of his article). And while writing this post, it occurred to me that this implementation could make a fine small scale backend for auto completion feature in a web application (for example the apropos cgi above). ;-)

All this code is in the demo-spell , exp-spell branch of the project on github.

I am not sure if anyone would read this far, but thanks anyway for reading and taking interest. :-)

Monday, October 3, 2011

Improvements to makemandb

Over a month passed since GSoC finished and I made some improvements and introduced new features (which are experimental) in apropos. I wanted to write about a few of the things I did in last one month.

Indexing Additional metadata For Faster Update Operations: Previously makemandb was maintaining md5 hashes of all the pages indexed. On each run, makemandb would read all the man pages, generate their md5s and compare those with the md5 hashes it already had in it's index. Then it would parse and store the pages whose md5 hash it did not find in the database, meaning these are the new or modified pages and need (re)indexing.

Joerg pointed out that this wasn't a very efficient approach. It required unnecessarily reading up all the man pages. He suggested to index more metadata about the man page files, like the mtime, device id and the inode number. So rather than reading up the pages and generating their md5, makemandb would do a stat(2) on them, read their {device id, inode, mtime} and see if a matching triplet exists in the database or not and decide whether this page needs to be indexed or not. This is a more efficient approach when you are updating the index after installing some new man pages or updating few of the existing ones. Though when you are building the index from scratch, doing a stat(2) for all the pages just proves to be a roadblock.

Faster makemandb With Clever Memory Management: Due to the above mentioned changes in makemandb it's runtime had increased by more than double. Earlier makemandb could build an index for 8000+ pages under 30-40 seconds but now it was taking 130-150 seconds to do the same job. The changes which made makemandb slow were necessary and could not be undone so I had to identify the other areas where it could do better.

As it turns out, makemandb was managing the memory very poorly. It needs to perform one operation very frequently and that is of concatenating two strings, one of which contains previously parsed data from the man page and the other one contains newly parsed data. Doing such kind of string manipulation is always a tedious task in C. Most straightforward way is to call realloc(3) to allocate sufficient space to hold the contents of the new string and then copy the new string at the end of the old one. I had a function concat() which was doing just the same. In an average length man page there could be well over 100+ calls to concat() and for 8000+ pages this was a very large number of calls to malloc/realloc, and as the length of the string containing already parsed data increases, the realloc calls get even more expensive. So clearly this was the bottleneck which needed to be fixed.

Solution: The solution was very simple. Instead of doing memory allocations every time a new string needs to be concatenated, pre-allocate a large chunk of memory and keep writing to it until you fall short of space, in which case, just reallocate another large chunk and proceed as usual. This would reduce the calls to malloc from 100+ to around 10+ for a single page.


/*
 * A data structure for holding section specific data.
 */
    typedef struct secbuff {
        char *data;
        int buflen; //Total length of buffer allocated initially
        int offset; // Position of next byte to write at
    } secbuff;

static void
append(secbuff *sbuff, const char *src, int srclen)
{
	short flag = 0;
	assert(src != NULL);
	if (srclen == -1)
		srclen = strlen(src);

	if (sbuff->data == NULL) {
		sbuff->data = (char *) emalloc (sbuff->buflen);
		sbuff->offset = 0;
	}

	if ((srclen + 2) >= (sbuff->buflen - sbuff->offset)) {
		sbuff->data = (char *) erealloc(sbuff->data, sbuff->buflen + sbuff->offset);
		sbuff->buflen += sbuff->buflen;
		flag++;
	}

	/* Append a space at the end of the buffer */
	if (sbuff->offset || flag) {
		memcpy(sbuff->data + sbuff->offset, " ", 1);
		sbuff->offset++;
	}

	/* Now, copy src at the end of the buffer */
	memcpy(sbuff->data + sbuff->offset, src, srclen);
	sbuff->offset += srclen;
	return;
}

The secbuff data structure keeps track of the next byte offset in the data buffer where the next character needs to be written. In this way, I could allocate a sufficiently large chunk of memory to a buffer and simply use memcpy to write the new data at it's end. This approach brings large speed improvements to makemandb. The runtime has reduced from 130+ seconds to somewhere around ~45 seconds.

Wednesday, June 22, 2011

NetBSD GSoc Weekly report 3

This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.

Issues Fixed:

Handling .Nm macros: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: bbdab19
ENOMEM in traversedir(): This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:

char *desc;
if (desc == NULL)
    desc = strdup(n->string);
else
    asprintf(&desc, "%s %s", desc, n->string);

So the above asprintf call was leaking out desc at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)
This is the commit which fixed it: cd53b9b

Avoid Hardlinks: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: 14b024f

Implementation of apropos.c: Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The master branch on Github still has this basic version of apropos,

I have started a new experimental branch search on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.

So Currently the search branch has following two features:

Stopword Filter: I noticed that Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this.

It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: ec25546

A Ranking Function: As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.

Commit for this: 001a679fe9a4b4c04a8d

Some Sample Runs: I ran some sample queries to check out how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin: http://pastebin.com/qhQBRNd5

$ ./apropos "copy string"            
memccpy
The memccpy function copies bytes from string src to string dst . If the character c...

strndup
...copies at most len characters from the string str always NUL terminating the copied 
string...

bcopy
...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...

strlcat
size-bounded string copying and concatenation

bcopy
...bcopy function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

strncpy
...copy the string src to dst (including the terminating \e0 character). The strncpy function 
copies...

You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)

How to Test: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:

You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
You will also want to have the -current version of the man pages in /usr/share/man (at least).
libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)