My tech musings: apropos

Showing posts with label apropos. Show all posts

Tuesday, October 4, 2011

Spell Corrector for Apropos

One of the new features I wrote in apropos is a basic yet reasonably effective spell corrector. While working on apropos, one big nuisance that I noticed was wrongly spelled keywords in the query. When supporting full text searches, I guess it is the usual expectation to have support for spell correction as well.

The search of apropos is based on the Boolean search model, which means that it will return only those documents which contain all the keywords mentioned in the query. This means that you misspell even one keyword and you will either get all non-relevant search results or no results at all. This behaviour is contradictory to the way conventional apropos did it's search, it would return all the results which matched even one of the keywords.

The user might think that "this new apropos is useless, can't get me any right results." Then he would most likely start experimenting by changing keywords and he might or might not succeed. The point is, apropos should be clever enough to inform the user that probably he misspelled one or more keywords in the query, so that the user doesn't waste time scratching his head.

Implementation Of The Spell Corrector: Writing an industry strength spell corrector (like that of Google, Bing, etc.) is a complex task and I have no idea about their intricacies. I was looking for a fairly basic implementation. I came across two articles which discussed implementation of a relatively simple spell checker. One article was by Jon Bentley in his famous book Programming Pearls and the second was from Prof. Peter Norvig in his famous post "How to write a spell corrector". I decided to go with Peter Norvig's implementation because of it's simplicity and ease of implementation. Before continuing, I would like to thank Prof. Norvig for writing such an insightful article and sharing it :-)

I highly recommend reading Prof. Norvig's article to understand the maths and logic involved properly, I am going to give some insight on what his Python code is doing and then produce the C translation of the code, with some demo.

The idea is to find the word at the least edit distance from the word being checked for. Edit distance here means, the number of characters from the given word you need to add, remove or change position to get the correctly spelled word. Peter Norvig mentioned in his post that for 80-95% cases edit distance 1 is sufficient.

The strategy for finding words at edit distance 1 is very simple. Four different kind of mistakes are possible that can lead to a misspelled word at edit distance 1. These are:

Deletion: You missed a character while typing the word. For example: "speling".
Transposition: You exchanged the positions of two adjacent characters in the word. For example: "teh" instead of "the"
Replace: You replaced an alphabet in the word with some other alphabet (possibly you pressed the wrong key on the keyboard). For example: "dapple" instead of "apple" or "produkt" instead of "product"
Insertions: You probably entered one additional alphabet in the spelling of the word. For example: "filles" when you mean "files".

I will take a simple example and show all it's possible permutations at edit distance 1. Let's say we misspelled "the" as "teh", then following are the different possible permutations:

deletes =  ['eh', 'th', 'te']

transpose =  ['eth', 'the']

#the replaces and inserts list is compacted but you get the idea
replaces =  ['aeh', 'beh', 'ceh', 'deh', 'eeh', 'feh', ..., 'zeh', 
                 'tah', 'tbh', 'tch', 'tdh', 'teh', 'tfh', ..., 'tzh'
                  'tea', 'teb', 'tec', 'ted', 'tee', 'tef', ..., 'tez']

inserts =  ['ateh', 'bteh', 'cteh', 'dteh', 'eteh', 'fteh', ..., 'zteh', 
                'taeh', 'tbeh', 'tceh', 'tdeh', 'teeh', 'tfeh', ..., 'tzeh', 
                'teah', 'tebh', 'tech', 'tedh', 'teeh', 'tefh', ..., 'tezh', 
                'teha', 'tehb', 'tehc', 'tehd', 'tehe', 'tehf', ..., 'tehz']

Once we have generated all these possible permutations of the word at edit distance 1, we check in our dictionary which of these are real and valid words. It is always possible that more than one of these permutations is a valid word in the dictionary, in which case we pick up the word which occurs most frequently in our sample corpus used for building the dictionary (this is the training model used for this spell corrector).

I suppose that explains what we need to do. Now time for some code:

NOTE: The following is a C implementation of Peter Norvig's spell corrector. It is written by me from scratch and is part of the apropos_replacement project, licensed under the two clause BSD license.


/*-
 * Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in
 * the documentation and/or other materials provided with the
 * distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

static char **
edits1 (char *word)
{
 int i;
 int len_a;
 int len_b;
 int counter = 0;
 char alphabet;
 int n = strlen(word);
 set splits[n + 1];

 /* calculate number of possible permutations and allocate memory */
 size_t size = n + n -1 + 26 * n + 26 * (n + 1);
 char **candidates = emalloc (size * sizeof(char *));

 /* Start by generating a split up of the characters in the word */
 for (i = 0; i < n + 1; i++) {
  splits[i].a = (char *) emalloc(i + 1);
  splits[i].b = (char *) emalloc(n - i + 1);
  memcpy(splits[i].a, word, i);
  memcpy(splits[i].b, word + i, n - i + 1);
  splits[i].a[i] = 0;
 }

 /* Now generate all the permutations at maximum edit distance of 1.
 * counter keeps track of the current index position in the array candidates
 * where the next permutation needs to be stored.
 */
 for (i = 0; i < n + 1; i++) {
  len_a = strlen(splits[i].a);
  len_b = strlen(splits[i].b);
  assert(len_a + len_b == n);

  /* Deletes */
  if (i < n) {
   candidates[counter] = emalloc(n);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b -1 > 0)
    memcpy(candidates[counter] + len_a , 
                                              splits[i].b + 1, len_b - 1);
   candidates[counter][n - 1] = 0;
   counter++;
  }

  /* Transposes */
  if (i < n - 1) {
   candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a, splits[i].b + 1, 1);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, 1);
   if (len_b >= 2)
    memcpy(candidates[counter] + len_a + 2, 
                                               splits[i].b + 2, len_b - 2);
   candidates[counter][n] = 0;
   counter++;
  }

  /* For replaces and inserts, run a loop from 'a' to 'z' */
  for (alphabet = 'a'; alphabet <= 'z'; alphabet++) {
   /* Replaces */
   if (i < n) {
    candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b - 1 >= 1)
    memcpy(candidates[counter] + len_a + 1, 
                                               splits[i].b + 1, len_b - 1);
   candidates[counter][n] = 0;
   counter++;
          }

          /* Inserts */
   candidates[counter] = emalloc(n + 2);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b >=1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, len_b);
   candidates[counter][n + 1] = 0;
   counter++;
  }
 }
 return candidates;
}

/*
* known_word--
* Pass an array of strings to this function and it will return the word with
* maximum frequency in the dictionary. If no word in the array list is found
* in the dictionary, it returns NULL
* NOTE: The dictionary in our case is a table in the db with two fields:
*       term, occurrences
*/
static char *
known_word(sqlite3 *db, char **list, int n)
{
 int i, rc;
 char *sqlstr;
 char *termlist = NULL;
 char *correct = NULL;
 sqlite3_stmt *stmt;

 /* Build termlist: a comma separated list of all the words in the list for
 * use in the SQL query later.
 */
 int total_len = BUFLEN * 20; /* total bytes allocated to termlist */
 termlist = emalloc (total_len);
 int offset = 0; /* Next byte to write at in termlist */
 termlist[0] = '(';
 offset++;

 for (i = 0; i < n; i++) {
  int d = strlen(list[i]);
  if (total_len - offset < d + 3) {
   termlist = erealloc(termlist, offset + total_len);
   total_len *= 2;
  }
  memcpy(termlist + offset, "\'", 1);
  offset++;
  memcpy(termlist + offset, list[i], d);
  offset += d;

  if (i == n -1) {
   memcpy(termlist + offset, "\'", 1);
   offset++;
  }
  else {
   memcpy(termlist + offset, "\',", 2);
   offset += 2;
  }

 }
 if (total_len - offset > 3)
  memcpy(termlist + offset, ")", 2);
 else
  concat(&termlist, ")", 1);

 easprintf(&sqlstr, "SELECT term FROM metadb.dict WHERE "
 "occurrences = (SELECT MAX(occurrences) from metadb.dict "
 "WHERE term IN %s) AND term IN %s", termlist, termlist);
 rc = sqlite3_prepare_v2(db, sqlstr, -1, &stmt, NULL);
 if (rc != SQLITE_OK) {
  warnx("%s", sqlite3_errmsg(db));
  return NULL;
 }

 if (sqlite3_step(stmt) == SQLITE_ROW)
  correct = strdup((char *) sqlite3_column_text(stmt, 0));

 sqlite3_finalize(stmt);
 free(sqlstr);
 free(termlist);
 return (correct);
}

static void
free_list(char **list, int n)
{
 int i = 0;
 if (list == NULL)
  return;

 while (i < n) {
  free(list[i]);
  i++;
 }
}

/*
* spell--
* The API exposed to the user. Returns the most closely matched word from the
* dictionary. It will first search for all possible words at distance 1, if no
* matches are found, it goes further and tries to look for words at edit
* distance 2 as well. If no matches are found at all, it returns NULL.
*/
char *
spell(sqlite3 *db, char *word)
{
 int i;
 char *correct;
 char **candidates;
 int count2;
 char **cand2 = NULL;
 char *errmsg;
 const char *sqlstr;
 int n;
 int count;
 
 lower(word);
 
 /* If this word already exists in the dictionary then no need to go further */
 correct = known_word(db, &word, 1);

 if (!correct) {
  n = strlen(word);
  count = n + n -1 + 26 * n + 26 * (n + 1);
  candidates = edits1(word);
  correct = known_word(db, candidates, count);
  /* No matches found ? Let's go further and find matches at edit distance 2.
  * To make the search fast we use a heuristic. Take one word at a time from
  * candidates, generate it's permutations and look if a match is found.
  * If a match is found, exit the loop. Works reasonable fast but accuracy
  * is not quite there in some cases.
  */
  if (correct == NULL) {
   for (i = 0; i < count; i++) {
    n = strlen(candidates[i]);
    count2 = n + n - 1 + 26 * n + 26 * (n + 1);
    cand2 = edits1(candidates[i]);
    if ((correct = known_word(db, cand2, count2)))
     break;
    else {
     free_list(cand2, count2);
     cand2 = NULL;
    }
   }
  }
  free_list(candidates, count);
  free_list(cand2, count2);
 }

 return correct;
}

Demo:

Following are some sample runs of apropos:


$ ./apropos "funckiton for coping stings"
Did you mean "function for copying strings" ?
$ ./apropos "generat termcap databse"
Did you mean "generate termcap database" ?
$ ./apropos idcmp
Did you mean "icmp" ?
$ ./apropos "confguire kernal"
Did you mean "configure kernel" ?
$ ./apropos "packate fillter"
Did you mean "package filter" ?
$ ./apropos reeltek
Did you mean "realtek" ?

Following are some screenshots of apropos_cgi (a CGI version of apropos for browsers):

Further Scope: There are a few technical glitches in integrating this spell corrector with apropos so those need to be sorted. The suggestions are not always as expected, so probably the model for the spell corrector needs to be fine tuned (like Peter Norvig discussed at the end of his article). And while writing this post, it occurred to me that this implementation could make a fine small scale backend for auto completion feature in a web application (for example the apropos cgi above). ;-)

All this code is in the demo-spell , exp-spell branch of the project on github.

I am not sure if anyone would read this far, but thanks anyway for reading and taking interest. :-)

Monday, October 3, 2011

Improvements to makemandb

Over a month passed since GSoC finished and I made some improvements and introduced new features (which are experimental) in apropos. I wanted to write about a few of the things I did in last one month.

Indexing Additional metadata For Faster Update Operations: Previously makemandb was maintaining md5 hashes of all the pages indexed. On each run, makemandb would read all the man pages, generate their md5s and compare those with the md5 hashes it already had in it's index. Then it would parse and store the pages whose md5 hash it did not find in the database, meaning these are the new or modified pages and need (re)indexing.

Joerg pointed out that this wasn't a very efficient approach. It required unnecessarily reading up all the man pages. He suggested to index more metadata about the man page files, like the mtime, device id and the inode number. So rather than reading up the pages and generating their md5, makemandb would do a stat(2) on them, read their {device id, inode, mtime} and see if a matching triplet exists in the database or not and decide whether this page needs to be indexed or not. This is a more efficient approach when you are updating the index after installing some new man pages or updating few of the existing ones. Though when you are building the index from scratch, doing a stat(2) for all the pages just proves to be a roadblock.

Faster makemandb With Clever Memory Management: Due to the above mentioned changes in makemandb it's runtime had increased by more than double. Earlier makemandb could build an index for 8000+ pages under 30-40 seconds but now it was taking 130-150 seconds to do the same job. The changes which made makemandb slow were necessary and could not be undone so I had to identify the other areas where it could do better.

As it turns out, makemandb was managing the memory very poorly. It needs to perform one operation very frequently and that is of concatenating two strings, one of which contains previously parsed data from the man page and the other one contains newly parsed data. Doing such kind of string manipulation is always a tedious task in C. Most straightforward way is to call realloc(3) to allocate sufficient space to hold the contents of the new string and then copy the new string at the end of the old one. I had a function concat() which was doing just the same. In an average length man page there could be well over 100+ calls to concat() and for 8000+ pages this was a very large number of calls to malloc/realloc, and as the length of the string containing already parsed data increases, the realloc calls get even more expensive. So clearly this was the bottleneck which needed to be fixed.

Solution: The solution was very simple. Instead of doing memory allocations every time a new string needs to be concatenated, pre-allocate a large chunk of memory and keep writing to it until you fall short of space, in which case, just reallocate another large chunk and proceed as usual. This would reduce the calls to malloc from 100+ to around 10+ for a single page.


/*
 * A data structure for holding section specific data.
 */
    typedef struct secbuff {
        char *data;
        int buflen; //Total length of buffer allocated initially
        int offset; // Position of next byte to write at
    } secbuff;

static void
append(secbuff *sbuff, const char *src, int srclen)
{
	short flag = 0;
	assert(src != NULL);
	if (srclen == -1)
		srclen = strlen(src);

	if (sbuff->data == NULL) {
		sbuff->data = (char *) emalloc (sbuff->buflen);
		sbuff->offset = 0;
	}

	if ((srclen + 2) >= (sbuff->buflen - sbuff->offset)) {
		sbuff->data = (char *) erealloc(sbuff->data, sbuff->buflen + sbuff->offset);
		sbuff->buflen += sbuff->buflen;
		flag++;
	}

	/* Append a space at the end of the buffer */
	if (sbuff->offset || flag) {
		memcpy(sbuff->data + sbuff->offset, " ", 1);
		sbuff->offset++;
	}

	/* Now, copy src at the end of the buffer */
	memcpy(sbuff->data + sbuff->offset, src, srclen);
	sbuff->offset += srclen;
	return;
}

The secbuff data structure keeps track of the next byte offset in the data buffer where the next character needs to be written. In this way, I could allocate a sufficiently large chunk of memory to a buffer and simply use memcpy to write the new data at it's end. This approach brings large speed improvements to makemandb. The runtime has reduced from 130+ seconds to somewhere around ~45 seconds.

Tuesday, August 23, 2011

Final Report: NetBSD GSoC-2011 Apropos Replacement

This is the final status report for this project for this summer. I will try to summarise what the project was all about and what goals have been achieved.

Objective Of The Project: We all are well aware of the importance of quality documentation and Unix like operating systems ship with documentation in the form of manual pages (or man pages in short). They have been there since the early days to aid the system administrators as well as programmers in configuring and modifying the system. A very basic search utility was provided in the form of apropos(1). I am not aware the history apropos(1) but my guess is that, it was kept simple because of the hardware limitations of the day. There is usually a plain text file (whatis.db usually) which indexes the NAME section of the man pages and apropos(1) simply searches this file for the keywords specified by the user on the command line.

In today's time we have much more powerful machines and also the research in information retrieval has progressed a lot. So the objective of this project was to develop a replacement for apropos(1) so as to develop a complete search tool. The aim was to index the complete content of the man pages and develop a tool to search that index.

Full text search of the man pages was the main task but this entailed a few small other benefits to NetBSD as well. I will be discussing them as I go along.

Implementation Strategy:

The implementation had largely two main stages:

Parsing and Indexing Man Pages: The starting point of the project was to develop a utility which would parse all the man pages and build a full text search index. For parsing the man pages I used the libmandoc parser from the mdocml which is really an excellent and innovative project. And for the full text search I used the FTS engine of Sqlite. I used these tools to develop makemandb, which would traverse the set of directories containing man pages, parse them, extract the relevant data and store them in an FTS database.
Search: Once the FTS index was there, the next thing to do was to develop a tool to search the database and a mechanism to rank the search results. A long part of the project involved experimenting what techniques for ranking the results well and which ones did not work very well. This stage lead to the development a new version of apropos.

Deliverables: So the project resulted in following deliverables:

makemandb: It is a command line utility to parse the man pages installed on the system and build a full text search database as described above.
An option ('p') to man(1) to print the list of directories containing the man page sources.
apropos: This is the 2nd command line utility to search the FTS database. It sanitizes the user query, removes any stopwords from it, executes the search, and ranks the results in decreasing order of their weights which are computed on the basis of a term-frequency and inverse-document-frequency based algorithm.
An API: I developed a very small API (consisting of 4 functions or so) to allow building custom interfaces. For example in future it will be easy to build a CGI application on top of this code, or maybe a desktop GUI client.
Documentation: I have also developed detailed documentation in the form of manual pages.
There is also a patch for man(1) to modify the way symlinks/hardlinks are handled. This patch would eliminate the need of maintaining symlinks/hardlinks of man pages on the file system. A lot of man page files are simply links to other pages and this has to be specified explicitly in the makefiles of the packages via MKLINKS. This patch would take away this pain and also wipe off all the symlinks/hardlinks from the file system.

Challenges on the way: This was a relatively easier project to do, but every project has it's challenges.

Parsing man pages: I had never looked at how man pages are written before this so I was overwhelmed. Though libmandoc made my task very easy, my understanding about the syntax kept growing as I progressed in the project. It was also particularly challenging the man(7) based pages, while parsing mdoc(7) pages was relatively easy.
Ranking Algorithm: Another challenge was to come up with a suitable ranking algorithm which helps in bringing up the relevant results at the top. I did some study for this and experimented with a number of different algorithms, and finally settled with a mixture of different ranking schemes ;-).
Testing: On the way towards completion I had to make a number of changes in the code related to parsing. Initially I started with parsing only the NAME and DESCRIPTION section, then expanded to parsing all the sections and a lot of similar changes. Each time I made a change, I had to thoroughly review the results to make sure nothing wrong was happening as a result of the changes. I had to make sure both mdoc(7) and man(7) pages were getting parsed properly and the database had sane data. I agree that I should have written unit tests instead, maybe this is something worth doing in coming days.

Results: Attaching the output of some sample runs.

#apropos "add a new user"

#apropos "generate password hash"

#apropos "convert signal number to string"

#apropos -3 "compute log"

#apropos "realtek"

Acknowledgements:

I owe a big chunk of the success to my mentor Jörg Sonnenberger who was always there to answer my questions, offer advice and review the code. I have learnt a great deal from him and I am sure I have improved as a programmer. The best thing about working with him was that he never really disclosed the solution, instead he gently guided towards the direction of the solution, so I never lost a learning opportunity :-)

David Young also offered valuable guidance during the project. He provided some clever insights and tips to improve the search and ranking of the results. I decided to decompose the database into more columns based on different sections in a man page based on his idea only.

Thanks to Kristaps Dzonsons as well who is responsible for the mdocml project. He also reviewed the code related to parsing of the pages and pointed out bugs in the code. I implemented makemandb based on his utility "mandocdb", so that was also a huge help.

Special thanks goes to Thomas Klausner for reviewing the man pages I wrote and also proving patches for the errors/mistakes I had made in them.

I must also thank Julio Merino, Jan Schaumann, Jukka Ruohonen, S.P.Zeidler for the interest they showed in the project and offered help throughout :-)

And thanks to lots of other people in the community as well whose names I am forgetting. While there I would like to thank my friends and family as well for keeping up with me when I slept during the day and worked at night.

What Next?

Well no guesses here, I enjoyed my experience and I would try to grab another project and continue working in the NetBSD community. Systems programming always attracted me and although I don't have much practical knowledge in this field but I don't mind learning ;-)

Sunday, July 31, 2011

NetBSD GSoC: Project Update 5

First of all thanks to Joerg, David Young and all other people involved with GSoC as I cleared Midterm Evaluations. 20 days passed since I last posted an update of the project and I did not even realize it. I apologize if I seem to be inactive, but in my defense, I would like to just quote my Commit Log ;-). I have been actively pushing changes, fixing issues, adding new features all this while. As the size of the project is growing it is taking more and more time to make new changes, test them properly and fix any loose ends.

So a brief overview of the things I did in last 3 weeks:

[New Feature] Search Within Specific Sections: This feature has been on my TODO list for very long, but something or the other kept coming up which was more important to deal with. I wanted to do it this time before I posted this update. So here it is:

I have added options to support search within one or more specified sections.

Commit: 966c7ba
You can do a search like this:

$apropos -1 "copy files"

#It will search in section 1 only.

$apropos -18 "adding new user"


#This will search in section 1 and 8 only.
#I hope you get the idea :)

Some sample runs: http://paste2.org/p/1554491
http://paste2.org/p/1554510
http://paste2.org/p/1554509

Indexing Performance Improvement: Joerg suggested a clever way to bring down the time taken by makemandb to index the pages. He suggested that instead of doing a separate transaction for each page, it is better to index all the pages inside a single transaction which will decrease the IO overhead substantially. So I did the changes, and the indexing time came down from 3 minutes to within range of 30 seconds or so.

Commit: 926746

Parse And Index man(7) Pages: Till now we were indexing only the mdoc(7) pages and all the man(7) based pages were being ignored. Now when the project was working quite ok with mdoc(7) pages, it was time to scale up. Parsing man(7) pages was a bit more difficult as compared to parsing mdoc(7) pages. It took some 2-3 days to implement this code and the next 2-3 days to fix various bugs and testing whether it was working ok with the 7000+ man pages I have.

Commit: 2014855

Too Large DB Size (Regression): Parsing man(7) and mdoc(7) meant that I was indexing a whole lot of man pages. (7613 to be exact). This scale up in the number of indexed pages also scaled up some problems which were not really visible before this. One major problem that came up was the size of the DB. It had grown to almost 99M.
Root Cause: The root cause for this was that we were also storing all the unique terms in the corpus and their term-weights in a separate table, which was almost doubling the space requirements.
Solution: So as a quick solution to the problem I decided to remove the code related to pre-computation of the term-weights and drop this table. This brought down the DB size to around 60M and with a few optmizations it has come down in the range of 30-40M.
Drawbacks: The pre-computation of weight had it's advantages, I was using it to implement some advanced ranking algorithms and I had some plans to improve the ranking further on the basis of this work but I had to let it go.
Alternatives: The extra space was only helping to get more accurate results, it was a trade off between space and search quality. One alternative can be to let the user decide what does he/she want ?Let the user choose between the two versions.

Commit: 7928fc5

Added Compression Option To The DB: To bring down the DB size further, I implemented code for compressing and decompressing data using zlib(3). It was also an exercise to make the zlib interface work with Sqlite.

As a result of this the DB size came down to 43 M.

Commit: d878815

Stopword Tokenizer: Implementing a custom tokenizer to filter out any stopword was already on my TODO list but with the increased DB size it became the priority. I patched the porter tokenizer from the Sqlite source to filter out any stopwords.

The tokenizer seemed to be working fine, and it also helped in bringing the DB size down. When using the stopword tokenizer the size came to be around 31M.

Due to a small bug I have disabled the use of this tokenizer for now.

Commit: 76b4769

Parsing Additional sections & Storing Them In Individual Columns: This was a required change. With such a large number of pages (7613) in the db, and all of the content in a single column mean a lot of noise and the search results were off the mark by a great margin. David Young had also suggested this previously to give weight to some prominent sections like "DIAGNOSTICS" than others like "ERRORS", etc.

It was a big task to do. I first started with decomposing the mdoc(7) pages, then man(7) pages and then sat down to fix apropos to take in account the new columns in the DB and fix the ranking function.

Commit: decomposing mdoc(7) , decomposing man(7) , changes in ranking function

I would say, the time taken to implement this was worth it. Because it has helped in making the code more clean. In future if there was a requirement to parse another extra section, it will only require adding a switch case statement and a couple of extra lines of code.

static void
mdoc_parse_section(enum mdoc_sec sec, const char *string)
{
    switch (sec) {
    case SEC_LIBRARY:
        concat(&lib, string);
        break;
    case SEC_SYNOPSIS:
        concat(&synopsis, string);
        break;
    case SEC_RETURN_VALUES:
        concat(&return_vals, string);
        break;
    case SEC_ENVIRONMENT:
        concat(&env, string);
        break;
    case SEC_FILES:
        concat(&files, string);
        break;
    case SEC_EXIT_STATUS:
        concat(&exit_status, string);
        break;
    case SEC_DIAGNOSTICS:
        concat(&diagnostics, string);
        break;
    case SEC_ERRORS:
        concat(&errors, string);
        break;
    case SEC_NAME:
        break;
    default:
        concat(&desc, string);
        break;
    }
}

It also allows to fine tune the ranking function easily and play with it. If you want to experiment around with search, you can easily modify the column weights and rebuild to see the effects. The column weights are in the form of a double array in the rank_func function.

double col_weights[] = {
    2.0, // NAME
    2.00, // Name-description
    0.55, // DESCRIPTION
    0.25, // LIBRARY
    0.10, //SYNOPSIS
    0.001, //RETURN VALUES
    0.20, //ENVIRONMENT
    0.01, //FILES
    0.001, //EXIT STATUS
    2.00, //DIAGNOSTICS
    0.05 //ERRORS
};

[Feature Proposal]: Show additional data with search results- Storing the different sections in separate column has it's advantages as well. One of them being the ability to fetch and show more specific content with search results. For example, I have already done something like this. Now, if you see the search results, you will also see the one line description of the result (.Nd macro).

Similarly it is possible to show the library, exit values, return values where possible. But I was wondering if it is a useful feature ? Any views ?

Besides this there are a lot of other things to be done that I had mentioned in my proposal like a CGI based interface and using the database for managing the man page aliases.These are now on top of my TODO list, and if no big issues come up, I would like to pick them up.

Friday, July 8, 2011

NetBSD GSoC: Midterm Project Update

Another update of the project after a gap of 2 weeks. I did not have much to write about last week, and come this week, we have reached the stage of midterm evaluations. I will try to explain what new changes I made in last two weeks, over all what is the present status of the project and what are the other things which I am planning to implement.

Printing the section number along with search results: Last time around when I posted a few sample runs of the project, there were no section numbers along with the search results, but now we have them (thanks to Kristaps for suggesting the right way to extract meta data).

Improve the ranking algorithm (Implement tf-idf): tf-idf based term weighting schemes are very common in information retrieval systems. Till now, I was ranking the results only on the basis of the term frequency (tf). I improved this by including another factor for ranking, .i.e, the inverse document frequency (idf).

Term Frequency: Is usually defined as the number of times a given term appears in a particular document.
Inverse Document Frequency: IDF of a term indicates in how many documents a given term appears (at least once).

Term frequency is a local factor, which is concerned only with the number of occurrences of the search terms in one particular document at a time.

While Inverse Document Frequency is a global factor, in the sense that, it indicates the discriminating power of a term. If a term appears in only a selected set of documents, then it means, that that term separates that set of documents from the rest. So ranking obtained by combining these two factors brings up more relevant documents.

So the weight of a term t in document d is calculated by the following formula:

weight = tf * idf

Where tf = Term frequency of term t in document d
idf = log (N / Nt)

Where N = Total number of documents in the corpus
Nt = Number of documents in which term t occurs (at least once).

So for a term which appears in only one document it will have
IDF = log(n)
while a term which appears in all the documents, it will have

IDF = log(1) = 0.

For example a term like "the" will have a high term frequency in any document, but at the same time it will have a lower inverse document frequency (almost close to 0), which will nullify it's effect on the quality of search results.

Pre-compute the term-weights: While the tf-idf based term-weighting scheme improved the quality of search, it degraded it's performance. I could see apropos taking time in flushing out the results. The reason for this was that, all the calculations of the term-weights were being done on the fly when running apropos. An obvious solution to this problem was to pre-compute the term-weights while creating the index and store them in the database. Thus while doing the search, we only need to lookup the database rather than do both lookup and perform calculation!

I implemented the code for pre-computing term weights in makemandb, but to my surprise, these changes made makemandb painfully slow. Earlier makemandb could index the man pages in under 2 minutes, but now it was taking close to 3 hours to do the pre-computation of each unique term in the corpus. In addition to that there were some bugs which were causing large deviations in the term-weights. I decided to first get the calculations right, it took me 3 days to get the calculations right, as after each bit of change in the code I had to re-run makemandb to do the indexing and see the results. Finally, I got it right, and then after some discussions with Joerg, the performance issue was also fixed. Basically the solution was to bring most of the processing inside Sqlite. Now makmandb does the indexing and pre-computation of weights, all under 3 minutes on my machine :-)

Further Improve the Ranking Algorithm: In my free time I am doing some study on Information Retrieval. During my studies I came across a very interesting research paper by Salton and Buckley from 1988, in which they discussed different term weighting schemes and their results. According to their study, the following formula for calculating term weights is most effective:

for weight of a given term in a particular document we can calculate the weight as:

I implemented this in a bit more simpler form. I avoided the calculation of powers in the denominator (square root and square) to avoid unnecessary overheads as these calculations are being done on the fly by apropos. The results have been pretty good.

Sample Results: http://pastebin.com/PjdNY68m

Note: The above mentioned change is in the search branch only at the moment. I did not merge this in master so that, if you guys want to compare the differences before and after the above change, you can easily checkout the master and search branches and see for yourself :-)

A keyword Recognizer: I have been thinking to implement this feature for a while. Basically the idea is to scan the query for some keywords, which would indicate that the user is probably looking for results from a particular section. For example "functions to copy string" gives an indication that the user is looking for standard library functions from section 3.

After some discussions with David, we came to the conclusion that probably a better way to implement a feature like this would be to do something like Google does. Google allows you to search within a specific website using a syntax like:

[book on IR site: amazon.com].

David suggested to use a similar interface, where user could specify a specific section using colon. So for example:

apropos "kernel: function to allocate memory"

will search only within section 9.

I started some work on this feature but it didn't work out properly, so it is at the moment on halt. I hope to resume work on it soon, but at the same time I would like to know if this feature is worth it ?

Where Do We Stand At Midterm Evaluation: As I promised in my proposal, I have accomplished most of the requirements just in time for the midterm evaluation (maybe I need to write some documentation). Now is the time for some community feedback :-). I would love to hear about

How good or bad the search results are ? If for some query you feel that right results are not coming up, please write to me about that query and what results you expected to see at the top.
If you want to see any improvements or new features, tell me about them.

What New Features Are Next? Apart from the keyword recognizer, there are another couple of features that I have in mind, although whether I will implement them or not is a different matter, as I need to make sure whether it is feasible to implement them.

A Link Analysis Algorithm For Ranking: Search engines these days do two types of ranking.

Content based ranking: It is concerned with finding relevant documents by matching the content. For example the tf-idf based term-weighting scheme is one way of doing content based ranking.
Popularity Based Ranking: It tries to rank the documents based on their popularity. This popularity is calculated on the basis of a link analysis algorithm. For example Google's PageRank or Jon Kleinberg's HITS algorithm.

I am studying about the PageRank algorithm and I am tempted to implement it, but I am held backby the fact that Stanford University has a patent on the PageRank process, so I am in a dilemma whether I should implement it or not.

A Spell Checker: It is a very common thing that the users do a typo while performing the search, which might lead to no results at all or in some cases wrong results. I am thinking to add a spell checker, which in case no results are found, would suggest to the user some related search terms (assuming that perhaps he made a typo).

I am held back on this because personally I have never looked at what techniques are involved in spell checkers but I have heard that it is computationally very expensive.

Testing out apropos:

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch. The search branch has an improved ranking algorithm, so you might want to check it out and compare the results before and after the algorithm improvement:

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:

You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
You will also want to have the -current version of the man pages in /usr/share/man (at least).
libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

Feedbacks are welcome :-)

Wednesday, June 22, 2011

NetBSD GSoc Weekly report 3

This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.

Issues Fixed:

Handling .Nm macros: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: bbdab19
ENOMEM in traversedir(): This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:

char *desc;
if (desc == NULL)
    desc = strdup(n->string);
else
    asprintf(&desc, "%s %s", desc, n->string);

So the above asprintf call was leaking out desc at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)
This is the commit which fixed it: cd53b9b

Avoid Hardlinks: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: 14b024f

Implementation of apropos.c: Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The master branch on Github still has this basic version of apropos,

I have started a new experimental branch search on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.

So Currently the search branch has following two features:

Stopword Filter: I noticed that Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this.

It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: ec25546

A Ranking Function: As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.

Commit for this: 001a679fe9a4b4c04a8d

Some Sample Runs: I ran some sample queries to check out how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin: http://pastebin.com/qhQBRNd5

$ ./apropos "copy string"            
memccpy
The memccpy function copies bytes from string src to string dst . If the character c...

strndup
...copies at most len characters from the string str always NUL terminating the copied 
string...

bcopy
...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...

strlcat
size-bounded string copying and concatenation

bcopy
...bcopy function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

strncpy
...copy the string src to dst (including the terminating \e0 character). The strncpy function 
copies...

You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)

How to Test: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:

You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
You will also want to have the -current version of the man pages in /usr/share/man (at least).
libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)

Tuesday, June 14, 2011

NetBSD GSoC Weekly Report 2

This was a relatively productive week as compared to the Ist week. A significant portion of work got done from the point of view of our first milestone (to have a working prototype).

What did I do this week:

Project Repository: The first thing I did was to create the project repository on Github. Here is the link: https://github.com/abhinav-upadhyay/apropos_replacement

makemandb: This is one of the crucial components of the project. makemandb is supposed to parse each and every man page installed on the user's system and store them in an Sqlite database. The reason I say, it is crucial is because we will be making a lot of changes in the database schema, the way how man pages are parsed and what information is extracted from them, until we reach near perfection in our search results. It is necessary to get this part right, because a good search experience comes only when we have done the indexing correctly.

makemandb first calls 'man -p' and recursively traverses the list of directories to get the complete path of the man pages and then passes them on to the libmandoc functions.

The parsing related code of makemandb is largely inspired from mandoc-db as I had no clue about how to use libmandoc and that too for parsing specific portions of the man pages, so it was a huge help for me. Thanks to Kristaps :)

makemandb will create a new Sqlite database named 'apropos.db' (even if there was already an existing database). It will create a new virtual table in the database before starting to insert data, the present schema of the virtual table is something like this:

Table name: mandb

Column Name	Description
name	For storing the Name of the man page
name_desc	For storing the one line description of the man page from the NAME section
desc	For storing the complete DESCRIPTION section

Present Issues:

Handling .Nm macros: .Nm macros seem to be special in the syntax of man pages. From what I have seen, the argument for the .Nm macro is specified only at the beginning of the man page (usually the NAME section) and after that if at any place in the rest of the man page .Nm macros is used, the parser will replace it with its original value specified previously at the top. So at the present moment, we are unable to handle this. So wherever .Nm macros is used again, it is being simply ignored.
Unable to parse Escape Sequences: Man pages are filled with a number of escape sequences. Presently our code does not try to do anything special to handle the escape sequences and they are being parsed as it is. The current version of mdocml has a new function mandoc_escape(), which I think should be helpful to rectify this. Hope to see the latest version of mdocml in the -current to be able to use this.
Unable to parse automatically generated man pages: Some of the man pages are generated automatically as a result their syntax is very different from the normal man pages, as a result we are unable to parse such man pages at the moment.

There are a few more issues which I have listed on Github. (https://github.com/abhinav-upadhyay/apropos_replacement/issues).

So at the present moment, you can clone the repository, run make to compile the source and run './makemandb'. If all goes well, a new sqlite database (apropos.db) will be created in the present directory. You can run some select queries against it to test.

Feedback will be highly appreciated :)