Wednesday, June 22, 2011

NetBSD GSoc Weekly report 3

This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.

Issues Fixed:

  • Handling .Nm macros: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: bbdab19
  • ENOMEM in traversedir(): This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:
char *desc;
if (desc == NULL)
    desc = strdup(n->string);
    asprintf(&desc, "%s %s", desc, n->string);

So the above asprintf call was leaking out desc at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)
This is the commit which fixed it: cd53b9b

  • Avoid Hardlinks: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: 14b024f

Implementation of apropos.c: Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The master branch on Github still has this basic version of apropos, 

I have started a new experimental branch search on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.

So Currently the search branch has following two features:

Stopword Filter: I noticed that Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this. 
It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: ec25546 

A Ranking Function: As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.

Commit for this: 001a679fe9a4b4c04a8d

Some Sample Runs: I ran some sample queries to check out  how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin:
$ ./apropos "copy string"            
The memccpy function copies bytes from string src to string dst . If the character c...

...copies at most len characters from the string str always NUL terminating the copied 

...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...

size-bounded string copying and concatenation

...bcopy function copies len bytes from string src to string dst . The two strings may...

The memcpy function copies len bytes from string src to string dst . The arguments must...

...memmove function copies len bytes from string src to string dst . The two strings may...

...memmove function copies len bytes from string src to string dst . The two strings may...

The memcpy function copies len bytes from string src to string dst . The arguments must...

...copy the string src to dst (including the terminating \e0 character). The strncpy function 

You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)

How to Test: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.

#Clone the repository:
$git clone git://

#Run make

#Run makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that

$git checkout -b search origin/search

and run make again to build it.

  1. You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
  2. You will also want to have the -current version of the man pages in /usr/share/man (at least).
  3. libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)