My tech musings: BSD

Monday, October 3, 2011

Improvements to makemandb

Over a month passed since GSoC finished and I made some improvements and introduced new features (which are experimental) in apropos. I wanted to write about a few of the things I did in last one month.

Indexing Additional metadata For Faster Update Operations: Previously makemandb was maintaining md5 hashes of all the pages indexed. On each run, makemandb would read all the man pages, generate their md5s and compare those with the md5 hashes it already had in it's index. Then it would parse and store the pages whose md5 hash it did not find in the database, meaning these are the new or modified pages and need (re)indexing.

Joerg pointed out that this wasn't a very efficient approach. It required unnecessarily reading up all the man pages. He suggested to index more metadata about the man page files, like the mtime, device id and the inode number. So rather than reading up the pages and generating their md5, makemandb would do a stat(2) on them, read their {device id, inode, mtime} and see if a matching triplet exists in the database or not and decide whether this page needs to be indexed or not. This is a more efficient approach when you are updating the index after installing some new man pages or updating few of the existing ones. Though when you are building the index from scratch, doing a stat(2) for all the pages just proves to be a roadblock.

Faster makemandb With Clever Memory Management: Due to the above mentioned changes in makemandb it's runtime had increased by more than double. Earlier makemandb could build an index for 8000+ pages under 30-40 seconds but now it was taking 130-150 seconds to do the same job. The changes which made makemandb slow were necessary and could not be undone so I had to identify the other areas where it could do better.

As it turns out, makemandb was managing the memory very poorly. It needs to perform one operation very frequently and that is of concatenating two strings, one of which contains previously parsed data from the man page and the other one contains newly parsed data. Doing such kind of string manipulation is always a tedious task in C. Most straightforward way is to call realloc(3) to allocate sufficient space to hold the contents of the new string and then copy the new string at the end of the old one. I had a function concat() which was doing just the same. In an average length man page there could be well over 100+ calls to concat() and for 8000+ pages this was a very large number of calls to malloc/realloc, and as the length of the string containing already parsed data increases, the realloc calls get even more expensive. So clearly this was the bottleneck which needed to be fixed.

Solution: The solution was very simple. Instead of doing memory allocations every time a new string needs to be concatenated, pre-allocate a large chunk of memory and keep writing to it until you fall short of space, in which case, just reallocate another large chunk and proceed as usual. This would reduce the calls to malloc from 100+ to around 10+ for a single page.


/*
 * A data structure for holding section specific data.
 */
    typedef struct secbuff {
        char *data;
        int buflen; //Total length of buffer allocated initially
        int offset; // Position of next byte to write at
    } secbuff;

static void
append(secbuff *sbuff, const char *src, int srclen)
{
	short flag = 0;
	assert(src != NULL);
	if (srclen == -1)
		srclen = strlen(src);

	if (sbuff->data == NULL) {
		sbuff->data = (char *) emalloc (sbuff->buflen);
		sbuff->offset = 0;
	}

	if ((srclen + 2) >= (sbuff->buflen - sbuff->offset)) {
		sbuff->data = (char *) erealloc(sbuff->data, sbuff->buflen + sbuff->offset);
		sbuff->buflen += sbuff->buflen;
		flag++;
	}

	/* Append a space at the end of the buffer */
	if (sbuff->offset || flag) {
		memcpy(sbuff->data + sbuff->offset, " ", 1);
		sbuff->offset++;
	}

	/* Now, copy src at the end of the buffer */
	memcpy(sbuff->data + sbuff->offset, src, srclen);
	sbuff->offset += srclen;
	return;
}

The secbuff data structure keeps track of the next byte offset in the data buffer where the next character needs to be written. In this way, I could allocate a sufficiently large chunk of memory to a buffer and simply use memcpy to write the new data at it's end. This approach brings large speed improvements to makemandb. The runtime has reduced from 130+ seconds to somewhere around ~45 seconds.

Wednesday, June 22, 2011

NetBSD GSoc Weekly report 3

This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.

Issues Fixed:

Handling .Nm macros: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: bbdab19
ENOMEM in traversedir(): This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:

char *desc;
if (desc == NULL)
    desc = strdup(n->string);
else
    asprintf(&desc, "%s %s", desc, n->string);

So the above asprintf call was leaking out desc at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)
This is the commit which fixed it: cd53b9b

Avoid Hardlinks: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: 14b024f

Implementation of apropos.c: Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The master branch on Github still has this basic version of apropos,

I have started a new experimental branch search on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.

So Currently the search branch has following two features:

Stopword Filter: I noticed that Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this.

It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: ec25546

A Ranking Function: As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.

Commit for this: 001a679fe9a4b4c04a8d

Some Sample Runs: I ran some sample queries to check out how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin: http://pastebin.com/qhQBRNd5

$ ./apropos "copy string"            
memccpy
The memccpy function copies bytes from string src to string dst . If the character c...

strndup
...copies at most len characters from the string str always NUL terminating the copied 
string...

bcopy
...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...

strlcat
size-bounded string copying and concatenation

bcopy
...bcopy function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

strncpy
...copy the string src to dst (including the terminating \e0 character). The strncpy function 
copies...

You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)

How to Test: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:

You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
You will also want to have the -current version of the man pages in /usr/share/man (at least).
libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)