Monday, October 3, 2011

Improvements to makemandb

Over a month passed since GSoC finished and I made some improvements and introduced new features (which are experimental) in apropos. I wanted to write about a few of the things I did in last one month.

Indexing Additional metadata For Faster Update Operations: Previously makemandb was maintaining md5 hashes of all the pages indexed. On each run, makemandb would read all the man pages, generate their md5s and compare those with the md5 hashes it already had in it's index. Then it would parse and store the pages whose md5 hash it did not find in the database, meaning these are the new or modified pages and need (re)indexing.
    Joerg pointed out that this wasn't a very efficient approach. It required unnecessarily reading up all the man pages. He suggested to index more metadata about the man page files, like the mtime, device id and the inode number. So rather than reading up the pages and generating their md5, makemandb would do a stat(2) on them, read their {device id, inode, mtime} and see if a matching triplet exists in the database or not and decide whether this page needs to be indexed or not.  This is a more efficient approach when you are updating the index after installing some new man pages or updating few of the existing ones. Though when you are building the index from scratch, doing a stat(2) for all the pages just proves to be a roadblock.



Faster makemandb With Clever Memory Management: Due to the above mentioned changes in makemandb it's runtime had increased by more than double. Earlier makemandb could build an  index for 8000+ pages under 30-40 seconds but now it was taking 130-150 seconds to do the same job. The changes which made makemandb slow were necessary and could not be undone so I had to identify the other areas where it could do better.
    As it turns out, makemandb was managing the memory very poorly. It needs to perform one operation very frequently and that is of concatenating two strings, one of which contains previously parsed data from the man page and the other one contains newly parsed data. Doing such kind of string manipulation is always a tedious task in C. Most straightforward way is to call realloc(3) to allocate sufficient space to hold the contents of the new string and then copy the new string at the end of the old one. I had a function concat() which was doing just the same. In an average length man page there could be well over 100+ calls to concat() and for 8000+ pages this was a very large number of calls to malloc/realloc, and as the length of the string containing already parsed data increases, the realloc calls get even more expensive. So clearly this was the bottleneck which needed to be fixed.

Solution: The solution was very simple. Instead of doing memory allocations every time a new string needs to be concatenated, pre-allocate a large chunk of memory and keep writing to it until you fall short of space, in which case, just reallocate another large chunk and proceed as usual. This would reduce the calls to malloc from 100+ to around 10+ for a single page.



/*
 * A data structure for holding section specific data.
 */
    typedef struct secbuff {
        char *data;
        int buflen; //Total length of buffer allocated initially
        int offset; // Position of next byte to write at
    } secbuff;

static void
append(secbuff *sbuff, const char *src, int srclen)
{
	short flag = 0;
	assert(src != NULL);
	if (srclen == -1)
		srclen = strlen(src);

	if (sbuff->data == NULL) {
		sbuff->data = (char *) emalloc (sbuff->buflen);
		sbuff->offset = 0;
	}

	if ((srclen + 2) >= (sbuff->buflen - sbuff->offset)) {
		sbuff->data = (char *) erealloc(sbuff->data, sbuff->buflen + sbuff->offset);
		sbuff->buflen += sbuff->buflen;
		flag++;
	}

	/* Append a space at the end of the buffer */
	if (sbuff->offset || flag) {
		memcpy(sbuff->data + sbuff->offset, " ", 1);
		sbuff->offset++;
	}

	/* Now, copy src at the end of the buffer */
	memcpy(sbuff->data + sbuff->offset, src, srclen);
	sbuff->offset += srclen;
	return;
}
The secbuff data structure keeps track of the next byte offset in the data buffer where the next character needs to be written. In this way, I could allocate a sufficiently large chunk of memory to a buffer and simply use memcpy to write the new data at it's end. This approach brings large speed improvements to makemandb. The runtime has reduced from 130+ seconds to somewhere around ~45 seconds.

No comments:

Post a Comment