Wednesday, June 22, 2011

NetBSD GSoc Weekly report 3

This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.

Issues Fixed:

  • Handling .Nm macros: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: bbdab19
  • ENOMEM in traversedir(): This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:
char *desc;
if (desc == NULL)
    desc = strdup(n->string);
else
    asprintf(&desc, "%s %s", desc, n->string);

So the above asprintf call was leaking out desc at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)
This is the commit which fixed it: cd53b9b

  • Avoid Hardlinks: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: 14b024f

Implementation of apropos.c: Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The master branch on Github still has this basic version of apropos, 

I have started a new experimental branch search on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.

So Currently the search branch has following two features:

Stopword Filter: I noticed that Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this. 
It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: ec25546 

A Ranking Function: As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.

Commit for this: 001a679fe9a4b4c04a8d

Some Sample Runs: I ran some sample queries to check out  how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin: http://pastebin.com/qhQBRNd5
$ ./apropos "copy string"            
memccpy
The memccpy function copies bytes from string src to string dst . If the character c...

strndup
...copies at most len characters from the string str always NUL terminating the copied 
string...

bcopy
...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...

strlcat
size-bounded string copying and concatenation

bcopy
...bcopy function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memmove
...memmove function copies len bytes from string src to string dst . The two strings may...

memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...

strncpy
...copy the string src to dst (including the terminating \e0 character). The strncpy function 
copies...

You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)


How to Test: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:
  1. You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
  2. You will also want to have the -current version of the man pages in /usr/share/man (at least).
  3. libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)

Monday, June 20, 2011

A memorable week!!

Last week was perhaps one of the most wonderful week of my life. I never thought that all the things that happened to me this last week, will ever happen to me (although I sure dreamt of them).

First of all, Tomboy 1.7 was released earlier this week, this release had one of my patches, and the Tomboy developers mentioned my name in the ChangeLog [1].

The next day, I updated my Tomboy from 1.6 to 1.7.1 and to the greatest of my surprises, I saw my name in the list of contributors to Tomboy. It's really a great feeling to see your name in the list of contributors of a software that you yourself love and use.



But the week just didn't end there, one more surprise was waiting for me. Famous Ubuntu developer Daniel Holbach contacted me for an interview. He told me that he is starting a series of weekly interviews with new contributors to Ubuntu Development, and he would like to start with me. I couldn't have asked anything more :-) The interview was published on OMG! Ubuntu [2]and ubuntu-news. [3]

For my little contributions to Ubuntu and Tomboy, this is more than I even dreamt of getting in return. I am grateful to the Tomboy devs, Daniel Holbach and everyone else around me who encouraged me all throughout this. I hope a lot many other people get encouraged by this, it's very easy to start contributing and you learn as you go.


Footnote:
[1]: Tomboy 1.7 release note and changelog
[2]: The interview on OMG! Ubuntu
[3]: The interview on ubuntu-news.org

Tuesday, June 14, 2011

NetBSD GSoC Weekly Report 2

This was a relatively productive week as compared to the Ist week. A significant portion of work got done from the point of view of our first milestone (to have a working prototype).

What did I do this week:

Project Repository: The first thing I did was to create the project repository on Github. Here is the link: https://github.com/abhinav-upadhyay/apropos_replacement

makemandb: This is one of the crucial components of the project. makemandb is supposed to parse each and every man page installed on the user's system and store them in an Sqlite database. The reason I say, it is crucial is because we will be making a lot of changes in the database schema, the way how man pages are parsed and what information is extracted from them, until we reach near perfection in our search results. It is necessary to get this part right, because a good search experience comes only when we have done the indexing correctly.

makemandb first calls 'man -p' and recursively traverses the list of directories to get the complete path of the man pages and then passes them on to the libmandoc functions.


The parsing related code of makemandb is largely inspired from mandoc-db as I had no clue about how to use libmandoc and that too for parsing specific portions of the man pages, so it was a huge help for me. Thanks to Kristaps :)

makemandb will create a new Sqlite database named 'apropos.db' (even if there was already an existing database). It will create a new virtual table in the database before starting to insert data, the present schema of the virtual table is something like this:


Table name: mandb


Column Name Description
name For storing the Name of the man page
name_desc For storing the one line description of the man page from the NAME section
desc For storing the complete DESCRIPTION section
 

Present Issues:

  1. Handling .Nm macros: .Nm macros seem to be special in the syntax of man pages. From what I have seen, the argument for the .Nm macro is specified only at the beginning of the man page (usually the NAME section) and after that if at any place in the rest of the man page .Nm macros is used, the parser will replace it with its original value specified previously at the top. So at the present moment, we are unable to handle this. So wherever .Nm macros is used again, it is being simply ignored. 
  2. Unable to parse Escape Sequences: Man pages are filled with a number of escape sequences. Presently our code does not try to do anything special to handle the escape sequences and they are being parsed as it is. The current version of mdocml has a new function mandoc_escape(), which I think should be helpful to rectify this. Hope to see the latest version of mdocml in the -current to be able to use this.
  3. Unable to parse automatically generated man pages: Some of the man pages are generated automatically as a result their syntax is very different from the normal man pages, as a result we are unable to parse such man pages at the moment. 
There are a few more issues which I have listed on Github. (https://github.com/abhinav-upadhyay/apropos_replacement/issues).

So at the present moment, you can clone the repository, run make to compile the source and run './makemandb'. If all goes well, a new sqlite database (apropos.db) will be created in the present directory. You can run some select queries against it to test.

Feedback will be highly appreciated :)

Wednesday, June 8, 2011

NetBSD GSoC Weekly Report 1

The coding period of GSoC started on the 23rd of May and we are in the 3rd week since then, but this is my first report because during the Ist week I was bogged down with my semester exams and then I picked up the work from the 1st of June. And just last night (8th June) I finally completed my first task. 

The Task: In one of my previous posts I described about the first task. The problem was to get the list of directories which contain the man page sources. We will be needing this information in future when creating a database index for searching the man pages. 
The information which we were seeking is present in the file /etc/man.conf. The following two bits are important in man.conf for us:

_default        /usr/{share,X11R6,X11R7,pkg,local}/man/

and

_subdir        cat1 man1 cat2 man2 cat4 man4... 


From this we need to build the path of directories containing man pages like this

/usr/share/man/man1
/usr/share/man/man8
/usr/pkg/man/man4
...

I wrote a patch for man(1) to add a new option -p which will print this list of directories on the terminal in new line separated format. It took me a whole week to do this relatively simple task mostly because of my stupid mistakes.

My initial patch was kind of based on Brute force approach of problem solving, it was working but it was too complicated to anyone's liking.

It looked something like this: 

+/**
+*      Tests if if the directory at dirpath exists or not
+*/
+static int
+testdir(const char *dirpath)
+{
+       DIR *dp;
+       if ((dp = opendir(dirpath)) == NULL)

+               return 0; +       closedir(dp); +       return 1; +} +
+/** +*      Builds a list of directories containing man pages +*/ +void
+printmanpath(struct manstate *m) +{ +       ENTRY *esubd, *epath; +       char *manpath; /*it will store the actual manpath as it is built */
+       char *manpath_tokens[3]; /* stores /usr/, {share, pkg, ...}, /man/ */
+       char *defaultpath = NULL; /* stores the _default tag value obtained
from man.conf */ +       char *str, *buf; /* for storing temporary values */ +       int i; + +       TAG *path = m->defaultpath; +       TAG *subdirs = m->subdirs; +       if (path == NULL ) { +               manpath = NULL; +               return ; +       } + +       /** routine code to get the default man path from the TAG. +       *       path is of the form /usr/{share,X11R7,X11R6,pkg,local}/man/ (see /etc/man.conf) +       *       We will first tokenize it into 3 parts +       *       1. /usr/ +       *       2. share,X11R7,X11R6, pkg, local +       *       3. /man/ +       *       and store them in the array manpath_tokens. +       */ +       TAILQ_FOREACH(epath, &path->entrylist, q) {
+               defaultpath = strdup(epath->s);
+               for (str = strtok(defaultpath, (const char *)"{}"), i = 0; str; str = strtok(NULL, (const char *)"{}"), i++) { +                       manpath_tokens[i] = str; +               }
+               free(str); +       } +               /** +               *       1. Tokenize manpath_tokens[1] (share, X11R7, X11R6,...) +               *       2. Traverse the tail queue subdirs and get the list of subdirs i.e.: +               *               man1, man2, man3, ... man9, etc. (see /etc/man.conf) +               *       3. Finally build the complete path of the directory by concatenating the +               *               different parts +               */ +       for (str = strtok(manpath_tokens[1], ","); str; str = strtok(NULL, ",")) { +               TAILQ_FOREACH(esubd, &subdirs->entrylist, q) { +                       // we need only path of the actual man pages and not the cat ones
+                       if (strncmp(esubd->s, "man", 3))
+                               continue; + +                       asprintf(&buf, "%s%s%s%s/", manpath_tokens[0], str, manpath_tokens[2], esubd->s);
+ +                       // we should not add non-existing directories to the man path +                       if (!testdir(buf)) +                               continue; + +                       if (manpath == NULL) +                               asprintf(&manpath, "%s", buf); +                       else
+                               printf("%s\n", buf); +                       free(buf); +               } +       } + +       free(str); +       free(defaultpath); +}


My mentors  David and Joerg are showing a lot patience with me. They went through the different versions of the patches and gave their useful reviews. Joerg suggested a more intuitive and efficient algorithm to build this path in a recursive fashion. In the end I discovered glob(3) which provided Csh style brace expansion, and I settled on using it, as it was easiest and ensured that nothing goes wrong.


The final version of patch looked something like this:

+
+/*
+ * printmanpath --
+ *    Prints a list of directories containing man pages.
+ */
+static void
+printmanpath(struct manstate *m)
+{
+    ENTRY *esubd;
+    char *defaultpath = NULL; /* _default tag value from man.conf. */
+    char *buf; /* for storing temporary values */
+    char **ap;
+    glob_t pg;
+    struct stat sb;
+    TAG *path = m->defaultpath;
+    TAG *subdirs = m->subdirs;
+    
+    /* the tail queue is empty if no _default tag is defined in * man.conf */
+    if (TAILQ_EMPTY(&path->entrylist))
+        errx(EXIT_FAILURE, "Empty manpath");
+        
+    defaultpath = TAILQ_LAST(&path->entrylist, tqh)->s;
+    
+    if (glob(defaultpath, GLOB_BRACE | GLOB_NOSORT, NULL, &pg) != 0)
+        err(EXIT_FAILURE, "glob failed");
+        
+    TAILQ_FOREACH(esubd, &subdirs->entrylist, q) {
+        /* Drop cat page directory, only sources are relevant. */
+        if (strncmp(esubd->s, "man", 3))
+            continue;
+
+        for (ap = pg.gl_pathv; *ap != NULL; ++ap) {
+            if (asprintf(&buf, "%s%s", *ap, esubd->s) == -1) 
+                err(EXIT_FAILURE, "memory allocation error");
+            /* Skip non-directories. */
+            if (stat(buf, &sb) == 0 && S_ISDIR(sb.st_mode))
+                printf("%s\n", buf);
+
+            free(buf);
+        }
+    }
+    globfree(&pg);
+} 

What did I learn in the process: It was a small and relatively simple task. Although I believe brace expansion is not a trivial thing to do manually. Overall, it was a great learning curve for me. I learnt about queue(3) interfaces, glob(3), a host of string utilities available as per the POSIX and ISO C standards and I was unaware of them. 

And the most important learning lesson was about memory management. I tried to take care of freeing memory at most of the points, but I came to learn about many corner cases which I didn't know about, but might lead to memory leaks. I hope I did learn a lesson here :) 

Hopefully this will be my first patch for NetBSD. Joerg promised to commit it soon to the repository.