Tuesday, June 14, 2011

NetBSD GSoC Weekly Report 2

This was a relatively productive week as compared to the Ist week. A significant portion of work got done from the point of view of our first milestone (to have a working prototype).

What did I do this week:

Project Repository: The first thing I did was to create the project repository on Github. Here is the link: https://github.com/abhinav-upadhyay/apropos_replacement

makemandb: This is one of the crucial components of the project. makemandb is supposed to parse each and every man page installed on the user's system and store them in an Sqlite database. The reason I say, it is crucial is because we will be making a lot of changes in the database schema, the way how man pages are parsed and what information is extracted from them, until we reach near perfection in our search results. It is necessary to get this part right, because a good search experience comes only when we have done the indexing correctly.

makemandb first calls 'man -p' and recursively traverses the list of directories to get the complete path of the man pages and then passes them on to the libmandoc functions.


The parsing related code of makemandb is largely inspired from mandoc-db as I had no clue about how to use libmandoc and that too for parsing specific portions of the man pages, so it was a huge help for me. Thanks to Kristaps :)

makemandb will create a new Sqlite database named 'apropos.db' (even if there was already an existing database). It will create a new virtual table in the database before starting to insert data, the present schema of the virtual table is something like this:


Table name: mandb


Column Name Description
name For storing the Name of the man page
name_desc For storing the one line description of the man page from the NAME section
desc For storing the complete DESCRIPTION section
 

Present Issues:

  1. Handling .Nm macros: .Nm macros seem to be special in the syntax of man pages. From what I have seen, the argument for the .Nm macro is specified only at the beginning of the man page (usually the NAME section) and after that if at any place in the rest of the man page .Nm macros is used, the parser will replace it with its original value specified previously at the top. So at the present moment, we are unable to handle this. So wherever .Nm macros is used again, it is being simply ignored. 
  2. Unable to parse Escape Sequences: Man pages are filled with a number of escape sequences. Presently our code does not try to do anything special to handle the escape sequences and they are being parsed as it is. The current version of mdocml has a new function mandoc_escape(), which I think should be helpful to rectify this. Hope to see the latest version of mdocml in the -current to be able to use this.
  3. Unable to parse automatically generated man pages: Some of the man pages are generated automatically as a result their syntax is very different from the normal man pages, as a result we are unable to parse such man pages at the moment. 
There are a few more issues which I have listed on Github. (https://github.com/abhinav-upadhyay/apropos_replacement/issues).

So at the present moment, you can clone the repository, run make to compile the source and run './makemandb'. If all goes well, a new sqlite database (apropos.db) will be created in the present directory. You can run some select queries against it to test.

Feedback will be highly appreciated :)

2 comments:

  1. Your project scares me big time Abhinav.It seems to be a very challenging task :P Good job; plethora of work is done:)

    ReplyDelete
  2. Hey Srishti thanks a lot :-). hahaha my project is not at all scary :-P but yes it is an interesting project :-)

    ReplyDelete