Tuesday, August 23, 2011

Final Report: NetBSD GSoC-2011 Apropos Replacement

This is the final status report for this project for this summer. I will try to summarise what the project was all about and what goals have been achieved.

Objective Of The Project: We all are well aware of the importance of quality documentation and Unix like operating systems ship with documentation in the form of manual pages (or man pages in short). They have been there since the early days to aid the system administrators as well as programmers in configuring and modifying the system. A very basic search utility was provided in the form of apropos(1). I am not aware the history apropos(1) but my guess is that, it was kept simple because of the hardware limitations of the day. There is usually a plain text file (whatis.db usually) which indexes the NAME section of the man pages and apropos(1) simply searches this file for the keywords specified by the user on the command line.

In today's time we have much more powerful machines and also the research in information retrieval has progressed a lot. So the objective of this project was to develop a replacement for apropos(1) so as to develop a complete search tool. The aim was to index the complete content of the man pages and develop a tool to search that index.

Full text search of the man pages was the main task but this entailed a few small other benefits to NetBSD as well. I will be discussing them as I go along.

Implementation Strategy:

The implementation had largely two main stages:
  1. Parsing and Indexing Man Pages: The starting point of the project was to develop a utility which would parse all the man pages and build a full text search index. For parsing the man pages I used the libmandoc parser from the mdocml which is really an excellent and innovative project. And for the full text search I used the FTS engine of Sqlite. I used these tools to develop makemandb, which would traverse the set of directories containing man pages, parse them, extract the relevant data and store them in an FTS database.
  2. Search: Once the FTS index was there, the next thing to do was to develop a tool to search the database and a mechanism to rank the search results. A long part of the project involved experimenting what techniques for ranking the results well and which ones did not work very well. This stage lead to the development a new version of apropos.
Deliverables: So the project resulted in following deliverables:

  • makemandb: It is a command line utility to parse the man pages installed on the system and build a full text search database as described above. 
  • An option ('p') to man(1) to print the list of directories containing the man page sources.
  • apropos: This is the 2nd command line utility to search the FTS database. It  sanitizes the user query, removes any stopwords from it, executes the search, and ranks the results in decreasing order of their weights which are computed on the basis of a term-frequency and inverse-document-frequency based algorithm.
  • An API: I developed a very small API (consisting of 4 functions or so) to allow building custom interfaces. For example in future it will be easy to build a CGI application on top of this code, or maybe a desktop GUI client.
  • Documentation: I have also developed detailed documentation in the form of manual pages. 
  • There is also a patch for man(1) to modify the way symlinks/hardlinks are handled. This patch would eliminate the need of maintaining symlinks/hardlinks of man pages on the file system. A lot of man page files are simply links to other pages and this has to be specified explicitly in the makefiles of the packages via MKLINKS. This patch would take away this pain and also wipe off all the symlinks/hardlinks from the file system.

 Challenges on the way: This was a relatively easier project to do, but every project has it's challenges.
  • Parsing man pages: I had never looked at how man pages are written before this so I was overwhelmed. Though libmandoc made my task very easy, my understanding about the syntax kept growing as I progressed in the project. It was also particularly challenging the man(7) based pages, while parsing mdoc(7) pages was relatively easy.
  • Ranking Algorithm: Another challenge was to come up with a suitable ranking algorithm which helps in bringing up the relevant results at the top. I did some study for this and experimented with a number of different algorithms, and finally settled with a mixture of different ranking schemes ;-).
  • Testing: On the way towards completion I had to make a number of changes in the code related to parsing. Initially I started with parsing only the NAME and DESCRIPTION section, then expanded to parsing all the sections and a lot of similar changes. Each time I made a change, I had to thoroughly review the results to make sure nothing wrong was happening as a result of the changes. I had to make sure both mdoc(7) and man(7) pages were getting parsed properly and the database had sane data. I agree that I should have written unit tests instead, maybe this is something worth doing in coming days.
Results: Attaching the output of some sample runs.

#apropos "add a new user"

#apropos "generate password hash"

#apropos "convert signal number to string"

#apropos -3 "compute log"

#apropos "realtek"

I owe a big chunk of the success to my mentor Jörg Sonnenberger who was always there to answer my questions, offer advice and review the code. I have learnt a great deal from him and I am sure I have improved as a programmer. The best thing about working with him was that he never really disclosed the solution, instead he gently guided towards the direction of the solution, so I never lost a learning opportunity :-)

David Young also offered valuable guidance during the project. He provided some clever insights and tips to improve the search and ranking of the results. I decided to decompose the database into more columns based on different sections in a man page based on his idea only. 

Thanks to Kristaps Dzonsons as well who is responsible for the mdocml project. He also reviewed the code related to parsing of the pages and pointed out bugs in the code. I implemented makemandb based on his utility "mandocdb", so that was also a huge help.

Special thanks goes to Thomas Klausner for reviewing the man pages I wrote and also proving patches for the errors/mistakes I had made in them.
I must also thank Julio Merino, Jan Schaumann, Jukka Ruohonen, S.P.Zeidler for the interest they showed in the project and offered help throughout :-)

And thanks to lots of other people in the community as well whose names I am forgetting. While there I would like to thank my friends and family as well for keeping up with me when I slept during the day and worked at night.


What Next? 

Well no guesses here, I enjoyed my experience and I would try to grab another project and continue working in the NetBSD community. Systems programming always attracted me and although I don't have much practical knowledge in this field but I don't mind learning ;-)