Showing posts with label GSoC. Show all posts
Showing posts with label GSoC. Show all posts

Tuesday, August 23, 2011

Final Report: NetBSD GSoC-2011 Apropos Replacement

This is the final status report for this project for this summer. I will try to summarise what the project was all about and what goals have been achieved.

Objective Of The Project: We all are well aware of the importance of quality documentation and Unix like operating systems ship with documentation in the form of manual pages (or man pages in short). They have been there since the early days to aid the system administrators as well as programmers in configuring and modifying the system. A very basic search utility was provided in the form of apropos(1). I am not aware the history apropos(1) but my guess is that, it was kept simple because of the hardware limitations of the day. There is usually a plain text file (whatis.db usually) which indexes the NAME section of the man pages and apropos(1) simply searches this file for the keywords specified by the user on the command line.

In today's time we have much more powerful machines and also the research in information retrieval has progressed a lot. So the objective of this project was to develop a replacement for apropos(1) so as to develop a complete search tool. The aim was to index the complete content of the man pages and develop a tool to search that index.

Full text search of the man pages was the main task but this entailed a few small other benefits to NetBSD as well. I will be discussing them as I go along.

Implementation Strategy:

The implementation had largely two main stages:
  1. Parsing and Indexing Man Pages: The starting point of the project was to develop a utility which would parse all the man pages and build a full text search index. For parsing the man pages I used the libmandoc parser from the mdocml which is really an excellent and innovative project. And for the full text search I used the FTS engine of Sqlite. I used these tools to develop makemandb, which would traverse the set of directories containing man pages, parse them, extract the relevant data and store them in an FTS database.
  2. Search: Once the FTS index was there, the next thing to do was to develop a tool to search the database and a mechanism to rank the search results. A long part of the project involved experimenting what techniques for ranking the results well and which ones did not work very well. This stage lead to the development a new version of apropos.
Deliverables: So the project resulted in following deliverables:

  • makemandb: It is a command line utility to parse the man pages installed on the system and build a full text search database as described above. 
  • An option ('p') to man(1) to print the list of directories containing the man page sources.
  • apropos: This is the 2nd command line utility to search the FTS database. It  sanitizes the user query, removes any stopwords from it, executes the search, and ranks the results in decreasing order of their weights which are computed on the basis of a term-frequency and inverse-document-frequency based algorithm.
  • An API: I developed a very small API (consisting of 4 functions or so) to allow building custom interfaces. For example in future it will be easy to build a CGI application on top of this code, or maybe a desktop GUI client.
  • Documentation: I have also developed detailed documentation in the form of manual pages. 
  • There is also a patch for man(1) to modify the way symlinks/hardlinks are handled. This patch would eliminate the need of maintaining symlinks/hardlinks of man pages on the file system. A lot of man page files are simply links to other pages and this has to be specified explicitly in the makefiles of the packages via MKLINKS. This patch would take away this pain and also wipe off all the symlinks/hardlinks from the file system.

 Challenges on the way: This was a relatively easier project to do, but every project has it's challenges.
  • Parsing man pages: I had never looked at how man pages are written before this so I was overwhelmed. Though libmandoc made my task very easy, my understanding about the syntax kept growing as I progressed in the project. It was also particularly challenging the man(7) based pages, while parsing mdoc(7) pages was relatively easy.
  • Ranking Algorithm: Another challenge was to come up with a suitable ranking algorithm which helps in bringing up the relevant results at the top. I did some study for this and experimented with a number of different algorithms, and finally settled with a mixture of different ranking schemes ;-).
  • Testing: On the way towards completion I had to make a number of changes in the code related to parsing. Initially I started with parsing only the NAME and DESCRIPTION section, then expanded to parsing all the sections and a lot of similar changes. Each time I made a change, I had to thoroughly review the results to make sure nothing wrong was happening as a result of the changes. I had to make sure both mdoc(7) and man(7) pages were getting parsed properly and the database had sane data. I agree that I should have written unit tests instead, maybe this is something worth doing in coming days.
Results: Attaching the output of some sample runs.

#apropos "add a new user"


#apropos "generate password hash"



#apropos "convert signal number to string"





#apropos -3 "compute log"





#apropos "realtek"





Acknowledgements: 
I owe a big chunk of the success to my mentor Jörg Sonnenberger who was always there to answer my questions, offer advice and review the code. I have learnt a great deal from him and I am sure I have improved as a programmer. The best thing about working with him was that he never really disclosed the solution, instead he gently guided towards the direction of the solution, so I never lost a learning opportunity :-)

David Young also offered valuable guidance during the project. He provided some clever insights and tips to improve the search and ranking of the results. I decided to decompose the database into more columns based on different sections in a man page based on his idea only. 

Thanks to Kristaps Dzonsons as well who is responsible for the mdocml project. He also reviewed the code related to parsing of the pages and pointed out bugs in the code. I implemented makemandb based on his utility "mandocdb", so that was also a huge help.

Special thanks goes to Thomas Klausner for reviewing the man pages I wrote and also proving patches for the errors/mistakes I had made in them.
I must also thank Julio Merino, Jan Schaumann, Jukka Ruohonen, S.P.Zeidler for the interest they showed in the project and offered help throughout :-)

And thanks to lots of other people in the community as well whose names I am forgetting. While there I would like to thank my friends and family as well for keeping up with me when I slept during the day and worked at night.

 

What Next? 

Well no guesses here, I enjoyed my experience and I would try to grab another project and continue working in the NetBSD community. Systems programming always attracted me and although I don't have much practical knowledge in this field but I don't mind learning ;-)

    Sunday, July 31, 2011

    NetBSD GSoC: Project Update 5

    First of all thanks to Joerg, David Young and all other people involved with GSoC as I cleared Midterm Evaluations. 20 days passed since I last posted an update of the project and I did not even realize it. I apologize if I seem to be inactive, but in my defense, I would like to just quote my Commit Log ;-). I have been actively pushing changes, fixing issues, adding new features all this while. As the size of the project is growing it is taking more and more time to make new changes, test them properly and fix any loose ends.

    So a brief overview of the things I did in last 3 weeks:

    [New Feature] Search Within Specific Sections: This feature has been on my TODO list for very long, but something or the other kept coming up which was more important to deal with. I wanted to do it this time before I posted this update. So here it is:
    I have added options to support search within one or more specified sections.
    Commit: 966c7ba
    You can do a search like this:
    $apropos -1 "copy files"
    
    #It will search in section 1 only.
    
    $apropos -18 "adding new user"
    
    
    #This will search in section 1 and 8 only.
    #I hope you get the idea :)
    

    Some sample runs:    http://paste2.org/p/1554491
                                      http://paste2.org/p/1554510
                                      http://paste2.org/p/1554509

    Indexing Performance Improvement: Joerg suggested a clever way to bring down the time taken by makemandb to index the pages. He suggested that instead of doing a separate transaction for each page, it is better to index all the pages inside a single transaction which will decrease the IO overhead substantially. So I did the changes, and the indexing time came down from 3 minutes to within range of 30 seconds or so.
    Commit: 926746

    Parse And Index man(7) Pages: Till now we were indexing only the mdoc(7) pages and all the man(7) based pages were being ignored. Now when the project was working quite ok with mdoc(7) pages, it was time to scale up. Parsing man(7) pages was a bit more difficult as compared to parsing mdoc(7) pages. It took some 2-3 days to implement this code and the next 2-3 days to fix various bugs and testing whether it was working ok with the 7000+ man pages I have.
    Commit: 2014855

    Too Large DB Size (Regression): Parsing man(7) and mdoc(7) meant that I was indexing a whole lot of man pages. (7613 to be exact). This scale up in the number of indexed pages also scaled up some problems which were not really visible before this. One major problem that came up was the size of the DB. It had grown to almost 99M.
    Root Cause: The root cause for this was that we were also storing all the unique terms in the corpus and their term-weights in a separate table, which was almost doubling the space requirements.
    Solution: So as a quick solution to the problem I decided to remove the code related to pre-computation of the term-weights and drop this table. This brought down the DB size to around 60M and with a few optmizations it has come down in the range of 30-40M.
    Drawbacks: The pre-computation of weight had it's advantages, I was using it to implement some advanced ranking algorithms and I had some plans to improve the ranking further on the basis of this work but I had to let it go.
    Alternatives: The extra space was only helping to get more accurate results, it was a trade off between space and search quality. One alternative can be to let the user decide what does he/she want ?Let the user choose between the two versions.
    Commit: 7928fc5


    Added Compression Option To The DB: To bring down the DB size further, I implemented code for compressing and decompressing data using zlib(3). It was also an exercise to make the zlib interface work with Sqlite. 
    As a result of this the DB size came down to 43 M.
    Commit: d878815


    Stopword Tokenizer: Implementing a custom tokenizer to filter out any stopword was already on my TODO list but with the increased DB size it became the priority. I patched the porter tokenizer from the Sqlite source to filter out any stopwords. 
    The tokenizer seemed to be working fine, and it also helped in bringing the DB size down. When using the stopword tokenizer the size came to be around 31M. 
    Due to a small bug I have disabled the use of this tokenizer for now.
    Commit: 76b4769

    Parsing Additional sections & Storing Them In Individual Columns: This was a required change. With such a large number of pages (7613) in the db, and all of the content in a single column mean a lot of noise and the search results were off the mark by a great margin. David Young had also suggested this previously to give weight to some prominent sections like "DIAGNOSTICS" than others like "ERRORS", etc. 
    It was a big task to do. I first started with decomposing the mdoc(7) pages, then man(7) pages and then sat down to fix apropos to take in account the new columns in the DB and fix the ranking function. 

    I would say, the time taken to implement this was worth it. Because it has helped in  making the code more clean. In future if there was a requirement to parse another extra section, it will only require adding a switch case statement and a couple of extra lines of code.

    static void
    mdoc_parse_section(enum mdoc_sec sec, const char *string)
    {
        switch (sec) {
        case SEC_LIBRARY:
            concat(&lib, string);
            break;
        case SEC_SYNOPSIS:
            concat(&synopsis, string);
            break;
        case SEC_RETURN_VALUES:
            concat(&return_vals, string);
            break;
        case SEC_ENVIRONMENT:
            concat(&env, string);
            break;
        case SEC_FILES:
            concat(&files, string);
            break;
        case SEC_EXIT_STATUS:
            concat(&exit_status, string);
            break;
        case SEC_DIAGNOSTICS:
            concat(&diagnostics, string);
            break;
        case SEC_ERRORS:
            concat(&errors, string);
            break;
        case SEC_NAME:
            break;
        default:
            concat(&desc, string);
            break;
        }
    }
    


    It also allows to fine tune the ranking function easily and play with it. If you want to experiment around with search, you can easily modify the column weights and rebuild to see the effects. The column weights are in the form of a double array in the rank_func function.

    double col_weights[] = {
        2.0, // NAME
        2.00, // Name-description
        0.55, // DESCRIPTION
        0.25, // LIBRARY
        0.10, //SYNOPSIS
        0.001, //RETURN VALUES
        0.20, //ENVIRONMENT
        0.01, //FILES
        0.001, //EXIT STATUS
        2.00, //DIAGNOSTICS
        0.05 //ERRORS
    };
    



    [Feature Proposal]: Show additional data with search results- Storing the different sections in separate column has it's advantages as well. One of them being the ability to fetch and show more specific content with search results. For example, I have already done something like this. Now, if you see the search results, you will also see the one line description of the result (.Nd macro).
    Similarly it is possible to show the library, exit values, return values where possible. But I was wondering if it is a useful feature ? Any views ?

    Besides this there are a lot of other things to be done that I had mentioned in my proposal like a CGI based interface and using the database for managing the man page aliases.These are now on top of my TODO list, and if no big issues come up, I would like to pick them  up.

    Friday, July 8, 2011

    NetBSD GSoC: Midterm Project Update

    Another update of the project after a gap of 2 weeks. I did not have much to write about last week, and come this week, we have reached the stage of midterm evaluations. I will try to explain what new changes I made in last two weeks, over all what is the present status of the project and what are the other things which I am planning to implement.

    Printing the section number along with search results: Last time around when I posted a few sample runs of the project, there were no section numbers along with the search results, but now we have them (thanks to Kristaps for suggesting the right way to extract meta data).

    Improve the ranking algorithm (Implement tf-idf): tf-idf based term weighting schemes are very common in information retrieval systems. Till now, I was ranking the results only on the basis of the term frequency (tf). I improved this by including another factor for ranking, .i.e, the inverse document frequency (idf).

    • Term Frequency: Is usually defined as the number of times a given term appears in a particular document.
    • Inverse Document Frequency: IDF of a term indicates in how many documents a given term appears (at least once).

    Term frequency is a local factor, which is concerned only with the number of occurrences of the search terms in one particular document at a time.
    While Inverse Document Frequency is a global factor, in the sense that, it indicates the discriminating power of a term. If a term appears in only a selected set of documents, then it means, that that term separates that set of documents from the rest. So ranking obtained by combining these two factors brings up more relevant documents.

    So the weight of a term t in document d is calculated by the following formula:

    weight  = tf * idf

    Where tf = Term frequency of term t in document d
    idf = log (N / Nt)

    Where N = Total number of documents in the corpus
    Nt = Number of documents in which term t occurs (at least once).
    So for a term which appears in only one document it will have
    IDF = log(n)
    while a term which appears in all the documents, it will have
    IDF = log(1) = 0.

    For example a term like "the" will have a high term frequency in any document, but at the same time it will have a lower inverse document frequency (almost close to 0), which will nullify it's effect on the quality of search results.


    Pre-compute the term-weights: While the tf-idf based term-weighting scheme improved the quality of search, it degraded it's performance. I could see apropos taking time in flushing out the results. The reason for this was that, all the calculations of the term-weights were being done on the fly when running apropos. An obvious solution to this problem was to pre-compute the term-weights while creating the index and store them in the database. Thus while doing the search, we only need to lookup the database rather than do both lookup and perform calculation!

    I implemented the code for pre-computing term weights in makemandb, but to my surprise, these changes made makemandb painfully slow. Earlier makemandb could index the man pages in under 2 minutes, but now it was taking close to 3 hours to do the pre-computation of each unique term in the corpus. In addition to that there were some bugs which were causing large deviations in the term-weights. I decided to first get the calculations right, it took me 3 days to get the calculations right, as after each bit of change in the code I had to re-run makemandb to do the indexing and see the results. Finally, I got it right, and then after some discussions with Joerg, the performance issue was also fixed. Basically the solution was to bring most of the processing inside Sqlite. Now makmandb does the indexing and pre-computation of weights, all under 3 minutes on my machine :-)

    Further Improve the Ranking Algorithm: In my free time I am doing some study on Information Retrieval. During my studies I came across a very interesting research paper by Salton and Buckley from 1988, in which they discussed different term weighting schemes and their results. According to their study, the following formula for calculating term weights is most effective:


    for weight of a given term in a particular document we can calculate the weight as:
     
    I implemented this in a bit more simpler form. I avoided the calculation of powers in the denominator (square root and square) to avoid unnecessary overheads as these calculations are being done on the fly by apropos. The results have been pretty good.

    Sample Results: http://pastebin.com/PjdNY68m



    Note: The above mentioned change is in the search branch only at the moment. I did not merge this in master so that, if you guys want to compare the differences before and after the above change, you can easily checkout the master and search branches and see for yourself :-)

    A keyword Recognizer: I have been thinking to implement this feature for a while. Basically the idea is to scan the query for some keywords, which would indicate that the user is probably looking for results from a particular section. For example "functions to copy string" gives an indication that the user is looking for standard library functions from section 3.

    After some discussions with David, we came to the conclusion that probably a better way to implement a feature like this would be to do something like Google does. Google allows you to search within a specific website using a syntax like:
    [book on IR site: amazon.com]. 
    David suggested to use a similar interface, where user could specify a specific section using colon. So for example:
    apropos "kernel: function to allocate memory" 
    will search only within section 9.
    I started some work on this feature but it didn't work out properly, so it is at the moment on halt. I hope to resume work on it soon, but at the same time I would like to know if this feature is worth it ?


    Where Do We Stand At Midterm Evaluation: As I promised in my proposal, I have accomplished most of the requirements just in time for the midterm evaluation (maybe I need to write some documentation). Now is the time for some community feedback :-). I would love to hear about

    • How good or bad the search results are ? If for some query you feel that right results are not coming up, please write to me about that query and what results you expected to see at the top. 
    • If you want to see any improvements or new features, tell me about them.

    What New Features Are Next? Apart from the keyword recognizer, there are another couple of features that I have in mind, although whether I will implement them or not is a different matter, as I need to make sure whether it is feasible to implement them.

    A Link Analysis Algorithm For Ranking: Search engines these days do two types of ranking.

    1. Content based ranking: It is concerned with finding relevant documents by matching the content. For example the tf-idf based term-weighting scheme is one way of doing content based ranking.
    2. Popularity Based Ranking: It tries to rank the documents based on their popularity. This popularity is calculated on the basis of a link analysis algorithm. For example Google's PageRank or Jon Kleinberg's HITS algorithm. 

    I am studying about the PageRank algorithm and I am tempted to implement it, but I am held backby the fact that Stanford University has a patent on the PageRank process, so I am in a dilemma whether I should implement it or not.

    A Spell Checker: It is a very common thing that the users do a typo while performing the search, which might lead to no results at all or in some cases wrong results. I am thinking to add a spell checker, which in case no results are found, would suggest to the user some related search terms (assuming that perhaps he made a typo).

    I am held back on this because personally I have never looked at what techniques are involved in spell checkers but I have heard that it is computationally very expensive.


    Testing out apropos:

    #Clone the repository:
    $git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
    
    
    #Run make
    $make
    
    #Run makemandb
    $./makemandb
    
    #Run apropos
    $./apropos "list directories"
    

    By default you will be on the master branch. The search branch has an improved ranking algorithm, so you might want to check it out and compare the results before and after the algorithm improvement:

    $git checkout -b search origin/search

    and run make again to build it.

    Prerequisites:
    1. You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
    2. You will also want to have the -current version of the man pages in /usr/share/man (at least).
    3. libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml 
    Feedbacks are welcome :-)

      Wednesday, June 22, 2011

      NetBSD GSoc Weekly report 3

      This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.

      Issues Fixed:

      • Handling .Nm macros: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: bbdab19
      • ENOMEM in traversedir(): This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:
      char *desc;
      if (desc == NULL)
          desc = strdup(n->string);
      else
          asprintf(&desc, "%s %s", desc, n->string);

      So the above asprintf call was leaking out desc at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)
      This is the commit which fixed it: cd53b9b

      • Avoid Hardlinks: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: 14b024f

      Implementation of apropos.c: Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The master branch on Github still has this basic version of apropos, 

      I have started a new experimental branch search on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.

      So Currently the search branch has following two features:

      Stopword Filter: I noticed that Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this. 
      It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: ec25546 

      A Ranking Function: As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.

      Commit for this: 001a679fe9a4b4c04a8d

      Some Sample Runs: I ran some sample queries to check out  how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin: http://pastebin.com/qhQBRNd5
      $ ./apropos "copy string"            
      memccpy
      The memccpy function copies bytes from string src to string dst . If the character c...
      
      strndup
      ...copies at most len characters from the string str always NUL terminating the copied 
      string...
      
      bcopy
      ...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...
      
      strlcat
      size-bounded string copying and concatenation
      
      bcopy
      ...bcopy function copies len bytes from string src to string dst . The two strings may...
      
      memcpy
      The memcpy function copies len bytes from string src to string dst . The arguments must...
      
      memmove
      ...memmove function copies len bytes from string src to string dst . The two strings may...
      
      memmove
      ...memmove function copies len bytes from string src to string dst . The two strings may...
      
      memcpy
      The memcpy function copies len bytes from string src to string dst . The arguments must...
      
      strncpy
      ...copy the string src to dst (including the terminating \e0 character). The strncpy function 
      copies...

      You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)


      How to Test: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.

      #Clone the repository:
      $git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
      
      
      #Run make
      $make
      
      #Run makemandb
      $./makemandb
      
      #Run apropos
      $./apropos "list directories"
      

      By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that

      $git checkout -b search origin/search

      and run make again to build it.

      Prerequisites:
      1. You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
      2. You will also want to have the -current version of the man pages in /usr/share/man (at least).
      3. libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

      I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)

      Tuesday, June 14, 2011

      NetBSD GSoC Weekly Report 2

      This was a relatively productive week as compared to the Ist week. A significant portion of work got done from the point of view of our first milestone (to have a working prototype).

      What did I do this week:

      Project Repository: The first thing I did was to create the project repository on Github. Here is the link: https://github.com/abhinav-upadhyay/apropos_replacement

      makemandb: This is one of the crucial components of the project. makemandb is supposed to parse each and every man page installed on the user's system and store them in an Sqlite database. The reason I say, it is crucial is because we will be making a lot of changes in the database schema, the way how man pages are parsed and what information is extracted from them, until we reach near perfection in our search results. It is necessary to get this part right, because a good search experience comes only when we have done the indexing correctly.

      makemandb first calls 'man -p' and recursively traverses the list of directories to get the complete path of the man pages and then passes them on to the libmandoc functions.


      The parsing related code of makemandb is largely inspired from mandoc-db as I had no clue about how to use libmandoc and that too for parsing specific portions of the man pages, so it was a huge help for me. Thanks to Kristaps :)

      makemandb will create a new Sqlite database named 'apropos.db' (even if there was already an existing database). It will create a new virtual table in the database before starting to insert data, the present schema of the virtual table is something like this:


      Table name: mandb


      Column Name Description
      name For storing the Name of the man page
      name_desc For storing the one line description of the man page from the NAME section
      desc For storing the complete DESCRIPTION section
       

      Present Issues:

      1. Handling .Nm macros: .Nm macros seem to be special in the syntax of man pages. From what I have seen, the argument for the .Nm macro is specified only at the beginning of the man page (usually the NAME section) and after that if at any place in the rest of the man page .Nm macros is used, the parser will replace it with its original value specified previously at the top. So at the present moment, we are unable to handle this. So wherever .Nm macros is used again, it is being simply ignored. 
      2. Unable to parse Escape Sequences: Man pages are filled with a number of escape sequences. Presently our code does not try to do anything special to handle the escape sequences and they are being parsed as it is. The current version of mdocml has a new function mandoc_escape(), which I think should be helpful to rectify this. Hope to see the latest version of mdocml in the -current to be able to use this.
      3. Unable to parse automatically generated man pages: Some of the man pages are generated automatically as a result their syntax is very different from the normal man pages, as a result we are unable to parse such man pages at the moment. 
      There are a few more issues which I have listed on Github. (https://github.com/abhinav-upadhyay/apropos_replacement/issues).

      So at the present moment, you can clone the repository, run make to compile the source and run './makemandb'. If all goes well, a new sqlite database (apropos.db) will be created in the present directory. You can run some select queries against it to test.

      Feedback will be highly appreciated :)

      Friday, May 27, 2011

      GSoC Welcome Package

      Just now received my Welcome package from Google. I am overjoyed and couldn't stop looking at it, so sharing the pics of the things I received:


      The Citi Bank Prepaid Card:

      My Precioussss ;-)

      The GSoC Notebook, Pen and the Sticker



      The Welcome Letter from Google:


      Yaay!! I am happy :)