Sunday, July 31, 2011

NetBSD GSoC: Project Update 5

First of all thanks to Joerg, David Young and all other people involved with GSoC as I cleared Midterm Evaluations. 20 days passed since I last posted an update of the project and I did not even realize it. I apologize if I seem to be inactive, but in my defense, I would like to just quote my Commit Log ;-). I have been actively pushing changes, fixing issues, adding new features all this while. As the size of the project is growing it is taking more and more time to make new changes, test them properly and fix any loose ends.

So a brief overview of the things I did in last 3 weeks:

[New Feature] Search Within Specific Sections: This feature has been on my TODO list for very long, but something or the other kept coming up which was more important to deal with. I wanted to do it this time before I posted this update. So here it is:
I have added options to support search within one or more specified sections.
Commit: 966c7ba
You can do a search like this:
$apropos -1 "copy files"

#It will search in section 1 only.

$apropos -18 "adding new user"


#This will search in section 1 and 8 only.
#I hope you get the idea :)

Some sample runs:    http://paste2.org/p/1554491
                                  http://paste2.org/p/1554510
                                  http://paste2.org/p/1554509

Indexing Performance Improvement: Joerg suggested a clever way to bring down the time taken by makemandb to index the pages. He suggested that instead of doing a separate transaction for each page, it is better to index all the pages inside a single transaction which will decrease the IO overhead substantially. So I did the changes, and the indexing time came down from 3 minutes to within range of 30 seconds or so.
Commit: 926746

Parse And Index man(7) Pages: Till now we were indexing only the mdoc(7) pages and all the man(7) based pages were being ignored. Now when the project was working quite ok with mdoc(7) pages, it was time to scale up. Parsing man(7) pages was a bit more difficult as compared to parsing mdoc(7) pages. It took some 2-3 days to implement this code and the next 2-3 days to fix various bugs and testing whether it was working ok with the 7000+ man pages I have.
Commit: 2014855

Too Large DB Size (Regression): Parsing man(7) and mdoc(7) meant that I was indexing a whole lot of man pages. (7613 to be exact). This scale up in the number of indexed pages also scaled up some problems which were not really visible before this. One major problem that came up was the size of the DB. It had grown to almost 99M.
Root Cause: The root cause for this was that we were also storing all the unique terms in the corpus and their term-weights in a separate table, which was almost doubling the space requirements.
Solution: So as a quick solution to the problem I decided to remove the code related to pre-computation of the term-weights and drop this table. This brought down the DB size to around 60M and with a few optmizations it has come down in the range of 30-40M.
Drawbacks: The pre-computation of weight had it's advantages, I was using it to implement some advanced ranking algorithms and I had some plans to improve the ranking further on the basis of this work but I had to let it go.
Alternatives: The extra space was only helping to get more accurate results, it was a trade off between space and search quality. One alternative can be to let the user decide what does he/she want ?Let the user choose between the two versions.
Commit: 7928fc5


Added Compression Option To The DB: To bring down the DB size further, I implemented code for compressing and decompressing data using zlib(3). It was also an exercise to make the zlib interface work with Sqlite. 
As a result of this the DB size came down to 43 M.
Commit: d878815


Stopword Tokenizer: Implementing a custom tokenizer to filter out any stopword was already on my TODO list but with the increased DB size it became the priority. I patched the porter tokenizer from the Sqlite source to filter out any stopwords. 
The tokenizer seemed to be working fine, and it also helped in bringing the DB size down. When using the stopword tokenizer the size came to be around 31M. 
Due to a small bug I have disabled the use of this tokenizer for now.
Commit: 76b4769

Parsing Additional sections & Storing Them In Individual Columns: This was a required change. With such a large number of pages (7613) in the db, and all of the content in a single column mean a lot of noise and the search results were off the mark by a great margin. David Young had also suggested this previously to give weight to some prominent sections like "DIAGNOSTICS" than others like "ERRORS", etc. 
It was a big task to do. I first started with decomposing the mdoc(7) pages, then man(7) pages and then sat down to fix apropos to take in account the new columns in the DB and fix the ranking function. 

I would say, the time taken to implement this was worth it. Because it has helped in  making the code more clean. In future if there was a requirement to parse another extra section, it will only require adding a switch case statement and a couple of extra lines of code.

static void
mdoc_parse_section(enum mdoc_sec sec, const char *string)
{
    switch (sec) {
    case SEC_LIBRARY:
        concat(&lib, string);
        break;
    case SEC_SYNOPSIS:
        concat(&synopsis, string);
        break;
    case SEC_RETURN_VALUES:
        concat(&return_vals, string);
        break;
    case SEC_ENVIRONMENT:
        concat(&env, string);
        break;
    case SEC_FILES:
        concat(&files, string);
        break;
    case SEC_EXIT_STATUS:
        concat(&exit_status, string);
        break;
    case SEC_DIAGNOSTICS:
        concat(&diagnostics, string);
        break;
    case SEC_ERRORS:
        concat(&errors, string);
        break;
    case SEC_NAME:
        break;
    default:
        concat(&desc, string);
        break;
    }
}


It also allows to fine tune the ranking function easily and play with it. If you want to experiment around with search, you can easily modify the column weights and rebuild to see the effects. The column weights are in the form of a double array in the rank_func function.

double col_weights[] = {
    2.0, // NAME
    2.00, // Name-description
    0.55, // DESCRIPTION
    0.25, // LIBRARY
    0.10, //SYNOPSIS
    0.001, //RETURN VALUES
    0.20, //ENVIRONMENT
    0.01, //FILES
    0.001, //EXIT STATUS
    2.00, //DIAGNOSTICS
    0.05 //ERRORS
};



[Feature Proposal]: Show additional data with search results- Storing the different sections in separate column has it's advantages as well. One of them being the ability to fetch and show more specific content with search results. For example, I have already done something like this. Now, if you see the search results, you will also see the one line description of the result (.Nd macro).
Similarly it is possible to show the library, exit values, return values where possible. But I was wondering if it is a useful feature ? Any views ?

Besides this there are a lot of other things to be done that I had mentioned in my proposal like a CGI based interface and using the database for managing the man page aliases.These are now on top of my TODO list, and if no big issues come up, I would like to pick them  up.

No comments:

Post a Comment