Sunday, July 31, 2011

NetBSD GSoC: Project Update 5

First of all thanks to Joerg, David Young and all other people involved with GSoC as I cleared Midterm Evaluations. 20 days passed since I last posted an update of the project and I did not even realize it. I apologize if I seem to be inactive, but in my defense, I would like to just quote my Commit Log ;-). I have been actively pushing changes, fixing issues, adding new features all this while. As the size of the project is growing it is taking more and more time to make new changes, test them properly and fix any loose ends.

So a brief overview of the things I did in last 3 weeks:

[New Feature] Search Within Specific Sections: This feature has been on my TODO list for very long, but something or the other kept coming up which was more important to deal with. I wanted to do it this time before I posted this update. So here it is:
I have added options to support search within one or more specified sections.
Commit: 966c7ba
You can do a search like this:
$apropos -1 "copy files"

#It will search in section 1 only.

$apropos -18 "adding new user"


#This will search in section 1 and 8 only.
#I hope you get the idea :)

Some sample runs:    http://paste2.org/p/1554491
                                  http://paste2.org/p/1554510
                                  http://paste2.org/p/1554509

Indexing Performance Improvement: Joerg suggested a clever way to bring down the time taken by makemandb to index the pages. He suggested that instead of doing a separate transaction for each page, it is better to index all the pages inside a single transaction which will decrease the IO overhead substantially. So I did the changes, and the indexing time came down from 3 minutes to within range of 30 seconds or so.
Commit: 926746

Parse And Index man(7) Pages: Till now we were indexing only the mdoc(7) pages and all the man(7) based pages were being ignored. Now when the project was working quite ok with mdoc(7) pages, it was time to scale up. Parsing man(7) pages was a bit more difficult as compared to parsing mdoc(7) pages. It took some 2-3 days to implement this code and the next 2-3 days to fix various bugs and testing whether it was working ok with the 7000+ man pages I have.
Commit: 2014855

Too Large DB Size (Regression): Parsing man(7) and mdoc(7) meant that I was indexing a whole lot of man pages. (7613 to be exact). This scale up in the number of indexed pages also scaled up some problems which were not really visible before this. One major problem that came up was the size of the DB. It had grown to almost 99M.
Root Cause: The root cause for this was that we were also storing all the unique terms in the corpus and their term-weights in a separate table, which was almost doubling the space requirements.
Solution: So as a quick solution to the problem I decided to remove the code related to pre-computation of the term-weights and drop this table. This brought down the DB size to around 60M and with a few optmizations it has come down in the range of 30-40M.
Drawbacks: The pre-computation of weight had it's advantages, I was using it to implement some advanced ranking algorithms and I had some plans to improve the ranking further on the basis of this work but I had to let it go.
Alternatives: The extra space was only helping to get more accurate results, it was a trade off between space and search quality. One alternative can be to let the user decide what does he/she want ?Let the user choose between the two versions.
Commit: 7928fc5


Added Compression Option To The DB: To bring down the DB size further, I implemented code for compressing and decompressing data using zlib(3). It was also an exercise to make the zlib interface work with Sqlite. 
As a result of this the DB size came down to 43 M.
Commit: d878815


Stopword Tokenizer: Implementing a custom tokenizer to filter out any stopword was already on my TODO list but with the increased DB size it became the priority. I patched the porter tokenizer from the Sqlite source to filter out any stopwords. 
The tokenizer seemed to be working fine, and it also helped in bringing the DB size down. When using the stopword tokenizer the size came to be around 31M. 
Due to a small bug I have disabled the use of this tokenizer for now.
Commit: 76b4769

Parsing Additional sections & Storing Them In Individual Columns: This was a required change. With such a large number of pages (7613) in the db, and all of the content in a single column mean a lot of noise and the search results were off the mark by a great margin. David Young had also suggested this previously to give weight to some prominent sections like "DIAGNOSTICS" than others like "ERRORS", etc. 
It was a big task to do. I first started with decomposing the mdoc(7) pages, then man(7) pages and then sat down to fix apropos to take in account the new columns in the DB and fix the ranking function. 

I would say, the time taken to implement this was worth it. Because it has helped in  making the code more clean. In future if there was a requirement to parse another extra section, it will only require adding a switch case statement and a couple of extra lines of code.

static void
mdoc_parse_section(enum mdoc_sec sec, const char *string)
{
    switch (sec) {
    case SEC_LIBRARY:
        concat(&lib, string);
        break;
    case SEC_SYNOPSIS:
        concat(&synopsis, string);
        break;
    case SEC_RETURN_VALUES:
        concat(&return_vals, string);
        break;
    case SEC_ENVIRONMENT:
        concat(&env, string);
        break;
    case SEC_FILES:
        concat(&files, string);
        break;
    case SEC_EXIT_STATUS:
        concat(&exit_status, string);
        break;
    case SEC_DIAGNOSTICS:
        concat(&diagnostics, string);
        break;
    case SEC_ERRORS:
        concat(&errors, string);
        break;
    case SEC_NAME:
        break;
    default:
        concat(&desc, string);
        break;
    }
}


It also allows to fine tune the ranking function easily and play with it. If you want to experiment around with search, you can easily modify the column weights and rebuild to see the effects. The column weights are in the form of a double array in the rank_func function.

double col_weights[] = {
    2.0, // NAME
    2.00, // Name-description
    0.55, // DESCRIPTION
    0.25, // LIBRARY
    0.10, //SYNOPSIS
    0.001, //RETURN VALUES
    0.20, //ENVIRONMENT
    0.01, //FILES
    0.001, //EXIT STATUS
    2.00, //DIAGNOSTICS
    0.05 //ERRORS
};



[Feature Proposal]: Show additional data with search results- Storing the different sections in separate column has it's advantages as well. One of them being the ability to fetch and show more specific content with search results. For example, I have already done something like this. Now, if you see the search results, you will also see the one line description of the result (.Nd macro).
Similarly it is possible to show the library, exit values, return values where possible. But I was wondering if it is a useful feature ? Any views ?

Besides this there are a lot of other things to be done that I had mentioned in my proposal like a CGI based interface and using the database for managing the man page aliases.These are now on top of my TODO list, and if no big issues come up, I would like to pick them  up.

Friday, July 8, 2011

NetBSD GSoC: Midterm Project Update

Another update of the project after a gap of 2 weeks. I did not have much to write about last week, and come this week, we have reached the stage of midterm evaluations. I will try to explain what new changes I made in last two weeks, over all what is the present status of the project and what are the other things which I am planning to implement.

Printing the section number along with search results: Last time around when I posted a few sample runs of the project, there were no section numbers along with the search results, but now we have them (thanks to Kristaps for suggesting the right way to extract meta data).

Improve the ranking algorithm (Implement tf-idf): tf-idf based term weighting schemes are very common in information retrieval systems. Till now, I was ranking the results only on the basis of the term frequency (tf). I improved this by including another factor for ranking, .i.e, the inverse document frequency (idf).

  • Term Frequency: Is usually defined as the number of times a given term appears in a particular document.
  • Inverse Document Frequency: IDF of a term indicates in how many documents a given term appears (at least once).

Term frequency is a local factor, which is concerned only with the number of occurrences of the search terms in one particular document at a time.
While Inverse Document Frequency is a global factor, in the sense that, it indicates the discriminating power of a term. If a term appears in only a selected set of documents, then it means, that that term separates that set of documents from the rest. So ranking obtained by combining these two factors brings up more relevant documents.

So the weight of a term t in document d is calculated by the following formula:

weight  = tf * idf

Where tf = Term frequency of term t in document d
idf = log (N / Nt)

Where N = Total number of documents in the corpus
Nt = Number of documents in which term t occurs (at least once).
So for a term which appears in only one document it will have
IDF = log(n)
while a term which appears in all the documents, it will have
IDF = log(1) = 0.

For example a term like "the" will have a high term frequency in any document, but at the same time it will have a lower inverse document frequency (almost close to 0), which will nullify it's effect on the quality of search results.


Pre-compute the term-weights: While the tf-idf based term-weighting scheme improved the quality of search, it degraded it's performance. I could see apropos taking time in flushing out the results. The reason for this was that, all the calculations of the term-weights were being done on the fly when running apropos. An obvious solution to this problem was to pre-compute the term-weights while creating the index and store them in the database. Thus while doing the search, we only need to lookup the database rather than do both lookup and perform calculation!

I implemented the code for pre-computing term weights in makemandb, but to my surprise, these changes made makemandb painfully slow. Earlier makemandb could index the man pages in under 2 minutes, but now it was taking close to 3 hours to do the pre-computation of each unique term in the corpus. In addition to that there were some bugs which were causing large deviations in the term-weights. I decided to first get the calculations right, it took me 3 days to get the calculations right, as after each bit of change in the code I had to re-run makemandb to do the indexing and see the results. Finally, I got it right, and then after some discussions with Joerg, the performance issue was also fixed. Basically the solution was to bring most of the processing inside Sqlite. Now makmandb does the indexing and pre-computation of weights, all under 3 minutes on my machine :-)

Further Improve the Ranking Algorithm: In my free time I am doing some study on Information Retrieval. During my studies I came across a very interesting research paper by Salton and Buckley from 1988, in which they discussed different term weighting schemes and their results. According to their study, the following formula for calculating term weights is most effective:


for weight of a given term in a particular document we can calculate the weight as:
 
I implemented this in a bit more simpler form. I avoided the calculation of powers in the denominator (square root and square) to avoid unnecessary overheads as these calculations are being done on the fly by apropos. The results have been pretty good.

Sample Results: http://pastebin.com/PjdNY68m



Note: The above mentioned change is in the search branch only at the moment. I did not merge this in master so that, if you guys want to compare the differences before and after the above change, you can easily checkout the master and search branches and see for yourself :-)

A keyword Recognizer: I have been thinking to implement this feature for a while. Basically the idea is to scan the query for some keywords, which would indicate that the user is probably looking for results from a particular section. For example "functions to copy string" gives an indication that the user is looking for standard library functions from section 3.

After some discussions with David, we came to the conclusion that probably a better way to implement a feature like this would be to do something like Google does. Google allows you to search within a specific website using a syntax like:
[book on IR site: amazon.com]. 
David suggested to use a similar interface, where user could specify a specific section using colon. So for example:
apropos "kernel: function to allocate memory" 
will search only within section 9.
I started some work on this feature but it didn't work out properly, so it is at the moment on halt. I hope to resume work on it soon, but at the same time I would like to know if this feature is worth it ?


Where Do We Stand At Midterm Evaluation: As I promised in my proposal, I have accomplished most of the requirements just in time for the midterm evaluation (maybe I need to write some documentation). Now is the time for some community feedback :-). I would love to hear about

  • How good or bad the search results are ? If for some query you feel that right results are not coming up, please write to me about that query and what results you expected to see at the top. 
  • If you want to see any improvements or new features, tell me about them.

What New Features Are Next? Apart from the keyword recognizer, there are another couple of features that I have in mind, although whether I will implement them or not is a different matter, as I need to make sure whether it is feasible to implement them.

A Link Analysis Algorithm For Ranking: Search engines these days do two types of ranking.

  1. Content based ranking: It is concerned with finding relevant documents by matching the content. For example the tf-idf based term-weighting scheme is one way of doing content based ranking.
  2. Popularity Based Ranking: It tries to rank the documents based on their popularity. This popularity is calculated on the basis of a link analysis algorithm. For example Google's PageRank or Jon Kleinberg's HITS algorithm. 

I am studying about the PageRank algorithm and I am tempted to implement it, but I am held backby the fact that Stanford University has a patent on the PageRank process, so I am in a dilemma whether I should implement it or not.

A Spell Checker: It is a very common thing that the users do a typo while performing the search, which might lead to no results at all or in some cases wrong results. I am thinking to add a spell checker, which in case no results are found, would suggest to the user some related search terms (assuming that perhaps he made a typo).

I am held back on this because personally I have never looked at what techniques are involved in spell checkers but I have heard that it is computationally very expensive.


Testing out apropos:

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch. The search branch has an improved ranking algorithm, so you might want to check it out and compare the results before and after the algorithm improvement:

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:
  1. You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
  2. You will also want to have the -current version of the man pages in /usr/share/man (at least).
  3. libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml 
Feedbacks are welcome :-)