My tech musings: NetBSD GSoC: Midterm Project Update

Another update of the project after a gap of 2 weeks. I did not have much to write about last week, and come this week, we have reached the stage of midterm evaluations. I will try to explain what new changes I made in last two weeks, over all what is the present status of the project and what are the other things which I am planning to implement.

Printing the section number along with search results: Last time around when I posted a few sample runs of the project, there were no section numbers along with the search results, but now we have them (thanks to Kristaps for suggesting the right way to extract meta data).

Improve the ranking algorithm (Implement tf-idf): tf-idf based term weighting schemes are very common in information retrieval systems. Till now, I was ranking the results only on the basis of the term frequency (tf). I improved this by including another factor for ranking, .i.e, the inverse document frequency (idf).

Term Frequency: Is usually defined as the number of times a given term appears in a particular document.
Inverse Document Frequency: IDF of a term indicates in how many documents a given term appears (at least once).

Term frequency is a local factor, which is concerned only with the number of occurrences of the search terms in one particular document at a time.

While Inverse Document Frequency is a global factor, in the sense that, it indicates the discriminating power of a term. If a term appears in only a selected set of documents, then it means, that that term separates that set of documents from the rest. So ranking obtained by combining these two factors brings up more relevant documents.

So the weight of a term t in document d is calculated by the following formula:

weight = tf * idf

Where tf = Term frequency of term t in document d
idf = log (N / Nt)

Where N = Total number of documents in the corpus
Nt = Number of documents in which term t occurs (at least once).

So for a term which appears in only one document it will have
IDF = log(n)
while a term which appears in all the documents, it will have

IDF = log(1) = 0.

For example a term like "the" will have a high term frequency in any document, but at the same time it will have a lower inverse document frequency (almost close to 0), which will nullify it's effect on the quality of search results.

Pre-compute the term-weights: While the tf-idf based term-weighting scheme improved the quality of search, it degraded it's performance. I could see apropos taking time in flushing out the results. The reason for this was that, all the calculations of the term-weights were being done on the fly when running apropos. An obvious solution to this problem was to pre-compute the term-weights while creating the index and store them in the database. Thus while doing the search, we only need to lookup the database rather than do both lookup and perform calculation!

I implemented the code for pre-computing term weights in makemandb, but to my surprise, these changes made makemandb painfully slow. Earlier makemandb could index the man pages in under 2 minutes, but now it was taking close to 3 hours to do the pre-computation of each unique term in the corpus. In addition to that there were some bugs which were causing large deviations in the term-weights. I decided to first get the calculations right, it took me 3 days to get the calculations right, as after each bit of change in the code I had to re-run makemandb to do the indexing and see the results. Finally, I got it right, and then after some discussions with Joerg, the performance issue was also fixed. Basically the solution was to bring most of the processing inside Sqlite. Now makmandb does the indexing and pre-computation of weights, all under 3 minutes on my machine :-)

Further Improve the Ranking Algorithm: In my free time I am doing some study on Information Retrieval. During my studies I came across a very interesting research paper by Salton and Buckley from 1988, in which they discussed different term weighting schemes and their results. According to their study, the following formula for calculating term weights is most effective:

for weight of a given term in a particular document we can calculate the weight as:

I implemented this in a bit more simpler form. I avoided the calculation of powers in the denominator (square root and square) to avoid unnecessary overheads as these calculations are being done on the fly by apropos. The results have been pretty good.

Sample Results: http://pastebin.com/PjdNY68m

Note: The above mentioned change is in the search branch only at the moment. I did not merge this in master so that, if you guys want to compare the differences before and after the above change, you can easily checkout the master and search branches and see for yourself :-)

A keyword Recognizer: I have been thinking to implement this feature for a while. Basically the idea is to scan the query for some keywords, which would indicate that the user is probably looking for results from a particular section. For example "functions to copy string" gives an indication that the user is looking for standard library functions from section 3.

After some discussions with David, we came to the conclusion that probably a better way to implement a feature like this would be to do something like Google does. Google allows you to search within a specific website using a syntax like:

[book on IR site: amazon.com].

David suggested to use a similar interface, where user could specify a specific section using colon. So for example:

apropos "kernel: function to allocate memory"

will search only within section 9.

I started some work on this feature but it didn't work out properly, so it is at the moment on halt. I hope to resume work on it soon, but at the same time I would like to know if this feature is worth it ?

Where Do We Stand At Midterm Evaluation: As I promised in my proposal, I have accomplished most of the requirements just in time for the midterm evaluation (maybe I need to write some documentation). Now is the time for some community feedback :-). I would love to hear about

How good or bad the search results are ? If for some query you feel that right results are not coming up, please write to me about that query and what results you expected to see at the top.
If you want to see any improvements or new features, tell me about them.

What New Features Are Next? Apart from the keyword recognizer, there are another couple of features that I have in mind, although whether I will implement them or not is a different matter, as I need to make sure whether it is feasible to implement them.

A Link Analysis Algorithm For Ranking: Search engines these days do two types of ranking.

Content based ranking: It is concerned with finding relevant documents by matching the content. For example the tf-idf based term-weighting scheme is one way of doing content based ranking.
Popularity Based Ranking: It tries to rank the documents based on their popularity. This popularity is calculated on the basis of a link analysis algorithm. For example Google's PageRank or Jon Kleinberg's HITS algorithm.

I am studying about the PageRank algorithm and I am tempted to implement it, but I am held backby the fact that Stanford University has a patent on the PageRank process, so I am in a dilemma whether I should implement it or not.

A Spell Checker: It is a very common thing that the users do a typo while performing the search, which might lead to no results at all or in some cases wrong results. I am thinking to add a spell checker, which in case no results are found, would suggest to the user some related search terms (assuming that perhaps he made a typo).

I am held back on this because personally I have never looked at what techniques are involved in spell checkers but I have heard that it is computationally very expensive.

Testing out apropos:

#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git


#Run make
$make

#Run makemandb
$./makemandb

#Run apropos
$./apropos "list directories"

By default you will be on the master branch. The search branch has an improved ranking algorithm, so you might want to check it out and compare the results before and after the algorithm improvement:

$git checkout -b search origin/search

and run make again to build it.

Prerequisites:

You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the -p option to man(1) which is being used by makemandb.
You will also want to have the -current version of the man pages in /usr/share/man (at least).
libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml

Feedbacks are welcome :-)

My tech musings

Friday, July 8, 2011

NetBSD GSoC: Midterm Project Update

1 comment: