Saturday, May 7, 2016

Teaching Apropos to Rank - A Work in Progress

Last month I deployed man-k.org, which is a web interface to NetBSD's apropos implementation. After that, I thought that I could try using machine learning to improve the ranking algorithm used by apropos. In this post, I describe how the ranking algorithm could be improved by machine learning, the challenges in the way, and the results obtained thus far.

Before we start, I would like to say that the results shown here are no where close to the final output I want, it's still a work in progress. I will publish the code soon, when I feel the results are noteworthy.  The data for these experiments is available on my github repo. Now let's dive in :)


Apropos' Ranking Algorithm:

section_weights = {'NAME': 1.0, 'NAME_DESC': 0.9, 'DESCRIPTION': 0.5}

for each matched document in the result set:

    for each section in the document:

         tf = tf + compute_tf_for_section() * section_weights[section]

         idf = idf + compute_idf_for_section() * section_weight[section]

    document_score = (tf * idf)/ (k + tf)

This algorithm is used to generate relevance score for each matching document and the results are then sorted in decreasing order of this score. The algorithm multiplies the tf and idf values by a weight for each section. The values of these weights are hard coded and were determined by running some arbitrary queries and evaluating the results. 

How Machine Learning Can Help?

Ideally you would like to determine such weights by evaluating against a standard set of queries and measuring the precision and recall to see what values get the best results, but we didn't have any such dataset of apropos queries and output.

Thanks to man-k.org, I was able to get some data. I got close to 1000 queries and click results which I sifted through manually to remove anomalies, for example someone going to the last page and opening the result, or bot traffic.

Now, I wanted to learn the weights which I had hard coded to arbitrary values in the ranking algorithm above. This seemed like a straightforward regression problem to me. I will explain how.

In machine learning, you use regression when you want to predict a continuous range of values. For example, predicting the temperature, or predicting housing prices. Usually you have a set of features in your data, for example, in case of housing example, you could have features like number of bedrooms, area of the house, number of bathrooms and you want to use them to predict a target value, which is the price of the house. Now you need to combine these features in a way so that they could be related to the price of the house. One way to do that is to assign some weight to each feature and take their linear combination. For example:

w1 * number_bedrooms + w2 * number_bathrooms + w3 * area = price

Now, if we can determine the optimum value of these weights, we can predict the price of a house (to a good approximation), given these features.  There are a number of algorithms out there which can learn these weights and use them to predict the output for you.

My problem of learning the section wise weights was similar. My algorithm is generating a tf-idf score for each section , multiplying it by a weight, summing it up and calling it the score for that document. So, I had the following:
  •  Features in the form of section wise scores
  •  And I wanted to learn the weights for these features (section specific weights)
Sounds like a machine learning problem.

Why Learn Weights Instead of Learning to Rank?

Learning to rank is a different problem and probably more interesting too. But I wanted to first tackle the problem of learning the weights, because if I can learn the optimum value of the weights, those can be directly used in NetBSD's apropos code and immediately improve its search.

The Challenges

I didn't have the target value for my data. I couldn't use the output score as the target value. If I used the output of my ranking algorithm as it is as the target value in my data set, the weights learned by the machine learning model will be same as the current weights. So I decided to manufacture the target value.

For each query in the data set, if the clicked result was not ranked 1, I would set its score as the score of the document ranked 1. For example, for the query "list files", ideally you want ls(1) at the top but currently the top result is file(1). So I would set the target value for ls(1) = the output score of file(1). This way, the machine learning model would try to optimize the weights so as to go from the current score of ls(1) to that of file(1)

I also had to do a lot of manual work of processing the data and writing code to get the section wise scores from apropos, but those details are not very important.

 

The Results So Far

I tried out a few machine learning models, such as linear regression, support vector machines and random forests, which are very popular models for these kind of problems. I used leave one out cross validation technique to make sure the models didn't overfit the data and used Mean Squared Error as the evaluation metric. The best results were produced by the random forest model. I still need to tune the parameters of SVM and may be it will beat random forests? I need to try that out and many more other possibilities with this.

Admittedly, this approach is not perfect, prone to abuse in the traffic to man-k.org, requires manual work but at least it works as a validation for my idea. I am considering to use the data from man-k.org, manually refine it, annotate it and use it as a standard for further experiments.

Following are comparison of some of the queries with the old weights and the new weights, it's not anything drastic but the new weights seem to get rid of some of the non relevant results from the top.

apropos -n 10 -C fork #old weights
fork (2)  create a new process
perlfork (1)      Perls fork() emulation
cpu_lwp_fork (9)  finish a fork operation
pthread_atfork (3)        register handlers to be called when process forks
rlogind (8)       remote login server
rshd (8)  remote shell server
rexecd (8)        remote execution server
script (1)        make typescript of terminal session
moncontrol (3)    control execution profile
vfork (2) spawn new process in a virtual memory efficient way

apropos -n 10 -C fork #new weights
fork (2) create a new process
perlfork (1) Perls fork() emulation
cpu_lwp_fork (9) finish a fork operation
pthread_atfork (3) register handlers to be called when process forks
vfork (2) spawn new process in a virtual memory efficient way
clone (2) spawn new process with options
daemon (3) run in the background
script (1) make typescript of terminal session
openpty (3) tty utility functions
rlogind (8) remote login server

The new weights seem to bring more relevant results up, for example clone(2) shows up, rshd(8) and rexecd(8) go away, rlogind(8) moves down.

apropos -n 10 -C create new process
init (8)  process control initialization
fork (2)  create a new process
fork1 (9) create a new process
timer_create (2)  create a per-process timer
getpgrp (2)       get process group
supfilesrv (8)    sup server processes
posix_spawn (3)   spawn a process
master (8)        Postfix master process
popen (3) process I/O
_lwp_create (2)   create a new light-weight process

apropos -n 10 -C create new process #new weights
fork (2) create a new process
fork1 (9) create a new process
_lwp_create (2) create a new light-weight process
pthread_create (3) create a new thread
clone (2) spawn new process with options
timer_create (2) create a per-process timer
UI_new (3) New User Interface
init (8) process control initialization
posix_spawn (3) spawn a process
master (8) Postfix master process

You can see, fork(2) moves to number 1, init(8) moves to 7, clone(2) appears etc.

apropos -n 10 -C remove packages #old weights
groff_mdoc (7)    reference for groffs mdoc implementation
pkg_add (1)       a utility for installing and upgrading software package distributions
pkg_create (1)    a utility for creating software package distributions
pkg_delete (1)    a utility for deleting previously installed software package distributions
deroff (1)        remove nroff/troff, eqn, pic and tbl constructs
pkg_admin (1)     perform various administrative tasks to the pkg system
groff_tmac (5)    macro files in the roff typesetting system
ci (1)    check in RCS revisions
update\-binfmts (8)       maintain registry of executable binary formats
rpc_svc_reg (3)   library routines for registering servers

apropos -n 10 -C remove packages #new weights
pkg_create (1) a utility for creating software package distributions
pkg_add (1) a utility for installing and upgrading software package distributions
pkg_delete (1) a utility for deleting previously installed software package distributions
deroff (1) remove nroff/troff, eqn, pic and tbl constructs
groff_mdoc (7) reference for groffs mdoc implementation
groff_tmac (5) macro files in the roff typesetting system
ci (1) check in RCS revisions
pkg_admin (1) perform various administrative tasks to the pkg system
update\-binfmts (8) maintain registry of executable binary formats
rpc_svc_reg (3) library routines for registering servers

pkg_create moves to 1, pkg_delete moves up, I think pkg_admin should have been further up.


No comments:

Post a Comment