Saturday, May 7, 2016

Teaching Apropos to Rank - A Work in Progress

Last month I deployed man-k.org, which is a web interface to NetBSD's apropos implementation. After that, I thought that I could try using machine learning to improve the ranking algorithm used by apropos. In this post, I describe how the ranking algorithm could be improved by machine learning, the challenges in the way, and the results obtained thus far.

Before we start, I would like to say that the results shown here are no where close to the final output I want, it's still a work in progress. I will publish the code soon, when I feel the results are noteworthy.  The data for these experiments is available on my github repo. Now let's dive in :)


Apropos' Ranking Algorithm:

section_weights = {'NAME': 1.0, 'NAME_DESC': 0.9, 'DESCRIPTION': 0.5}

for each matched document in the result set:

    for each section in the document:

         tf = tf + compute_tf_for_section() * section_weights[section]

         idf = idf + compute_idf_for_section() * section_weight[section]

    document_score = (tf * idf)/ (k + tf)

This algorithm is used to generate relevance score for each matching document and the results are then sorted in decreasing order of this score. The algorithm multiplies the tf and idf values by a weight for each section. The values of these weights are hard coded and were determined by running some arbitrary queries and evaluating the results. 

How Machine Learning Can Help?

Ideally you would like to determine such weights by evaluating against a standard set of queries and measuring the precision and recall to see what values get the best results, but we didn't have any such dataset of apropos queries and output.

Thanks to man-k.org, I was able to get some data. I got close to 1000 queries and click results which I sifted through manually to remove anomalies, for example someone going to the last page and opening the result, or bot traffic.

Now, I wanted to learn the weights which I had hard coded to arbitrary values in the ranking algorithm above. This seemed like a straightforward regression problem to me. I will explain how.

In machine learning, you use regression when you want to predict a continuous range of values. For example, predicting the temperature, or predicting housing prices. Usually you have a set of features in your data, for example, in case of housing example, you could have features like number of bedrooms, area of the house, number of bathrooms and you want to use them to predict a target value, which is the price of the house. Now you need to combine these features in a way so that they could be related to the price of the house. One way to do that is to assign some weight to each feature and take their linear combination. For example:

w1 * number_bedrooms + w2 * number_bathrooms + w3 * area = price

Now, if we can determine the optimum value of these weights, we can predict the price of a house (to a good approximation), given these features.  There are a number of algorithms out there which can learn these weights and use them to predict the output for you.

My problem of learning the section wise weights was similar. My algorithm is generating a tf-idf score for each section , multiplying it by a weight, summing it up and calling it the score for that document. So, I had the following:
  •  Features in the form of section wise scores
  •  And I wanted to learn the weights for these features (section specific weights)
Sounds like a machine learning problem.

Why Learn Weights Instead of Learning to Rank?

Learning to rank is a different problem and probably more interesting too. But I wanted to first tackle the problem of learning the weights, because if I can learn the optimum value of the weights, those can be directly used in NetBSD's apropos code and immediately improve its search.

The Challenges

I didn't have the target value for my data. I couldn't use the output score as the target value. If I used the output of my ranking algorithm as it is as the target value in my data set, the weights learned by the machine learning model will be same as the current weights. So I decided to manufacture the target value.

For each query in the data set, if the clicked result was not ranked 1, I would set its score as the score of the document ranked 1. For example, for the query "list files", ideally you want ls(1) at the top but currently the top result is file(1). So I would set the target value for ls(1) = the output score of file(1). This way, the machine learning model would try to optimize the weights so as to go from the current score of ls(1) to that of file(1)

I also had to do a lot of manual work of processing the data and writing code to get the section wise scores from apropos, but those details are not very important.

 

The Results So Far

I tried out a few machine learning models, such as linear regression, support vector machines and random forests, which are very popular models for these kind of problems. I used leave one out cross validation technique to make sure the models didn't overfit the data and used Mean Squared Error as the evaluation metric. The best results were produced by the random forest model. I still need to tune the parameters of SVM and may be it will beat random forests? I need to try that out and many more other possibilities with this.

Admittedly, this approach is not perfect, prone to abuse in the traffic to man-k.org, requires manual work but at least it works as a validation for my idea. I am considering to use the data from man-k.org, manually refine it, annotate it and use it as a standard for further experiments.

Following are comparison of some of the queries with the old weights and the new weights, it's not anything drastic but the new weights seem to get rid of some of the non relevant results from the top.

apropos -n 10 -C fork #old weights
fork (2)  create a new process
perlfork (1)      Perls fork() emulation
cpu_lwp_fork (9)  finish a fork operation
pthread_atfork (3)        register handlers to be called when process forks
rlogind (8)       remote login server
rshd (8)  remote shell server
rexecd (8)        remote execution server
script (1)        make typescript of terminal session
moncontrol (3)    control execution profile
vfork (2) spawn new process in a virtual memory efficient way

apropos -n 10 -C fork #new weights
fork (2) create a new process
perlfork (1) Perls fork() emulation
cpu_lwp_fork (9) finish a fork operation
pthread_atfork (3) register handlers to be called when process forks
vfork (2) spawn new process in a virtual memory efficient way
clone (2) spawn new process with options
daemon (3) run in the background
script (1) make typescript of terminal session
openpty (3) tty utility functions
rlogind (8) remote login server

The new weights seem to bring more relevant results up, for example clone(2) shows up, rshd(8) and rexecd(8) go away, rlogind(8) moves down.

apropos -n 10 -C create new process
init (8)  process control initialization
fork (2)  create a new process
fork1 (9) create a new process
timer_create (2)  create a per-process timer
getpgrp (2)       get process group
supfilesrv (8)    sup server processes
posix_spawn (3)   spawn a process
master (8)        Postfix master process
popen (3) process I/O
_lwp_create (2)   create a new light-weight process

apropos -n 10 -C create new process #new weights
fork (2) create a new process
fork1 (9) create a new process
_lwp_create (2) create a new light-weight process
pthread_create (3) create a new thread
clone (2) spawn new process with options
timer_create (2) create a per-process timer
UI_new (3) New User Interface
init (8) process control initialization
posix_spawn (3) spawn a process
master (8) Postfix master process

You can see, fork(2) moves to number 1, init(8) moves to 7, clone(2) appears etc.

apropos -n 10 -C remove packages #old weights
groff_mdoc (7)    reference for groffs mdoc implementation
pkg_add (1)       a utility for installing and upgrading software package distributions
pkg_create (1)    a utility for creating software package distributions
pkg_delete (1)    a utility for deleting previously installed software package distributions
deroff (1)        remove nroff/troff, eqn, pic and tbl constructs
pkg_admin (1)     perform various administrative tasks to the pkg system
groff_tmac (5)    macro files in the roff typesetting system
ci (1)    check in RCS revisions
update\-binfmts (8)       maintain registry of executable binary formats
rpc_svc_reg (3)   library routines for registering servers

apropos -n 10 -C remove packages #new weights
pkg_create (1) a utility for creating software package distributions
pkg_add (1) a utility for installing and upgrading software package distributions
pkg_delete (1) a utility for deleting previously installed software package distributions
deroff (1) remove nroff/troff, eqn, pic and tbl constructs
groff_mdoc (7) reference for groffs mdoc implementation
groff_tmac (5) macro files in the roff typesetting system
ci (1) check in RCS revisions
pkg_admin (1) perform various administrative tasks to the pkg system
update\-binfmts (8) maintain registry of executable binary formats
rpc_svc_reg (3) library routines for registering servers

pkg_create moves to 1, pkg_delete moves up, I think pkg_admin should have been further up.


Tuesday, October 4, 2011

Spell Corrector for Apropos

One of the new features I wrote in apropos is a basic yet reasonably effective spell corrector.  While working on apropos, one big nuisance that I noticed was wrongly spelled keywords in the query. When supporting full text searches, I guess it is the usual expectation to have support for spell correction as well.

The search of apropos is based on the Boolean search model, which means that it will return only those documents which contain all the keywords mentioned in the query. This means that you misspell even one keyword and you will either get all non-relevant search results or no results at all. This behaviour is contradictory to the way conventional apropos did it's search, it would return all the results which matched even one of the keywords.

The user might think that "this new apropos is useless, can't get me any right results." Then he would most likely start experimenting by changing keywords and he might or might not succeed. The point is, apropos should be clever enough to inform the user that probably he misspelled one or more keywords in the query, so that the user doesn't waste time scratching his head.

Implementation Of The Spell Corrector: Writing an industry strength spell corrector (like that of Google, Bing, etc.) is a complex task and I have no idea about their intricacies. I was looking for a fairly basic implementation. I came across two articles which discussed implementation of a relatively simple spell checker. One article was by Jon Bentley in his famous book Programming Pearls and the second was from Prof. Peter Norvig in his famous post "How to write a spell corrector". I decided to go with Peter Norvig's implementation because of it's simplicity and ease of implementation. Before continuing, I would like to thank Prof. Norvig for writing such an insightful article and sharing it :-)

I highly recommend reading Prof. Norvig's article to understand the maths and logic involved properly, I am going to give some insight on what his Python code is doing and then produce the C translation of the code, with some demo.

The idea is to find the word at the least edit distance from the word being checked for. Edit distance here means, the number of characters from the given word you need to add, remove or change position to get the correctly spelled word. Peter Norvig mentioned in his post that for 80-95% cases edit distance 1 is sufficient.

The strategy for finding words at edit distance 1 is very simple. Four different kind of mistakes are possible that can lead to a misspelled word at edit distance 1. These are:
  1. Deletion: You missed a character while typing the word. For example: "speling".
  2. Transposition: You exchanged the positions of two adjacent characters in the word. For example: "teh" instead of "the"
  3. Replace: You replaced an alphabet in the word with some other alphabet (possibly you pressed the wrong key on the keyboard). For example: "dapple" instead of "apple" or "produkt" instead of "product"
  4. Insertions: You probably entered one additional alphabet in the spelling of the word. For example: "filles" when you mean "files".
I will take a simple example and show all it's possible permutations at edit distance 1. Let's say we misspelled "the" as "teh", then following are the different possible permutations:

deletes =  ['eh', 'th', 'te']

transpose =  ['eth', 'the']

#the replaces and inserts list is compacted but you get the idea
replaces =  ['aeh', 'beh', 'ceh', 'deh', 'eeh', 'feh', ..., 'zeh', 
                 'tah', 'tbh', 'tch', 'tdh', 'teh', 'tfh', ..., 'tzh'
                  'tea', 'teb', 'tec', 'ted', 'tee', 'tef', ..., 'tez']

inserts =  ['ateh', 'bteh', 'cteh', 'dteh', 'eteh', 'fteh', ..., 'zteh', 
                'taeh', 'tbeh', 'tceh', 'tdeh', 'teeh', 'tfeh', ..., 'tzeh', 
                'teah', 'tebh', 'tech', 'tedh', 'teeh', 'tefh', ..., 'tezh', 
                'teha', 'tehb', 'tehc', 'tehd', 'tehe', 'tehf', ..., 'tehz']


Once we have generated all these possible permutations of the word at edit distance 1, we check in our dictionary which of these are real and valid words. It is always possible that more than one of these permutations is a valid word in the dictionary, in which case we pick up the word which occurs most frequently in our sample corpus used for building the dictionary (this is the training model used for this spell corrector).

I suppose that explains what we need to do. Now time for some code:
NOTE: The following is a C implementation of Peter Norvig's spell corrector. It is written by me from scratch and is part of the apropos_replacement project, licensed under the two clause BSD license.

/*-
 * Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in
 * the documentation and/or other materials provided with the
 * distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

static char **
edits1 (char *word)
{
 int i;
 int len_a;
 int len_b;
 int counter = 0;
 char alphabet;
 int n = strlen(word);
 set splits[n + 1];

 /* calculate number of possible permutations and allocate memory */
 size_t size = n + n -1 + 26 * n + 26 * (n + 1);
 char **candidates = emalloc (size * sizeof(char *));

 /* Start by generating a split up of the characters in the word */
 for (i = 0; i < n + 1; i++) {
  splits[i].a = (char *) emalloc(i + 1);
  splits[i].b = (char *) emalloc(n - i + 1);
  memcpy(splits[i].a, word, i);
  memcpy(splits[i].b, word + i, n - i + 1);
  splits[i].a[i] = 0;
 }

 /* Now generate all the permutations at maximum edit distance of 1.
 * counter keeps track of the current index position in the array candidates
 * where the next permutation needs to be stored.
 */
 for (i = 0; i < n + 1; i++) {
  len_a = strlen(splits[i].a);
  len_b = strlen(splits[i].b);
  assert(len_a + len_b == n);

  /* Deletes */
  if (i < n) {
   candidates[counter] = emalloc(n);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b -1 > 0)
    memcpy(candidates[counter] + len_a , 
                                              splits[i].b + 1, len_b - 1);
   candidates[counter][n - 1] = 0;
   counter++;
  }

  /* Transposes */
  if (i < n - 1) {
   candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a, splits[i].b + 1, 1);
   if (len_b >= 1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, 1);
   if (len_b >= 2)
    memcpy(candidates[counter] + len_a + 2, 
                                               splits[i].b + 2, len_b - 2);
   candidates[counter][n] = 0;
   counter++;
  }

  /* For replaces and inserts, run a loop from 'a' to 'z' */
  for (alphabet = 'a'; alphabet <= 'z'; alphabet++) {
   /* Replaces */
   if (i < n) {
    candidates[counter] = emalloc(n + 1);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b - 1 >= 1)
    memcpy(candidates[counter] + len_a + 1, 
                                               splits[i].b + 1, len_b - 1);
   candidates[counter][n] = 0;
   counter++;
          }

          /* Inserts */
   candidates[counter] = emalloc(n + 2);
   memcpy(candidates[counter], splits[i].a, len_a);
   memcpy(candidates[counter] + len_a, &alphabet, 1);
   if (len_b >=1)
    memcpy(candidates[counter] + len_a + 1, splits[i].b, len_b);
   candidates[counter][n + 1] = 0;
   counter++;
  }
 }
 return candidates;
}

/*
* known_word--
* Pass an array of strings to this function and it will return the word with
* maximum frequency in the dictionary. If no word in the array list is found
* in the dictionary, it returns NULL
* NOTE: The dictionary in our case is a table in the db with two fields:
*       term, occurrences
*/
static char *
known_word(sqlite3 *db, char **list, int n)
{
 int i, rc;
 char *sqlstr;
 char *termlist = NULL;
 char *correct = NULL;
 sqlite3_stmt *stmt;

 /* Build termlist: a comma separated list of all the words in the list for
 * use in the SQL query later.
 */
 int total_len = BUFLEN * 20; /* total bytes allocated to termlist */
 termlist = emalloc (total_len);
 int offset = 0; /* Next byte to write at in termlist */
 termlist[0] = '(';
 offset++;

 for (i = 0; i < n; i++) {
  int d = strlen(list[i]);
  if (total_len - offset < d + 3) {
   termlist = erealloc(termlist, offset + total_len);
   total_len *= 2;
  }
  memcpy(termlist + offset, "\'", 1);
  offset++;
  memcpy(termlist + offset, list[i], d);
  offset += d;

  if (i == n -1) {
   memcpy(termlist + offset, "\'", 1);
   offset++;
  }
  else {
   memcpy(termlist + offset, "\',", 2);
   offset += 2;
  }

 }
 if (total_len - offset > 3)
  memcpy(termlist + offset, ")", 2);
 else
  concat(&termlist, ")", 1);

 easprintf(&sqlstr, "SELECT term FROM metadb.dict WHERE "
 "occurrences = (SELECT MAX(occurrences) from metadb.dict "
 "WHERE term IN %s) AND term IN %s", termlist, termlist);
 rc = sqlite3_prepare_v2(db, sqlstr, -1, &stmt, NULL);
 if (rc != SQLITE_OK) {
  warnx("%s", sqlite3_errmsg(db));
  return NULL;
 }

 if (sqlite3_step(stmt) == SQLITE_ROW)
  correct = strdup((char *) sqlite3_column_text(stmt, 0));

 sqlite3_finalize(stmt);
 free(sqlstr);
 free(termlist);
 return (correct);
}

static void
free_list(char **list, int n)
{
 int i = 0;
 if (list == NULL)
  return;

 while (i < n) {
  free(list[i]);
  i++;
 }
}

/*
* spell--
* The API exposed to the user. Returns the most closely matched word from the
* dictionary. It will first search for all possible words at distance 1, if no
* matches are found, it goes further and tries to look for words at edit
* distance 2 as well. If no matches are found at all, it returns NULL.
*/
char *
spell(sqlite3 *db, char *word)
{
 int i;
 char *correct;
 char **candidates;
 int count2;
 char **cand2 = NULL;
 char *errmsg;
 const char *sqlstr;
 int n;
 int count;
 
 lower(word);
 
 /* If this word already exists in the dictionary then no need to go further */
 correct = known_word(db, &word, 1);

 if (!correct) {
  n = strlen(word);
  count = n + n -1 + 26 * n + 26 * (n + 1);
  candidates = edits1(word);
  correct = known_word(db, candidates, count);
  /* No matches found ? Let's go further and find matches at edit distance 2.
  * To make the search fast we use a heuristic. Take one word at a time from
  * candidates, generate it's permutations and look if a match is found.
  * If a match is found, exit the loop. Works reasonable fast but accuracy
  * is not quite there in some cases.
  */
  if (correct == NULL) {
   for (i = 0; i < count; i++) {
    n = strlen(candidates[i]);
    count2 = n + n - 1 + 26 * n + 26 * (n + 1);
    cand2 = edits1(candidates[i]);
    if ((correct = known_word(db, cand2, count2)))
     break;
    else {
     free_list(cand2, count2);
     cand2 = NULL;
    }
   }
  }
  free_list(candidates, count);
  free_list(cand2, count2);
 }

 return correct;
}

Demo:
Following are some sample runs of apropos:

$ ./apropos "funckiton for coping stings"
Did you mean "function for copying strings" ?
$ ./apropos "generat termcap databse"
Did you mean "generate termcap database" ?
$ ./apropos idcmp
Did you mean "icmp" ?
$ ./apropos "confguire kernal"
Did you mean "configure kernel" ?
$ ./apropos "packate fillter"
Did you mean "package filter" ?
$ ./apropos reeltek
Did you mean "realtek" ?

Following are some screenshots of apropos_cgi (a CGI version of apropos for browsers):
















Further Scope: There are a few technical glitches in integrating this spell corrector with apropos so those need to be sorted. The suggestions are not always as expected, so probably the model for the spell corrector needs to be fine tuned (like Peter Norvig discussed at the end of his article). And while writing this post, it occurred to me that this implementation could make a fine small scale backend for auto completion feature in a web application (for example the apropos cgi above). ;-)

All this code is in the demo-spell , exp-spell branch of the project on github.

I am not sure if anyone would read this far, but thanks anyway for reading and taking interest. :-)