Linear Hashing

In my most recent night time project I am learning how databases work by actually implementing one. I started by implementing a simple key-value store using a hash index. Which has basically required me to dig deep into hash table literature.

Many of us may have implemented a hash table one or few times in our lives - I've certainly done my fair share of implementations. One of the major problems that we tend to worry about when doing a hash table implementation is hash collision. Just to recall, hash collision refers to the problem when two or more keys hash to the same index in the table. The most straightforward way of handling collision is a technique called separate chaining. In separate chaining we use a linked list within each bucket to store all the key/value pairs which were mapped to the same bucket by the hash function. An illustration is shown below (sorry for my terrible handwriting and illustration skills):

The problem with separate chaining is that as the number of entries in the hash table grows we get more and more collisions and the linked lists tend to get bigger. This impacts the lookup performance because in worst case we have to scan a linked list to find a key in a bucket, whereas we use hash tables for their O(1) complexity. To make sure that the average complexity of lookup in the hash table remains O(1) what we usually do is to expand the table once we cross a threshold load factor. The load factor could be as simple as the number of entries in the table vs the number of buckets. Once this ratio crosses a threshold, say 0.75, we may decide to grow the size of the table so that the hash collisions will go down and lookups will remain fast.

Problem with regrowing is that that we have to allocate new memory for the expanded table and rehash all the keys stored in the previous table (because with increased table size, their bucket index in the new table may change) and then finally copy over all those entries into the new table. The usual growing factor for the table is 2, i.e. we double the size of the table every time. This gets expensive with every time we have to regrow the table. Although in practice it may work out to be okay.

However for an on disk hash table (for indexing in a database), it is not that simple. You start by allocating a fix number of buckets in a file and the hash function uses the number of buckets to map a key into one of the buckets. When you start getting a high number of collisions or grow beyond a certain load factor, doubling the number of buckets (like in the case of in-memory hash tables) is not that straightforward. Because we would have to rehash and rewrite all the entries and disk i/o is millions of time slower than RAM. This would also have consequences in terms of concurrency, i.e., while we are expanding the index and rewriting the entries all the readers/writers would be blocked for this to finish.

Then how do we expand the index cheaply for the on disk case? This was answered in the 1980 by Litwin in a paper titled "Linear Hashing: A New Tool for File and Table Addressing" . Using linear hashing it is possible to access any record from the disk in two accesses. Linear hashing involves linearly growing the table one bucket at a time instead of the exponential growth when we double the table size every time.

Linear hashing technique is part of a family of hashing techniques called dynamic hashing. In dynamic hashing we use a family of hash function rather than a single fixed hash function (as is done in a static hash table implementation). Linear hashing is a specific example of dynamic hashing where we use two hash functions at any point of time. Following is an outline of how it works:

We start with following variables:

n = number of initial buckets (must be a power of 2)
s = 0 (this is the index of the bucket which is to be split next)
i = number of bits required to address the n buckets.

For load factor we will use the following formula:

load_factor = number of entries / 2 * number of buckets

The number 2 in the denominator is to allow on average 2 records in a chain in any given bucket. For 1 entry and 2 buckets, load factor is 1/4. For 2 entries and 2 buckets it is 1/2, for 3 entries and 2 buckets it is 3/4. For 4 entries and 2 buckets it will be 1. And so on. We will use the threshold for load factor as 0.8.

Let's assume we are using a hash function which gives us a 64 bit unsigned integer. Which means we can use it to address upto 2^64 buckets. In the beginning our hash table will only have n buckets. So we will use the first i bits of the hash function to map a key to a bucket. Once we reach the threshold of our load factor, we will add one more bucket to the table and increment i ( we increment i every time the number of buckets has reached a power of 2 and we need an extra bit to address the new bucket). After adding the new bucket we split the keys stored in the bucket at index s and the newly added bucket. We increment s by one after every split. Once we have doubled the number of buckets from where we started, we reset s to 0 and repeat.

Walk Through of inserting keys:

Following is a walk through of how it would work through a toy example. Let's say we start with n=2 buckets, so that each bucket is addressable with just one bit (0 or 1), so i=1 and s=0.

Let's say we insert four keys into the table via following calls:

put(k1, v1)

put(k2, v2)

put(k3, v3)

put(k4, v4)

Following is an illustration of how it might be arranged in the table:

After adding 4 entries into our table, the load factor becomes 4/4 = 1 which is greater than our threshold load factor of 0.8 so we grow the table. In linear hashing we grow the table by adding just one new bucket at a time. So we add bucket B2. Since to address the 3rd bucket we need one extra bit, we increment i to 2. We also split the entries in bucket B0 with B2 by rehashing each of them using 2 bits now. Following is how it would look like after doing this:

After the split s is incremented to 1.

Now, let's insert another entry into the table. The new entry is key k5 and it hashes to bucket 11. As right now we only have 3 buckets, 11 is an invalid index, so we just use the first bit of this value and store k5 in bucket B1. Following is how it would look:

After this again, our load factor would be 5/6=0.83 which would require another addition of a bucket followed by split of entries between bucket 1 and 3. We would still use i=2 since we can address 4 buckets using 2 bits. Following is how it would look after that:

Since we have doubled the number of buckets from where we started, we reset the split pointer, s to 0 (we reset it every time we have doubled the number of buckets as stated previously).

In this way we can continue adding entries and slowly growing the table as required. This works particularly well for on disk storage because it's cheap to append an entry and just rewrite a few values to adjust the linked lists. This works well from concurrency point of view as well because we can just take a lock on the hash chain which is being split and continue to read/write other hash chains.

Reading Values

Let's briefly also talk about how reading from this hash table would work. Let's say after having added 5 entries as above, we want to read the key k3. How would we do that?

Since i=2, we get h(k3) = 00. That is the first bucket, we can safely read the value from there and return.

What about the case when we have incremented i but our total number of buckets is less than 2^i? For example if in the above case we add a 5th bucket we would have to increment i to 3. But 3 bits can address 8 buckets while we only have 5. How would we read from the hash table in that case? It's pretty simple: We use 3 bits to hash the key. If the resulting value is greater than the number of buckets we have right now, we use only 2 bits and read from the bucket at the resulting index, otherwise we use all three bits.

Following is a python implementation of a vanilla hash table in Python along with Linear Hashing in less than 100 lines. I first implemented the usual HashMap which grows exponentially and then implemented LinearHashMap by extending it and overriding the _grow method. I hope it makes sense, but feel free to leave comment if something is not clear.

# Copyright (c) 2020,2021 Abhinav Upadhyay
 # All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
 # are met:
 # 1. Redistributions of source code must retain the above copyright
 #    notice, this list of conditions and the following disclaimer.
 # 2. Redistributions in binary form must reproduce the above copyright
 #    notice, this list of conditions and the following disclaimer in the
 #    documentation and/or other materials provided with the distribution.
 #
 # THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
 # ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 # TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 # PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
 # BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 # POSSIBILITY OF SUCH DAMAGE.
 #
 
import math
import xxhash

class HashMap:
    def __init__(self, size=32, load_factor=0.75, grow=True):
        self.table = [EntryList() for _ in range(size)]
        self.nentries = 0
        self.load_factor = load_factor
        self.grow = grow
    
    def put(self, k, v):
        bkt_idx = self._get_bucket_idx(k, len(self.table))
        self.table[bkt_idx].append(k, v)
        self.nentries += 1
        if self.grow and self._comput_load_factor() > self.load_factor:
            self._grow()

    def get(self, k):
        bkt_idx = self._get_bucket_idx(k, len(self.table))
        for entry in self.table[bkt_idx]:
            if entry.key == k:
                return entry.value
    
    def _comput_load_factor(self):
        return self.nentries / (3 * len(self.table))
    
    def _get_bucket_idx(self, k, size):
        return xxhash.xxh64(k).intdigest() % size

    def _grow(self):
        # we double the table size and rehash all the entries
        newsize = len(self.table) * 2
        new_table = [EntryList() for _ in range(newsize)]
        for bucket in self.table:
            for e in bucket:
                bucket_idx = self._get_bucket_idx(e.key, newsize)
                new_table[bucket_idx].append(e.key, e.value)
        self.table = new_table

class LinearHashMap(HashMap):
    def __init__(self, size=32, load_factor=0.75):
        super().__init__(size, load_factor)
        self.i = int(math.log2(size))
        self.split_idx = 0
    
    def _grow(self):
        split_idx = self.split_idx
        self.split_idx += 1
        old_bucket = self.table[split_idx]
        new_bucket = EntryList()
        self.table.append(new_bucket)
        # if we have grown to the next power of 2 number of buckets
        # we increment i
        if len(self.table) > (1 << self.i):
            self.i += 1
        # if we have doubled the number of buckets, we reset s to 0
        if self.split_idx * 2 == len(self.table):
            self.split_idx = 0
        # rehash the entries in the old bucket and split with new bucket
        prev_e = old_bucket
        for e in old_bucket:
            new_bucket_id = self._get_bucket_idx(e.key, len(self.table))
            if new_bucket_id != split_idx:
                new_bucket.append(e.key, e.value)
                prev_e.next = e.next
            else:
                prev_e = e

    
    def _get_bucket_idx(self, k, size):
        h = xxhash.xxh64(k).intdigest()
        # we take the first i bits as the bucket index
        # if this index is less than the number of buckets
        # we return it as it is. Otherwise we unset the MSB
        # so we only use i-1 bits effectively and address the valid bucket
        bkt_idx = h & ((1 << self.i) - 1)
        if bkt_idx < size:
            return bkt_idx
        return bkt_idx ^ (1 << (self.i - 1))
    

class EntryList:
    def __init__(self):
        self.head = None
    
    def append(self, k, v):
        if self.head is None:
            self.head = Entry(k, v)
            return
        self.head.append(k, v)
    
    def __iter__(self):
        next = self.head
        while next:
            yield next
            next = next.next

class Entry:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None
    
    def append(self, k, v):
        new_entry = Entry(k, v)
        new_entry.next = self.next
        self.next = new_entry
    

My tech musings

Saturday, December 5, 2020

Understanding Linear Hashing

Linear Hashing

Walk Through of inserting keys:

Reading Values

No comments:

Post a Comment