Probabilistic Data Structures

Bloom filter is a probabilistic data structure. Before we understand what is a bloom filter, let's first look at what are probabilistic data structures.

Probabilistic data structures, as their name suggests are data structures which have some sort of uncertainty involved with them and thus their results are approximate in nature. For example, a set is a data structure we all are familiar with. When we query a set to check if it contains a value or not, it always returns true if it contains the value, otherwise false. There is no uncertainty in its output. If we imagine a probabilistic counter part for the set data structure, it could be that this probabilistic set returns false with 100% certainty if the value is not contained in it, but if it returns true then it basically means that the value might be contained but it is not guaranteed. So this probabilistic set data structure would have some degree of false positive rate. Most of these probabilistic data structures employ randomization techniques and we can estimate an upper bound on their false positive and false negative rates and tune them to our tolerance.

Now comes the question, why do we need such probabilistic data structures when we have 100% accurate data structures already available. The answer boils down to the scale of data, resource requirements and performance. With normal data structures such as hash tables, sets, trees, linked lists etc, we need to store the complete data in them. As the scale of data grows, so does the cost of these data structures and in some cases the cost of their operations. Probabilistic data structures solve this problem by not storing all the data, but rather some sort of a smaller signature to represent the data they are holding. This makes them very memory efficient but since they are not storing the full data, they compromise in their accuracy.

Let's look at few examples of such probabilistic data structures.

HyperLogLog: This is used for finding cardinality of a dataset. It is typically used in big data and streaming applications where keeping the complete data in memory in order to measure its cardinality is not feasible. A HyperLogLog can find approximate cardinalities of 10^9 with error rate of 2% while consuming 1.5 KB memory.

Locality Sensitive Hashing (LSH): In database systems, sometimes we are interested in finding items which are similar to a given query item. Typically this would require comparing the query item with every item stored in the system which even though linear in nature, can be very expensive if we want low latency. Locality Sensitive Hashing solves this by employing randomization techniques. Essentially LSH is able to do this expensive operation in constant amount of time with a fraction of memory usage than the actual dataset size. The cost is some amount of false positive and false negative rate.

Now that we have looked at few of these, let's look in detail what a Bloom Filter is and how it works.

Bloom Filter

A bloom filter is a data structure designed to answer membership queries. It is similar to the Set data structure, but its answers are approximate. Just like a set we can store data items in a bloom filter and later query it to determine if a particular item is stored in the filter or not. If the filter returns false, then it is 100% guaranteed that it is not present in the filter, however, if the filter returns true then that means that the item might be contained in the set but we cannot be sure. Let's try to understand how it works and implement it alongside to see it in action.

Being a probabilistic data structure, bloom filter does not actually store the data items, but only a signature of the data. The signature can be configured to be of a fixed size for example 4 bits per data item, which means doesn't matter how big our values are, the filter will take a very small amount of memory to hold it.

At its core the filter uses a bit vector to store the signatures. It generates the signature by using a bunch of hash functions. The number of hash functions is also a tuning parameter, the more the number of hash functions we use, the bigger the signature size. Along with the size of the bitvector and the number of hash functions we use, we can tune the filter for an upper bound on its false positive rate. We will see more on this later.

Bloom Filter Put: Let's see how do we put an item in a bloom filter. Let's assume we are using j number of hash functions in the filter. We use each hash function on the data item one at a time and get an index returned. This particular index is turned on (set to 1) in the bit vector.

Bloom filter Query: Similar to the put operation, at the time of querying we do the same thing. We apply each hash function in sequence and check whether the corresponding index in the bit vector is on or not. If any of the indices is off that means the item was never stored in the filter. However if all of the indices are on, it doesn't necessarily mean that the item was stored in the filter, it could be that those indices were on because of storing other data items.

Let's see an example. Let's say our bit vector is 8 bits long and we are using 2 hash functions h1 and h2. Initially the bitvector is going to be empty, i.e. all bits are set to 0.

Now let's say we wish to store the string "apple" in the filter. We will apply the two hash functions in order, like so:

h1("apple") = 3

h2("apple") = 0

So we will set the corresponding indices in the vector.

Let's insert the next item, "banana" in the filter.

h1("banana") = 6

h2("banana") = 3

We update the vector by turning on the corresponding bits.

Now let's try to query for "grapes".

h1("grapes") = 0

h2("grapes") = 7

We see that even though bit 0 is set, but bit 7 is off which means that "grapes" was never stored in the filter.

Let's query for "orange".

h1("orange") = 3

h2("orange") = 6

We can see that both bits 3 and 6 are on in the filter but in our example we never stored "orange". But bit was turned on for both "apple" and "banana" while bit 6 was set for "banana". This explains the false positive element in bloom filter.

Bloom Filter Implementation in C

Now that we understand how the bloom filter works conceptually. Let's implement in C to get a better understanding. I'm choosing C just because we can very efficiently implement a bit vector in it because we have tighter control over the memory. Talking about bit vectors, let's start by implementing that first since the bloom filter builds on top of that.

Bit Vector:

Let's first define the API of the bit vector data structure. We will declare the API in a file called bitvector.h. It will look like this:

#ifndef BITVECTOR_H
#define BITVECTOR_H
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct bitvector_t {
    uint8_t *vector;
    size_t size;
} bitvector_t;

bitvector_t *bitvector_allocate(size_t);
void bitvector_free(bitvector_t *);
void bitvector_set(bitvector_t *, size_t);
void bitvector_unset(bitvector_t *, size_t);
bool bitvector_isset(bitvector_t *, size_t);
#endif

The bitvector_t is the main data structure which consists of an array of type uint8_t, which is basically an unsigned byte. We also maintain the size of the bit vector in the struct. Then we have apis to allocate a bit vector and free it, to set a particular bit position and unset it and lastly an API to test whether a particular bit index is set in the vector or not.

Before we start implementing the APIs, let's first write a test following the TDD approach and then we can write the implementation to verify it. We will be writing a lot of tests and to save some effort I am going to write a few utilities for tests. Let's create a file called test_utils.h and put following code in it.

#ifndef TEST_UTILS_H
#define TEST_UTILS_H

#include <assert.h>
#include <stdio.h>

#define test(expr, ...) if (expr) { \
    ; } else { \
    fprintf(stderr, __VA_ARGS__); \
    abort(); \
    }

void
print_test_separator_line(void)
{
    for (int i = 0; i < 100; i++)
        printf("-");
    printf("\n");
}

#endif

The file defines a macro called test to which we can a boolean expression. If the expression is false the macro prints a message on stderr and exits. When writing tests we can call this macro with the condition we are asserting and a message to be printed in case the test fails.

With that behind, let's write the test for bitvector in a file called test_bitvecor.c

#include <stdio.h>
#include "bitvector.h"
#include "test_utils.h"

static void
test_bitvector(void)
{
    printf("Testing bitvector initialized with all bits clear");
    print_test_separator_line();
    bitvector_t *vector = bitvector_allocate(32);
    for (size_t i = 0; i < 32; i++) {
        bool isset = bitvector_isset(vector, i);
        test(isset == false, "Expected bit %zu to be unset", i);
    }
    size_t indices[] = {5, 7, 8, 0, 10, 16};
    for (size_t i = 0; i < sizeof(indices) / sizeof(indices[0]); i++) {
        bitvector_set(vector, indices[i]);
    }

    printf("Testing bitvector set");
    print_test_separator_line();
    for (size_t i = 0; i < sizeof(indices) / sizeof(indices[0]); i++) {
        bool isset = bitvector_isset(vector, indices[i]);
        test(isset == true, "Expected bit %zu to be unset", indices[i]);
    }

    printf("Testing bitvector unset");
    print_test_separator_line();
    for (size_t i = 0; i < sizeof(indices) / sizeof(indices[0]); i++) {
        bitvector_unset(vector, indices[i]);
    }
    for (size_t i = 0; i < 32; i++) {
        bool isset = bitvector_isset(vector, i);
        test(isset == false, "Expected bit %zu to be unset", i);
    }

    bitvector_free(vector);
}

int
main(int argc, char **argv)
{
    test_bitvector();
    return 0;
}

The test code has only one test function which is testing all APIs of the bit vector. In the test_bitvector test function we are doing following things:

Allocating a bit vector of size 32
Then we are asserting that all the bits in the vector should be off since it was just created
Then we set some of the bits on in the vector.
Next we test whether the bits we just set are indeed set.
After that we turn those bits back to off and assert that indeed that now those bits are off
Finally we free the vector.

We can start writing a Makefile in order to compile this.

all: test_bitvector

test_bitvector: test_bitvector.o
    ${CC} ${CFLAGS} -o test_bitvector test_bitvector.o

test_bitvector.o: test_bitvector.c
    ${CC} ${CFLAGS} -c test_bitvector.c

Although right now there isn't much point of trying to compile because we don't have implementations of the bitvector APIs and so the linker will complain about missing functions. Let's implement the bitvector APIs in a file called bitvector.c. Instead of pasting the complete file here, I will reproduce individual functions and try to explain what's happening.

bitvector_allocate:

bitvector_t *
bitvector_allocate(size_t size)
{
    bitvector_t *vector;
    vector = malloc(sizeof(*vector));
    if (vector == NULL)
        err(EXIT_FAILURE, "malloc failed");

    size_t nbytes = size / 8 + 1;
    vector->size = size;
    vector->vector = malloc(nbytes);
    if (vector->vector == NULL)
        err(EXIT_FAILURE, "malloc failed");
    memset(vector->vector, nbytes, 0);
    return vector;
}

This is the API to allocate the bitvector. We pass the size of the vector as an argument. We know that we cannot allocate memory at the level of bits, the smallest amount of memory that we can allocate and address is one byte. Therefore we need to figure out to be able to address the requested number of bits how many bytes we need to allocate. That's what we are doing by dividing the size by 8. Once we allocate the requested memory we need to set it to 0 so that the vector is initialized with all bits set as 0 initially.

bitvector_set:

void
bitvector_set(bitvector_t *vector, size_t index)
{
    size_t byte_index = index / 8;
    size_t byte_offset = index % 8;
    vector->vector[byte_index] |= (1UL << byte_offset);
}

This function is expected to set the bit at the given index position to 1. We know that didn't really allocate a bit array but a byte array. Therefore first we need to identify which byte we need to index into the byte array and then within that byte which bit position we need to turn on. The first line of the function identifies the byte index and the second line identifies the bit index within the byte. The last line uses the bitwise operations to turn on that bit.

bitvector_unset

void
bitvector_unset(bitvector_t *vector, size_t index)
{
    size_t byte_index = index / 8;
    size_t byte_offset = index % 8;
    vector->vector[byte_index] &= ~(1UL << byte_offset);
}

This function is the opposite of bitvector_set. The code is almost identical except the last line where we are unsetting the bit.

bitvector_isset

bool
bitvector_isset(bitvector_t *vector, size_t index)
{
    size_t byte_index = index / 8;
    size_t byte_offset = index % 8;
    return (vector->vector[byte_index] >> byte_offset) & 1U;
}

This function tests whether the bit at the given index is on or not. The logic is same, except the last line where we testing whether the bit is on or off and returning the result.

Now we can update our Makefile to include this file and run our test cases.

all: test_bitvector

test_bitvector: test_bitvector.o bitvector.o
    ${CC} ${CFLAGS} -o test_bitvector test_bitvector.o bitvector.o

bitvector.o: bitvector.c
    ${CC} ${CFLAGS} -c bitvector.c

clean:
    rm -rf *.o test_bitvector

We can compile the programs by running make on the shell and then executing ./test_bitvector.

Now the bitvector is out of the way, let's head towards implementing the bloom filter

Bloom Filter:

Let's start by defining the API in a file called bloomfilter.h

#ifndef BLOOMFILTER_H
#define BLOOMFILTER_H

#include <stdbool.h>
#include <stdint.h>
#include "bitvector.h"

#define NHASH 6

typedef struct bloomfilter_t {
    bitvector_t *bitvector;
    size_t size;
    uint8_t nhash;
} bloomfilter_t;

bloomfilter_t *bloomfilter_init(size_t);
void bloomfilter_put(bloomfilter_t *, const void *, int);
bool bloomfilter_contains(bloomfilter_t *, const void *, int);
void bloomfilter_free(bloomfilter_t *);

#endif

bloomfilter_t is the data structure which consists of the bitvector, size of the filter and the number of hash functions we are using. Then we have APIs to init the structure, put values and a contains API to query the filter. Similar to the bitvector, first let's write some tests for the filter in a file called test_bloomfilter.c

#include "bloomfilter.h"
#include "test_utils.h"

static void
test_bloomfilter(void)
{
    printf("Testing bloomfilter put and contains");
    print_test_separator_line();
    size_t size = 1000000;
    bloomfilter_t *filter = bloomfilter_init(size);
    size_t fp_count = 0;
    for (size_t i = 0; i < size; i++) {
        if (i % 2 == 0)
            bloomfilter_put(filter, &i, sizeof(size_t));
    }
    for (size_t i = 0; i < size; i++) {
        bool contains = bloomfilter_contains(filter, &i, sizeof(i));
        if (i % 2 == 0) {
            test(contains == true, "Expected value %zu to be present in
                 filter", i);
        } else {
            if (contains == true) {
                printf("Expected value %zu to be not present in filter,
                  maybe false positive\n", i);
                fp_count++;
            }
        }
    }
    bloomfilter_free(filter);
    printf("Total number of false positive = %zu, percentage: %f\n", fp_count,
          100.0 * fp_count / size);
}

int
main(int argc, char **argv)
{
    test_bloomfilter();
    return 0;
}

Here also we only have one test function which testing a few things. Here is what's happening in the test:

We are creating a bloom filter of some fixed size
We iterate from 0 to the size of the filter and for every even index we insert that number in the filter
Then we iterate again and now we query the filter for each loop index value.
For every even index value we expect the filter to return true because those values we stored in the filter.
For every odd value we are printing out if the filter returned true. That is a false positive.
At the end of the test we print out how many false positives we encountered and what is the false positive rate.

Let's implement the bloom filter APIs now in a file called bloomfilter.c. Similar to bitvector I will reproduce individual functions and explain them.

bloomfilter_init:

#include "bloomfilter.h"
#include "murmur3.h"

static uint32_t SEEDS[] = {80430271, 89023841, 88060457, 60974549, 50009261, 87906149};

bloomfilter_t *
bloomfilter_init(size_t size)
{
    bloomfilter_t *filter;
    filter = malloc(sizeof(*filter));
    if (filter == NULL) {
        err(EXIT_FAILURE, "malloc failed");
    }
    // we init the bloom filter with size 10 times greater than the requested size
    // to maintain 1% false positive rate
    size_t filter_size = 10 * size;
    filter->size = filter_size;
    // with 10% bigger filter size and 1% fp rate, the optimal number of hash
    // functions comes about to be 6
    filter->nhash = NHASH;
    filter->bitvector = bitvector_allocate(filter_size);
    return filter;
}

This function is expected to initialize the filter. We do that by allocating the underlying bitvector. We are allocating the bitvector with size 10 times greater than the requested filter size. It turns out that based on some calculations if we want to maintain 1% false positive rate then we need to have a bit vector at least 10 times larger than the expected number of elements in the filter and we should use at least 6 hash functions. That's what is happening in the function.

bloomfilter_put:

void
bloomfilter_put(bloomfilter_t *filter, const void *data, int len)
{
    __uint128_t hash = 0;
    for (size_t i = 0; i < NHASH; i++) {
        MurmurHash3_x64_128(data, len, SEEDS[i], &hash);
        size_t index = hash % filter->size;
        bitvector_set(filter->bitvector, index);
        hash = 0;
    }
}

This API is expected to store the given data item in the filter. We are passing a pointer to the data item and its length in bytes. A good bloom filter should use hash functions which are independent and uniformly distribute, which means that they distribute the values uniformly. This reduces the probability of collisions. Also, since bloom filters are used in big data applications where performance is critical the hash functions should also be fast. Murmur3 is one such hash function which satisfies all these criteria and that's why I am using it here. Instead of implementing it myself, I have taken the public domain implementation available on Github here: https://github.com/PeterScott/murmur3. This implementation exposes the MurmurHash3_x64_128 function as its main API which I have called. We can pass a seed to this function, I've created 6 random prime numbers as the seed which results in 6 hash functions.

In essence, we are iterating through each seed, calling Murmur3 to hash the data and setting the corresponding bit to 1 in the bitvector. The Murmur3 hash function returns a 128 bit integer so we need to take a mod with the filter size to get the right bit index.

bloomfilter_contains:

bool
bloomfilter_contains(bloomfilter_t *filter, const void *data, int len)
{
    __uint128_t hash = 0;
    for (size_t i = 0; i < NHASH; i++) {
        MurmurHash3_x64_128(data, len, SEEDS[i], &hash);
        size_t index = hash % filter->size;
        bool isset = bitvector_isset(filter->bitvector, index);
        if (!isset)
            return false;
        hash = 0;
    }
    return true;
}

This is pretty much same as bloomfilter_put. We iterate through each of the hash functions, get the bit index and check if it is on or off. If any of the indices is off we know that the value was never stored in the filter. Otherwise it might have been stored.

That's it, we have finished our bloom filter. Let's integrate this in the Makefile. I've taken the murmur3.h and murmur3.c files from the Github murmur3 implementation and included in the build so that the code compiles and runs.

all: test_bitvector test_bloomfilter

test_bitvector: test_bitvector.o bitvector.o
    ${CC} ${CFLAGS} -o test_bitvector test_bitvector.o bitvector.o

test_bloomfilter: test_bloomfilter.o bitvector.o bloomfilter.o murmur3.o
    ${CC} ${CFLAGS} -o test_bloomfilter test_bloomfilter.o bitvector.o bloomfilter.o murmur3.o

test_bitvector.o: test_bitvector.c
    ${CC} ${CFLAGS} -c test_bitvector.c

test_bloomfilter.o: test_bloomfilter.c
    ${CC} ${CFLAGS} -c test_bloomfilter.c

murmur3.o: murmur3.c
    ${CC} ${CFLAGS} -c murmur3.c

bloomfilter.o: bloomfilter.c
    ${CC} ${CFLAGS} -c bloomfilter.c

bitvector.o: bitvector.c
    ${CC} ${CFLAGS} -c bitvector.c

clean:
    rm -rf *.o test_bitvector test_bloomfilter

Finally let's write a small benchmark program to measure how much memory the filter actually takes. I am going to use a large file containing a list of dictionary words in new line seprated format. The file contains 421124 words. Here is the benchmark program in a file called bloomfilter_benchmark.c

#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>

#include "bloomfilter.h"

const char *FILENAME = "web3";

static void
read_file_and_index(bloomfilter_t *filter)
{
    FILE *f = fopen(FILENAME, "r");
    if (f == NULL)
        err(EXIT_FAILURE, "Failed to open file for reading");
    ssize_t bytes_read;
    char *line = NULL;
    size_t linesize = 0;
    while ((bytes_read = getline(&line, &linesize, f)) != -1) {
        line[bytes_read - 1] = 0;
        bloomfilter_put(filter, line, bytes_read);
        free(line);
        linesize = 0;
        line = NULL;
    }
    free(line);
    fclose(f);
}

static void
read_file_and_query(bloomfilter_t *filter)
{
    FILE *f = fopen(FILENAME, "r");
    if (f == NULL)
        err(EXIT_FAILURE, "Failed to open file for reading");
    ssize_t bytes_read;
    char *line = NULL;
    size_t linesize = 0;
    while ((bytes_read = getline(&line, &linesize, f)) != -1) {
        line[bytes_read - 1] = 0;
        bool contains = bloomfilter_contains(filter, line, bytes_read);
        if (contains == false) {
            printf("Expected for the filter to contain %s\n", line);
        }
        free(line);
        linesize = 0;
        line = NULL;
    }
    free(line);
    fclose(f);
}

static void
print_resource_usage(size_t filter_size)
{
    size_t vector_size = filter_size * 10;
    size_t bytes_reqd = vector_size / 8 + 1;
    printf("Memory used for %f mb\n", 1.0 * bytes_reqd / (1024 * 1024));
}

int
main(int argc, char **argv)
{
    bloomfilter_t *filter = bloomfilter_init(500000);
    read_file_and_index(filter);
    read_file_and_query(filter);
    print_resource_usage(500000);
    bloomfilter_free(filter);
    return 0;
}

We create a filter with size just slightly greater than the number of items in the file. The function read_file_and_index reads the file line by line and puts each word in the filter. The function read_file_and_query reads each word and queries whether it is contained in the filter or not. The function print_resource_usage() prints the max resident size of the process. This is the amount of memory the process is occupying at the moment in the RAM. This is what I get after running this benchmark:

➜ time ./bloomfilter_benchmark

Memory used for 0.596047 mb

________________________________________________________

Executed in 231.93 millis fish external

usr time 232.06 millis 765.00 micros 231.30 millis

sys time 0.00 millis 0.00 micros 0.00 millis

So for holding 421124 words it took just 0.6 MB memory. The file size is 4.3MB for reference.

The complete code is available on Github: https://github.com/abhinav-upadhyay/bloom-filter-et-al

My tech musings

Sunday, January 30, 2022

Implementing a Bloomfilter in C

Probabilistic Data Structures

Bloom Filter

Bloom Filter Implementation in C

Bit Vector:

Bloom Filter:

No comments:

Post a Comment