tag:blogger.com,1999:blog-91855643378920583582024-03-24T00:11:17.160-07:00My tech musingsAbhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.comBlogger32125tag:blogger.com,1999:blog-9185564337892058358.post-4361651831532876852023-07-27T06:35:00.002-07:002023-07-27T06:35:58.725-07:00Classifying Text using Gzip and KNN<p> A new paper came out at ACL 2023 which showed that using gzip compression in combination with the k-nearest neighbours algorithm can achieve accuracy similar to that of state-of-the-art deep learning models such as BERT on text classification. Not only that, due to the fact that this technique is non-parametric, it beats those large deep learning models on out of distribution data samples. If this intrigues you, then read <a href="https://codeconfessions.substack.com/p/decoding-the-acl-paper-gzip-and-knn">my article</a> on substack which explains the findings of the paper in simple terms. </p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-57790055072868175742023-06-16T00:39:00.005-07:002023-06-16T00:39:40.279-07:00A Deep Dive to Understand The Sorting Algorithm Optimizations of AlphaDev<p>DeepMind's AlphaDev model has been able to find optimized implementations for small sorting functions. In my latest <a href="https://codeconfessions.substack.com/p/creating-chatgpt-plugins-using-the">article</a> I take a deep dive to explain what these optimizations are and how they fare against the benchmark implementations. Check it out: <a href="https://codeconfessions.substack.com/p/creating-chatgpt-plugins-using-the">https://codeconfessions.substack.com/p/creating-chatgpt-plugins-using-the</a> </p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-48181925957715386962023-06-16T00:36:00.004-07:002023-06-16T00:36:48.095-07:00Building ChatGPT Plugins Using Function Calls<p> In my latest article on substack I take you through a <a href="https://codeconfessions.substack.com/p/creating-chatgpt-plugins-using-the">full tutorial where we develop a Flask based chat application</a> and then implement ChatGPT like plugins to support features such as web browsing and Python code interpreter. Check it out: <a href="https://codeconfessions.substack.com/p/creating-chatgpt-plugins-using-the">https://codeconfessions.substack.com/p/creating-chatgpt-plugins-using-the</a></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-11872822277859851392023-06-16T00:29:00.000-07:002023-06-16T00:29:11.516-07:00A New Home for Blogging<p>I've moved my blogging practice to a new home. Since April, 23 I have started writing on substack where I am <a href="https://codeconfessions.substack.com/">writing on topics around coding, software engineering and computer science in general</a>. If you stumble upon this old blog, <a href="https://codeconfessions.substack.com/">subscribe to me on Substack</a>. See you on the other side!</p><p><br /></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-43259014687171076052022-08-21T12:13:00.004-07:002022-08-27T07:27:37.475-07:00Understanding Base64 Encoding<h2 style="text-align: left;"><span style="font-family: Roboto Mono;">Introduction</span></h2><p><span style="font-family: Roboto Mono;">Base64 is ubiquitously used in web development as well as lower level network programming. It is a scheme to encode arbitrary data, whether it's binary or plain text. In the early days of the Internet, sending binary data over the wire was complicated because the devices and softwares interpreted byte values outside of the printable ASCII range in their own ways. For example modems interpreted code 6 as an acknowledgement. This was problematic for transmitting binary data, such as compressed images or executables and therefore Base64 was designed as a mechanism to encode data into a subset of printable ASCII character set.</span></p><p><span style="font-family: Roboto Mono;">In modern day web development Base64 is also used as a way to encode the Authroization header to encode the username and password, when making an HTTP request call. Even though base64 mangles the input but by no means it is secure. If such a request is being transmitted over an insecure network, an attacker can easily decode and capture the user name and password. Many people confuse Base64 to be an encryption scheme but by no means it is encryption or a hashing scheme, it is a simple encoding scheme.</span></p><p><span style="font-family: "Roboto Mono";">The name of the scheme is Base64 because of the way it works. In simple terms, the encoder scans the input 6 bits at a time and maps each unique 6 bit pattern to one of the 64 ASCII symbols (A-Z, a-z, 0-9, +, /). Since it works with 6 bits at a time, there are 2^6 = 64 unique mappings possible, and hence the name is Base64.</span></p><h2 style="text-align: left;"><span style="font-family: Roboto Mono;">Base64 Encoding Implementation</span></h2><p><span style="font-family: Roboto Mono;">As noted above, the encoding scheme works by looking at 6 bits of the input at a time and encoding that as one byte (or 8 bits) in the encoded output. Since one byte consists of 8 bits and we are working with 6 bits of input at a time, we need at least 24 bits (least common multiple of 6 and 8) of input in order to encode. One possibility would be to encode the input with 0s to make it a multiple of 24 bits but the convention followed has been to append two '=' characters in the output if the input length is 2 bytes short of a multiple of 3, or to append single '=' character if input length is 1 byte short of a multiple of 3.</span></p><p><span style="font-family: Roboto Mono;">Since we are encoding 6 bits of input into 8 bits of output, the output to input size ratio is 8 / 6 = 4/ 3, i.e. the output is 4/3 times larger than the input. </span></p><p><span style="font-family: Roboto Mono;">With these details out of the way, we can start looking at the actual implementation of the encoding. Even though the encoding scheme sounds simple enough but because of the low level bit manipulation it makes for a challenging implementation. Let's dig in. </span></p><p><br /></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><err.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdlib.h></span></div><br /><div><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">base64_chars</span> = <span style="color: #ce9178;">"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl"</span> <span style="color: #d7ba7d;">\</span></div><div> <span style="color: #ce9178;">"mnopqrstuvwxyz0123456789+/"</span>;</div><br /></div><p><span style="font-family: Roboto Mono;">We have created a static array of all the base64 encoding characters so that we can easily map the input bits to one of these characters at encoding time.</span></p><p><br /></p><div style="background-color: #1e1e1e; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; line-height: 19px; white-space: pre;"><div style="color: #d4d4d4;"><span style="color: #569cd6;">char</span> *</div><div style="color: #d4d4d4;"><span style="color: #dcdcaa;">base64_encode</span>(<span style="color: #569cd6;">const</span> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">input</span>, <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">len</span>)</div><div style="color: #d4d4d4;">{</div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// Allocate enough space to hold the base64 encoding</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// for each 6 bits of input the encoded output is 8 bits,</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// so the output is 4/3 times the input.</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// Additionally, since we encode 6 bits of input at a time,</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// if the input is not a multiple of 24 bits then we need</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// to add '=' to the end of the output to make it a multiple</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// of 24 bits, so the output length is 4/3 times the input</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// length plus 2 for the '=' and one byte for the null terminator</span></div><div><span style="color: #d4d4d4;"> </span><span style="color: #4ec9b0;">size_t</span><span style="color: #d4d4d4;"> </span><span style="color: #9cdcfe;">output_len</span><span style="color: #d4d4d4;"> = (</span><span style="color: #9cdcfe;">len</span><span style="color: #d4d4d4;"> * </span><span style="color: #b5cea8;">4</span><span style="color: #d4d4d4;"> / </span><span style="color: #b5cea8;">3</span><span style="color: #d4d4d4;"> + </span><span style="color: #b5cea8;">1</span><span style="color: #d4d4d4;">) + </span><span style="color: #b5cea8;">3</span><span style="color: #d4d4d4;">; </span></div><div style="color: #d4d4d4;"> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">out</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #9cdcfe;">output_len</span>);</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">out</span> == <span style="color: #569cd6;">NULL</span>)</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"malloc failed"</span>);</div><span style="color: #d4d4d4;"><br /><br /></span><div style="color: #d4d4d4;"> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">idx</span> = <span style="color: #b5cea8;">0</span>;</div><span style="color: #d4d4d4;"><br /></span><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// we scan 6 bits of input at a time and map it to one</span></div><div style="color: #d4d4d4;"> <span style="color: #6a9955;">// of the bytes from base64_chars</span></div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">do</span> {</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[(<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">0</span>] & <span style="color: #b5cea8;">0xFC</span>) >> <span style="color: #b5cea8;">2</span>];</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">len</span> == <span style="color: #b5cea8;">1</span>) {</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[(<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">0</span>] & <span style="color: #b5cea8;">0x03</span>) << <span style="color: #b5cea8;">4</span>];</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #ce9178;">'='</span>;</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #ce9178;">'='</span>;</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">break</span>;</div><div style="color: #d4d4d4;"> }</div><span style="color: #d4d4d4;"><br /></span><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[(<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">0</span>] & <span style="color: #b5cea8;">0x03</span>) << <span style="color: #b5cea8;">4</span> | (<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">1</span>] & <span style="color: #b5cea8;">0xF0</span>) >> <span style="color: #b5cea8;">4</span>];</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">len</span> == <span style="color: #b5cea8;">2</span>) {</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[(<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">1</span>] & <span style="color: #b5cea8;">0x0F</span>) << <span style="color: #b5cea8;">2</span>];</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #ce9178;">'='</span>;</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">break</span>;</div><div style="color: #d4d4d4;"> }</div><span style="color: #d4d4d4;"><br /></span><div><span style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[(<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">1</span>] & <span style="color: #b5cea8;">0x0F</span>) << </span><span style="color: #b5cea8;">2</span><span style="color: #d4d4d4;"> | (</span><span style="color: #9cdcfe;">input</span><span style="color: #d4d4d4;">[</span><span style="color: #b5cea8;">2</span><span style="color: #d4d4d4;">] & </span><span style="color: #b5cea8;">0xC0</span><span style="color: #d4d4d4;">) >> </span><span style="color: #b5cea8;">6</span><span style="color: #d4d4d4;">];</span></div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">2</span>] & <span style="color: #b5cea8;">0x3F</span>];</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">input</span> += <span style="color: #b5cea8;">3</span>;</div><span style="color: #d4d4d4;"><br /></span><div style="color: #d4d4d4;"> } <span style="color: #c586c0;">while</span> (<span style="color: #9cdcfe;">len</span> -= <span style="color: #b5cea8;">3</span>);</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>] = <span style="color: #b5cea8;">0</span>;</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">return</span> <span style="color: #9cdcfe;">out</span>;</div><div style="color: #d4d4d4;">}</div><span style="color: #d4d4d4;"><br /></span></div><p><span style="font-family: Roboto Mono;">If you are not used to working with bitmasks then this code may look a bit intimidating at the beginning but it's not that complicated. Within the do while loop, the first thing we are doing is to take the first 6 bits off the first byte of the input and look up its corresponding mapping in the b64_chars. The way we are selecting the 6 bits of the first byte is interesting. We are using the bitmask 0xFC which is hexadecimal for the binary string 11111100. When we do a bitwise AND of this string with the first byte of the input, the last two bits of the input are set to 0 and we are left with only the first 6 bits. Then we are right shifting these bits by 2 so that we get a valid 6 bit number for which we can find a mapping.</span></p><p><span style="font-family: Roboto Mono;">Next, if the length of the input is just 1 byte, then we need to encode the remaining 2 bits of the first byte and append 2 '=' characters to the output. Now we are using the bitmask 0x03 for selecting the last 2 bits because it corresponds to the bit pattern 00000011. We shift these 2 bits to the left by 4 so that they become top 2 bits of the 6 bits (rest of the 4 bits will be 0 because the input finished). </span></p><p><span style="font-family: Roboto Mono;">On the other hand if input had another byte then we need to take the last 2 bits of the first byte and first 4 bits of the 2nd byte. For which we are doing this gymnastic</span></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; line-height: 19px; white-space: pre;"><span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">idx</span>++] = <span style="color: #9cdcfe;">b64_chars</span>[(<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">0</span>] & <span style="color: #b5cea8;">0x03</span>) << <span style="color: #b5cea8;">4</span> | (<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">1</span>] & <span style="color: #b5cea8;">0xF0</span>) >> <span style="color: #b5cea8;">4</span>];</div><p><span style="font-family: Roboto Mono;">The logic for selecting last 2 bits of first byte is same as above. The bitmask for selecting first 4 bits of 2nd byte is 0xF0 which is 11110000 in binary, we shift it to the right by 4 and do a bitwise OR with the other bits selected from the first byte. Combined these will give us the next 6 bits to encode.</span></p><p><span style="font-family: Roboto Mono;">Now, if the input had only 2 bytes, then we need to encode the remaining 4 bits of the 2nd byte and append an '=' character to the output. It's the same story, we select the last 4 bits of the 2nd byte and left shift them by 2 bits so that they form the top 4 bits of the 6 bits (last 2 bits will be 0 since the input is finished).</span></p><p><span style="font-family: Roboto Mono;">However, if the input did not finish then we take the last 4 bits of the 2nd byte and first 2 bits of the next byte to encode. Afterwards we are only left with the last 6 bits of the 3rd byte for which we can do the encoding pretty easily without any gymnastics required.</span></p><p><span style="font-family: Roboto Mono;">We keep doing this till we finished scanning the input bytes.</span></p><h2 style="text-align: left;"><span style="font-family: Roboto Mono;">Base64 Decoding</span></h2><p><span style="font-family: Roboto Mono;">The decoding process is just the reverse of what we did above. We scan each byte of the encoded input and map it to 6 bits of output. Since when generating the encoding we know what numeric representation we gave to each individual character of the base64 encoding, we can generate a table like this:</span></p><p><br /></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">unbase64</span> <span style="color: #569cd6;">[]</span> = {</div><div> -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>,</div><div> -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>,</div><div> -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, <span style="color: #b5cea8;">62</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, <span style="color: #b5cea8;">63</span>, <span style="color: #b5cea8;">52</span>,</div><div> <span style="color: #b5cea8;">53</span>, <span style="color: #b5cea8;">54</span>, <span style="color: #b5cea8;">55</span>, <span style="color: #b5cea8;">56</span>, <span style="color: #b5cea8;">57</span>, <span style="color: #b5cea8;">58</span>, <span style="color: #b5cea8;">59</span>, <span style="color: #b5cea8;">60</span>, <span style="color: #b5cea8;">61</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, <span style="color: #b5cea8;">0</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>,</div><div> <span style="color: #b5cea8;">0</span>, <span style="color: #b5cea8;">1</span>, <span style="color: #b5cea8;">2</span>, <span style="color: #b5cea8;">3</span>, <span style="color: #b5cea8;">4</span>, <span style="color: #b5cea8;">5</span>, <span style="color: #b5cea8;">6</span>, <span style="color: #b5cea8;">7</span>, <span style="color: #b5cea8;">8</span>, <span style="color: #b5cea8;">9</span>, <span style="color: #b5cea8;">10</span>, <span style="color: #b5cea8;">11</span>, <span style="color: #b5cea8;">12</span>, <span style="color: #b5cea8;">13</span>, <span style="color: #b5cea8;">14</span>, <span style="color: #b5cea8;">15</span>,</div><div> <span style="color: #b5cea8;">16</span>, <span style="color: #b5cea8;">17</span>, <span style="color: #b5cea8;">18</span>, <span style="color: #b5cea8;">19</span>, <span style="color: #b5cea8;">20</span>, <span style="color: #b5cea8;">21</span>, <span style="color: #b5cea8;">22</span>, <span style="color: #b5cea8;">23</span>, <span style="color: #b5cea8;">24</span>, <span style="color: #b5cea8;">25</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>,</div><div> <span style="color: #b5cea8;">26</span>, <span style="color: #b5cea8;">27</span>, <span style="color: #b5cea8;">28</span>, <span style="color: #b5cea8;">29</span>, <span style="color: #b5cea8;">30</span>, <span style="color: #b5cea8;">31</span>, <span style="color: #b5cea8;">32</span>, <span style="color: #b5cea8;">33</span>, <span style="color: #b5cea8;">34</span>, <span style="color: #b5cea8;">35</span>, <span style="color: #b5cea8;">36</span>, <span style="color: #b5cea8;">37</span>, <span style="color: #b5cea8;">38</span>, <span style="color: #b5cea8;">39</span>, <span style="color: #b5cea8;">40</span>, <span style="color: #b5cea8;">41</span>,</div><div> <span style="color: #b5cea8;">42</span>, <span style="color: #b5cea8;">43</span>, <span style="color: #b5cea8;">44</span>, <span style="color: #b5cea8;">45</span>, <span style="color: #b5cea8;">46</span>, <span style="color: #b5cea8;">47</span>, <span style="color: #b5cea8;">48</span>, <span style="color: #b5cea8;">49</span>, <span style="color: #b5cea8;">50</span>, <span style="color: #b5cea8;">51</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span>, -<span style="color: #b5cea8;">1</span></div><div>};</div><br /></div><p><span style="font-family: "Roboto Mono";">This is essentially the ASCII table, where the valid base64 character entries have been given their numeric sequence as per the encoding, e.g. 'A' is 0, 'B' is 1 and so on. The non base64 characters are assigned -1.</span></p><p><span style="font-family: Roboto Mono;">Now let's look at the decoding implementation</span></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">char</span> *</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"><span style="color: #dcdcaa;">base64_decode</span>(<span style="color: #569cd6;">const</span> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">input</span>, <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">len</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;">{</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">if</span> ((<span style="color: #9cdcfe;">len</span> & <span style="color: #b5cea8;">0x03</span>)) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">warnx</span>(<span style="color: #ce9178;">"input length expected to be a multiple of 4 bytes"</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">NULL</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> }</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">output_len</span> = (<span style="color: #9cdcfe;">len</span> * <span style="color: #b5cea8;">3</span>) / <span style="color: #b5cea8;">4</span> + <span style="color: #b5cea8;">2</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">out</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #9cdcfe;">output_len</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">out</span> == <span style="color: #569cd6;">NULL</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"malloc failed"</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">offset</span> = <span style="color: #b5cea8;">0</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> </div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">do</span> {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">for</span> (<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> <= <span style="color: #b5cea8;">3</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">input</span>[<span style="color: #9cdcfe;">i</span>] > <span style="color: #b5cea8;">127</span> || <span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #9cdcfe;">i</span>]] == -<span style="color: #b5cea8;">1</span>) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">warnx</span>(<span style="color: #ce9178;">"invalid base64 character </span><span style="color: #9cdcfe;">%c</span><span style="color: #ce9178;">:"</span>, <span style="color: #9cdcfe;">input</span>[<span style="color: #9cdcfe;">i</span>]);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">free</span>(<span style="color: #9cdcfe;">out</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">NULL</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> }</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> }</div><span style="font-family: Droid Sans Mono, monospace, monospace;"><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">offset</span>++] = (<span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">0</span>]] << <span style="color: #b5cea8;">2</span>) | </div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> ((<span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">1</span>]] & <span style="color: #b5cea8;">0x30</span>) >> <span style="color: #b5cea8;">4</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">2</span>] != <span style="color: #ce9178;">'='</span>) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">offset</span>++] = ((<span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">1</span>]] & <span style="color: #b5cea8;">0x0F</span>) << <span style="color: #b5cea8;">4</span>) | </div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> ((<span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">2</span>]] & <span style="color: #b5cea8;">0x3C</span>) >> <span style="color: #b5cea8;">2</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> }</div><span style="font-family: Droid Sans Mono, monospace, monospace;"><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">3</span>] != <span style="color: #ce9178;">'='</span>) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #9cdcfe;">out</span>[<span style="color: #9cdcfe;">offset</span>++] = ((<span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">2</span>]] & <span style="color: #b5cea8;">0x03)</span> << <span style="color: #b5cea8;">6</span>) | </div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> (<span style="color: #9cdcfe;">unbase64</span>[<span style="color: #9cdcfe;">input</span>[<span style="color: #b5cea8;">3</span>]]);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> }</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #9cdcfe;">input</span> += <span style="color: #b5cea8;">4</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> } <span style="color: #c586c0;">while</span> (<span style="color: #9cdcfe;">len</span> -= <span style="color: #b5cea8;">4</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">return</span> <span style="color: #9cdcfe;">out</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;">}</div><span style="font-family: Droid Sans Mono, monospace, monospace;"><br /></span></div><p><span style="font-family: Roboto Mono;">The first thing we are doing is validating that this is valid base64 encoded string or not. For example we expect the input length to be a multiple of 4 bytes for it to be a valid base64 encoding. Similarly we are doing checks for valid base64 characters before we actually start decoding.</span></p><p><span style="font-family: Roboto Mono;">The decoding is pretty simple. We are going to take bytes of the encoded input and look up the corresponding decoded value from the table. Since all the numbers in the decoded space are in the range 0 to 63, they can be represented by 6 bits and hence in their byte representation the first 2 bits will always be 0. We need to form the bytes of the decoded output by stitching together these 6 bits at a time.</span></p><p><span style="font-family: Roboto Mono;">So, we take the first character of the input and find its corresponding decoded value from the table. We left shift the decoded value by 2 because the top 2 bits would have been unset as mentioned previously. Now that we have first 6 bits of the decoded output, we need another 2 bits to make it a valid decoded byte. For that we take the 2nd byte of the encoded input and lookup in the table. Since this will also be a 6 bit number we shift it to the right by 4 so that its top 2 bits move to the end and combined with the first 6 bits we got by decoding the first byte, we now have 8 bits of decoded output which can be stored in the output.</span></p><p><span style="font-family: Roboto Mono;">If the next byte is an '=' that means the input finished and we are done, otherwise we need to take the rest of the 4 bits we got from decoding the 2nd byte and take first 4 bits by decoding the 3rd byte and stitch them together.</span></p><p><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">(</span><span style="background-color: #1e1e1e; color: #9cdcfe; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">unbase64</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">[</span><span style="background-color: #1e1e1e; color: #9cdcfe; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">input</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">[</span><span style="background-color: #1e1e1e; color: #b5cea8; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">1</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">]] & </span><span style="background-color: #1e1e1e; color: #b5cea8; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">0x0F</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">) << </span><span style="background-color: #1e1e1e; color: #b5cea8; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">4</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">)</span> <span style="font-family: Roboto Mono;">-> this takes the last 4 bits of the decoded 2nd byte and moves them to the top. </span></p><p><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;"> ((</span><span style="background-color: #1e1e1e; color: #9cdcfe; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">unbase64</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">[</span><span style="background-color: #1e1e1e; color: #9cdcfe; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">input</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">[</span><span style="background-color: #1e1e1e; color: #b5cea8; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">2</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">]] & </span><span style="background-color: #1e1e1e; color: #b5cea8; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">0x3C</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">) >> </span><span style="background-color: #1e1e1e; color: #b5cea8; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">2</span><span style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace; font-size: 14px; white-space: pre;">)</span> <span style="font-family: Roboto Mono;">-> this decodes the 3rd byte and selects its 4 bits between positions 3 to 6 (inclusive). We are selecting the 4 bits starting at position 3 and not 1 because the first 2 bits will be unset as we are mapping to a 6 bit space. We then right shift these bits by 2 so that they form the last 4 bits. When this is OR'd with the previous 4 bits we got from the 2nd byte, we have another 8 bits of decoded output.</span></p><p><span style="font-family: Roboto Mono;">If the next input byte is not an '=' charcter that means we have another byte to decode. We take the remaining 2 last bits of the decoded 3rd byte, and combine them with first 6 bits of the decoded 4th byte. We don't have do any bit shifting for the 4th byte because we need the complete 6 bits at this time in order to make a complete byte of output. </span></p><p><span style="font-family: "Roboto Mono";">We keep doing this until we reach the end of input.</span></p><h2 style="text-align: left;"><span style="font-family: Roboto Mono;">Testing</span></h2><p><span style="font-family: Roboto Mono;">We can also write a small test to verify that the encoding and decoding works as expected. I am going to take a test string and find out its base64 encoding from a well known tool, such as the Python REPL which comes with the base64 module and assert that our encoder generates the same value. Then when we try to decode the encoded string we get back the original string. </span></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">int</span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"><span style="color: #dcdcaa;">main</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">argc</span>, <span style="color: #569cd6;">char</span> **<span style="color: #9cdcfe;">argv</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;">{</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">input</span> = <span style="color: #ce9178;">"apple"</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">expected_output</span> = <span style="color: #ce9178;">"YXBwbGU="</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">output</span> = <span style="color: #dcdcaa;">base64_encode</span>(<span style="color: #9cdcfe;">input</span>, <span style="color: #b5cea8;">5</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"output: </span><span style="color: #9cdcfe;">%s</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">, expected output: </span><span style="color: #9cdcfe;">%s</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #9cdcfe;">output</span>, <span style="color: #9cdcfe;">expected_output</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"decoded: </span><span style="color: #9cdcfe;">%s</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #dcdcaa;">base64_decode</span>(<span style="color: #9cdcfe;">output</span>, <span style="color: #dcdcaa;">strlen</span>(<span style="color: #9cdcfe;">output</span>)));</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #dcdcaa;">free</span>(<span style="color: #9cdcfe;">output</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;"> <span style="color: #c586c0;">return</span> <span style="color: #b5cea8;">0</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace;">}</div><span style="font-family: Droid Sans Mono, monospace, monospace;"><br /></span></div><h2 style="text-align: left;"><span style="font-family: Roboto Mono;">Conclusion</span></h2><p><span style="font-family: Roboto Mono;">Even though the Base64 encoding is simple and elegant yet its implementation is a bit handful to do unless you are used to doing bit masking and manipulations. This is a fun exercise to get used to the bit twiddling techniques. </span></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-49170528284652438062022-01-30T10:30:00.001-08:002022-01-30T10:30:36.162-08:00Implementing a Bloomfilter in C<h1 style="text-align: left;">Probabilistic Data Structures</h1><div>Bloom filter is a probabilistic data structure. Before we understand what is a bloom filter, let's first look at what are probabilistic data structures. </div><div><br /></div><div>Probabilistic data structures, as their name suggests are data structures which have some sort of uncertainty involved with them and thus their results are approximate in nature. For example, a set is a data structure we all are familiar with. When we query a set to check if it contains a value or not, it always returns true if it contains the value, otherwise false. There is no uncertainty in its output. If we imagine a probabilistic counter part for the set data structure, it could be that this probabilistic set returns false with 100% certainty if the value is not contained in it, but if it returns true then it basically means that the value <b>might</b> be contained but it is not guaranteed. So this probabilistic set data structure would have some degree of false positive rate. Most of these probabilistic data structures employ randomization techniques and we can estimate an upper bound on their false positive and false negative rates and tune them to our tolerance.</div><div><br /></div><div>Now comes the question, why do we need such probabilistic data structures when we have 100% accurate data structures already available. The answer boils down to the scale of data, resource requirements and performance. With normal data structures such as hash tables, sets, trees, linked lists etc, we need to store the complete data in them. As the scale of data grows, so does the cost of these data structures and in some cases the cost of their operations. Probabilistic data structures solve this problem by not storing all the data, but rather some sort of a smaller signature to represent the data they are holding. This makes them very memory efficient but since they are not storing the full data, they compromise in their accuracy. </div><div><br /></div><div>Let's look at few examples of such probabilistic data structures. </div><div><br /></div><div><b>HyperLogLog</b>: This is used for finding cardinality of a dataset. It is typically used in big data and streaming applications where keeping the complete data in memory in order to measure its cardinality is not feasible. A HyperLogLog can find approximate cardinalities of 10^9 with error rate of 2% while consuming 1.5 KB memory.</div><div><br /></div><div><b>Locality Sensitive Hashing (LSH)</b>: In database systems, sometimes we are interested in finding items which are similar to a given query item. Typically this would require comparing the query item with every item stored in the system which even though linear in nature, can be very expensive if we want low latency. Locality Sensitive Hashing solves this by employing randomization techniques. Essentially LSH is able to do this expensive operation in constant amount of time with a fraction of memory usage than the actual dataset size. The cost is some amount of false positive and false negative rate. </div><div><br /></div><div>Now that we have looked at few of these, let's look in detail what a Bloom Filter is and how it works.</div><div><br /></div><h1 style="text-align: left;">Bloom Filter</h1><div>A bloom filter is a data structure designed to answer membership queries. It is similar to the Set data structure, but its answers are approximate. Just like a set we can store data items in a bloom filter and later query it to determine if a particular item is stored in the filter or not. If the filter returns false, then it is 100% guaranteed that it is not present in the filter, however, if the filter returns true then that means that the item might be contained in the set but we cannot be sure. Let's try to understand how it works and implement it alongside to see it in action.</div><div><br /></div><div>Being a probabilistic data structure, bloom filter does not actually store the data items, but only a signature of the data. The signature can be configured to be of a fixed size for example 4 bits per data item, which means doesn't matter how big our values are, the filter will take a very small amount of memory to hold it. </div><div><br /></div><div>At its core the filter uses a bit vector to store the signatures. It generates the signature by using a bunch of hash functions. The number of hash functions is also a tuning parameter, the more the number of hash functions we use, the bigger the signature size. Along with the size of the bitvector and the number of hash functions we use, we can tune the filter for an upper bound on its false positive rate. We will see more on this later.</div><div><br /></div><div><b>Bloom Filter Put</b>: Let's see how do we put an item in a bloom filter. Let's assume we are using j number of hash functions in the filter. We use each hash function on the data item one at a time and get an index returned. This particular index is turned on (set to 1) in the bit vector. </div><div><br /></div><div><b>Bloom filter Query</b>: Similar to the put operation, at the time of querying we do the same thing. We apply each hash function in sequence and check whether the corresponding index in the bit vector is on or not. If any of the indices is off that means the item was never stored in the filter. However if all of the indices are on, it doesn't necessarily mean that the item was stored in the filter, it could be that those indices were on because of storing other data items.</div><div><br /></div><div>Let's see an example. Let's say our bit vector is 8 bits long and we are using 2 hash functions h1 and h2. Initially the bitvector is going to be empty, i.e. all bits are set to 0.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhaW91ZRsh9RjxroIwpokMGJQGFmtlFBc9pBMuN7bXrPlLZfnP6MO7hJV-GYSSZTjBqDpwrik5zwzaUAnnbwbCbL0CYgfUEDTxntaQV9dG9yIfRwqhl_QDXEWh7hGNHnOi16lIf64BEl0HP32pKvhqTwr82wLSQxoD6azYTEJB2Y4jjXsyxAu7quA=s629" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="49" data-original-width="629" height="31" src="https://blogger.googleusercontent.com/img/a/AVvXsEhaW91ZRsh9RjxroIwpokMGJQGFmtlFBc9pBMuN7bXrPlLZfnP6MO7hJV-GYSSZTjBqDpwrik5zwzaUAnnbwbCbL0CYgfUEDTxntaQV9dG9yIfRwqhl_QDXEWh7hGNHnOi16lIf64BEl0HP32pKvhqTwr82wLSQxoD6azYTEJB2Y4jjXsyxAu7quA=w400-h31" width="400" /></a></div><br /><div><br /></div><div><br /></div><div>Now let's say we wish to store the string "apple" in the filter. We will apply the two hash functions in order, like so:</div><div><br /></div><div><span style="font-family: Roboto Mono;">h1("apple") = 3</span></div><div><span style="font-family: Roboto Mono;">h2("apple") = 0</span></div><div><br /></div><div>So we will set the corresponding indices in the vector.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhuicndZfhH8vQM4p67qCvR-7gNomF-OP1X5w7NBB6S04eEM8Uvnwh8jUbqVYNq0GWaD10z3d4KFCGCqzM4A01Gq8dbLt8Gt3sj1Tm82TWnE859zjly1MTz2G-nHRMj_LHQunSBa83WMW7L9bVJb_k5ESUUv3imkGz7j3KbPVeaLXFpBv_RYdqMEQ=s644" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="56" data-original-width="644" height="35" src="https://blogger.googleusercontent.com/img/a/AVvXsEhuicndZfhH8vQM4p67qCvR-7gNomF-OP1X5w7NBB6S04eEM8Uvnwh8jUbqVYNq0GWaD10z3d4KFCGCqzM4A01Gq8dbLt8Gt3sj1Tm82TWnE859zjly1MTz2G-nHRMj_LHQunSBa83WMW7L9bVJb_k5ESUUv3imkGz7j3KbPVeaLXFpBv_RYdqMEQ=w400-h35" width="400" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div><br /></div><div>Let's insert the next item, "banana" in the filter.</div><div><span style="font-family: Roboto Mono;">h1("banana") = 6</span></div><div><span style="font-family: Roboto Mono;">h2("banana") = 3</span></div><div><br /></div><div>We update the vector by turning on the corresponding bits.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEivLQwmCN6ujs53TXpx6xntExKq6EoBf5wuXqNbmltiVVT3WkuIxD5v_9mwgbfVZlRUIKF0BqLsLnFAs6qDDaEUoPJ_4T62GHQa9qKVH7yUswuoSPtjf-EIk9E4m83LuLddN8coG6RBrHnWgqHMNBWe-7TddNKnSzEyCXbDNi9tVnSsWtGUuu3XjQ=s651" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="73" data-original-width="651" height="45" src="https://blogger.googleusercontent.com/img/a/AVvXsEivLQwmCN6ujs53TXpx6xntExKq6EoBf5wuXqNbmltiVVT3WkuIxD5v_9mwgbfVZlRUIKF0BqLsLnFAs6qDDaEUoPJ_4T62GHQa9qKVH7yUswuoSPtjf-EIk9E4m83LuLddN8coG6RBrHnWgqHMNBWe-7TddNKnSzEyCXbDNi9tVnSsWtGUuu3XjQ=w400-h45" width="400" /></a></div><br /><div><br /></div><div><br /></div><div>Now let's try to query for "grapes".</div><div><span style="font-family: Roboto Mono;">h1("grapes") = 0</span></div><div><span style="font-family: Roboto Mono;">h2("grapes") = 7</span></div><div><br /></div><div>We see that even though bit 0 is set, but bit 7 is off which means that "grapes" was never stored in the filter. </div><div><br /></div><div>Let's query for "orange".</div><div><span style="font-family: Roboto Mono;">h1("orange") = 3</span></div><div><span style="font-family: Roboto Mono;">h2("orange") = 6</span></div><div><br /></div><div>We can see that both bits 3 and 6 are on in the filter but in our example we never stored "orange". But bit was turned on for both "apple" and "banana" while bit 6 was set for "banana". This explains the false positive element in bloom filter.</div><div><br /></div><h1 style="text-align: left;">Bloom Filter Implementation in C</h1><div>Now that we understand how the bloom filter works conceptually. Let's implement in C to get a better understanding. I'm choosing C just because we can very efficiently implement a bit vector in it because we have tighter control over the memory. Talking about bit vectors, let's start by implementing that first since the bloom filter builds on top of that. </div><div><br /></div><h3 style="text-align: left;">Bit Vector:</h3><div>Let's first define the API of the bit vector data structure. We will declare the API in a file called <span style="font-family: Roboto Mono;">bitvector.h</span>. It will look like this:</div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#ifndef</span><span style="color: #569cd6;"> BITVECTOR_H</span></div><div><span style="color: #c586c0;">#define</span><span style="color: #569cd6;"> </span><span style="color: #569cd6;">BITVECTOR_H</span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdint.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdlib.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdbool.h></span></div><div><span style="color: #569cd6;">typedef</span> <span style="color: #569cd6;">struct</span> <span style="color: #4ec9b0;">bitvector_t</span> {</div><div> <span style="color: #4ec9b0;">uint8_t</span> *<span style="color: #9cdcfe;">vector</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">size</span>;</div><div>} <span style="color: #4ec9b0;">bitvector_t</span>;</div><br /><div><span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #dcdcaa;">bitvector_allocate</span>(<span style="color: #4ec9b0;">size_t</span>);</div><div><span style="color: #569cd6;">void</span> <span style="color: #dcdcaa;">bitvector_free</span>(<span style="color: #4ec9b0;">bitvector_t</span> *);</div><div><span style="color: #569cd6;">void</span> <span style="color: #dcdcaa;">bitvector_set</span>(<span style="color: #4ec9b0;">bitvector_t</span> *, <span style="color: #4ec9b0;">size_t</span>);</div><div><span style="color: #569cd6;">void</span> <span style="color: #dcdcaa;">bitvector_unset</span>(<span style="color: #4ec9b0;">bitvector_t</span> *, <span style="color: #4ec9b0;">size_t</span>);</div><div><span style="color: #569cd6;">bool</span> <span style="color: #dcdcaa;">bitvector_isset</span>(<span style="color: #4ec9b0;">bitvector_t</span> *, <span style="color: #4ec9b0;">size_t</span>);</div><div><span style="color: #c586c0;">#endif</span></div></div></div><div><br /></div><div>The <span style="background-color: #1e1e1e; color: #4ec9b0; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">bitvector_t</span> is the main data structure which consists of an array of type <span style="background-color: #1e1e1e; color: #4ec9b0; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">uint8_t</span>, which is basically an unsigned byte. We also maintain the size of the bit vector in the struct. Then we have apis to allocate a bit vector and free it, to set a particular bit position and unset it and lastly an API to test whether a particular bit index is set in the vector or not.</div><div><br /></div><div>Before we start implementing the APIs, let's first write a test following the TDD approach and then we can write the implementation to verify it. We will be writing a lot of tests and to save some effort I am going to write a few utilities for tests. Let's create a file called <span style="font-family: Roboto Mono;">test_utils.h</span> and put following code in it.</div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#ifndef</span><span style="color: #569cd6;"> TEST_UTILS_H</span></div><div><span style="color: #c586c0;">#define</span><span style="color: #569cd6;"> </span><span style="color: #569cd6;">TEST_UTILS_H</span></div><br /><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><assert.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdio.h></span></div><br /><div><span style="color: #c586c0;">#define</span><span style="color: #569cd6;"> </span><span style="color: #569cd6;">test</span><span style="color: #569cd6;">(</span><span style="color: #9cdcfe;">expr</span><span style="color: #569cd6;">, ...) </span><span style="color: #c586c0;">if</span><span style="color: #569cd6;"> (expr) { </span><span style="color: #d7ba7d;">\</span></div><div><span style="color: #569cd6;"> ; } </span><span style="color: #c586c0;">else</span><span style="color: #569cd6;"> { </span><span style="color: #d7ba7d;">\</span></div><div><span style="color: #569cd6;"> </span><span style="color: #dcdcaa;">fprintf</span><span style="color: #569cd6;">(</span><span style="color: #569cd6;">stderr</span><span style="color: #569cd6;">, __VA_ARGS__); </span><span style="color: #d7ba7d;">\</span></div><div><span style="color: #569cd6;"> </span><span style="color: #dcdcaa;">abort</span><span style="color: #569cd6;">(); </span><span style="color: #d7ba7d;">\</span></div><div><span style="color: #569cd6;"> }</span></div><br /><div><span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">print_test_separator_line</span>(<span style="color: #569cd6;">void</span>)</div><div>{</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #b5cea8;">100</span>; <span style="color: #9cdcfe;">i</span>++)</div><div> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"-"</span>);</div><div> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>);</div><div>}</div><br /><div><span style="color: #c586c0;">#endif</span></div></div></div><div><br /></div><div>The file defines a macro called test to which we can a boolean expression. If the expression is false the macro prints a message on stderr and exits. When writing tests we can call this macro with the condition we are asserting and a message to be printed in case the test fails. </div><div><br /></div><div>With that behind, let's write the test for bitvector in a file called <span style="font-family: Roboto Mono;">test_bitvecor.c</span></div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdio.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"bitvector.h"</span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"test_utils.h"</span></div><br /><div><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">test_bitvector</span>(<span style="color: #569cd6;">void</span>)</div><div>{</div><div> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Testing bitvector initialized with all bits clear"</span>);</div><div> <span style="color: #dcdcaa;">print_test_separator_line</span>();</div><div> <span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #9cdcfe;">vector</span> = <span style="color: #dcdcaa;">bitvector_allocate</span>(<span style="color: #b5cea8;">32</span>);</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #b5cea8;">32</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #569cd6;">bool</span> <span style="color: #9cdcfe;">isset</span> = <span style="color: #dcdcaa;">bitvector_isset</span>(<span style="color: #9cdcfe;">vector</span>, <span style="color: #9cdcfe;">i</span>);</div><div> <span style="color: #569cd6;">test</span>(<span style="color: #9cdcfe;">isset</span> == <span style="color: #569cd6;">false</span>, <span style="color: #ce9178;">"Expected bit </span><span style="color: #9cdcfe;">%zu</span><span style="color: #ce9178;"> to be unset"</span>, <span style="color: #9cdcfe;">i</span>);</div><div> }</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">indices</span><span style="color: #569cd6;">[]</span> = {<span style="color: #b5cea8;">5</span>, <span style="color: #b5cea8;">7</span>, <span style="color: #b5cea8;">8</span>, <span style="color: #b5cea8;">0</span>, <span style="color: #b5cea8;">10</span>, <span style="color: #b5cea8;">16</span>};</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">indices</span>) / <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">indices</span>[<span style="color: #b5cea8;">0</span>]); <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #dcdcaa;">bitvector_set</span>(<span style="color: #9cdcfe;">vector</span>, <span style="color: #9cdcfe;">indices</span>[<span style="color: #9cdcfe;">i</span>]);</div><div> }</div><br /><div> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Testing bitvector set"</span>);</div><div> <span style="color: #dcdcaa;">print_test_separator_line</span>();</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">indices</span>) / <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">indices</span>[<span style="color: #b5cea8;">0</span>]); <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #569cd6;">bool</span> <span style="color: #9cdcfe;">isset</span> = <span style="color: #dcdcaa;">bitvector_isset</span>(<span style="color: #9cdcfe;">vector</span>, <span style="color: #9cdcfe;">indices</span>[<span style="color: #9cdcfe;">i</span>]);</div><div> <span style="color: #569cd6;">test</span>(<span style="color: #9cdcfe;">isset</span> == <span style="color: #569cd6;">true</span>, <span style="color: #ce9178;">"Expected bit </span><span style="color: #9cdcfe;">%zu</span><span style="color: #ce9178;"> to be unset"</span>, <span style="color: #9cdcfe;">indices</span>[<span style="color: #9cdcfe;">i</span>]);</div><div> }</div><br /><div> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Testing bitvector unset"</span>);</div><div> <span style="color: #dcdcaa;">print_test_separator_line</span>();</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">indices</span>) / <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">indices</span>[<span style="color: #b5cea8;">0</span>]); <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #dcdcaa;">bitvector_unset</span>(<span style="color: #9cdcfe;">vector</span>, <span style="color: #9cdcfe;">indices</span>[<span style="color: #9cdcfe;">i</span>]);</div><div> }</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #b5cea8;">32</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #569cd6;">bool</span> <span style="color: #9cdcfe;">isset</span> = <span style="color: #dcdcaa;">bitvector_isset</span>(<span style="color: #9cdcfe;">vector</span>, <span style="color: #9cdcfe;">i</span>);</div><div> <span style="color: #569cd6;">test</span>(<span style="color: #9cdcfe;">isset</span> == <span style="color: #569cd6;">false</span>, <span style="color: #ce9178;">"Expected bit </span><span style="color: #9cdcfe;">%zu</span><span style="color: #ce9178;"> to be unset"</span>, <span style="color: #9cdcfe;">i</span>);</div><div> }</div><br /><div> <span style="color: #dcdcaa;">bitvector_free</span>(<span style="color: #9cdcfe;">vector</span>);</div><div>}</div><br /><div><span style="color: #569cd6;">int</span></div><div><span style="color: #dcdcaa;">main</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">argc</span>, <span style="color: #569cd6;">char</span> **<span style="color: #9cdcfe;">argv</span>)</div><div>{</div><div> <span style="color: #dcdcaa;">test_bitvector</span>();</div><div> <span style="color: #c586c0;">return</span> <span style="color: #b5cea8;">0</span>;</div><div>}</div></div></div><div><br /></div><div><br /></div><div><br /></div><div>The test code has only one test function which is testing all APIs of the bit vector. In the test_bitvector test function we are doing following things:</div><div><ul style="text-align: left;"><li>Allocating a bit vector of size 32</li><li>Then we are asserting that all the bits in the vector should be off since it was just created</li><li>Then we set some of the bits on in the vector.</li><li>Next we test whether the bits we just set are indeed set.</li><li>After that we turn those bits back to off and assert that indeed that now those bits are off</li><li>Finally we free the vector.</li></ul></div><div><br /></div><div>We can start writing a Makefile in order to compile this.</div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #dcdcaa;">all</span>: test_bitvector</div><br /><div><span style="color: #dcdcaa;">test_bitvector</span>: test_bitvector.o</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -o test_bitvector test_bitvector.o</div><div><br /></div><div><div style="line-height: 19px;"><div><span style="color: #dcdcaa;">test_bitvector.o</span>: test_bitvector.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c test_bitvector.c</div><br /></div></div></div></div><div><br /></div><div>Although right now there isn't much point of trying to compile because we don't have implementations of the bitvector APIs and so the linker will complain about missing functions. Let's implement the bitvector APIs in a file called <span style="font-family: Roboto Mono;">bitvector.c</span>. Instead of pasting the complete file here, I will reproduce individual functions and try to explain what's happening.</div><div><br /></div><div><b>bitvector_allocate</b>:</div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #4ec9b0;">bitvector_t</span> *</div><div><span style="color: #dcdcaa;">bitvector_allocate</span>(<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">size</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #9cdcfe;">vector</span>;</div><div> <span style="color: #9cdcfe;">vector</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #569cd6;">sizeof</span>(*<span style="color: #9cdcfe;">vector</span>));</div><div> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">vector</span> == <span style="color: #569cd6;">NULL</span>)</div><div> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"malloc failed"</span>);</div><br /><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">nbytes</span> = <span style="color: #9cdcfe;">size</span> / <span style="color: #b5cea8;">8</span> + <span style="color: #b5cea8;">1</span>;</div><div> <span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">size</span> = <span style="color: #9cdcfe;">size</span>;</div><div> <span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">vector</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #9cdcfe;">nbytes</span>);</div><div> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">vector</span> == <span style="color: #569cd6;">NULL</span>)</div><div> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"malloc failed"</span>);</div><div> <span style="color: #dcdcaa;">memset</span>(<span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">vector</span>, <span style="color: #9cdcfe;">nbytes</span>, <span style="color: #b5cea8;">0</span>);</div><div> <span style="color: #c586c0;">return</span> <span style="color: #9cdcfe;">vector</span>;</div><div>}</div><br /></div></div><div><br /></div><div>This is the API to allocate the bitvector. We pass the size of the vector as an argument. We know that we cannot allocate memory at the level of bits, the smallest amount of memory that we can allocate and address is one byte. Therefore we need to figure out to be able to address the requested number of bits how many bytes we need to allocate. That's what we are doing by dividing the size by 8. Once we allocate the requested memory we need to set it to 0 so that the vector is initialized with all bits set as 0 initially.</div><div><br /></div><div><b>bitvector_set:</b></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">bitvector_set</span>(<span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #9cdcfe;">vector</span>, <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">index</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">byte_index</span> = <span style="color: #9cdcfe;">index</span> / <span style="color: #b5cea8;">8</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">byte_offset</span> = <span style="color: #9cdcfe;">index</span> % <span style="color: #b5cea8;">8</span>;</div><div> <span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">vector</span>[<span style="color: #9cdcfe;">byte_index</span>] |= (<span style="color: #b5cea8;">1UL</span> << <span style="color: #9cdcfe;">byte_offset</span>);</div><div>}</div><br /></div></div><div><br /></div><div>This function is expected to set the bit at the given index position to 1. We know that didn't really allocate a bit array but a byte array. Therefore first we need to identify which byte we need to index into the byte array and then within that byte which bit position we need to turn on. The first line of the function identifies the byte index and the second line identifies the bit index within the byte. The last line uses the bitwise operations to turn on that bit.</div><div><br /></div><div><b>bitvector_unset</b></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">bitvector_unset</span>(<span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #9cdcfe;">vector</span>, <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">index</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">byte_index</span> = <span style="color: #9cdcfe;">index</span> / <span style="color: #b5cea8;">8</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">byte_offset</span> = <span style="color: #9cdcfe;">index</span> % <span style="color: #b5cea8;">8</span>;</div><div> <span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">vector</span>[<span style="color: #9cdcfe;">byte_index</span>] &= ~(<span style="color: #b5cea8;">1UL</span> << <span style="color: #9cdcfe;">byte_offset</span>);</div><div>}</div><br /></div></div><div><br /></div><div>This function is the opposite of<span style="font-family: Roboto Mono;"> bitvector_set</span>. The code is almost identical except the last line where we are unsetting the bit.</div><div><br /></div><div><b>bitvector_isset</b></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">bool</span></div><div><span style="color: #dcdcaa;">bitvector_isset</span>(<span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #9cdcfe;">vector</span>, <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">index</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">byte_index</span> = <span style="color: #9cdcfe;">index</span> / <span style="color: #b5cea8;">8</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">byte_offset</span> = <span style="color: #9cdcfe;">index</span> % <span style="color: #b5cea8;">8</span>;</div><div> <span style="color: #c586c0;">return</span> (<span style="color: #9cdcfe;">vector</span>-><span style="color: #9cdcfe;">vector</span>[<span style="color: #9cdcfe;">byte_index</span>] >> <span style="color: #9cdcfe;">byte_offset</span>) & <span style="color: #b5cea8;">1U</span>;</div><div>}</div></div></div><div><br /></div><div>This function tests whether the bit at the given index is on or not. The logic is same, except the last line where we testing whether the bit is on or off and returning the result.</div><div><br /></div><div>Now we can update our Makefile to include this file and run our test cases.</div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #dcdcaa;">all</span>: test_bitvector</div><br /><div><span style="color: #dcdcaa;">test_bitvector</span>: test_bitvector.o bitvector.o</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -o test_bitvector test_bitvector.o bitvector.o</div><br /><div><span style="color: #dcdcaa;">bitvector.o</span>: bitvector.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c bitvector.c</div><br /><div><span style="color: #dcdcaa;">clean</span>:</div><div> rm -rf *.o test_bitvector</div><br /></div></div><div><br /></div><div>We can compile the programs by running make on the shell and then executing ./test_bitvector. </div><div><br /></div><div>Now the bitvector is out of the way, let's head towards implementing the bloom filter</div><div><br /></div><h3 style="text-align: left;">Bloom Filter:</h3><div><br /></div><div>Let's start by defining the API in a file called <span style="font-family: Roboto Mono;">bloomfilter.h</span></div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#ifndef</span><span style="color: #569cd6;"> BLOOMFILTER_H</span></div><div><span style="color: #c586c0;">#define</span><span style="color: #569cd6;"> </span><span style="color: #569cd6;">BLOOMFILTER_H</span></div><br /><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdbool.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdint.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"bitvector.h"</span></div><br /><div><span style="color: #c586c0;">#define</span><span style="color: #569cd6;"> </span><span style="color: #569cd6;">NHASH</span><span style="color: #569cd6;"> </span><span style="color: #b5cea8;">6</span></div><br /><div><span style="color: #569cd6;">typedef</span> <span style="color: #569cd6;">struct</span> <span style="color: #4ec9b0;">bloomfilter_t</span> {</div><div> <span style="color: #4ec9b0;">bitvector_t</span> *<span style="color: #9cdcfe;">bitvector</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">size</span>;</div><div> <span style="color: #4ec9b0;">uint8_t</span> <span style="color: #9cdcfe;">nhash</span>;</div><div>} <span style="color: #4ec9b0;">bloomfilter_t</span>;</div><br /><div><span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #dcdcaa;">bloomfilter_init</span>(<span style="color: #4ec9b0;">size_t</span>);</div><div><span style="color: #569cd6;">void</span> <span style="color: #dcdcaa;">bloomfilter_put</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *, <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">void</span> *, <span style="color: #569cd6;">int</span>);</div><div><span style="color: #569cd6;">bool</span> <span style="color: #dcdcaa;">bloomfilter_contains</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *, <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">void</span> *, <span style="color: #569cd6;">int</span>);</div><div><span style="color: #569cd6;">void</span> <span style="color: #dcdcaa;">bloomfilter_free</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *);</div><br /><div><span style="color: #c586c0;">#endif</span></div></div></div><div><br /></div><div><span style="background-color: #1e1e1e; color: #4ec9b0; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">bloomfilter_t</span> is the data structure which consists of the bitvector, size of the filter and the number of hash functions we are using. Then we have APIs to init the structure, put values and a contains API to query the filter. Similar to the bitvector, first let's write some tests for the filter in a file called <span style="font-family: Roboto Mono;">test_bloomfilter.c</span></div><div><br /></div><div><div style="background-color: #1e1e1e; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div style="color: #d4d4d4;"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"bloomfilter.h"</span></div><div style="color: #d4d4d4;"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"test_utils.h"</span></div><span style="color: #d4d4d4;"><br /></span><div style="color: #d4d4d4;"><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">void</span></div><div style="color: #d4d4d4;"><span style="color: #dcdcaa;">test_bloomfilter</span>(<span style="color: #569cd6;">void</span>)</div><div style="color: #d4d4d4;">{</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Testing bloomfilter put and contains"</span>);</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">print_test_separator_line</span>();</div><div style="color: #d4d4d4;"> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">size</span> = <span style="color: #b5cea8;">1000000</span>;</div><div style="color: #d4d4d4;"> <span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span> = <span style="color: #dcdcaa;">bloomfilter_init</span>(<span style="color: #9cdcfe;">size</span>);</div><div style="color: #d4d4d4;"> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">fp_count</span> = <span style="color: #b5cea8;">0</span>;</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #9cdcfe;">size</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">i</span> % <span style="color: #b5cea8;">2</span> == <span style="color: #b5cea8;">0</span>)</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">bloomfilter_put</span>(<span style="color: #9cdcfe;">filter</span>, &<span style="color: #9cdcfe;">i</span>, <span style="color: #569cd6;">sizeof</span>(<span style="color: #4ec9b0;">size_t</span>));</div><div style="color: #d4d4d4;"> }</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #9cdcfe;">size</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div style="color: #d4d4d4;"> <span style="color: #569cd6;">bool</span> <span style="color: #9cdcfe;">contains</span> = <span style="color: #dcdcaa;">bloomfilter_contains</span>(<span style="color: #9cdcfe;">filter</span>, &<span style="color: #9cdcfe;">i</span>, <span style="color: #569cd6;">sizeof</span>(<span style="color: #9cdcfe;">i</span>));</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">i</span> % <span style="color: #b5cea8;">2</span> == <span style="color: #b5cea8;">0</span>) {</div><div style="color: #d4d4d4;"> <span style="color: #569cd6;">test</span>(<span style="color: #9cdcfe;">contains</span> == <span style="color: #569cd6;">true</span>, <span style="color: #ce9178;">"Expected value </span><span style="color: #9cdcfe;">%zu</span><span style="color: #ce9178;"> to be present in</span></div><div style="color: #d4d4d4;"><span style="color: #ce9178;"> filter"</span>, <span style="color: #9cdcfe;">i</span>);</div><div style="color: #d4d4d4;"> } <span style="color: #c586c0;">else</span> {</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">contains</span> == <span style="color: #569cd6;">true</span>) {</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Expected value </span><span style="color: #9cdcfe;">%zu</span><span style="color: #ce9178;"> to be not present in filter,</span></div><div style="color: #d4d4d4;"><span style="color: #ce9178;"> maybe false positive</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #9cdcfe;">i</span>);</div><div style="color: #d4d4d4;"> <span style="color: #9cdcfe;">fp_count</span>++;</div><div style="color: #d4d4d4;"> }</div><div style="color: #d4d4d4;"> }</div><div style="color: #d4d4d4;"> }</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">bloomfilter_free</span>(<span style="color: #9cdcfe;">filter</span>);</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Total number of false positive = </span><span style="color: #9cdcfe;">%zu</span><span style="color: #ce9178;">, percentage: </span><span style="color: #9cdcfe;">%f</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #9cdcfe;">fp_count</span>,</div><div><span style="color: #d4d4d4;"> </span><span style="color: #b5cea8;">100.0</span><span style="color: #d4d4d4;"> * </span><span style="color: #9cdcfe;">fp_count</span><span style="color: #d4d4d4;"> / </span><span style="color: #9cdcfe;">size</span><span style="color: #d4d4d4;">);</span></div><div style="color: #d4d4d4;">}</div><span style="color: #d4d4d4;"><br /></span><div style="color: #d4d4d4;"><span style="color: #569cd6;">int</span></div><div style="color: #d4d4d4;"><span style="color: #dcdcaa;">main</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">argc</span>, <span style="color: #569cd6;">char</span> **<span style="color: #9cdcfe;">argv</span>)</div><div style="color: #d4d4d4;">{</div><div style="color: #d4d4d4;"> <span style="color: #dcdcaa;">test_bloomfilter</span>();</div><div style="color: #d4d4d4;"> <span style="color: #c586c0;">return</span> <span style="color: #b5cea8;">0</span>;</div><div style="color: #d4d4d4;">}</div></div></div><div><br /></div><div>Here also we only have one test function which testing a few things. Here is what's happening in the test:</div><div><br /></div><div><ul style="text-align: left;"><li>We are creating a bloom filter of some fixed size</li><li>We iterate from 0 to the size of the filter and for every even index we insert that number in the filter</li><li>Then we iterate again and now we query the filter for each loop index value.</li><li>For every even index value we expect the filter to return true because those values we stored in the filter.</li><li>For every odd value we are printing out if the filter returned true. That is a false positive.</li><li>At the end of the test we print out how many false positives we encountered and what is the false positive rate.</li></ul></div><div><br /></div><div>Let's implement the bloom filter APIs now in a file called <span style="font-family: Roboto Mono;">bloomfilter.c</span>. Similar to bitvector I will reproduce individual functions and explain them.</div><div><br /></div><div><b>bloomfilter_init:</b></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"bloomfilter.h"</span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"murmur3.h"</span></div><br /><div><span style="color: #569cd6;">static</span> <span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">SEEDS</span><span style="color: #569cd6;">[]</span> = {<span style="color: #b5cea8;">80430271</span>, <span style="color: #b5cea8;">89023841</span>, <span style="color: #b5cea8;">88060457</span>, <span style="color: #b5cea8;">60974549</span>, <span style="color: #b5cea8;">50009261</span>, <span style="color: #b5cea8;">87906149</span>};</div><br /><div><span style="color: #4ec9b0;">bloomfilter_t</span> *</div><div><span style="color: #dcdcaa;">bloomfilter_init</span>(<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">size</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span>;</div><div> <span style="color: #9cdcfe;">filter</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #569cd6;">sizeof</span>(*<span style="color: #9cdcfe;">filter</span>));</div><div> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">filter</span> == <span style="color: #569cd6;">NULL</span>) {</div><div> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"malloc failed"</span>);</div><div> }</div><div> <span style="color: #6a9955;">// we init the bloom filter with size 10 times greater than the requested size</span></div><div> <span style="color: #6a9955;">// to maintain 1% false positive rate</span></div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">filter_size</span> = <span style="color: #b5cea8;">10</span> * <span style="color: #9cdcfe;">size</span>;</div><div> <span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">size</span> = <span style="color: #9cdcfe;">filter_size</span>;</div><div> <span style="color: #6a9955;">// with 10% bigger filter size and 1% fp rate, the optimal number of hash</span></div><div> <span style="color: #6a9955;">// functions comes about to be 6</span></div><div> <span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">nhash</span> = <span style="color: #569cd6;">NHASH</span>;</div><div> <span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">bitvector</span> = <span style="color: #dcdcaa;">bitvector_allocate</span>(<span style="color: #9cdcfe;">filter_size</span>);</div><div> <span style="color: #c586c0;">return</span> <span style="color: #9cdcfe;">filter</span>;</div><div>}</div><br /></div></div><div><br /></div><div>This function is expected to initialize the filter. We do that by allocating the underlying bitvector. We are allocating the bitvector with size 10 times greater than the requested filter size. It turns out that based on some calculations if we want to maintain 1% false positive rate then we need to have a bit vector at least 10 times larger than the expected number of elements in the filter and we should use at least 6 hash functions. That's what is happening in the function.</div><div><br /></div><div><b>bloomfilter_put:</b></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">bloomfilter_put</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span>, <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">data</span>, <span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">len</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">__uint128_t</span> <span style="color: #9cdcfe;">hash</span> = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">NHASH</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #dcdcaa;">MurmurHash3_x64_128</span>(<span style="color: #9cdcfe;">data</span>, <span style="color: #9cdcfe;">len</span>, <span style="color: #9cdcfe;">SEEDS</span>[<span style="color: #9cdcfe;">i</span>], &<span style="color: #9cdcfe;">hash</span>);</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">index</span> = <span style="color: #9cdcfe;">hash</span> % <span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">size</span>;</div><div> <span style="color: #dcdcaa;">bitvector_set</span>(<span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">bitvector</span>, <span style="color: #9cdcfe;">index</span>);</div><div> <span style="color: #9cdcfe;">hash</span> = <span style="color: #b5cea8;">0</span>;</div><div> }</div><div>}</div><br /></div></div><div><br /></div><div>This API is expected to store the given data item in the filter. We are passing a pointer to the data item and its length in bytes. A good bloom filter should use hash functions which are independent and uniformly distribute, which means that they distribute the values uniformly. This reduces the probability of collisions. Also, since bloom filters are used in big data applications where performance is critical the hash functions should also be fast. Murmur3 is one such hash function which satisfies all these criteria and that's why I am using it here. Instead of implementing it myself, I have taken the public domain implementation available on Github here: <a href="https://github.com/PeterScott/murmur3">https://github.com/PeterScott/murmur3</a>. This implementation exposes the <span style="background-color: #1e1e1e; color: #dcdcaa; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">MurmurHash3_x64_128</span> function as its main API which I have called. We can pass a seed to this function, I've created 6 random prime numbers as the seed which results in 6 hash functions. </div><div><br /></div><div>In essence, we are iterating through each seed, calling Murmur3 to hash the data and setting the corresponding bit to 1 in the bitvector. The Murmur3 hash function returns a 128 bit integer so we need to take a mod with the filter size to get the right bit index.</div><div><br /></div><div><b>bloomfilter_contains:</b></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">bool</span></div><div><span style="color: #dcdcaa;">bloomfilter_contains</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span>, <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">data</span>, <span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">len</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">__uint128_t</span> <span style="color: #9cdcfe;">hash</span> = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">NHASH</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div> <span style="color: #dcdcaa;">MurmurHash3_x64_128</span>(<span style="color: #9cdcfe;">data</span>, <span style="color: #9cdcfe;">len</span>, <span style="color: #9cdcfe;">SEEDS</span>[<span style="color: #9cdcfe;">i</span>], &<span style="color: #9cdcfe;">hash</span>);</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">index</span> = <span style="color: #9cdcfe;">hash</span> % <span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">size</span>;</div><div> <span style="color: #569cd6;">bool</span> <span style="color: #9cdcfe;">isset</span> = <span style="color: #dcdcaa;">bitvector_isset</span>(<span style="color: #9cdcfe;">filter</span>-><span style="color: #9cdcfe;">bitvector</span>, <span style="color: #9cdcfe;">index</span>);</div><div> <span style="color: #c586c0;">if</span> (!<span style="color: #9cdcfe;">isset</span>)</div><div> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">false</span>;</div><div> <span style="color: #9cdcfe;">hash</span> = <span style="color: #b5cea8;">0</span>;</div><div> }</div><div> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">true</span>;</div><div>}</div><br /></div></div><div><br /></div><div>This is pretty much same as <span style="font-family: Roboto Mono;">bloomfilter_put</span>. We iterate through each of the hash functions, get the bit index and check if it is on or off. If any of the indices is off we know that the value was never stored in the filter. Otherwise it might have been stored.</div><div><br /></div><div>That's it, we have finished our bloom filter. Let's integrate this in the Makefile. I've taken the murmur3.h and murmur3.c files from the Github murmur3 implementation and included in the build so that the code compiles and runs. </div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #dcdcaa;">all</span>: test_bitvector test_bloomfilter</div><br /><div><span style="color: #dcdcaa;">test_bitvector</span>: test_bitvector.o bitvector.o</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -o test_bitvector test_bitvector.o bitvector.o</div><br /><div><span style="color: #dcdcaa;">test_bloomfilter</span>: test_bloomfilter.o bitvector.o bloomfilter.o murmur3.o</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -o test_bloomfilter test_bloomfilter.o bitvector.o bloomfilter.o murmur3.o</div><br /><div><span style="color: #dcdcaa;">test_bitvector.o</span>: test_bitvector.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c test_bitvector.c</div><br /><div><span style="color: #dcdcaa;">test_bloomfilter.o</span>: test_bloomfilter.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c test_bloomfilter.c</div><br /><div><span style="color: #dcdcaa;">murmur3.o</span>: murmur3.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c murmur3.c</div><br /><div><span style="color: #dcdcaa;">bloomfilter.o</span>: bloomfilter.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c bloomfilter.c</div><br /><div><span style="color: #dcdcaa;">bitvector.o</span>: bitvector.c</div><div> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CC</span><span style="color: #ce9178;">}</span> <span style="color: #ce9178;">${</span><span style="color: #9cdcfe;">CFLAGS</span><span style="color: #ce9178;">}</span> -c bitvector.c</div><br /><div><span style="color: #dcdcaa;">clean</span>:</div><div> rm -rf *.o test_bitvector test_bloomfilter</div><br /></div></div><div><br /></div><div><br /></div><div>Finally let's write a small benchmark program to measure how much memory the filter actually takes. I am going to use a large file containing a list of dictionary words in new line seprated format. The file contains 421124 words. Here is the benchmark program in a file called bloomfilter_benchmark.c</div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div style="line-height: 19px;"><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><err.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdio.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdlib.h></span></div><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><sys/types.h></span></div><br /><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;">"bloomfilter.h"</span></div><br /><div><span style="color: #569cd6;">const</span> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">FILENAME</span> = <span style="color: #ce9178;">"web3"</span>;</div><br /><div><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">read_file_and_index</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">FILE</span> *<span style="color: #9cdcfe;">f</span> = <span style="color: #dcdcaa;">fopen</span>(<span style="color: #9cdcfe;">FILENAME</span>, <span style="color: #ce9178;">"r"</span>);</div><div> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">f</span> == <span style="color: #569cd6;">NULL</span>)</div><div> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"Failed to open file for reading"</span>);</div><div> <span style="color: #4ec9b0;">ssize_t</span> <span style="color: #9cdcfe;">bytes_read</span>;</div><div> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">line</span> = <span style="color: #569cd6;">NULL</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">linesize</span> = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #c586c0;">while</span> ((<span style="color: #9cdcfe;">bytes_read</span> = <span style="color: #dcdcaa;">getline</span>(&<span style="color: #9cdcfe;">line</span>, &<span style="color: #9cdcfe;">linesize</span>, <span style="color: #9cdcfe;">f</span>)) != -<span style="color: #b5cea8;">1</span>) {</div><div> <span style="color: #9cdcfe;">line</span>[<span style="color: #9cdcfe;">bytes_read</span> - <span style="color: #b5cea8;">1</span>] = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #dcdcaa;">bloomfilter_put</span>(<span style="color: #9cdcfe;">filter</span>, <span style="color: #9cdcfe;">line</span>, <span style="color: #9cdcfe;">bytes_read</span>);</div><div> <span style="color: #dcdcaa;">free</span>(<span style="color: #9cdcfe;">line</span>);</div><div> <span style="color: #9cdcfe;">linesize</span> = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #9cdcfe;">line</span> = <span style="color: #569cd6;">NULL</span>;</div><div> }</div><div> <span style="color: #dcdcaa;">free</span>(<span style="color: #9cdcfe;">line</span>);</div><div> <span style="color: #dcdcaa;">fclose</span>(<span style="color: #9cdcfe;">f</span>);</div><div>}</div><br /><div><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">read_file_and_query</span>(<span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">FILE</span> *<span style="color: #9cdcfe;">f</span> = <span style="color: #dcdcaa;">fopen</span>(<span style="color: #9cdcfe;">FILENAME</span>, <span style="color: #ce9178;">"r"</span>);</div><div> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">f</span> == <span style="color: #569cd6;">NULL</span>)</div><div> <span style="color: #dcdcaa;">err</span>(<span style="color: #569cd6;">EXIT_FAILURE</span>, <span style="color: #ce9178;">"Failed to open file for reading"</span>);</div><div> <span style="color: #4ec9b0;">ssize_t</span> <span style="color: #9cdcfe;">bytes_read</span>;</div><div> <span style="color: #569cd6;">char</span> *<span style="color: #9cdcfe;">line</span> = <span style="color: #569cd6;">NULL</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">linesize</span> = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #c586c0;">while</span> ((<span style="color: #9cdcfe;">bytes_read</span> = <span style="color: #dcdcaa;">getline</span>(&<span style="color: #9cdcfe;">line</span>, &<span style="color: #9cdcfe;">linesize</span>, <span style="color: #9cdcfe;">f</span>)) != -<span style="color: #b5cea8;">1</span>) {</div><div> <span style="color: #9cdcfe;">line</span>[<span style="color: #9cdcfe;">bytes_read</span> - <span style="color: #b5cea8;">1</span>] = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #569cd6;">bool</span> <span style="color: #9cdcfe;">contains</span> = <span style="color: #dcdcaa;">bloomfilter_contains</span>(<span style="color: #9cdcfe;">filter</span>, <span style="color: #9cdcfe;">line</span>, <span style="color: #9cdcfe;">bytes_read</span>);</div><div> <span style="color: #c586c0;">if</span> (<span style="color: #9cdcfe;">contains</span> == <span style="color: #569cd6;">false</span>) {</div><div> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Expected for the filter to contain </span><span style="color: #9cdcfe;">%s</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #9cdcfe;">line</span>);</div><div> }</div><div> <span style="color: #dcdcaa;">free</span>(<span style="color: #9cdcfe;">line</span>);</div><div> <span style="color: #9cdcfe;">linesize</span> = <span style="color: #b5cea8;">0</span>;</div><div> <span style="color: #9cdcfe;">line</span> = <span style="color: #569cd6;">NULL</span>;</div><div> }</div><div> <span style="color: #dcdcaa;">free</span>(<span style="color: #9cdcfe;">line</span>);</div><div> <span style="color: #dcdcaa;">fclose</span>(<span style="color: #9cdcfe;">f</span>);</div><div>}</div><br /><div><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">void</span></div><div><span style="color: #dcdcaa;">print_resource_usage</span>(<span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">filter_size</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">vector_size</span> = <span style="color: #9cdcfe;">filter_size</span> * <span style="color: #b5cea8;">10</span>;</div><div> <span style="color: #4ec9b0;">size_t</span> <span style="color: #9cdcfe;">bytes_reqd</span> = <span style="color: #9cdcfe;">vector_size</span> / <span style="color: #b5cea8;">8</span> + <span style="color: #b5cea8;">1</span>;</div><div><div style="line-height: 19px;"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"Memory used for </span><span style="color: #9cdcfe;">%f</span><span style="color: #ce9178;"> mb</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #b5cea8;">1.0</span> * <span style="color: #9cdcfe;">bytes_reqd</span> / (<span style="color: #b5cea8;">1024</span> * <span style="color: #b5cea8;">1024</span>));</div></div><div>}</div><br /><div><span style="color: #569cd6;">int</span></div><div><span style="color: #dcdcaa;">main</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">argc</span>, <span style="color: #569cd6;">char</span> **<span style="color: #9cdcfe;">argv</span>)</div><div>{</div><div> <span style="color: #4ec9b0;">bloomfilter_t</span> *<span style="color: #9cdcfe;">filter</span> = <span style="color: #dcdcaa;">bloomfilter_init</span>(<span style="color: #b5cea8;">500000</span>);</div><div> <span style="color: #dcdcaa;">read_file_and_index</span>(<span style="color: #9cdcfe;">filter</span>);</div><div> <span style="color: #dcdcaa;">read_file_and_query</span>(<span style="color: #9cdcfe;">filter</span>);</div><div> <span style="color: #dcdcaa;">print_resource_usage</span>(<span style="color: #b5cea8;">500000</span>);</div><div> <span style="color: #dcdcaa;">bloomfilter_free</span>(<span style="color: #9cdcfe;">filter</span>);</div><div> <span style="color: #c586c0;">return</span> <span style="color: #b5cea8;">0</span>;</div><div>}<br /></div></div></div></div><div><br /></div><div>We create a filter with size just slightly greater than the number of items in the file. The function <span style="background-color: #1e1e1e; color: #dcdcaa; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">read_file_and_index</span> reads the file line by line and puts each word in the filter. The function <span style="background-color: #1e1e1e; color: #dcdcaa; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">read_file_and_query</span> reads each word and queries whether it is contained in the filter or not. The function <span style="background-color: #1e1e1e; color: #dcdcaa; font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback"; font-size: 14px; white-space: pre;">print_resource_usage</span>() prints the max resident size of the process. This is the amount of memory the process is occupying at the moment in the RAM. This is what I get after running this benchmark:</div><div><br /></div><div><div><div>➜ time ./bloomfilter_benchmark</div><div>Memory used for 0.596047 mb</div><div><br /></div><div>________________________________________________________</div><div>Executed in 231.93 millis fish external</div><div> usr time 232.06 millis 765.00 micros 231.30 millis</div><div> sys time 0.00 millis 0.00 micros 0.00 millis</div></div></div><div><br /></div><div>So for holding 421124 words it took just 0.6 MB memory. The file size is 4.3MB for reference.</div><div><br /></div><div>The complete code is available on Github: <a href="https://github.com/abhinav-upadhyay/bloom-filter-et-al">https://github.com/abhinav-upadhyay/bloom-filter-et-al</a></div><div><br /></div><div><br /></div>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-31239639085339904512022-01-02T07:54:00.003-08:002022-01-02T08:55:20.786-08:00To Compute or to Cache?<p><span style="font-family: Roboto;">Modern CPUs have become so complicated that it's hard to have an intuition about their performance characteristics. One of the common things that we do is to cache a commonly done computation because intuitively it makes sense that computation is expensive and if we cache it, that will save us time. But that may not always be true. Let's see why.</span></p><p><span style="font-family: Roboto;">When writing high level code and trying to analyze its performance, it helps to have a mental model of how it might be compiled to assembly or machine instructions and what the latency of those instructions might be. There is very helpful infographic that I came across a while back which lists down some of the most common CPU operations and their latency in number of CPU cycles and to put things in perspective it also shows the distance travelled by light by the time that operations finishes.</span></p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjFt2xXFbb2Og60hfXILZpyxMQ8enzElE224hVLavow-WSPwXtlzn2KipTiKhuEsufO5ZL-ns5frgwFI3dxQwAZWtolZ2P_WBFdoVdzgs_77uFiRhAmFZh198iKsK6eLAR0ouKd8ysZ7akFyruTMTkI_BuPZQZ749VMNKhi0o5spAfGyxz0eL9umw=s2400" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="2000" data-original-width="2400" height="534" src="https://blogger.googleusercontent.com/img/a/AVvXsEjFt2xXFbb2Og60hfXILZpyxMQ8enzElE224hVLavow-WSPwXtlzn2KipTiKhuEsufO5ZL-ns5frgwFI3dxQwAZWtolZ2P_WBFdoVdzgs_77uFiRhAmFZh198iKsK6eLAR0ouKd8ysZ7akFyruTMTkI_BuPZQZ749VMNKhi0o5spAfGyxz0eL9umw=w640-h534" width="640" /></a></div><br /><p><span style="font-family: Roboto;">There are few interesting things here. Simple register register ops are very fast, such as adding two numbers. That means if the data is sitting in registers directly, it can be operated upon in no time.</span></p><p><span style="font-family: Roboto;">We can also notice that floating point addition and multiplication of different numeric types is very fast, between 1-3 and 1-7 cycles respectively. At the same time we can see that a read from L1 cache is slightly slower than those ops (3-4 cycles). This tells us that it is faster to multiply two numbers than have the result of the multiplication sitting in the cache. Not only would it waste the precious cache space, it would be slower. This is revealing because most of us would think the opposite is true and write code where we store the computed result, thinking it saves CPU cycles.</span></p><p><span style="font-family: Roboto;">But at the same time we should note that division is a very expensive operation (it's a relatively well known fact) and we should avoid it where possible or cache the result if doing it too often.</span></p><p><span style="font-family: Roboto;">Let's try to test measure how vast is the difference in the performance in practice. I'm going to use C because the compiler generates code very close to the hardware model as oppossed to other higher level languages where things like the runtime of the language can make it hard to understand what's happening. </span></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><time.h></span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdio.h></span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdlib.h></span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><stdint.h></span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><unistd.h></span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><math.h></span></div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback;"><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #c586c0;">#define</span><span style="color: #569cd6;"> </span><span style="color: #569cd6;">MEASURE_COUNT</span><span style="color: #569cd6;"> </span><span style="color: #b5cea8;">200000000</span></div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback;"><br /><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">double</span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #dcdcaa;">mean</span>(<span style="color: #569cd6;">double</span> *<span style="color: #9cdcfe;">values</span>, <span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">size</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">{</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> <span style="color: #9cdcfe;">sum</span> = <span style="color: #b5cea8;">0.0</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #9cdcfe;">size</span>; <span style="color: #9cdcfe;">i</span>++)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum</span> += <span style="color: #9cdcfe;">values</span>[<span style="color: #9cdcfe;">i</span>];</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">return</span> <span style="color: #9cdcfe;">sum</span> / <span style="color: #9cdcfe;">size</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">}</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">double</span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #dcdcaa;">std</span>(<span style="color: #569cd6;">double</span> *<span style="color: #9cdcfe;">values</span>, <span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">size</span>, <span style="color: #569cd6;">double</span> <span style="color: #9cdcfe;">mean</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">{</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> <span style="color: #9cdcfe;">sum</span> = <span style="color: #b5cea8;">0.0</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #9cdcfe;">size</span>; <span style="color: #9cdcfe;">i</span>++)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum</span> += <span style="color: #dcdcaa;">pow</span>(<span style="color: #9cdcfe;">values</span>[<span style="color: #9cdcfe;">i</span>] - <span style="color: #9cdcfe;">mean</span>, <span style="color: #b5cea8;">2</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">return</span> <span style="color: #dcdcaa;">sqrt</span>(<span style="color: #9cdcfe;">sum</span> / <span style="color: #9cdcfe;">size</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">}</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback;"><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #569cd6;">static</span> <span style="color: #569cd6;">double</span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #dcdcaa;">sum</span>(<span style="color: #569cd6;">double</span> *<span style="color: #9cdcfe;">values</span>, <span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">size</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">{</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> <span style="color: #9cdcfe;">sum</span> = <span style="color: #b5cea8;">0.0</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">for</span> (<span style="color: #4ec9b0;">uint32_t</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #9cdcfe;">size</span>; <span style="color: #9cdcfe;">i</span>++)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum</span> += <span style="color: #9cdcfe;">values</span>[<span style="color: #9cdcfe;">i</span>];</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">return</span> <span style="color: #9cdcfe;">sum</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">}</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback;"><br /><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #569cd6;">int</span></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><span style="color: #dcdcaa;">main</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">argc</span>, <span style="color: #569cd6;">char</span> **<span style="color: #9cdcfe;">argv</span>)</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">{</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">a</span> = <span style="color: #b5cea8;">100</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">b</span> = <span style="color: #b5cea8;">200</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">int</span> *<span style="color: #9cdcfe;">prod</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #569cd6;">sizeof</span>(<span style="color: #569cd6;">int</span>));</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> *<span style="color: #9cdcfe;">prod</span> = <span style="color: #9cdcfe;">a</span> * <span style="color: #9cdcfe;">b</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> *<span style="color: #9cdcfe;">times1</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #569cd6;">sizeof</span>(<span style="color: #569cd6;">double</span>) * <span style="color: #569cd6;">MEASURE_COUNT</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> * <span style="color: #9cdcfe;">times2</span> = <span style="color: #dcdcaa;">malloc</span>(<span style="color: #569cd6;">sizeof</span>(<span style="color: #569cd6;">double</span>) * <span style="color: #569cd6;">MEASURE_COUNT</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">struct</span> <span style="color: #4ec9b0;">timespec</span> <span style="color: #9cdcfe;">begin</span>, <span style="color: #9cdcfe;">end</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> <span style="color: #9cdcfe;">time_taken</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #569cd6;">double</span> <span style="color: #9cdcfe;">mean_time</span>, <span style="color: #9cdcfe;">std_time</span>, <span style="color: #9cdcfe;">sum_time</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"><br /></div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">for</span> (<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">MEASURE_COUNT</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">clock_gettime</span>(CLOCK_MONOTONIC_RAW, &<span style="color: #9cdcfe;">begin</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"</span><span style="color: #9cdcfe;">%d</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, <span style="color: #9cdcfe;">a</span> * <span style="color: #9cdcfe;">b</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">clock_gettime</span>(CLOCK_MONOTONIC_RAW, &<span style="color: #9cdcfe;">end</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">time_taken</span> = (<span style="color: #9cdcfe;">end</span>.<span style="color: #9cdcfe;">tv_nsec</span> - <span style="color: #9cdcfe;">begin</span>.<span style="color: #9cdcfe;">tv_nsec</span>) / <span style="color: #b5cea8;">1000000000.0</span> +</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> (<span style="color: #9cdcfe;">end</span>.<span style="color: #9cdcfe;">tv_sec</span> - <span style="color: #9cdcfe;">begin</span>.<span style="color: #9cdcfe;">tv_sec</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">times1</span>[<span style="color: #9cdcfe;">i</span>] = <span style="color: #9cdcfe;">time_taken</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> }</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">for</span> (<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">i</span> = <span style="color: #b5cea8;">0</span>; <span style="color: #9cdcfe;">i</span> < <span style="color: #569cd6;">MEASURE_COUNT</span>; <span style="color: #9cdcfe;">i</span>++) {</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">clock_gettime</span>(CLOCK_MONOTONIC_RAW, &<span style="color: #9cdcfe;">begin</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">printf</span>(<span style="color: #ce9178;">"</span><span style="color: #9cdcfe;">%d</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>, *<span style="color: #9cdcfe;">prod</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">clock_gettime</span>(CLOCK_MONOTONIC_RAW, &<span style="color: #9cdcfe;">end</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">time_taken</span> = (<span style="color: #9cdcfe;">end</span>.<span style="color: #9cdcfe;">tv_nsec</span> - <span style="color: #9cdcfe;">begin</span>.<span style="color: #9cdcfe;">tv_nsec</span>) / <span style="color: #b5cea8;">1000000000.0</span> +</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> (<span style="color: #9cdcfe;">end</span>.<span style="color: #9cdcfe;">tv_sec</span> - <span style="color: #9cdcfe;">begin</span>.<span style="color: #9cdcfe;">tv_sec</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">times2</span>[<span style="color: #9cdcfe;">i</span>] = <span style="color: #9cdcfe;">time_taken</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> }</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback;"><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">mean_time</span> = <span style="color: #dcdcaa;">mean</span>(<span style="color: #9cdcfe;">times1</span>, <span style="color: #569cd6;">MEASURE_COUNT</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum_time</span> = <span style="color: #dcdcaa;">sum</span>(<span style="color: #9cdcfe;">times1</span>, <span style="color: #569cd6;">MEASURE_COUNT</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">std_time</span> = <span style="color: #dcdcaa;">std</span>(<span style="color: #9cdcfe;">times1</span>, <span style="color: #569cd6;">MEASURE_COUNT</span>, <span style="color: #9cdcfe;">mean_time</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">fprintf</span>(<span style="color: #569cd6;">stderr</span>, <span style="color: #ce9178;">"total time taken for product: </span><span style="color: #9cdcfe;">%f</span><span style="color: #ce9178;">, avg time: </span><span style="color: #9cdcfe;">%f</span><span style="color: #ce9178;"> +- </span><span style="color: #9cdcfe;">%f</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>,</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum_time</span>, <span style="color: #9cdcfe;">mean_time</span>, <span style="color: #b5cea8;">2</span> * <span style="color: #9cdcfe;">std_time</span>);</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback;"><br /></span><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">mean_time</span> = <span style="color: #dcdcaa;">mean</span>(<span style="color: #9cdcfe;">times2</span>, <span style="color: #569cd6;">MEASURE_COUNT</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum_time</span> = <span style="color: #dcdcaa;">sum</span>(<span style="color: #9cdcfe;">times2</span>, <span style="color: #569cd6;">MEASURE_COUNT</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">std_time</span> = <span style="color: #dcdcaa;">std</span>(<span style="color: #9cdcfe;">times2</span>, <span style="color: #569cd6;">MEASURE_COUNT</span>, <span style="color: #9cdcfe;">mean_time</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #dcdcaa;">fprintf</span>(<span style="color: #569cd6;">stderr</span>, <span style="color: #ce9178;">"total time taken for print_number: </span><span style="color: #9cdcfe;">%f</span><span style="color: #ce9178;">, avg time: </span><span style="color: #9cdcfe;">%f</span><span style="color: #ce9178;"> +- </span><span style="color: #9cdcfe;">%f</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span>,</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #9cdcfe;">sum_time</span>, <span style="color: #9cdcfe;">mean_time</span>, <span style="color: #b5cea8;">2</span> * <span style="color: #9cdcfe;">std_time</span>);</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";"> <span style="color: #c586c0;">return</span> <span style="color: #b5cea8;">0</span>;</div><div style="font-family: "Droid Sans Mono", "monospace", monospace, "Droid Sans Fallback";">}</div></div><p><span style="font-family: Roboto;">Here we are running two loops to measure the latency of the two operations. The first loop computes the product of two integers a and b and prints it on stdout. </span></p><p><span style="font-family: Roboto;">The second loop dereferences the pointer to an integer which contains the precomputed product of a and b and prints it on stdout. Here we are using a pointer to an integer instead of the integer directly in order to bring the L1 cache into play. The generated assembly code would be such that the CPU would have to read the value from the memory address the pointer points to and put it in a register before printing it. The CPU will put the value in the cache after the first run and so the loop is effectively going to measure the latency of reading from L1 cache.</span></p><p><span style="font-family: Roboto;">We can compile and run this code as follows:</span></p><p><span style="font-family: Roboto;">➜ clang -o latency latency.c -lm</span></p><p><span style="font-family: Roboto;">➜ ./latency 1>/dev/null</span></p><p><span style="font-family: Roboto;">total time taken for product: 11.637324, avg time: 0.000000 +- 0.000000</span></p><p><span style="font-family: Roboto;">total time taken for print_number: 12.010727, avg time: 0.000000 +- 0.000000</span></p><div><span style="font-family: Roboto;"><br /></span></div><div><span style="font-family: Roboto;">We can see that the difference between the two computations is not drastic but in the long run just mulitplying the numbers was slightly faster than keeping the result cached. The difference doesn't look that significant enough to really bother about such micro optimizations. That said, this can save precious space in the L1 cache. We know that the cache is small and scarce and having unnecessary data around in it can cause cache misses which eventually will cost more CPU time. So even if the cost of multiplication is on average same or just slightly better than getting the result from L1, it may be better to just always compute the product and save the cache space.</span></div>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-92100156771026746752021-08-08T09:55:00.002-07:002021-08-08T10:42:30.903-07:00Quick look at the new vector API in Java-16<p><span style="font-family: Roboto; font-size: medium;">One of the shiny new features of Java 16 is the new vector API which allows vectorized execution of numerical operations on arrays by JIT compiling the Java code to SIMD instructions. Compiling high level language code directly to SIMD instructions is a hard challenge and cannot be done as well as someone writing the assembly code by hand. There are libraries which make this easy by providing optimized implementations of these numerical computations, such as the various BLAS (Basic Linear Algebra Subprograms) libraries, usually written in FORTRAN and built on top of more lower level math libraries which eventually deal with the SIMD intrinsics. Most of the languages have libraries wrapped around the BLAS libraries, e.g. Python/R/Octave work this way. </span></p><p><span style="font-family: Roboto; font-size: medium;">I've been ignorant of the options available for writing high performance numerical computing code in Java because I didn't have to care. But few years back when it was announced that some future release of Java will have a library to compile Java code directly to vectorized instructions, it was very exciting and finally the API is out in Java 16.</span></p><p><span style="font-family: Roboto; font-size: medium;">The vector API in java is nice but it is not that straightforward to use, it does leak out some details of the underlying hardware and the programmer has to be aware and deal with it. For example what is the size of the SIMD registers on the CPU, and you need to provide the stride by which to move the vectorized loop forward. It also differs in the some of the terminology which is widely used in other languages and libraries, e.g. in Python what is called a shape of the array, seems to be referred to as lanes here (I think that's what it means but I could be misunderstanding). </span></p><p><span style="font-family: Roboto; font-size: medium;">Anyway, I just wanted to do a quick comparison of a simple numerical computation between vectorized API and the old school non-vectorized version and see the performance benefits. I'm going to create an array of 100,000 random floats and then compute its mean. The mean computation is repeated several times in a loop. It appears the performance benefit of vectorized API kicks only when repeating the operation many number of times, it could be because it takes a while for the JIT compiler to identify and compile the loop.</span></p><p><span style="font-family: Roboto; font-size: medium;"><br /></span></p><p><span style="font-family: Roboto; font-size: medium;">Here is the mean computation using the new vector API:</span></p><p><br /></p><pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: "JetBrains Mono", monospace; font-size: 10.5pt;"><span style="color: #cc7832;">private static final </span>VectorSpecies<Float> <span style="color: #9876aa; font-style: italic;">SPECIES </span>= <br /> FloatVector.<span style="color: #9876aa; font-style: italic;">SPECIES_PREFERRED</span><span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"><br /></span><span style="color: #cc7832;">public static float </span><span style="color: #ffc66d;">vectorizedMean</span>(<span style="color: #cc7832;">float</span>[] values) {<br /> <span style="color: #cc7832;">int </span>i = <span style="color: #6897bb;">0</span><span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> float </span>sum = <span style="color: #6897bb;">0.0f</span><span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> for </span>(<span style="color: #cc7832;">; </span>i < <span style="color: #9876aa; font-style: italic;">SPECIES</span>.loopBound(values.<span style="color: #9876aa;">length</span>)<span style="color: #cc7832;">; </span>i += <span style="color: #9876aa; font-style: italic;">SPECIES</span>.length()) {<br /> FloatVector floatVector = FloatVector.<span style="font-style: italic;">fromArray</span>(<span style="color: #9876aa; font-style: italic;">SPECIES</span><span style="color: #cc7832;">, </span>values<span style="color: #cc7832;">, </span>i)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>sum += floatVector.reduceLanes(VectorOperators.<span style="color: #9876aa; font-style: italic;">ADD</span>)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br /> <span style="color: #cc7832;">return </span>sum / values.<span style="color: #9876aa;">length</span><span style="color: #cc7832;">;<br /></span>}<br /></pre><p><span style="font-family: Roboto; font-size: medium;">This is the non-vectoried version:</span></p><pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: "JetBrains Mono", monospace; font-size: 10.5pt;"><span style="color: #cc7832;">private static float </span><span style="color: #ffc66d;">mean</span>(<span style="color: #cc7832;">float</span>[] arr) {<br /> <span style="color: #cc7832;">float </span>result = <span style="color: #6897bb;">0.0f</span><span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> for </span>(<span style="color: #cc7832;">float </span>v : arr) {<br /> result += v<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br /> <span style="color: #cc7832;">return </span>result / arr.<span style="color: #9876aa;">length</span><span style="color: #cc7832;">;<br /></span>}<br /></pre><p><span style="font-family: Roboto; font-size: medium;">Here, we are going to execute both methods a bunch of times and collect the timings</span></p><pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: "JetBrains Mono", monospace; font-size: 10.5pt;"><span style="color: #cc7832;">public static void </span><span style="color: #ffc66d;">main</span>(String[] args) <span style="color: #cc7832;">throws </span>IOException {<br /> <span style="color: #cc7832;">float</span>[] values = <span style="font-style: italic;">generateArray</span>(<span style="color: #6897bb;">100000</span>)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>List<Long> vecMeanTimes = <span style="color: #cc7832;">new </span>ArrayList<>()<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>List<Long> nonvecMeanTimes = <span style="color: #cc7832;">new </span>ArrayList<>()<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> long </span>start = System.<span style="font-style: italic;">currentTimeMillis</span>()<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> for </span>(<span style="color: #cc7832;">int </span>i = <span style="color: #6897bb;">0</span><span style="color: #cc7832;">; </span>i < <span style="color: #9876aa; font-style: italic;">MAX_RUN_TIMES</span><span style="color: #cc7832;">; </span>i++) {<br /> <span style="font-style: italic;">vectorizedMean</span>(values)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> final long </span>vectorizedTimeTaken = System.<span style="font-style: italic;">currentTimeMillis</span>() - start<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>vecMeanTimes.add(vectorizedTimeTaken)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br /> start = System.<span style="font-style: italic;">currentTimeMillis</span>()<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> for </span>(<span style="color: #cc7832;">int </span>i = <span style="color: #6897bb;">0</span><span style="color: #cc7832;">; </span>i < <span style="color: #9876aa; font-style: italic;">MAX_RUN_TIMES</span><span style="color: #cc7832;">; </span>i++) {<br /> <span style="color: #cc7832;">float </span>v = <span style="font-style: italic;">mean</span>(values)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> final long </span>timeTaken = System.<span style="font-style: italic;">currentTimeMillis</span>() - start<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>nonvecMeanTimes.add(timeTaken)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br /> <span style="font-style: italic;">writeTimesToFile</span>(vecMeanTimes<span style="color: #cc7832;">, </span><span style="color: #6a8759;">"vecMeanTimes.csv"</span>)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span><span style="font-style: italic;">writeTimesToFile</span>(nonvecMeanTimes<span style="color: #cc7832;">, </span><span style="color: #6a8759;">"nonvecMeanTimes.csv"</span>)<span style="color: #cc7832;">;<br /></span>}<br /></pre><p><br /></p><p><span style="font-family: Roboto; font-size: medium;">MAX_RUN_TIMES was set to 100,000. </span></p><p><span style="font-family: Roboto; font-size: medium;">The following plot shows how the two comapre:</span></p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfqVrtB9n8psmYkk_P61-U-fTbVUceeVoJyyOhcWx4dQEnSOXffvORkITRfj8HbDew5J4QwNqpGWvIlXLxde1PUznejwakKYWWPAzXFsAwtpGCPOiukjbR7fRFgmkJObzIulLkkXdNmA/s1920/java_vectorized_vs_nonvec.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="974" data-original-width="1920" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfqVrtB9n8psmYkk_P61-U-fTbVUceeVoJyyOhcWx4dQEnSOXffvORkITRfj8HbDew5J4QwNqpGWvIlXLxde1PUznejwakKYWWPAzXFsAwtpGCPOiukjbR7fRFgmkJObzIulLkkXdNmA/w640-h324/java_vectorized_vs_nonvec.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><p><span style="font-family: Roboto; font-size: medium;">As I said the vectorized code is slower than the non vectorized code for a while before it kicks up and gets significantly faster. Let's zoom in to find the threshold at which it becomes faster</span></p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifnJrIP5D_vTdJYWVem2JTAT5SLYBkSrf8azpJH9uLNJrn4UtpbxgWRN4axQim19GWQNmdw0zKkcaIgCkcJX5rJRSyPmjwGD2FZroKcWn4Ku6xkEgujawZsdTr6fLozNvY6hrMJ1fzBg/s1920/java_vectorized_vs_nonvec_zoom.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="974" data-original-width="1920" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifnJrIP5D_vTdJYWVem2JTAT5SLYBkSrf8azpJH9uLNJrn4UtpbxgWRN4axQim19GWQNmdw0zKkcaIgCkcJX5rJRSyPmjwGD2FZroKcWn4Ku6xkEgujawZsdTr6fLozNvY6hrMJ1fzBg/w640-h324/java_vectorized_vs_nonvec_zoom.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><p><span style="font-family: Roboto; font-size: medium;">It appears it takes about 500 iterations before the JIT compiler realizes it needs to optimize the code, but it's just a guess. </span></p><p><span style="font-family: Roboto; font-size: medium;">Any case, this looks great and exciting and much stuff can be built on top of it.</span></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-38724204912076711162021-07-31T06:40:00.005-07:002021-07-31T10:09:43.527-07:00Designing NullPointerException Safe APIs in Java<p><span style="font-family: Roboto; font-size: medium;">I strongly believe that programming is a skill we never truly master, we get better at it with experience but every project brings new lessons and sometimes can be humbling. One of the things that we can do to get better at this skill is to learn more programming languages. Surely, we need to master one or few languages which are our daily bread earners but restricting ourselves to just those is not healthy. It expands and opens up our minds to different ideas and concepts. That being said, I am guilty of not doing this myself. I've restricted myself to a family of programming languages which essentially look identical (C/C++/C#/Java et al). </span></p><p><span style="font-family: Roboto; font-size: medium;">In the recent past I've started to look at other languages more seriously. Learning new programming languages is a bit difficult for me personally because I get bored very quickly in learning how to define a function, or how to write a conditional statement in a new language. But once these hurdles are crossed, then it's a more fruitful experience. You start understanding the philosophy of the language, why was it designed the way it is designed, what are the new ideas which are not available in another language and so on. </span></p><p><span style="font-family: Roboto; font-size: medium;">One of the languages that I've found very interesting recently is Rust, I am always looking for things to do in Rust. It is a language which is designed based on some innovative ideas but at the same time is pragmatic. For a full list of Rust's innovative features, I refer to this <a href="https://cacm.acm.org/magazines/2021/4/251364-safe-systems-programming-in-rust/fulltext" target="_blank">ACM article</a>. One of its features which probably gets lost in the list of all the other shinier features is that it does not have the problem of NullPointerException which is prevalent in many of the mainstream languages.</span></p><p><span style="font-family: Roboto; font-size: medium;">Tony Hoare (famous for inventing QuickSort) is credited for introducing the concept of </span><span style="font-family: Roboto Mono; font-size: medium;">null</span><span style="font-family: Roboto; font-size: medium;"> in programming languages. He introduced the null type in </span><span style="font-family: Roboto Mono; font-size: medium;">ALGOL</span><span style="font-family: Roboto; font-size: medium;"> and many years later he famously apologised for having done that, he said it was a billion dollar mistake. Many programming languages aped the concept of </span><span style="font-family: Roboto Mono; font-size: medium;">null</span><span style="font-family: Roboto; font-size: medium;"> pointer, which has been a cause numerous runtime bugs.</span></p><p><span style="font-family: Roboto; font-size: medium;">Rust avoids </span><span style="font-family: Roboto Mono; font-size: medium;">NullPointerException</span><span style="font-family: Roboto; font-size: medium;"> by not providing the option of returning the </span><span style="font-family: Roboto Mono; font-size: medium;">null</span><span style="font-family: Roboto; font-size: medium;"> type to the programmer. The equivalent of not returning any valid result in Rust is called the </span><span style="font-family: Roboto Mono; font-size: medium;">None</span><span style="font-family: Roboto; font-size: medium;"> type. But the compiler forbids you from returning a </span><span style="font-family: Roboto Mono; font-size: medium;">None</span><span style="font-family: Roboto; font-size: medium;"> value directly. When writing a function in Rust, you either define the function to return a value of a specific type, and then the function <b>has</b> to return a value of that type. If the function does not need to return anything, that's also possible. But if you are writing a function which may have a value to return sometimes and may not have any valid result to return other times you need to define your function to return the </span><span style="font-family: Roboto Mono; font-size: medium;">Option<T></span><span style="font-family: Roboto; font-size: medium;"> type. </span></p><p><span style="font-family: Roboto; font-size: medium;"><b>Sidenote: </b>There is another way of getting a null pointer error, which is when we dereference a pointer pointing to an uninitialized memory or to an object which does not exist anymore. Rust avoids those as well but I'm only talking about the problem of null values here.</span></p><p><span style="font-family: Roboto; font-size: medium;">Following is an example of a Rust function which can return a value only in some cases. We have a struct to represent a matrix and a function to find the inverse of the matrix. But we know that sometimes the matrices are non-invertible, specifically when their determinant is 0.</span></p><div style="background-color: #1e1e1e; color: #d4d4d4; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">pub</span> <span style="color: #569cd6;">struct</span> <span style="color: #4ec9b0;">Matrix</span> {</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #9cdcfe;">nrows</span>: <span style="color: #4ec9b0;">usize</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #9cdcfe;">ncols</span>: <span style="color: #4ec9b0;">usize</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #9cdcfe;">vals</span>: <span style="color: #4ec9b0;">Vec</span><<span style="color: #4ec9b0;">f32</span>></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">}</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"><br /></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; line-height: 19px;"><span style="color: #569cd6;">impl</span> <span style="color: #4ec9b0;">Matrix</span> {</div></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px;"><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; line-height: 19px;"><div> <span style="color: #569cd6;">pub</span> <span style="color: #569cd6;">fn</span> <span style="color: #dcdcaa;">inverse</span>(&<span style="color: #569cd6;">self</span>) -> <span style="color: #4ec9b0;">Option</span><<span style="color: #4ec9b0;">Self</span>> {</div><div> <span style="color: #569cd6;">let</span> <span style="color: #9cdcfe;">det</span> = <span style="color: #569cd6;">self</span>.<span style="color: #dcdcaa;">det</span>();</div><div> <span style="color: #c586c0;">if</span> <span style="color: #9cdcfe;">det</span> == <span style="color: #b5cea8;">0.0</span> {</div><div> <span style="color: #c586c0;">return</span> <span style="color: #4fc1ff;">None</span>;</div><div> }</div><div> <span style="color: #569cd6;">let</span> <span style="color: #569cd6;">mut</span> <span style="color: #9cdcfe; text-decoration-line: underline;">inverse_mat</span> = <span style="color: #4ec9b0;">Self</span>::<span style="color: #dcdcaa;">new</span>(<span style="color: #569cd6;">self</span>.<span style="color: #9cdcfe;">nrows</span>, <span style="color: #569cd6;">self</span>.<span style="color: #9cdcfe;">ncols</span>);</div><div> <span style="color: #c586c0;">for</span> <span style="color: #9cdcfe;">i</span> <span style="color: #c586c0;">in</span> <span style="color: #b5cea8;">0</span>..<span style="color: #569cd6;">self</span>.<span style="color: #9cdcfe;">nrows</span> {</div><div> <span style="color: #c586c0;">for</span> <span style="color: #9cdcfe;">j</span> <span style="color: #c586c0;">in</span> <span style="color: #b5cea8;">0</span>..<span style="color: #569cd6;">self</span>.<span style="color: #9cdcfe;">ncols</span> {</div><div> <span style="color: #569cd6;">let</span> <span style="color: #9cdcfe;">c</span> = <span style="color: #569cd6;">self</span>.<span style="color: #dcdcaa;">cofactor</span>(<span style="color: #9cdcfe;">i</span>, <span style="color: #9cdcfe;">j</span>);</div><div> <span style="color: #9cdcfe; text-decoration-line: underline;">inverse_mat</span>.<span style="color: #dcdcaa; text-decoration-line: underline;">set</span>(<span style="color: #9cdcfe;">j</span>, <span style="color: #9cdcfe;">i</span>, <span style="color: #9cdcfe;">c</span> / <span style="color: #9cdcfe;">det</span>);</div><div> }</div><div> }</div><div> return <span style="color: #4fc1ff;">Some</span>(<span style="color: #9cdcfe; text-decoration-line: underline;">inverse_mat</span>);</div><div> }</div><br /></div></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px;">}</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback; font-size: 14px;"><br /></span></div><p><span style="font-size: medium;"><span style="font-family: Roboto;">Here we see that the function is defined to return an </span><span style="font-family: Roboto Mono;">Option<Self></span><span style="font-family: Roboto;"> type. Which basically declares that the function may return an object of type Matrix or it may return None. Option is an enum defined in Rust, which consists of two possible values: </span><span style="font-family: Roboto Mono;">None</span><span style="font-family: Roboto;"> and </span><span style="font-family: Roboto Mono;">Some<T>.</span><span style="font-family: Roboto;"> </span></span></p><div style="background-color: #1e1e1e; color: #d4d4d4; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6;">pub</span> <span style="color: #569cd6;">enum</span> <span style="color: #4ec9b0;">Option</span><<span style="color: #4ec9b0;">T</span>> {</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #4fc1ff;">None</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #4fc1ff;">Some</span>(<span style="color: #4ec9b0;">T</span>),</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">}</div></div><p><span style="font-family: Roboto; font-size: medium;">When a function is declared to return an </span><span style="font-family: Roboto Mono; font-size: medium;">Option</span><span style="font-family: Roboto; font-size: medium;"> type value, whoever calls that function has to make sure to check whether they got a None value return or an actual value. Here is an example how such a function may be used.</span></p><div style="background-color: #1e1e1e; color: #d4d4d4; line-height: 19px; white-space: pre;"><div> <span style="color: #569cd6;">fn</span> <span style="color: #dcdcaa;">test_inverse2</span>() {</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #569cd6;">let</span> <span style="color: #9cdcfe;">a</span> = <span style="color: #4ec9b0;">Matrix</span>::<span style="color: #dcdcaa;">from_array</span>(<span style="color: #b5cea8;">4</span>, <span style="color: #b5cea8;">4</span>, &[<span style="color: #b5cea8;">8.0</span>, <span style="color: #b5cea8;">-</span><span style="color: #b5cea8;">5.0</span>, <span style="color: #b5cea8;">9.0</span>, <span style="color: #b5cea8;">2.0</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #b5cea8;">7.0</span>, <span style="color: #b5cea8;">5.0</span>, <span style="color: #b5cea8;">6.0</span>, <span style="color: #b5cea8;">1.0</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #b5cea8;">-</span><span style="color: #b5cea8;">6.0</span>, <span style="color: #b5cea8;">0.0</span>, <span style="color: #b5cea8;">9.0</span>, <span style="color: #b5cea8;">6.0</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #b5cea8;">-</span><span style="color: #b5cea8;">3.0</span>, <span style="color: #b5cea8;">0.0</span>, <span style="color: #b5cea8;">-</span><span style="color: #b5cea8;">9.0</span>, <span style="color: #b5cea8;">-</span><span style="color: #b5cea8;">4.0</span>]).<span style="color: #dcdcaa;">unwrap</span>();</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #569cd6;">let</span> <span style="color: #9cdcfe;">inverse_result</span> = <span style="color: #9cdcfe;">a</span>.<span style="color: #dcdcaa;">inverse</span>();</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">match</span> <span style="color: #9cdcfe;">inverse_result</span> {</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #4fc1ff;">Some</span>(<span style="color: #9cdcfe;">inverse</span>) => <span style="color: #569cd6;">println!</span>(<span style="color: #ce9178;">"Inverse result: </span><span style="color: #569cd6;">{</span><span style="color: #569cd6;">:</span><span style="color: #569cd6;">?</span><span style="color: #569cd6;">}</span><span style="color: #ce9178;">"</span>, <span style="color: #9cdcfe;">inverse</span>),</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #4fc1ff;">None</span> => <span style="color: #569cd6;">println!</span>(<span style="color: #ce9178;">"Non invertible matrix"</span>)</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> }</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> }</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback; font-size: 14px;"><br /></span></div><p><span style="font-family: Roboto; font-size: medium;">After getting the result of the inverse function, we are using pattern matching to check the actual value received and taking action accordingly. In this way we are guaranteed that the caller of inverse will take care of handling the case when inverse doesn't return any value. </span></p><p><span style="font-family: Roboto; font-size: medium;">In other languages such as Java, which allow returning null directly have the problem of </span><span style="font-family: Roboto Mono; font-size: medium;">NullPointerException</span><span style="font-family: Roboto; font-size: medium;">, because most of the times we don't take care of not checking for a null value being returned. And sometimes we are too careful and litter the code with null checks all over the place even when there is a good chance that there is no probability of getting a null value in many of the places. For example:</span></p><pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: "JetBrains Mono", monospace; font-size: 11.3pt;">String <span style="color: #9876aa;">managerName </span>= bob.department.manager.name<span style="color: #cc7832;">;</span></pre><p><span style="font-family: Roboto; font-size: medium;">There is a good chance of getting an NPE in this chain. To avoid that we may have to write this monstrosity:</span></p><pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: "JetBrains Mono", monospace; font-size: 11.3pt;"><span style="color: #cc7832;">if</span>(bob != <span style="color: #cc7832;">null</span>) {<br /> Department department = bob.department<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> if</span>(department != <span style="color: #cc7832;">null</span>) {<br /> Employee manager = department.manager<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> if</span>(manager != <span style="color: #cc7832;">null</span>) {<br /> String name = manager.name<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> if</span>(name != <span style="color: #cc7832;">null</span>) {<br /> <span style="color: grey;">//do something<br /></span><span style="color: grey;"> </span>}<br /> }<br /> }<br /> }<br /></pre><p><span style="font-family: Roboto; font-size: medium;">If the language provides us with a way to declare when a function can return a null value, not only can we prevent the pest of </span><span style="font-family: Roboto Mono; font-size: medium;">NullPointerExceptions</span><span style="font-family: Roboto; font-size: medium;"> but we can also write clearer code. Java doesn't have any option of preventing </span><span style="font-family: Roboto Mono; font-size: medium;">NullPointerException</span><span style="font-family: Roboto; font-size: medium;"> at the language level but as part of Java 8 they have added the </span><span style="font-family: Roboto Mono; font-size: medium;">Optional<T></span><span style="font-family: Roboto; font-size: medium;"> type in </span><span style="font-family: Roboto Mono; font-size: medium;">java.utils</span><span style="font-family: Roboto; font-size: medium;"> package. It seems similar to Rust's </span><span style="font-family: Roboto Mono; font-size: medium;">Option</span><span style="font-family: Roboto; font-size: medium;"> type. Basically we can write methods declared to return an </span><span style="font-family: Roboto Mono; font-size: medium;">Optional<T></span><span style="font-family: Roboto; font-size: medium;"> type, so the caller knows that they have to check whether the function returned some value or not. It is still possible to write functions which return null so NPEs are not completely gone but within a team we can agree upon designing APIs which declare to return </span><span style="font-family: Roboto Mono; font-size: medium;">Optional<T></span><span style="font-family: Roboto; font-size: medium;"> type whenever they think there is a possibility of returning null from the function and if a function is declared to return something else we can assume (in good faith) that the function is guaranteed to return a proper result.</span></p><pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: "JetBrains Mono", monospace; font-size: 11.3pt;"><span style="color: #cc7832;">class </span>Scratch {<br /> <span style="color: #cc7832;">public static </span>Optional<Long> <span style="color: #ffc66d;">divide</span>(<span style="color: #cc7832;">long </span>numerator<span style="color: #cc7832;">, long </span>denominator) {<br /> <span style="color: #cc7832;">if </span>(denominator == <span style="color: #6897bb;">0</span>) {<br /> <span style="color: #cc7832;">return </span>Optional.<span style="font-style: italic;">empty</span>()<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br /> <span style="color: #cc7832;">return </span>Optional.<span style="font-style: italic;">of</span>(numerator / denominator)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br /><br /> <span style="color: #cc7832;">public static void </span><span style="color: #ffc66d;">main</span>(String[] args) {<br /> Optional<Long> result = <span style="font-style: italic;">divide</span>(<span style="color: #6897bb;">100</span><span style="color: #cc7832;">, </span><span style="color: #6897bb;">0</span>)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>Long quotient = result.orElseThrow(() -><br /> <span style="color: #cc7832;">new </span>IllegalArgumentException(<span style="color: #6a8759;">"Divide by 0"</span>))<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>System.<span style="color: #9876aa; font-style: italic;">out</span>.println(<span style="color: #6a8759;">"Result: " </span>+ quotient)<span style="color: #cc7832;">;<br /></span><span style="color: #cc7832;"> </span>}<br />}<br /></pre><p><span style="font-family: Roboto; font-size: medium;">I am not sure if Java is ever going to go as far as to disallow returning </span><span style="font-family: Roboto Mono; font-size: medium;">null</span><span style="font-family: Roboto; font-size: medium;"> from methods but this comes close to getting rid of those pesky NPEs.</span></p><p><span style="font-family: Roboto; font-size: medium;">Optionals have a great deal of features in built to avoid writing verbose checks, one of which I've used above where if the result contains a value I will get the value or I can throw an exception of my choice. Similarly there are other options available such as getting a default value back if the Optional does not contain any value.</span></p><p><br /></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-77108476579426311612021-06-12T04:54:00.004-07:002021-06-12T05:22:43.597-07:00Unix File I/O - Beyond read(2) and write(2) - Part I<p><span style="font-family: Roboto; font-size: medium;"><a href="https://man.netbsd.org/read.2" target="_blank">read(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">write(2)</a> system calls are well known for doing file I/O operations. They are perhaps covered in the very beginning in C or Unix programming courses. But more commonly programmers tend to use the <a href="https://man.netbsd.org/fwrite.3" target="_blank">fread(3)</a>, <a href="https://man.netbsd.org/fwrite.3" target="_blank">fwrite(3)</a>, <a href="https://man.netbsd.org/fprintf.3" target="_blank">fprintf(3)</a> et al functions. One of the advantage of using the latter set of functions for file I/O is that they are buffered and thus offer better performance. The <a href="https://man.netbsd.org/read.2" target="_blank">read(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">write(2)</a> system calls on the other hand are unbuffered which means that when you use <a href="https://man.netbsd.org/write.2" target="_blank">write(2)</a> to write some data, it has to be written to the file, unlike <a href="https://man.netbsd.org/fwrite.3" target="_blank">fwrite(3)</a> where it might sit in a buffer for a while before eventually being written to the file. Not writing data immediately to the file results in better performance because the function can return faster and the actual writing will happen in the background at some point of time (e.g. when the buffer gets full or another process is trying to read that data at which point the in memory data needs to be flushed to the file on disk).</span></p><p><span style="font-family: Roboto; font-size: medium;">But there are situations where using unbuffered I/O, such as the <a href="https://man.netbsd.org/read.2" target="_blank">read(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">write(2)</a> system calls makes sense. For example, database systems prefer to have control over their reads and writes and want to make sure that when they write data it is actually written to the disk, or the data they are reading from the disk is not stale. For this reason many databases avoid using buffered I/O functions and stick to the unbuffered versions, while managing the cache themselves so that they have control over the consistency of the data.</span></p><p><span style="font-family: Roboto; font-size: medium;">Using unbuffered I/O for these reasons is fine but the <a href="https://man.netbsd.org/read.2" target="_blank">read(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">write(2)</a> system calls are cumbersome to use in complex applications such as databases. Let's see why.</span></p><p><span style="font-family: Roboto; font-size: medium;">Database systems when reading or writing data, always want to read from specific offsets in the data file. For instance when looking up a record, it will first do a lookup for the record in the index file. The index file usually provides the offset of the record in the data file where the actual data for that record is stored. The database system then needs to do a read from that offset in the data file. This could translate to something like the following imaginary code</span></p><div style="background-color: #1e1e1e; color: #d4d4d4; line-height: 19px; white-space: pre;"><div><span style="color: #569cd6; font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">typedef</span><span style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> struct record_metadata {</span></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #569cd6;">off_t</span> rec_offset;</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #569cd6;">size_t</span> rec_size;</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">} record_metadata;</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback; font-size: 14px;"><br /></span><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">record_metadata *</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"><span style="color: #dcdcaa;">search_index</span>(<span style="color: #4ec9b0;">db_t</span> *<span style="color: #9cdcfe;">db</span>, <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">key</span>)</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">{</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #6a9955;">// returns record_metadata object which contains the offset of the record</span></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #6a9955;">// in the data file and the record size</span></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">}</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback; font-size: 14px;"><br /></span><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; line-height: 19px;"><div><span style="color: #4ec9b0;">record_t</span> *</div><div><span style="color: #dcdcaa;">lookup_record</span>(<span style="color: #4ec9b0;">db_t</span> *<span style="color: #9cdcfe;">db</span>, <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">key</span>)</div><div>{</div><div> record_metadata *record_meta = <span style="color: #dcdcaa;">search_index</span>(db, key);</div><div> <span style="color: #c586c0;">if</span> (record_offset == -<span style="color: #b5cea8;">1</span>) {</div><div> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">NULL</span>;</div><div> }</div><div> <span style="color: #4ec9b0;">record_t</span> *rec = <span style="color: #dcdcaa;">allocate_record</span>(record_meta); </div><div> <span style="color: #6a9955;">// ignoring error handling for simplicity</span></div><div> <span style="color: #569cd6;">off_t</span> ret = <span style="color: #dcdcaa;">lseek</span>(<span style="color: #9cdcfe;">db</span>-><span style="color: #9cdcfe;">datafd</span>, <span style="color: #9cdcfe;">record_meta</span>-><span style="color: #9cdcfe;">rec_offset</span>, SEEK_SET);</div><div> <span style="color: #c586c0;">if</span> (ret == -<span style="color: #b5cea8;">1</span>) {</div><div> <span style="color: #c586c0;">goto</span> ERROR_HANDLE;</div><div> }</div><div> <span style="color: #569cd6;">ssize_t</span> bytes_read = <span style="color: #dcdcaa;">read</span>(<span style="color: #9cdcfe;">db</span>-><span style="color: #9cdcfe;">datafd</span>, <span style="color: #9cdcfe;">rec</span>-><span style="color: #9cdcfe;">data</span>, <span style="color: #9cdcfe;">record_meta</span>-><span style="color: #9cdcfe;">rec_size</span>);</div><div> <span style="color: #c586c0;">if</span> (bytes_read == -<span style="color: #b5cea8;">1</span>) {</div><div> <span style="color: #c586c0;">goto</span> ERROR_HANDLE;</div><div> }</div><div> <span style="color: #dcdcaa;">free</span>(record_meta);</div><div> <span style="color: #c586c0;">return</span> rec;</div><div> ERROR_HANDLE:</div><div> <span style="color: #dcdcaa;">warn</span>(<span style="color: #ce9178;">"record lookup failed"</span>);</div><div> <span style="color: #dcdcaa;">free</span>(record_meta);</div><div> <span style="color: #dcdcaa;">free</span>(rec);</div><div> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">NULL</span>;</div><div>}</div><br /></div></div></div><p><span style="font-family: Roboto; font-size: medium;">What we see here is that we need to issue two system calls to read from the data file, once to <a href="https://man.netbsd.org/lseek.2" target="_blank">seek</a> to the right offset and then to actually do the read. Similar patterns repeat throughout the system whether reading/writing the index or the data file.</span></p><p><span style="font-family: Roboto; font-size: medium;">Another problem with this arises if the system is multi-threaded. For every open file descriptor in the process, the kernel maintains the value of the current offset for that process in a table. With every read or write for n number of bytes, the kernel increments the current offset by that many bytes, so that the next read or write will happen at that position. Since the threads within the process share the same file descriptors as the main process, they also share the same file offsets. Which means that if one of the threads seeks to a particular offset to read something, it's possible that another thread issues a seek to write some data at another offset. This can result in all kinds of chaos and data corruption.</span></p><p><span style="font-family: Roboto; font-size: medium;">To avoid the above two problems, POSIX provides two system calls - <a href="https://man.netbsd.org/read.2" target="_blank">pread(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">pwrite(2)</a>. Their signature and behavior is very similar to <a href="https://man.netbsd.org/read.2" target="_blank">read(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">write(2)</a> but with one important difference. Following is their synopsis:</span></p><p><br /></p><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="color: #c586c0;">#include</span><span style="color: #569cd6;"> </span><span style="color: #ce9178;"><unistd.h></span></div><br /><div><span style="color: #569cd6;">ssize_t</span> <span style="color: #dcdcaa;">pread</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">fd</span>, <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">buf</span>, <span style="color: #569cd6;">size_t</span> <span style="color: #9cdcfe;">count</span>, <span style="color: #569cd6;">off_t</span> <span style="color: #9cdcfe;">offset</span>);</div><br /><div><span style="color: #569cd6;">ssize_t</span> <span style="color: #dcdcaa;">pwrite</span>(<span style="color: #569cd6;">int</span> <span style="color: #9cdcfe;">fd</span>, <span style="color: #569cd6;">const</span> <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">buf</span>, <span style="color: #569cd6;">size_t</span> <span style="color: #9cdcfe;">count</span>, <span style="color: #569cd6;">off_t</span> <span style="color: #9cdcfe;">offset</span>);</div><br /></div><p><span style="font-family: Roboto; font-size: medium;">The difference is that these system calls have an additional fourth argument, called offset. These system calls explicitly ask for the offset from which to start reading or writing the data, thus avoiding the need to manually first seek to that offset. This not only makes the code simpler, avoids one extra system call but also makes the read and writes thread safe. Using these the above sample code will change to something like this:</span></p><div style="background-color: #1e1e1e; color: #d4d4d4; line-height: 19px; white-space: pre;"><div><span style="color: #dcdcaa; font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">record_t *</span></div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"><span style="color: #dcdcaa;">lookup_record</span>(<span style="color: #4ec9b0;">db_t</span> *<span style="color: #9cdcfe;">db</span>, <span style="color: #569cd6;">void</span> *<span style="color: #9cdcfe;">key</span>)</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">{</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> record_metadata *record_meta = <span style="color: #dcdcaa;">search_index</span>(db, key);</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">if</span> (record_offset == -<span style="color: #b5cea8;">1</span>) {</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">NULL</span>;</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> }</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #4ec9b0;">record_t</span> *rec = <span style="color: #dcdcaa;">allocate_record</span>(record_meta); </div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #569cd6;">ssize_t</span> bytes_read = <span style="color: #dcdcaa;">pread</span>(<span style="color: #9cdcfe;">db</span>-><span style="color: #9cdcfe;">datafd</span>, <span style="color: #9cdcfe;">rec</span>-><span style="color: #9cdcfe;">data</span>, <span style="color: #9cdcfe;">record_meta</span>-><span style="color: #9cdcfe;">rec_size</span>,</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"><span style="color: #9cdcfe;"> record_meta</span>-><span style="color: #9cdcfe;">offset</span>);</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">if</span> (bytes_read == -<span style="color: #b5cea8;">1</span>) {</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">goto</span> ERROR_HANDLE;</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> }</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #dcdcaa;">free</span>(record_meta);</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">return</span> rec;</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> ERROR_HANDLE:</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #dcdcaa;">warn</span>(<span style="color: #ce9178;">"record lookup failed"</span>);</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #dcdcaa;">free</span>(record_meta);</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #dcdcaa;">free</span>(rec);</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;"> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">NULL</span>;</div><div style="font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px;">}</div><span style="font-family: Droid Sans Mono, monospace, monospace, Droid Sans Fallback; font-size: 14px;"><br /></span></div><p><span style="font-family: Roboto; font-size: medium;">Apart from <a href="https://man.netbsd.org/read.2" target="_blank">pread(2)</a> and <a href="https://man.netbsd.org/write.2" target="_blank">pwrite(2)</a>, there are two more exciting (😬) system calls, <a href="https://man.netbsd.org/read.2">readv(2)</a>, <a href="https://man.netbsd.org/write.2" target="_blank">writev(2)</a> which provide ways to do vectorized I/O. Database systems tend to exploit these as well, I will talk about these with specific examples in another post. Stay tuned! ⏳</span></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-64355752407484696262021-05-15T07:10:00.010-07:002021-05-16T00:22:00.763-07:00Using Queueing Theory to Simulate Server Load<h2 style="text-align: left;"><span style="font-family: Roboto;">Introduction</span></h2><p><span style="font-family: Roboto;">One of the projects that we've been working at work is about multivariate time series modelling. We've been evaluating a particular model on a variety of real and synthetically generated datasets and have been trying to figure out the strengths and weaknesses of the current model. Real and clean datasets are hard to come by and may not provide all the scenarios that you want to test your model. We have been wondering of a variety of scenarios and how would the model react to them. Of course, you can write some ad-hoc code to introduce those scenarios in your existing dataset but it doesn't scale. For every different scenario you need to write a bunch of complex code which is time consuming and error prone.</span></p><h2 style="text-align: left;"><span style="font-family: Roboto;">Queueing Theory</span></h2><p><span style="font-family: Roboto;">This lead me to look into queueing theory, which is commonly used to model server load for tasks such as capacity planning and resource requirements, as well as simulation. Queueing theory is a branch of study designed to model systems involving queues, scarce resources and delays. It can help answer questions such as, "if from tomorrow the number of job requests to the system doubles, how much more CPU do we need so that the average response latency remains the same". Queueing theory predates computer science and is a part of Operations Research where it has been employed to study the resource requirements for providing a service. It was first developed and used in 1909 by Agner Krarup Erlang, a Danish engineer, to model the number of telephone calls arriving at an exchange. It has also been used to model the flow of packets in the packet switching networks during the development of TCP/IP in the 70s. </span></p><h2 style="text-align: left;"><span style="font-family: Roboto;">Queueing Theory by Example</span></h2><p><span style="font-family: Roboto;">Let's say we have a system consisting of a single server with a single CPU and an practically unbounded queue size. There is a random process which is generating jobs at some fixed rate, the jobs come and sit in the queue on the server. The jobs are served first come first serve and each job has some resource requirement in order to be serviced. Let's also assume that the server is capable of serving n jobs per second.</span></p><p><span style="font-family: Roboto;">Concretely, we can model the job generating process as a Poisson process with rate <b>λ</b> , i.e. on average there are <b>λ</b> jobs per second. Similarly we can think of the resource requirement of a job as a probability distribution. If all jobs are homogeneous (i.e. require similar amount of work), then we could model it as a normal distribution with parameters (𝛍, 𝛔).</span></p><p><span style="font-family: Roboto;">Once we have this basic structure in place, we can answer more complex questions. For example, if the number of jobs coming to the server doubles (while the resource requirement of each job being same as before), how much more CPU do we need to keep the average response latency same?. Or if the resource requirement of the jobs has doubled, what will be the average response latency of the server?</span></p><h2 style="text-align: left;"><span style="font-family: Roboto;">Simulating Server Load using Queueing Theory</span></h2><p><span style="font-family: Roboto;">Now let's talk about how can we utilize queueing theory to model a server and simulate data using that model. The goal is to model the resource usage on a server hosting a microservice. The server receives requests over the network at some rate and each request requires some amount of resources in order to be served, e.g. CPU, memory, network, disk etc.</span></p><h3 style="text-align: left;"><span style="font-family: Roboto;">Modelling the Requests</span></h3><p><span style="font-family: Roboto;">It's possible to model the requests as a random variable following some probability distribution (.e.g exponential or Poisson) but in my case I decided to use an existing time series data to represent request rate. The time series represented the number of requests to a popular ride sharing app. I decided to use it because it was a real world data with nice weekly and daily seasonality patterns and reflected the kind of data we wanted to simulate. For those unfamiliar with time-series data, the data consists of observations made at fixed time steps. In this case, the request data consisted of number of requests every minute.</span></p><h3 style="text-align: left;"><span style="font-family: Roboto;">Not All Requests are Same:</span></h3><p><span style="font-family: Roboto;">We can definitely model the resource requirement of all the requests as being identical but in real world situations things are not that simple. For example, for a database, some queries might be simple one row index lookup which can be served pretty fast (especially if the data is sitting in cache). But there may be other queries which may require reading a large number of rows and then there might be others which may require reading large number of rows without using any index or primary key, thus resulting in slow response. We decided to model this kind of behavior in our simulation. In order to do this, we split the requests into 6 categories, each request could belong to one of these categories. We defined a categorical (or multinomial) distribution to decide which category a request falls into. For example</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; font-family: Roboto Mono;">request_job_sizes ~ Categorical(n=6, probs=[1/6, 1/6, 1/6, 1/10, 1/10, 1/3])</span></p><p style="text-align: left;"><span style="font-family: Roboto;">Which means each request has 1/6 probability of being in the first 3 categories, 0.1 probability of being in categories 4 and 5, and 0.3 probability of being in category 6. At simulation time we can sample from this distribution to decide the request job category.</span></p><p><span style="font-family: Roboto;">We can also assume that the resource requirements increase with the category, i.e. category 1 requires the minimum amount of resources, whereas category 6 request requires maximum amount of resources.</span></p><p><span style="font-family: Roboto;">To model a real world business service, we probably want to include the time of the day as a factor as well. For example, during the day hours there might be more number of requests of certain size as compared to others and during night time the behavior changes. Similar differences can be modelled for weekdays vs weekends as well. As an example we could define the above distribution for the business hours from morning 9 to evening 9. And for the night hours we could define a different distribution such as </span><span style="font-family: Roboto Mono;"><span style="background-color: #eeeeee;">probs = [1/3, 1/3, 1/10, 1/5, 0, 0]</span> </span><span style="font-family: Roboto;">which basically means during night hours the larger requests stop coming and small requests grow in numbers.</span></p><p><span style="font-family: Roboto;">Another possibility is to model requests as a Markov chain. For example, if we are modelling a certain application/businesss workflow where the user follows a certain set of steps. For instance, in an e-commerce application once you login, there is a high probability of doing a search. Similarly after you click on checkout there is a high probability of going to the payments page, rather than going back to search for another item. Such behavior can also be modelled. To model this, we can say that our system consists of 3 (as an example) possible states, at the beginning the system starts up in </span><span style="font-family: Roboto Mono;">state-1</span><span style="font-family: Roboto;">. We can define a state transition probability distribution like this:</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-1|state-1) = 0.5</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-1|state-2) = 0.3</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-1|state-3 = 0.2</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-2|state-1) = 0.1</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-2|state-2) = 0.3</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-2|state-3) = 0.4</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-3|state-1) = 0.8</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-3|state-2) = 0.0</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; color: #444444; font-family: Roboto Mono;">P(state-3|state-3) = 0.2</span></p><p style="text-align: left;"><span style="font-family: Roboto;">And then we could define the categorical distribution of request sizes for each of the states.</span></p><h3 style="text-align: left;"><span style="font-family: Roboto;">Modelling Resource Requirements for Request Jobs</span></h3><p><span style="font-family: Roboto;">For each of the 6 request job categories, we can then define the resource requirements.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Modelling CPU:</span></h4><p><span style="font-family: Roboto;">We need to define the CPU requirement for each request type. One possible formulation is this:</span></p><p style="text-align: center;"><span style="background-color: #eeeeee; font-family: Roboto Mono;">cpu | request_category_1 ~ Normal(𝛍=10000, 𝛔=1000)</span></p><p><span style="font-family: Roboto;">This means that the CPU requirement for request category 1 is normally distributed with the given mean and standard deviation. The CPU requirement here is defined in terms of the number of CPU cycles required to service the job. If we know the maximum CPU cycles per second, we can easily calculate the percentage CPU requirement using that. We can similarly define the CPU model for rest of the 5 other request job categories.</span></p><p><span style="font-family: Roboto;">Apart from modelling the CPU load because of the requests, we may also want to factor in some constant noise because of background processes and daemons running on the server causing some constant load at all times.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Modelling Memory:</span></h4><p><span style="font-family: Roboto;">We can assume handling each request adds up some stress on the memory utilization of the server. Similar to CPU we can define distributions for memory usage for each of the request types. For sake of brevity I will skip reproducing it here but it can be modelled using any of the common distributions, such as normal, uniform, student's t etc.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Modelling Network:</span></h4><p><span style="font-family: Roboto;">Since the requests are arriving at the server over the network, each request coming in will cause some amount of traffic in and the response being generated by the server will cause some traffic flowing out of the server. These can also be modelled similar to the CPU using distributions of our choice. The interesting part in modelling network is modelling the background noise, i.e. the traffic which is present because at all times because of background processes, daemons and things like health checks (or pings). Usually this will be very small amount of traffic as compared to the traffic because of the actual service requests. This can be best modelled using an extreme value distribution such as the Pareto or Weibull distributions. These distributions are from the family of long tail distributions and commonly used for modelling extreme value data, where majority of data has small values but there are large disruptions once in a while.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Modelling Disk:</span></h4><p><span style="font-family: Roboto;">Disk activity because of the requests can also be modelled similar to the other resources. Although in our case we decided to model the requests such as they don't cause any disk activity on the server, so the disk is independent of the requests being served by the server. But we decided to model disk activity occurring due to a daemon process running on the server at regular intervals and causing disk reads and writes.</span></p><p><span style="font-family: Roboto;">Modelling disk activity is slightly different from modelling CPU or memory. Disks are very slow to read and write, and the access patterns can result in variable latencies. For example in spinning magnetic disks, if the data being read resides on one of the outer tracks of the disk, it can be read faster as compared to data residing on the inner track. Also, the disk controller does read ahead and caches the near by data, apart from the requested data so that if request for near by data comes next it can just return from cache. The operating system also caches the disk data in its page cache and can service future reads much faster. We may want to model these behavior in our simulation if we want.</span></p><p><span style="font-family: Roboto;">In our case, I decided to model the variation in disk latency because of the sector location (outer track vs inner track) using a Bernouli distribution with 50% percent chance of reading from an outer track. Based on if we are reading from an outer track or inner track, we can define the speed of the disk, for exampl 7200 RPM for outer track and 5000 RPM for inner track. We can model the number of bytes read/written as a normal distribution and based on the disk speed we can come up with a latency of reads and writes.</span></p><p><span style="background-color: #eeeeee; font-family: Roboto Mono;">Disk read ~ Normal(mean=10000 bytes, std=500 bytes)</span></p><p><span style="background-color: #eeeeee; font-family: Roboto Mono;">Disk write ~ Normal(mean=4000 bytes, std=200 bytes)</span></p><p><span style="background-color: #eeeeee; font-family: Roboto Mono;">P(outer track) = 0.5</span></p><p><span style="background-color: #eeeeee; font-family: Roboto Mono;">P(inner track) = 0.5</span></p><p><span style="font-family: Roboto;"><br /></span></p><h3 style="text-align: left;"><span style="font-family: Roboto;">Simulating the Data:</span></h3><p><span style="font-family: Roboto;">Let's first take a look at how the request data looks like (which we took from a real world time series data), rather than simulating. This serves as the base to simulate server resource utilization load.</span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlwdEnqqlfRWtzpTUlE8wsu9KUQNvYlS9g6Mhz17gwse7gpc5-kkXuECLz7caQ0YRP2HfsbfhpvxzEubxdFjJJOtIHTCKeD9oGE2OGWBSWARKbqMBsI54CtdflaASdTfFsMFyIp2hzdw/s1920/Figure_1.png" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Roboto;"><img border="0" data-original-height="974" data-original-width="1920" height="323" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlwdEnqqlfRWtzpTUlE8wsu9KUQNvYlS9g6Mhz17gwse7gpc5-kkXuECLz7caQ0YRP2HfsbfhpvxzEubxdFjJJOtIHTCKeD9oGE2OGWBSWARKbqMBsI54CtdflaASdTfFsMFyIp2hzdw/w640-h323/Figure_1.png" width="640" /></span></a></div><span style="font-family: Roboto;"><br /></span><p><span style="font-family: Roboto;">This data represents the number of requests coming to our server at each timestamp. In order to simulate the server resources utilized we follow the below process:</span></p><p><span style="font-family: Roboto;"><br /></span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Define limits of the system:</span></h4><p><span style="font-family: Roboto;">First we define the maximum resources available on the system and then what part of it can actually be utilized to service these requests. Usually in real world deployment we don't let 100% of the resources be consumed. We first define the maximum capacity such as:</span></p><p><span style="font-family: Roboto;">Number of CPU cycles per second: for example for a 2 GHz system it is 2<span style="background-color: white; color: #202124; font-size: 14px;">,000,000,000 cycles per second</span></span></p><p><span style="font-family: Roboto;">Maximum RAM: e.g. 32 GB</span></p><p><span style="font-family: Roboto;">Maximum Network Bandwidth: e.g. 1 Gbps</span></p><p><span style="font-family: Roboto;">Max disk speed: e.g. 7200 RPM for spinning disks, or 4 MBps for SSD</span></p><p><span style="font-family: Roboto;">Then we define the maximum allowed usage of these resources by the service. For example we could say only 80% of the CPU and memory can be utilized by the service.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Actual Simulation:</span></h4><p><span style="font-family: Roboto;">At every timestamp we read the number of requests k. For each of the <b>k</b> requests, we determine their job category by sampling from the categorical distributions we defined in our model.</span></p><p><span style="font-family: Roboto;">Once we know the job category for each of the <b>k</b> requests, we start to simulate the resource utilization caused by each of those. We do this by sampling from the respective probability distributions we defined for each of the resources. </span></p><p><span style="font-family: Roboto;">Each job coming to the server sits in a queue. The simulator will pick them up and simulate their load. The simulator needs to make sure that load on the system at no point exceeds the maximum allowed load. If such a situation occurs, all the jobs in the queue need to wait till the next timestep at which point some of the currently running jobs may have finished, making space for the new ones.</span></p><p><span style="font-family: Roboto;">Let's take an example to simulate just the CPU load. At the beginning of simulation the queue is empty and no CPU is being utilized. Now we get <b>k</b> requests with varying job sizes. We go through each of the <b>k</b> jobs, and depending on their size, we sample from the corresponding CPU distribution to determine the number of cpu cycles required to service the job and reduce the available CPU cycles for the current timestamp. We do this for each of the <b>k</b> jobs. </span></p><p><span style="font-family: Roboto;">If all of the <b>k</b> jobs were satisfied within less than 80% of CPU, the queue would be empty at next timestamp because the CPU would have been able to serve all the requests. On the other hand if the CPU load reached 80% by just the first <b>m</b> jobs, then at the beginning of next timestamp we will have <b>k-m</b> jobs waiting in the queue, so we would have to give them preference (in first come first serve policy) before we take on any of the new requests. </span></p><p><span style="font-family: Roboto;">We should note that this is a simplified scenario where we are just simulating CPU data. But if we were simulating CPU and memory then things could be more interesting. For example even though there might be CPU cycles available to serve more requests, but it's possible that memory was saturated and we would have to make the rest of the jobs wait.</span></p><p><span style="font-family: Roboto;">We can follow this process for all the system resources and simultaneously simulate an actual server. The complicated part is simulating the response latency because it is not an actual system resource but it is generated as a function of the collective resource availability of the system and we need to be able to factor in the time a job spends waiting in the queue. For that to happen we need to define a clock in the simulator so that we know how many clock ticks have passed since the job arrived till it got serviced.</span></p><p><span style="font-family: Roboto;">This is a sample simulated data generated:</span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOA3Mbfxp1jkunrPGzv9F7EByzYDxBxFq4oK4QnSTnGPKfA0uC2PaaY-ldxgdbQrIkX84n6RlMPYFH3Ke2iMJ364GV2VAG4cDfCs9XVlLlTN3uNH1PbBTS8uPB8uP2A6Ps-GujkHYUSw/s1920/sim1.png" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Roboto;"><img border="0" data-original-height="974" data-original-width="1920" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOA3Mbfxp1jkunrPGzv9F7EByzYDxBxFq4oK4QnSTnGPKfA0uC2PaaY-ldxgdbQrIkX84n6RlMPYFH3Ke2iMJ364GV2VAG4cDfCs9XVlLlTN3uNH1PbBTS8uPB8uP2A6Ps-GujkHYUSw/w640-h324/sim1.png" width="640" /></span></a></div><span style="font-family: Roboto;">We came up with a generic config format in order to facilitate doing simulations easily. The config used for the above data is provided below. There is a potential to simplify this by moving to a custom DSL based config which will be simpler at the same time giving more expressive power in defining the behavior of the simulation.</span><div><span style="font-family: Roboto;"><br /></span></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-size: 14px; line-height: 19px; white-space: pre;"><div><span style="font-family: Roboto;">{</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"job_metrics"</span>: [</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"http_requests"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"dist_type"</span>: <span style="color: #ce9178;">"categorical"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"num_categories"</span>: <span style="color: #b5cea8;">10</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"metric_type"</span>: <span style="color: #ce9178;">"job"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"source"</span>: <span style="color: #ce9178;">"uber-aprsep-14.csv"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"categorical_probs"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"comment"</span>: <span style="color: #ce9178;">"using different probabilities during day vs night hours and weekdays and weekends"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"day_of_weeks"</span>:</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"0"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"9"</span>: <span style="color: #ce9178;">"[1/20., 1/20., 1/20., 1/20., 1/10., 1/10., 1/10., 1/10., 2/10., 2/10.]"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"21"</span>: <span style="color: #ce9178;">"[1/10.] * 10"</span></span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"5"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"0"</span>: <span style="color: #ce9178;">"[1/6.] * 6 + [0.0] * 4"</span></span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"affected_resources"</span>: [</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"cpu"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"uniform"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"10 ** 6"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"5 * 10 ** 6"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"5 * 10 ** 6"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"8 * 10 ** 6"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"8 * 10 ** 6"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"10 ** 7"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"10 ** 7"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"5 * 10 ** 7"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"5 * 10 ** 7"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"10 ** 8"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"10 ** 8"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"5 * 10 ** 8"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"5 * 10 ** 8"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"8 * 10 ** 8"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"8 * 10 ** 8"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"10 ** 9"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"10 ** 9"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"5 * 10 ** 9"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"low"</span>: <span style="color: #ce9178;">"5 * 10 ** 9"</span>, <span style="color: #9cdcfe;">"high"</span>: <span style="color: #ce9178;">"10 ** 10"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"network_rx_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"10 ** 3"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"100"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"2 * 10 ** 4"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"5000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5.0 * 10 ** 5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"8 * 10 ** 5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"10 ** 6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10 ** 4"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5 * 10 ** 6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"50000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"10 ** 7"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"50000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5 * 10 ** 7"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"30000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"10 ** 8"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10 ** 6"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"10 ** 9"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10 ** 7"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"network_tx_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"lognormal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>:[</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.5"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"2"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.6"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"2.5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.8"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"3"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"4"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1.3"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"4.5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1.5"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"2"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"2.5"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"7"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"3"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"8"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"3.5"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"memory_used_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>:[</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e3"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e2"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5 * 1e3"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e3"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e4"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"5 * 1e3"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5 * 1e4"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"8 * 1e3"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e4"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5 * 1e5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e4"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"5 * 1e4"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5 * 1e6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e5"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e7"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"5 * 1e5"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5e7"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e6"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"disk_jobs"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"dist_type"</span>: <span style="color: #ce9178;">"categorical"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"num_categories"</span>: <span style="color: #b5cea8;">9</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"source"</span>: <span style="color: #ce9178;">"sim_disk_jobs.csv"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"metric_type"</span>: <span style="color: #ce9178;">"resource"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"categorical_probs"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"comment"</span>: <span style="color: #ce9178;">"we model this as having 9 kind of jobs, representing small, medium, large and read/write/read+write categories"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"day_of_weeks"</span>:{</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"0"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"0"</span>: <span style="color: #ce9178;">"[0.33, 0.33, 0.33, 0, 0, 0, 0, 0, 0]"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"6"</span>: <span style="color: #ce9178;">"[0.33, 0.33, 0.33, 0, 0, 0, 0, 0, 0]"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"18"</span>: <span style="color: #ce9178;">"[0.0, 0.0, 0.0, 0.3, 0.3, 0.3, 0.025, 0.025, 0.05]"</span></span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"affected_resources"</span>: [</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"cpu"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.08 * 2 * 10 ** 9"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.02 * 2 * 10 ** 9"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.1 * 2 * 10 ** 9"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.03 * 2 * 10 ** 9"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.4 * 2 * 10 ** 9"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.1 * 2 * 10 ** 9"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.08 * 2 * 10 ** 9 * 10"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.01 * 2 * 10 ** 9 * 10"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.2 * 2 * 10 ** 9 * 10"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.02 * 2 * 10 ** 9 * 10"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.3 * 2 * 10 ** 9 * 10"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.02 * 2 * 10 ** 9 * 10"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.1 * 2 * 10 ** 8 "</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.06 * 2 * 10 ** 8"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.4 * 2 * 10 ** 8 "</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.1 * 2 * 10 ** 8"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.5 * 2 * 10 ** 8 "</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.1 * 2 * 10 ** 8"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"disk_read_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"exponential"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"100.0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"200.0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"500.0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"200.0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"400.0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"600.0"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"disk_write_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e5"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"2e3"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e4"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"5e6"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e4"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"100"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e3"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e2"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e4"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1e3"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"memory_used_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"exponential"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"values_type"</span>: <span style="color: #ce9178;">"absolute"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"100"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"500"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1000"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"10"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"20"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"30"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"150"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"600"</span>},</span></div><div><span style="font-family: Roboto;"> {<span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"1000"</span>}</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;"> }</span></div><div><span style="font-family: Roboto;"> ],</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"resource_metrics"</span>: [</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"cpu"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [{<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"0.08"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"0.02"</span>}]</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_usage_type"</span>: <span style="color: #ce9178;">"percent"</span></span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"disk_read_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [{<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"10 ** 2"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"20"</span>}]</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_usage_type"</span>: <span style="color: #ce9178;">"absolute"</span></span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"network_rx_bytes"</span></span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"disk_write_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"normal"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [{<span style="color: #9cdcfe;">"loc"</span>: <span style="color: #ce9178;">"1e2"</span>, <span style="color: #9cdcfe;">"scale"</span>: <span style="color: #ce9178;">"20"</span>}]</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_usage_type"</span>: <span style="color: #ce9178;">"absolute"</span></span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"network_tx_bytes"</span></span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"name"</span>: <span style="color: #ce9178;">"memory_used_bytes"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_distribution_spec"</span>: {</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_type"</span>: <span style="color: #ce9178;">"pareto"</span>,</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"distribution_parameters"</span>: [{<span style="color: #9cdcfe;">"a"</span>: <span style="color: #ce9178;">"1.0"</span>}]</span></div><div><span style="font-family: Roboto;"> },</span></div><div><span style="font-family: Roboto;"> <span style="color: #9cdcfe;">"baseline_usage_type"</span>: <span style="color: #ce9178;">"absolute"</span></span></div><div><span style="font-family: Roboto;"> }</span></div><span style="font-family: Roboto;"><br /></span><div><span style="font-family: Roboto;"> ]</span></div><div><span style="font-family: Roboto;">}</span></div><span style="font-family: Roboto;"><br /></span></div><p><span style="font-family: Roboto;"><br /></span></p><p><br /></p></div>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-68343666269874292042020-12-10T11:15:00.003-08:002020-12-12T23:38:15.819-08:00An Impossibility Theorem For Clustering<p><span style="font-family: Roboto;"> Clustering is one of the most commonly employed task when doing data analysis. The objective of clustering is to group a given set of data points based on some distance metric such that similar points are grouped together in a cluster where as dissimilar points are grouped in other clusters. There are a myriad algorithms for doing clustering, some of the common ones are k-means, agglomerative clustering, dbscan, spectral.</span></p><p><span style="font-family: Roboto;">Even though these techniques are commonly used and taught but there is hardly any discussion about their limitations. Jon Kleinberg changed this with his paper titled "<a href="https://www.cs.cornell.edu/home/kleinber/nips15.pdf" target="_blank">An Impossibility Theorem for Clustering</a>" wherein he formally defined the limits of clustering algorithms in terms of three properties and showing that all clustering algorithms can satisfy only at most two of the three properties. This post is an attempt to summarise and explain the paper while skipping the proof of the theorem(s). </span></p><p><span style="font-family: Roboto;">Before someone complains, much of the prose is verbatim copied from the paper. The language used in the paper is very simple and there was no point in twisting it.</span></p><p><br /></p><h2 style="text-align: left;"><span style="font-family: Roboto;">The Impossibility Theorem:</span></h2><p><span style="font-family: Roboto;">Before talking about the theorem, let's define some things concretely:</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Data (S):</span></h4><p><span style="font-family: Roboto;">We refer to the data being clustered as the set S of n points.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Distance Function (d): </span></h4><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;"><span style="font-family: Roboto;">Distance function is any function d : S x S →ℝ such that for distinct i,ȷ Є S, we have d(i,ȷ) ≥ 0, d(i,ȷ) = 0 if and only if i = ȷ, and d(i,ȷ) = d(ȷ,i).</span></p></blockquote><h4 style="text-align: left;"><span style="font-family: Roboto;">Clustering:</span></h4><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;"><span style="font-family: Roboto;">In terms of the distance function d, the clustering function can be defined as a function ƒ that takes a distance function d on S and returns a partition Γ of S.</span></p></blockquote><p><span style="font-family: Roboto;">The paper then defines three properties: <i>scale-invariance</i>, <i>richnesss</i> and <i>consistency</i> and defines the impossibility theorem based on them. Let's look at the definition of these properties.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Scale-invariance:</span></h4><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;"><span style="font-family: Roboto;">If d is a distance function then ⍺.d is is the distance function in which distance between i and j is ⍺d(i,j). Scale-invariance for a clustering function ƒ is defined as ƒ(d) = f(⍺.d) for any ⍺ > 0. What this basically means is that even if we change the scale of the distance function or its unit, the output of the clustering should not change. </span></p></blockquote><h4 style="text-align: left;"><span style="font-family: Roboto;">Richness:</span></h4><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;"><span style="font-family: Roboto;">This simply means that all clusterings of S should be possible. Mathematically range(ƒ) = the set of all partitions of S. In other words, suppose we are given the names of the points only (i.e. the indices in S) but not the distances between them. Richness requires that for any desired partition Γ, it should be possible to construct a distance function d on S for which ƒ(d) = Γ.</span></p></blockquote><h4 style="text-align: left;"><span style="font-family: Roboto;">Consistency:</span></h4><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;"><span style="font-family: Roboto;">A clustering function is said to be consistent if when we decrease the distance between points within a cluster and expand distances between points in different clusters, the clustering remains unchanged. Mathematically, if Γ is a particular clustering of S using distance function d, and we have another distance function d' on S, such that for all i,j within a cluster we have d'(i, j) < d(i, j) and for all i, j in different clusters of Γ we have d'(i, j) > d(i, j), then if ƒ(d') = Γ, then ƒ is consistent, i.e. we get the same clustering using the two distance functions. Also, d' is called a Γ transformation of d here.</span></p></blockquote><h3 style="text-align: left;"><span style="font-family: Roboto;">Impossibility Theorem: </span></h3><p><span style="font-family: Roboto;"><b>There is no clustering function which satisfies <i>Scale-invariance</i>, <i>richness</i> and <i>consistency</i>.</b></span></p><p><span style="font-family: Roboto;"><b><br /></b></span></p><h2 style="text-align: left;"><span style="font-family: Roboto;">Understanding with Examples:</span></h2><p><span style="font-family: Roboto;">The paper mathematically proves the theorem in full generality, and also shows how it applies in context of some of the common clustering algorithms. Skipping the proof, we can directly try to see how it applies in common clustering models.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Single Linkage Clustering: </span></h4><p><span style="font-family: Roboto;">Single linkage is a form of hierarchical clustering where we start by representing every single point as its own cluster and iteratively merge these clusters with one another based on their distances. The algorithm stops based on a stopping criteria. The implication of the theorem is that no matter which stopping criteria we choose, the resulting clustering function will only satisfy at most 2 of the 3 properties defined above. Following are the three commonly used stopping conditions used in single linkage - let's see which 2 of the 3 properties they satisfy.</span></p><p><span style="font-family: Roboto;"><b>k-clusters</b>: We can stop the clustering as soon as we have reached k clusters. This eliminates the richness property, because not all possible clusterings can be obtained with this condition.</span></p><p><span style="font-family: Roboto;"><b>distance-r stopping condition</b>: In this case we merge two clusters only if their <i>distance</i> <= <i>some distance r</i>. This does not satisfy the <i>scale-invariance</i> property. Because if we obtain a clustering C using a distance function d, and then scale the distance function by </span><span style="font-family: Roboto;">⍺, then the clustering would change</span><span style="font-family: Roboto;"> as depending on the change in scale, clusters which were previously merged may not merge anymore or clusters which could not be merged previously since their distance was greater than r, can now be merged due to change in scale.</span></p><p><span style="font-family: Roboto;"><b>scale-α stopping condition:</b> Let ρ denote the maximum pairwise distance in S using a distance function d1, then we will only merge two clusters if their <i>distance</i> is <= αρ. For any α < 1, this condition does not satisfy the consistency property. Let's see why - let's say <i>d2</i> is a Γ transformation of <i>d1</i>, then by definition the maximum pairwise distance obtained using <i>d2</i> would be greater than ρ. Therefore the clustering obtained using <i>d2</i> would not be same as that obtained using <i>d1</i>.</span></p><h4 style="text-align: left;"><span style="font-family: Roboto;">Centroid Based Clustering:</span></h4><p><span style="font-family: Roboto;"> Centroid based clustering refers to the commonly used <i>k-means</i> and <i>k-median</i> algorithms. Where we start with a predefined <i>k</i> number of clusters by selecting <i>k</i> points in the data as centroids and then assigning each point to their nearest cluster. These algorithms suffer with the problem of not satisfying the <i>richness</i> property. But the paper also proves that they don't satisfy the consistency property as well. I'm not reproducing the proof for the sake of brevity.</span></p><h3 style="text-align: left;"><span style="font-family: Roboto;"><br /></span></h3><h3 style="text-align: left;"><span style="font-family: Roboto;">Final Thoughts: </span></h3><p><span style="font-family: Roboto;">In practice we are aware that different clustering algorithms may produce different results on same data. But this paper throws light on this area in an organized manner and highlights the trade-offs involved in choosing different algorithms and their parameters. These results may help one decide which qualities in the output of the clustering algorithm are more important for them and design the model accordingly, e.g., if richness is more important (i.e. we want to make sure that all possible clusterings are viable) but perhaps consistency is not important, then we know what to do.</span></p><p><span style="font-family: Roboto;"><br /></span></p><p><span style="font-family: Roboto;"><br /></span></p><p><span style="font-family: Roboto;"><br /></span></p><p><span style="font-family: Roboto;"><br /></span></p><p><br /></p>Abhinav Upadhyayhttp://www.blogger.com/profile/05017913365335406004noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-74594868875494233922020-12-05T07:36:00.011-08:002020-12-05T10:09:02.955-08:00Understanding Linear Hashing<h1 style="text-align: left;"><span style="font-family: courier;">Linear Hashing</span></h1><div style="text-align: justify;"><span style="font-family: courier;">In my most recent night time project I am learning how databases work by actually implementing one. I started by implementing <a href="https://github.com/abhinav-upadhyay/brickdb" target="_blank">a simple key-value store using a hash index</a>. Which has basically required me to dig deep into hash table literature.</span></div><div style="text-align: justify;"><span style="font-family: courier;"><br /></span></div><div style="text-align: justify;"><span style="font-family: courier;">Many of us may have implemented a hash table one or few times in our lives - I've certainly done my fair share of implementations. One of the major problems that we tend to worry about when doing a hash table implementation is hash collision. Just to recall, hash collision refers to the problem when two or more keys hash to the same index in the table. The most straightforward way of handling collision is a technique called separate chaining. In separate chaining we use a linked list within each bucket to store all the key/value pairs which were mapped to the same bucket by the hash function. An illustration is shown below (sorry for my terrible handwriting and illustration skills):</span></div><div style="text-align: justify;"><span style="font-family: courier;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><span style="font-family: courier;"><br /></span></div><span style="font-family: courier;"><br /></span><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYvHOskwWf3cWTh_9uNpSTMdn6awmm5uRq6R107FhwXrVLWpjB6Q3jMyQNKC49ySGs5bxwW3xk_q3QhjKp5CaIjamz_nwwYU2zpjc3r8GugrspbPMpx1DJoUkC7tvca30TgiyZ5UiBbTI/s2048/separate+chaining_4.jpg" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: courier;"><img border="0" data-original-height="1419" data-original-width="2048" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYvHOskwWf3cWTh_9uNpSTMdn6awmm5uRq6R107FhwXrVLWpjB6Q3jMyQNKC49ySGs5bxwW3xk_q3QhjKp5CaIjamz_nwwYU2zpjc3r8GugrspbPMpx1DJoUkC7tvca30TgiyZ5UiBbTI/w400-h278/separate+chaining_4.jpg" width="400" /></span></a></div><span style="font-family: courier;"><br /></span><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">The problem with separate chaining is that as the number of entries in the hash table grows we get more and more collisions and the linked lists tend to get bigger. This impacts the lookup performance because in worst case we have to scan a linked list to find a key in a bucket, whereas we use hash tables for their </span><span style="font-family: Roboto Mono;">O(1)</span><span style="font-family: courier;"> complexity. To make sure that the average complexity of lookup in the hash table remains </span><span style="font-family: Roboto Mono;">O(1)</span><span style="font-family: courier;"> what we usually do is to expand the table once we cross a threshold load factor. The load factor could be as simple as the number of entries in the table vs the number of buckets. Once this ratio crosses a threshold, say 0.75, we may decide to grow the size of the table so that the hash collisions will go down and lookups will remain fast. </span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Problem with regrowing is that that we have to allocate new memory for the expanded table and rehash all the keys stored in the previous table (because with increased table size, their bucket index in the new table may change) and then finally copy over all those entries into the new table. The usual growing factor for the table is 2, i.e. we double the size of the table every time. This gets expensive with every time we have to regrow the table. Although in practice it may work out to be okay.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">However for an on disk hash table (for indexing in a database), it is not that simple. You start by allocating a fix number of buckets in a file and the hash function uses the number of buckets to map a key into one of the buckets. When you start getting a high number of collisions or grow beyond a certain load factor, doubling the number of buckets (like in the case of in-memory hash tables) is not that straightforward. Because we would have to rehash and rewrite all the entries and disk i/o is millions of time slower than RAM. This would also have consequences in terms of concurrency, i.e., while we are expanding the index and rewriting the entries all the readers/writers would be blocked for this to finish. </span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Then how do we expand the index cheaply for the on disk case? This was answered in the 1980 by Litwin in a paper titled <a href="https://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF" target="_blank">"Linear Hashing: A New Tool for File and Table Addressing"</a> . Using linear hashing it is possible to access any record from the disk in two accesses. Linear hashing involves linearly growing the table one bucket at a time instead of the exponential growth when we double the table size every time.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Linear hashing technique is part of a family of hashing techniques called dynamic hashing. In dynamic hashing we use a family of hash function rather than a single fixed hash function (as is done in a static hash table implementation). Linear hashing is a specific example of dynamic hashing where we use two hash functions at any point of time. Following is an outline of how it works:</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">We start with following variables:</span></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div><span style="font-family: Roboto Mono;">n = number of initial buckets (must be a power of 2)</span></div><div><span style="font-family: Roboto Mono;">s = 0 (this is the index of the bucket which is to be split next)</span></div><div><span style="font-family: Roboto Mono;">i = number of bits required to address the n buckets. </span></div></blockquote><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">For load factor we will use the following formula:</span></div><div><span style="font-family: courier;"><br /></span></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="text-align: left;"><span style="font-family: Roboto Mono;">load_factor = number of entries / 2 * number of buckets</span></div></blockquote><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">The number 2 in the denominator is to allow on average 2 records in a chain in any given bucket. </span><span style="font-family: courier;">For 1 entry and 2 buckets, load factor is 1/4. For 2 entries and 2 buckets it is 1/2, for 3 entries and 2 buckets it is 3/4. For 4 entries and 2 buckets it will be 1. And so on. We will use the threshold for load factor as 0.8. </span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Let's assume we are using a hash function which gives us a 64 bit unsigned integer. Which means we can use it to address upto 2^64 buckets. In the beginning our hash table will only have </span><span style="font-family: Roboto Mono;">n</span><span style="font-family: courier;"> buckets. So we will use the first </span><span style="font-family: Roboto Mono;">i</span><span style="font-family: courier;"> bits of the hash function to map a key to a bucket. Once we reach the threshold of our load factor, we will add one more bucket to the table and increment </span><span style="font-family: Roboto Mono;">i</span><span style="font-family: courier;"> ( we increment </span><span style="font-family: Roboto Mono;">i</span><span style="font-family: courier;"> every time the number of buckets has reached a power of 2 and we need an extra bit to address the new bucket). After adding the new bucket we split the keys stored in the bucket at index </span><span style="font-family: Roboto Mono;">s</span><span style="font-family: courier;"> and the newly added bucket. We increment </span><span style="font-family: Roboto Mono;">s</span><span style="font-family: courier;"> by one after every split. Once we have doubled the number of buckets from where we started, we reset </span><span style="font-family: Roboto Mono;">s</span><span style="font-family: courier;"> to 0 and repeat.</span></div><div><span style="font-family: courier;"><br /></span></div><h3 style="text-align: left;"><span style="font-family: courier;">Walk Through of inserting keys:</span></h3><div><span style="font-family: courier;">Following is a walk through of how it would work through a toy example. Let's say we start with </span><span style="font-family: Roboto Mono;">n=2</span><span style="font-family: courier;"> buckets, so that each bucket is addressable with just one bit (0 or 1), so </span><span style="font-family: Roboto Mono;">i=1</span><span style="font-family: courier;"> and </span><span style="font-family: Roboto Mono;">s=0</span><span style="font-family: courier;">.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Let's say we insert four keys into the table via following calls:</span></div><div style="text-align: center;"><span style="font-family: Roboto Mono; font-size: x-small;">put(k1, v1)</span></div><div style="text-align: center;"><span style="font-family: Roboto Mono; font-size: x-small;">put(k2, v2)</span></div><div style="text-align: center;"><span style="font-family: Roboto Mono; font-size: x-small;">put(k3, v3)</span></div><div style="text-align: center;"><span style="font-family: Roboto Mono; font-size: x-small;">put(k4, v4)</span></div><div style="text-align: center;"><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Following is an illustration of how it might be arranged in the table:</span></div><div><span style="font-family: courier;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs-KbXhjCjkYGI9vIyI8x4s6us58Y_T6gPcfn9O7BCVBC41W7yk-SqjiXtinRlyjqmEBev2rd-LOuex_CT8w6RT681r1n87w4BdLKBjamD6APzxQTKomozWAGIwfuHVIAa3PU2bMUUahw/s2048/Adobe+Scan+05-Dec-2020_1+%25281%2529.jpg" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: courier;"><img border="0" data-original-height="1193" data-original-width="2048" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs-KbXhjCjkYGI9vIyI8x4s6us58Y_T6gPcfn9O7BCVBC41W7yk-SqjiXtinRlyjqmEBev2rd-LOuex_CT8w6RT681r1n87w4BdLKBjamD6APzxQTKomozWAGIwfuHVIAa3PU2bMUUahw/s320/Adobe+Scan+05-Dec-2020_1+%25281%2529.jpg" width="320" /></span></a></div><span style="font-family: courier;"><br /></span><div class="separator" style="clear: both; text-align: center;"><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">After adding 4 entries into our table, the load factor becomes 4/4 = 1 which is greater than our threshold load factor of 0.8 so we grow the table. In linear hashing we grow the table by adding just one new bucket at a time. So we add bucket B2. Since to address the 3rd bucket we need one extra bit, we increment </span><span style="font-family: Roboto Mono;">i</span><span style="font-family: courier;"> to 2. We also split the entries in bucket B0 with B2 by rehashing each of them using 2 bits now. Following is how it would look like after doing this: </span></div><div><span style="font-family: courier;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVe92zfJKBWZjSXayMctZjsP386whQnHMKs6e2RM6hr2D3oip6sDzzo0vQX9-6wlTtIpE7cflQawqDX6GwLZFXcfo1Dhjn7iXzxz2VF-Zucr9fPfx2A64XcYq8LDUY2tMdIDnHflbJ9z0/s2339/Adobe+Scan+05-Dec-2020_2+%25281%2529.jpg" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: courier;"><img border="0" data-original-height="949" data-original-width="2339" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVe92zfJKBWZjSXayMctZjsP386whQnHMKs6e2RM6hr2D3oip6sDzzo0vQX9-6wlTtIpE7cflQawqDX6GwLZFXcfo1Dhjn7iXzxz2VF-Zucr9fPfx2A64XcYq8LDUY2tMdIDnHflbJ9z0/s320/Adobe+Scan+05-Dec-2020_2+%25281%2529.jpg" width="320" /></span></a></div><span style="font-family: courier;"><br /></span><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">After the split </span><span style="font-family: Roboto Mono;">s</span><span style="font-family: courier;"> is incremented to 1.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Now, let's insert another entry into the table. The new entry is key k5 and it hashes to bucket 11. As right now we only have 3 buckets, 11 is an invalid index, so we just use the first bit of this value and store k5 in bucket B1. Following is how it would look:</span></div><div><span style="font-family: courier;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_sV5LDpMt4Mv1I_72EKaZHit5KCXBxlbRqwR4trBtbkxGac7PsvM1b6bew0v_JCvG3BN4ZEWCmoxoHiiensDxnPYcg9UZQuiiewMVYYiVmPETZBPUlXaadpeH8H_uaXPlsRhAz_sHZ5s/s2339/Adobe+Scan+05-Dec-2020_3+%25281%2529.jpg" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: courier;"><img border="0" data-original-height="882" data-original-width="2339" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_sV5LDpMt4Mv1I_72EKaZHit5KCXBxlbRqwR4trBtbkxGac7PsvM1b6bew0v_JCvG3BN4ZEWCmoxoHiiensDxnPYcg9UZQuiiewMVYYiVmPETZBPUlXaadpeH8H_uaXPlsRhAz_sHZ5s/s320/Adobe+Scan+05-Dec-2020_3+%25281%2529.jpg" width="320" /></span></a></div><span style="font-family: courier;"><br /></span><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">After this again, our load factor would be 5/6=0.83 which would require another addition of a bucket followed by split of entries between bucket 1 and 3. We would still use </span><span style="font-family: Roboto Mono;">i=2</span><span style="font-family: courier;"> since we can address 4 buckets using 2 bits. Following is how it would look after that:</span></div><div><span style="font-family: courier;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiL0cdvy8rvZ0jHCCmdZ1sHTaIh_JqRcI8o3d9IrJ7RCOq4PQDNtQiIB9N-yrIjEaGl9xIiaXKUnfjo1h2ZO_auM2HDhjfzY3PLtz42gwtfNEfFtlYGTloPrujMgG3ERKSTkRKQYQKr1gw/s2339/Adobe+Scan+05-Dec-2020_4+%25281%2529.jpg" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: courier;"><img border="0" data-original-height="846" data-original-width="2339" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiL0cdvy8rvZ0jHCCmdZ1sHTaIh_JqRcI8o3d9IrJ7RCOq4PQDNtQiIB9N-yrIjEaGl9xIiaXKUnfjo1h2ZO_auM2HDhjfzY3PLtz42gwtfNEfFtlYGTloPrujMgG3ERKSTkRKQYQKr1gw/s320/Adobe+Scan+05-Dec-2020_4+%25281%2529.jpg" width="320" /></span></a></div><span style="font-family: courier;"><br /></span><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Since we have doubled the number of buckets from where we started, we reset the split pointer, </span><span style="font-family: Roboto Mono;">s</span><span style="font-family: courier;"> to 0 (we reset it every time we have doubled the number of buckets as stated previously).</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">In this way we can continue adding entries and slowly growing the table as required. This works particularly well for on disk storage because it's cheap to append an entry and just rewrite a few values to adjust the linked lists. This works well from concurrency point of view as well because we can just take a lock on the hash chain which is being split and continue to read/write other hash chains.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;"><br /></span></div><h3 style="text-align: left;"><span style="font-family: courier;">Reading Values</span></h3><div><span style="font-family: courier;">Let's briefly also talk about how reading from this hash table would work. Let's say after having added 5 entries as above, we want to read the key k3. How would we do that? </span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Since </span><span style="font-family: Roboto Mono;">i=2</span><span style="font-family: courier;">, we get h(k3) = 00. That is the first bucket, we can safely read the value from there and return.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">What about the case when we have incremented </span><span style="font-family: Roboto Mono;">i</span><span style="font-family: courier;"> but our total number of buckets is less than </span><span style="font-family: Roboto Mono;">2^i</span><span style="font-family: courier;">? For example if in the above case we add a 5th bucket we would have to increment </span><span style="font-family: Roboto Mono;">i</span><span style="font-family: courier;"> to 3. But 3 bits can address 8 buckets while we only have 5. How would we read from the hash table in that case? It's pretty simple: We use 3 bits to hash the key. If the resulting value is greater than the number of buckets we have right now, we use only 2 bits and read from the bucket at the resulting index, otherwise we use all three bits.</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">Following is a python implementation of a vanilla hash table in Python along with Linear Hashing in less than 100 lines. I first implemented the usual HashMap which grows exponentially and then implemented LinearHashMap by extending it and overriding the _grow method. I hope it makes sense, but feel free to leave comment if something is not clear.</span></div><div><br /></div><div><br /></div><div><div style="background-color: #1e1e1e; color: #d4d4d4; font-family: "Droid Sans Mono", monospace, monospace, "Droid Sans Fallback"; font-size: 14px; line-height: 19px; white-space: pre;"><br /><div><span style="color: #6a9955;"># Copyright (c) 2020,2021 Abhinav Upadhyay</span></div><div> <span style="color: #6a9955;"># All rights reserved.</span></div><div> <span style="color: #6a9955;">#</span></div><div> <span style="color: #6a9955;"># Redistribution and use in source and binary forms, with or without</span></div><div> <span style="color: #6a9955;"># modification, are permitted provided that the following conditions</span></div><div> <span style="color: #6a9955;"># are met:</span></div><div> <span style="color: #6a9955;"># 1. Redistributions of source code must retain the above copyright</span></div><div> <span style="color: #6a9955;"># notice, this list of conditions and the following disclaimer.</span></div><div> <span style="color: #6a9955;"># 2. Redistributions in binary form must reproduce the above copyright</span></div><div> <span style="color: #6a9955;"># notice, this list of conditions and the following disclaimer in the</span></div><div> <span style="color: #6a9955;"># documentation and/or other materials provided with the distribution.</span></div><div> <span style="color: #6a9955;">#</span></div><div> <span style="color: #6a9955;"># THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS</span></div><div> <span style="color: #6a9955;"># ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED</span></div><div> <span style="color: #6a9955;"># TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR</span></div><div> <span style="color: #6a9955;"># PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS</span></div><div> <span style="color: #6a9955;"># BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR</span></div><div> <span style="color: #6a9955;"># CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF</span></div><div> <span style="color: #6a9955;"># SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS</span></div><div> <span style="color: #6a9955;"># INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN</span></div><div> <span style="color: #6a9955;"># CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)</span></div><div> <span style="color: #6a9955;"># ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE</span></div><div> <span style="color: #6a9955;"># POSSIBILITY OF SUCH DAMAGE.</span></div><div> <span style="color: #6a9955;">#</span></div><div> </div><div><span style="color: #c586c0;">import</span> math</div><div><span style="color: #c586c0;">import</span> xxhash</div><br /><div><span style="color: #569cd6;">class</span> <span style="color: #4ec9b0;">HashMap</span>:</div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">__init__</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">size</span>=<span style="color: #b5cea8;">32</span>, <span style="color: #9cdcfe;">load_factor</span>=<span style="color: #b5cea8;">0.75</span>, <span style="color: #9cdcfe;">grow</span>=<span style="color: #569cd6;">True</span>):</div><div> <span style="color: #569cd6;">self</span>.table = [EntryList() <span style="color: #c586c0;">for</span> _ <span style="color: #c586c0;">in</span> <span style="color: #dcdcaa;">range</span>(size)]</div><div> <span style="color: #569cd6;">self</span>.nentries = <span style="color: #b5cea8;">0</span></div><div> <span style="color: #569cd6;">self</span>.load_factor = load_factor</div><div> <span style="color: #569cd6;">self</span>.grow = grow</div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">put</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">k</span>, <span style="color: #9cdcfe;">v</span>):</div><div> bkt_idx = <span style="color: #569cd6;">self</span>._get_bucket_idx(k, <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table))</div><div> <span style="color: #569cd6;">self</span>.table[bkt_idx].append(k, v)</div><div> <span style="color: #569cd6;">self</span>.nentries += <span style="color: #b5cea8;">1</span></div><div> <span style="color: #c586c0;">if</span> <span style="color: #569cd6;">self</span>.grow <span style="color: #569cd6;">and</span> <span style="color: #569cd6;">self</span>._comput_load_factor() > <span style="color: #569cd6;">self</span>.load_factor:</div><div> <span style="color: #569cd6;">self</span>._grow()</div><br /><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">get</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">k</span>):</div><div> bkt_idx = <span style="color: #569cd6;">self</span>._get_bucket_idx(k, <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table))</div><div> <span style="color: #c586c0;">for</span> entry <span style="color: #c586c0;">in</span> <span style="color: #569cd6;">self</span>.table[bkt_idx]:</div><div> <span style="color: #c586c0;">if</span> entry.key == k:</div><div> <span style="color: #c586c0;">return</span> entry.value</div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">_comput_load_factor</span>(<span style="color: #9cdcfe;">self</span>):</div><div> <span style="color: #c586c0;">return</span> <span style="color: #569cd6;">self</span>.nentries / (<span style="color: #b5cea8;">3</span> * <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table))</div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">_get_bucket_idx</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">k</span>, <span style="color: #9cdcfe;">size</span>):</div><div> <span style="color: #c586c0;">return</span> xxhash.xxh64(k).intdigest() % size</div><br /><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">_grow</span>(<span style="color: #9cdcfe;">self</span>):</div><div> <span style="color: #6a9955;"># we double the table size and rehash all the entries</span></div><div> newsize = <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table) * <span style="color: #b5cea8;">2</span></div><div> new_table = [EntryList() <span style="color: #c586c0;">for</span> _ <span style="color: #c586c0;">in</span> <span style="color: #dcdcaa;">range</span>(newsize)]</div><div> <span style="color: #c586c0;">for</span> bucket <span style="color: #c586c0;">in</span> <span style="color: #569cd6;">self</span>.table:</div><div> <span style="color: #c586c0;">for</span> e <span style="color: #c586c0;">in</span> bucket:</div><div> bucket_idx = <span style="color: #569cd6;">self</span>._get_bucket_idx(e.key, newsize)</div><div> new_table[bucket_idx].append(e.key, e.value)</div><div> <span style="color: #569cd6;">self</span>.table = new_table</div><br /><div><span style="color: #569cd6;">class</span> <span style="color: #4ec9b0;">LinearHashMap</span>(<span style="color: #4ec9b0;">HashMap</span>):</div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">__init__</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">size</span>=<span style="color: #b5cea8;">32</span>, <span style="color: #9cdcfe;">load_factor</span>=<span style="color: #b5cea8;">0.75</span>):</div><div> <span style="color: #4ec9b0;">super</span>().<span style="color: #dcdcaa;">__init__</span>(size, load_factor)</div><div> <span style="color: #569cd6;">self</span>.i = <span style="color: #4ec9b0;">int</span>(math.log2(size))</div><div> <span style="color: #569cd6;">self</span>.split_idx = <span style="color: #b5cea8;">0</span></div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">_grow</span>(<span style="color: #9cdcfe;">self</span>):</div><div> split_idx = <span style="color: #569cd6;">self</span>.split_idx</div><div> <span style="color: #569cd6;">self</span>.split_idx += <span style="color: #b5cea8;">1</span></div><div> old_bucket = <span style="color: #569cd6;">self</span>.table[split_idx]</div><div> new_bucket = EntryList()</div><div> <span style="color: #569cd6;">self</span>.table.append(new_bucket)</div><div> <span style="color: #6a9955;"># if we have grown to the next power of 2 number of buckets</span></div><div> <span style="color: #6a9955;"># we increment i</span></div><div> <span style="color: #c586c0;">if</span> <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table) > (<span style="color: #b5cea8;">1</span> << <span style="color: #569cd6;">self</span>.i):</div><div> <span style="color: #569cd6;">self</span>.i += <span style="color: #b5cea8;">1</span></div><div> <span style="color: #6a9955;"># if we have doubled the number of buckets, we reset s to 0</span></div><div> <span style="color: #c586c0;">if</span> <span style="color: #569cd6;">self</span>.split_idx * <span style="color: #b5cea8;">2</span> == <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table):</div><div> <span style="color: #569cd6;">self</span>.split_idx = <span style="color: #b5cea8;">0</span></div><div> <span style="color: #6a9955;"># rehash the entries in the old bucket and split with new bucket</span></div><div> prev_e = old_bucket</div><div> <span style="color: #c586c0;">for</span> e <span style="color: #c586c0;">in</span> old_bucket:</div><div> new_bucket_id = <span style="color: #569cd6;">self</span>._get_bucket_idx(e.key, <span style="color: #dcdcaa;">len</span>(<span style="color: #569cd6;">self</span>.table))</div><div> <span style="color: #c586c0;">if</span> new_bucket_id != split_idx:</div><div> new_bucket.append(e.key, e.value)</div><div> prev_e.next = e.next</div><div> <span style="color: #c586c0;">else</span>:</div><div> prev_e = e</div><br /><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">_get_bucket_idx</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">k</span>, <span style="color: #9cdcfe;">size</span>):</div><div> h = xxhash.xxh64(k).intdigest()</div><div> <span style="color: #6a9955;"># we take the first i bits as the bucket index</span></div><div> <span style="color: #6a9955;"># if this index is less than the number of buckets</span></div><div> <span style="color: #6a9955;"># we return it as it is. Otherwise we unset the MSB</span></div><div> <span style="color: #6a9955;"># so we only use i-1 bits effectively and address the valid bucket</span></div><div> bkt_idx = h & ((<span style="color: #b5cea8;">1</span> << <span style="color: #569cd6;">self</span>.i) - <span style="color: #b5cea8;">1</span>)</div><div> <span style="color: #c586c0;">if</span> bkt_idx < size:</div><div> <span style="color: #c586c0;">return</span> bkt_idx</div><div> <span style="color: #c586c0;">return</span> bkt_idx ^ (<span style="color: #b5cea8;">1</span> << (<span style="color: #569cd6;">self</span>.i - <span style="color: #b5cea8;">1</span>))</div><div> </div><br /><div><span style="color: #569cd6;">class</span> <span style="color: #4ec9b0;">EntryList</span>:</div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">__init__</span>(<span style="color: #9cdcfe;">self</span>):</div><div> <span style="color: #569cd6;">self</span>.head = <span style="color: #569cd6;">None</span></div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">append</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">k</span>, <span style="color: #9cdcfe;">v</span>):</div><div> <span style="color: #c586c0;">if</span> <span style="color: #569cd6;">self</span>.head <span style="color: #569cd6;">is</span> <span style="color: #569cd6;">None</span>:</div><div> <span style="color: #569cd6;">self</span>.head = Entry(k, v)</div><div> <span style="color: #c586c0;">return</span></div><div> <span style="color: #569cd6;">self</span>.head.append(k, v)</div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">__iter__</span>(<span style="color: #9cdcfe;">self</span>):</div><div> <span style="color: #dcdcaa;">next</span> = <span style="color: #569cd6;">self</span>.head</div><div> <span style="color: #c586c0;">while</span> <span style="color: #dcdcaa;">next</span>:</div><div> <span style="color: #c586c0;">yield</span> <span style="color: #dcdcaa;">next</span></div><div> <span style="color: #dcdcaa;">next</span> = <span style="color: #dcdcaa;">next</span>.next</div><br /><div><span style="color: #569cd6;">class</span> <span style="color: #4ec9b0;">Entry</span>:</div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">__init__</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">key</span>, <span style="color: #9cdcfe;">value</span>):</div><div> <span style="color: #569cd6;">self</span>.key = key</div><div> <span style="color: #569cd6;">self</span>.value = value</div><div> <span style="color: #569cd6;">self</span>.next = <span style="color: #569cd6;">None</span></div><div> </div><div> <span style="color: #569cd6;">def</span> <span style="color: #dcdcaa;">append</span>(<span style="color: #9cdcfe;">self</span>, <span style="color: #9cdcfe;">k</span>, <span style="color: #9cdcfe;">v</span>):</div><div> new_entry = Entry(k, v)</div><div> new_entry.next = <span style="color: #569cd6;">self</span>.next</div><div> <span style="color: #569cd6;">self</span>.next = new_entry</div><div> </div></div></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-6493199500512403622017-08-17T12:09:00.000-07:002017-08-17T12:44:16.326-07:00Implementing a Toy Chatbot using Machine Learning<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Chatbots are all the rage these days. There are numerous companies offering chatbots as a service (wit.ai, api.ai, etc.). To an outsider it may look like magic how these things work but for an ML practitioner they are nothing more than simple classifier models. About a year back I made an attempt to create a weather bot + travel bot (a bot which could tell you weather and also help you book flights). It was a fun learning experiment with some interesting output. While a year is a long enough time that I don't remember much about the code but in this post I will explain the general design of the bot that I created and some demos.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Essentially a chatbot is like a very simple REPL (Read-Eval-Print-Loop), where you read inputs from a human one sentence at a time, evaluate it and decide what to do with it, print a response, and go back to step 1. We will talk about all these 3 steps in detail below, in the context of implementing a weather + travel bot, i.e. a bot which tells you weather of a place and also helps you plan your travel.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">For a weather bot, the most important thing is to be able to understand of which place you are asking the weather for. So, if we can simply train a model which is able to extract the location name from a sentence, we are good to go, right? Evidently, not quite so! Since these things are called chatbots (bots capable of chatting), expectations from them are greater.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">It is not necessary that the first sentence the user enters is asking about the weather. It might just be a simple greeting, such as "Hi!", or "Hello". Our bot should be able to understand these and respond accordingly. Similarly, user may also try to make other sorts of conversations, such as asking the bot its name, or telling the bot their (user's) name. These are just two examples of the types of conversations which we might want our bot to be able to handle apart from the regular weather or travel questions.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">So, in essence, we can't just expect that every sentence entered by the user is about weather. We need to first understand the sentence (i.e. greeting, asking name, or asking weather) and then generate a response. This means every input sentence has to go through a classifier, which classifies the sentence into one of the classes telling you what the sentence is about, e.g., is the user just greeting you, is the user asking a question, is the user saying something off topic, and finally is the user talking about weather or travel. Based on this the bot can decide what response to generate.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">I started with this design but I didn't have any training data to start with. To test the idea out, I just wrote some sample sentences about weather, travel, greetings, some questions (e.g. asking bot's name) in a text file. But I could only produce some 40 odd sentences overall, with 5-6 sentence of each individual sentence types that I wanted the bot to recognize. This was clearly a very small dataset to train any kind of machine learning model. Most of the models would end up badly overfitting it, resulting in a terribly confused bot.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">So, I decided to simplify the problem. I created a hierarchy of sentence types. The first class was intent, which are the types of the sentence for which the bot is designed to respond (such as telling the weather, planning travel, answering user's greetings). The second class was non-intent (anything for which the bot was not designed to answer, but we could provide some hard coded responses if we understood what the user said). See the figure below to get an idea about the hierarchy of the sentence classes.</span></div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUj_FHqdVoqXJgZoOcYpstx4nw6H2HdrbVKf-_KhytDAPYknEzKxiqq1QPzW-5DLTRc6qc9UsVOy8TrtqD0Lqo113fbO6Ob3FPirg46HNDQP4w2ZwA-wYACWrO1Fkk3OMonwWOICN8pko/s1600/Untitled+drawing.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="720" data-original-width="960" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUj_FHqdVoqXJgZoOcYpstx4nw6H2HdrbVKf-_KhytDAPYknEzKxiqq1QPzW-5DLTRc6qc9UsVOy8TrtqD0Lqo113fbO6Ob3FPirg46HNDQP4w2ZwA-wYACWrO1Fkk3OMonwWOICN8pko/s400/Untitled+drawing.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Hierarchy of sentence class types </td></tr>
</tbody></table>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">I gave a try to this second design, training individual classifiers for each of these different sentence classes. The way the whole thing works can be better expressed in the following pseudo code rather than any amount of prose I could write:</span></div>
<br />
<pre class="prettyprint"><code>
for each input sentence:
if sentence not of type 'intent':
if sentence of type 'question':
question_class = question_classifier.predict(sentence)
generate_question_response(question_class)
return
if sentence of type 'sentiment':
sentiment = sentiment_classifier.predict(sentence)
if sentiment == 'happy':
generate_happy_response()
else:
generate_sad_response()
else:
intent = intent_classifier(sentence)
if intent of type 'weather':
if location not in sentence ask location else tell weather
if intent of type 'travel':
if location not in sentence ask location else get flights
if intent of type 'goodbye':
say_goodbye()
if intent of type 'greeting':
greet_user()
</code></pre>
<br />
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The above pseudo code describes the way the bot utilizes the various individually trained classifiers to navigate through the sentence types hierarchy and decides what actions to take. In my implementation I did not actually integrate with any 3rd part services to get weather or show flights. I just hard coded some fixed responses, but those could easily be replaced with actual integration with live services.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">As far as training the models is concerned, I simply tried various models, such as logistic regression, support vector machines, random forests, neural networks etc. and used the one whichever showed best performance. Given the small size of the dataset, all of them were prone to overfitting, I just chose one which seemed to be least confused when tested on sentences of a different structure than the one in the training dataset.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">To vectorize the sentences, I used word2vec (glove vectors using spacy). To vectorize a sentence, I simply converted all the words of a sentence to their word2vec vector, and summed those up to get a single vector. Many people also suggest averaging the vectors but I did not try that. Also, perhaps there are better ways to vectorize a sentence, such as stacking all the word vectors instead of summing them up.</span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">I wrote this code more than a year back just as a proof of concept, so it is not super clean, commented or documented (this blog post is the best I got). The chatbot is present in the file bot.py. The code for training individual classifiers is in a bunch of ipython notebooks (I worked with those to easily experiment with various models but never got around moving that to actual Python files). I manually created dataset for individual sentence classes which are present in text files. Feel free to checkout the code at: <a href="https://github.com/abhinav-upadhyay/chatbot-poc">https://github.com/abhinav-upadhyay/chatbot-poc</a>. I have added pickle files of the pre-trained classifiers in the repo, so that bot.py just runs and you don't have to train classifiers yourself.</span></div>
<br />
Following are some demo conversations (click them to enlarge and view):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtvIP9OWxCeaogeIx5gsMJyuoDTmy-AEUtuwqI8l7VQqOoAu9T0DJgadElARzuGjg2c4xKaaYxZB07GgUAuatOctCXcyngf5oMgswvdGI0w25gHzyihyphenhyphenRoFc3nXtYIwUGRIEJ4oSRr4kU/s1600/funny.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="1272" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtvIP9OWxCeaogeIx5gsMJyuoDTmy-AEUtuwqI8l7VQqOoAu9T0DJgadElARzuGjg2c4xKaaYxZB07GgUAuatOctCXcyngf5oMgswvdGI0w25gHzyihyphenhyphenRoFc3nXtYIwUGRIEJ4oSRr4kU/s400/funny.gif" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhw4TT9fdz7VXjIMPN6ZJ6xolSLuBleXdoROvC0iOa4Q3luNvMdZEbj2AQ-o89cqAHzoRXwoFxaiqdjWEKAg6Yp4vGo5VeVzfQlVPbc4MnntLyi52BXErDy5wAeDdM8Tk5YvOBFoT5LwhI/s1600/name.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="1272" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhw4TT9fdz7VXjIMPN6ZJ6xolSLuBleXdoROvC0iOa4Q3luNvMdZEbj2AQ-o89cqAHzoRXwoFxaiqdjWEKAg6Yp4vGo5VeVzfQlVPbc4MnntLyi52BXErDy5wAeDdM8Tk5YvOBFoT5LwhI/s400/name.gif" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxr17wEoxCm8SBDL45zS4rrAsVwf-dYhtzVWTUhmN7nrJZpN_Xji9UJFk6fgy1tQYhgeiuFxX_9VLnOt1rk0SiiyrmhjojBfSbhCb_rihZzj0wImUDaYmvjG1kc9U6MgI1oBX9Th4M6H4/s1600/sad.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="1272" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxr17wEoxCm8SBDL45zS4rrAsVwf-dYhtzVWTUhmN7nrJZpN_Xji9UJFk6fgy1tQYhgeiuFxX_9VLnOt1rk0SiiyrmhjojBfSbhCb_rihZzj0wImUDaYmvjG1kc9U6MgI1oBX9Th4M6H4/s400/sad.gif" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixRI-oI9xSWg77lMWU_28gg_oyHOsGhByA4UAosgVQ22KqJtWLqzBQs4mGG3g3TtbJnW1I9ooX3H-NRFebyYRJu_LIqmGPi7tQqFXV-_k0TnQ902PYmNQCK5fkZtbm9a3WDcloOPFoPWQ/s1600/travel.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="1272" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixRI-oI9xSWg77lMWU_28gg_oyHOsGhByA4UAosgVQ22KqJtWLqzBQs4mGG3g3TtbJnW1I9ooX3H-NRFebyYRJu_LIqmGPi7tQqFXV-_k0TnQ902PYmNQCK5fkZtbm9a3WDcloOPFoPWQ/s400/travel.gif" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhf4FNM9V5CxGikbUR8_40Cvs1x7HbjCrXMbJfQAm0LKvfJv4zrBW4qQoC9wKEuaihNsKmHOw-9WmGvyfdr06CeclbIR4S5xMX-BpG4bleqZu8uRwhGPh0TqEu8WuxzNlIXeuSq9FMZQMU/s1600/happy.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="1272" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhf4FNM9V5CxGikbUR8_40Cvs1x7HbjCrXMbJfQAm0LKvfJv4zrBW4qQoC9wKEuaihNsKmHOw-9WmGvyfdr06CeclbIR4S5xMX-BpG4bleqZu8uRwhGPh0TqEu8WuxzNlIXeuSq9FMZQMU/s400/happy.gif" width="400" /></a></div>
<br />
<br />
<br /></div>
Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com2tag:blogger.com,1999:blog-9185564337892058358.post-20529574166612443022016-10-10T12:32:00.000-07:002016-10-10T12:32:42.995-07:00Understanding Deep Learning as a Stack of Logistic Regression Models<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">So, I had an interesting self realization today. I sat down to implement a multi class classification system (the details of which shall remain classified). I was working with text data and there was no way to directly map it to one of the target classes. So I decided to build a series of classifiers, where starting by classifying at a more broad level, I will drill down towards more specific set of classes with each classifier. Essentially it was like a chain of UNIX shell pipes, you take the output of one classifier, feed to the next and so on. So for example, first I detect one of the more broad classes, then towards more specific ones, until I get to one of the leaf nodes of this tree of the classes.</span></div>
<div style="text-align: justify;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">After getting done, I realized, the deep layered neural networks in vogue these days, essentially do the same thing for you <i>automatically</i>. For example a deep convolutional network for face recognition first starts with detecting the edges in the starting layers, then moves on to detecting the contours and curves and then to more complex features. It all makes sense now :-D</span></div>
<div style="text-align: justify;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">Moral of the lesson, if you have a ton of data, just give it to a deep neural network and it will do all the feature engineering for you. And if you don't have enough data, then you need to do all the feature engineering by hand and build a stack of classifiers, like I had to do.</span></div>
</div>
Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-5579651753018308522016-05-07T10:25:00.000-07:002016-05-07T11:24:18.936-07:00Teaching Apropos to Rank - A Work in Progress<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Last month I deployed <a href="http://man-k.org/">man-k.org</a>, which is a web interface to NetBSD's apropos implementation. After that, I thought that I could try using machine learning to improve the ranking algorithm used by apropos. In this post, I describe how the ranking algorithm could be improved by machine learning, the challenges in the way, and the results obtained thus far.</span></span><br />
<br />
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Before we start, I would like to say that the results shown here are no where close to the final output I want, it's still a work in progress. I will publish the code <span style="font-family: "verdana" , sans-serif;">s</span>oon, when I feel the results are noteworthy. The data for these experiments is available on my <a href="https://github.com/abhinav-upadhyay/man-nlp-experiments" target="_blank">git</a><span style="font-family: "verdana" , sans-serif;"><a href="https://github.com/abhinav-upadhyay/man-nlp-experiments" target="_blank">hub repo</a><span style="font-family: "verdana" , sans-serif;">. Now let's dive in :)</span></span></span></span><br />
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span><br />
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<h3>
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Apropos' Ranking Algorithm: </span></span></h3>
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<pre class="prettyprint"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><code>section_weights = {'NAME': 1.0, 'NAME_DESC': 0.9, 'DESCRIPTION': 0.5}
for each matched document in the result set:
for each section in the document:
tf = tf + compute_tf_for_section() * section_weights[section]
idf = idf + compute_idf_for_section() * section_weight[section]
document_score = (tf * idf)/ (k + tf)
</code>
</span></span></pre>
</div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">This algorithm is used to generate relevance score for each matching document and the results are then sorted in decreasing order of this score. The algorithm multiplies the tf and idf values by a weight for each section. The values of these weights are hard coded and were determined by running some arbitrary queries and evaluating the results. </span></span></div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">How Machine Learning Can Help? </span></span></h3>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Ideally you would like to determine such weights by evaluating against a standard set of queries and measuring the <a href="https://en.wikipedia.org/wiki/Precision_and_recall" target="_blank">precision and recall</a> to see what values get the best results, but we didn't have any such dataset of <span style="font-family: monospace;">apropos</span> queries and output.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Thanks to <a href="http://man-k.org/">man-k.org</a>, I was able to get some data. I got close to 1000 queries and click results which I sifted through manually to remove anomalies, for example someone going to the last page and opening the result, or bot traffic.</span></span><br />
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Now, I wanted to learn the weights which I had hard coded to arbitrary values in the ranking algorithm above. This seemed like a straightforward regression problem to me. I will explain how.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">In machine learning, you use <a href="https://en.wikipedia.org/wiki/Linear_regression" target="_blank">regression</a> when you want to predict a continuous range of values. For example, predicting the temperature, or predicting housing prices. Usually you have a set of features in your data, for example, in case of housing example, you could have features like <i>number of bedrooms</i>, <i>area of the house</i>, <i>number of bathrooms</i> and you want to use them to predict a target value, which is the <i>price of the house</i>. Now you need to combine these features in a way so that they could be related to the price of the house. One way to do that is to assign some weight to each feature and take their linear combination. For example:</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<pre class="prettyprint"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">w1 * number_bedrooms + w2 * number_bathrooms + w3 * area = price</span></span></pre>
</div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Now, if we can determine the optimum value of these weights, we can predict the price of a house (to a good approximation), given these features. There are a number of algorithms out there which can learn these weights and use them to predict the output for you.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">My problem of learning the section wise weights was similar. My algorithm is generating a <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank"><span style="font-family: monospace;">tf-idf</span> score</a> for each section , multiplying it by a weight, summing it up and calling it the score for that document. So, I had the following:</span></span></div>
<ul style="text-align: left;">
<li><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"> Features in the form of section wise scores</span></span></li>
<li><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"> And I wanted to learn the weights for these features (section specific weights)</span></span></li>
</ul>
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Sounds like a machine learning problem.</span></span><br />
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span><br />
<h4 style="text-align: left;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Why Learn Weights Instead of Learning to Rank?</span></span></h4>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Learning to rank is a different problem and probably more interesting too. But I wanted to first tackle the problem of learning the weights, because if I can learn the optimum value of the weights, those can be directly used in NetBSD's apropos code and immediately improve its search.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<h3 style="text-align: left;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">The Challenges</span></span></h3>
<div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">I didn't have the target value for my data. I couldn't use the output score as the target value. If I used the output of my ranking algorithm as it is as the target value in my data set, the weights learned by the machine learning model will be same as the current weights. So I decided to manufacture the target value.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">For each query in the data set, if the clicked result was not ranked <span style="font-family: "verdana" , sans-serif;">1</span>, I would set its score as the score of the document ranked 1.</span></span>
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">For example<span style="font-family: "verdana" , sans-serif;">, for the query "list files", ideally you want </span></span></span><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: monospace;">ls(1)</span></span></span> at the top but currently <span style="font-family: "verdana" , sans-serif;">the top result is </span></span></span></span><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: monospace;">file(1)</span></span></span></span></span>. So <span style="font-family: "verdana" , sans-serif;">I would set the target value for </span></span></span></span></span><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: monospace;">ls(1)</span></span></span></span></span></span> = the output score of </span></span></span></span></span><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: monospace;">file(1)</span></span></span></span></span>.<span style="font-family: "verdana" , sans-serif;"> </span><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">This way, the machine learning model would try to optimize the weights so as to go from the current score of <span style="font-family: monospace;">ls(1)</span> to that of <span style="font-family: monospace;">file(1)</span>. </span></span> </span></span></span></span></span></div>
<div style="text-align: justify;">
<br /></div>
<span style="font-size: small;">I also had to do a lot of manual work of processing the data and writing code to get the section wise scores from apropos, but those details are not very important.</span></div>
<div>
</div>
<div>
<h3 style="text-align: left;">
<span style="font-size: small;"> </span></h3>
<h3 style="text-align: left;">
<span style="font-size: small;">The Results So Far </span></h3>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">I tried out a few machine learning models, such as <a href="https://en.wikipedia.org/wiki/Linear_regression" target="_blank">linear regression</a>, <a href="https://en.wikipedia.org/wiki/Support_vector_machine" target="_blank">support vector machines</a> and <a href="https://en.wikipedia.org/wiki/Random_forest" target="_blank">random forests</a>, which are very popular models for these kind of problems. I used <a href="https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29" target="_blank">leave one out cross validation </a>technique to make sure the models didn't overfit the data and used <a href="https://en.wikipedia.org/wiki/Mean_squared_error" target="_blank">Mean Squared Error</a> as the evaluation metric. The best results were produced by the random forest model. I still need to tune the parameters of SVM and may be it will beat random forests? I need to try that out and many more other possibilities with this.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Admittedly, this approach is not perfect, prone to abuse in the traffic to <a href="http://man-k.org/">man-k.org</a>, requires manual work but at least it works as a validation for my idea. I am considering to use the data from <a href="http://man-k.org/">man-k.org</a>, manually refine it, annotate it and use it as a standard for further experiments.</span></span></div>
<br />
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">Following are comparison of some of the queries with the old weights and the new weights, it's not anything drastic but the new weights seem to get rid of some of the non relevant results from the top.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<pre class="prettyprint"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">apropos -n 10 -C fork #old weights
fork (2) create a new process
perlfork (1) Perls fork() emulation
cpu_lwp_fork (9) finish a fork operation
pthread_atfork (3) register handlers to be called when process forks
rlogind (8) remote login server
rshd (8) remote shell server
rexecd (8) remote execution server
script (1) make typescript of terminal session
moncontrol (3) control execution profile
vfork (2) spawn new process in a virtual memory efficient way
apropos -n 10 -C fork #new weights
fork (2) create a new process
perlfork (1) Perls fork() emulation
cpu_lwp_fork (9) finish a fork operation
pthread_atfork (3) register handlers to be called when process forks
vfork (2) spawn new process in a virtual memory efficient way
clone (2) spawn new process with options
daemon (3) run in the background
script (1) make typescript of terminal session
openpty (3) tty utility functions
rlogind (8) remote login server
</span></span></pre>
</div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">The new weights seem to bring more relevant results up, for example clone(2) shows up, rshd(8) and rexecd(8) go away, rlogind(8) moves down.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<pre class="prettyprint"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">apropos -n 10 -C create new process
init (8) process control initialization
fork (2) create a new process
fork1 (9) create a new process
timer_create (2) create a per-process timer
getpgrp (2) get process group
supfilesrv (8) sup server processes
posix_spawn (3) spawn a process
master (8) Postfix master process
popen (3) process I/O
_lwp_create (2) create a new light-weight process
apropos -n 10 -C create new process #new weights
fork (2) create a new process
fork1 (9) create a new process
_lwp_create (2) create a new light-weight process
pthread_create (3) create a new thread
clone (2) spawn new process with options
timer_create (2) create a per-process timer
UI_new (3) New User Interface
init (8) process control initialization
posix_spawn (3) spawn a process
master (8) Postfix master process
</span></span></pre>
</div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">You can see, fork(2) moves to number 1, init(8) moves to 7, clone(2) appears etc.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<pre class="prettyprint"><span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">apropos -n 10 -C remove packages #old weights
groff_mdoc (7) reference for groffs mdoc implementation
pkg_add (1) a utility for installing and upgrading software package distributions
pkg_create (1) a utility for creating software package distributions
pkg_delete (1) a utility for deleting previously installed software package distributions
deroff (1) remove nroff/troff, eqn, pic and tbl constructs
pkg_admin (1) perform various administrative tasks to the pkg system
groff_tmac (5) macro files in the roff typesetting system
ci (1) check in RCS revisions
update\-binfmts (8) maintain registry of executable binary formats
rpc_svc_reg (3) library routines for registering servers
apropos -n 10 -C remove packages #new weights
pkg_create (1) a utility for creating software package distributions
pkg_add (1) a utility for installing and upgrading software package distributions
pkg_delete (1) a utility for deleting previously installed software package distributions
deroff (1) remove nroff/troff, eqn, pic and tbl constructs
groff_mdoc (7) reference for groffs mdoc implementation
groff_tmac (5) macro files in the roff typesetting system
ci (1) check in RCS revisions
pkg_admin (1) perform various administrative tasks to the pkg system
update\-binfmts (8) maintain registry of executable binary formats
rpc_svc_reg (3) library routines for registering servers
</span></span></pre>
</div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;">pkg_create moves to 1, pkg_delete moves up, I think pkg_admin should have been further up.</span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-size: small;"><span style="font-family: "verdana" , sans-serif;"><br /></span></span></div>
</div>
</div>
Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-83101934810753578012011-10-04T10:10:00.000-07:002011-12-15T03:26:11.835-08:00Spell Corrector for Apropos<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="font-family: Verdana,sans-serif; text-align: justify;">
One of the new features I wrote in apropos is a basic yet reasonably effective spell corrector. While working on apropos, one big nuisance that I noticed was wrongly spelled keywords in the query. When supporting full text searches, I guess it is the usual expectation to have support for spell correction as well.</div>
<div style="text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
The search of apropos is based on the Boolean search model, which means that it will return only those documents which contain all the keywords mentioned in the query. This means that you misspell even one keyword and you will either get all non-relevant search results or no results at all. This behaviour is contradictory to the way conventional apropos did it's search, it would return all the results which matched even one of the keywords.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
The user might think that "this new apropos is useless, can't get me any right results." Then he would most likely start experimenting by changing keywords and he might or might not succeed. The point is, apropos should be clever enough to inform the user that probably he misspelled one or more keywords in the query, so that the user doesn't waste time scratching his head.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: large;"><b>Implementation Of The Spell Corrector:</b></span> <span style="font-family: Verdana, sans-serif;">Writing an industry strength spell corrector (like that of Google, Bing, etc.) is a complex task and I have no idea about their intricacies. I was looking for a fairly basic implementation. I came across two articles which discussed implementation of a relatively simple spell checker. One article was by </span><a href="http://en.wikipedia.org/wiki/Jon_Bentley" style="font-family: Verdana,sans-serif;" target="_blank">Jon Bentley</a><span style="font-family: Verdana, sans-serif;"> in his famous book </span><a href="http://cm.bell-labs.com/cm/cs/pearls/" style="font-family: Verdana,sans-serif;" target="_blank">Programming Pearls</a><span style="font-family: Verdana, sans-serif;"> and the second was from <a href="http://www.norvig.com/" target="_blank">Prof. Peter Norvig</a> in his famous post "</span><a href="http://norvig.com/spell-correct.html" style="font-family: Verdana,sans-serif;" target="_blank">How to write a spell corrector</a><span style="font-family: Verdana, sans-serif;">". I decided to go with Peter Norvig's implementation because of it's simplicity and ease of implementation. Before continuing, <b>I would like to thank Prof. Norvig for writing such an insightful article and sharing it :-) </b></span></div>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
I highly recommend reading Prof. Norvig's article to understand the maths and logic involved properly, I am going to give some insight on what his Python code is doing and then produce the C translation of the code, with some demo. </div>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif;">
</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
The idea is to find the word at the least edit distance from the word being checked for. Edit distance here means, the number of characters from the given word you need to add, remove or change position to get the correctly spelled word. Peter Norvig mentioned in his post that for 80-95% cases edit distance 1 is sufficient.</div>
<br />
<div style="font-family: Verdana,sans-serif; text-align: justify;">
The strategy for finding words at edit distance 1 is very simple. Four different kind of mistakes are possible that can lead to a misspelled word at edit distance 1. These are:</div>
<ol style="font-family: Verdana,sans-serif; text-align: left;">
<li style="text-align: justify;"><b>Deletion:</b> You missed a character while typing the word. For example: "speling".</li>
<li style="text-align: justify;"><b>Transposition:</b> You exchanged the positions of two adjacent characters in the word. For example: "teh" instead of "the"</li>
<li style="text-align: justify;"><b>Replace:</b> You replaced an alphabet in the word with some other alphabet (possibly you pressed the wrong key on the keyboard). For example: "dapple" instead of "apple" or "produkt" instead of "product"</li>
<li style="text-align: justify;"><b>Insertions:</b> You probably entered one additional alphabet in the spelling of the word. For example: "filles" when you mean "files".</li>
</ol>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
I will take a simple example and show all it's possible permutations at edit distance 1. Let's say we misspelled "the" as "teh", then following are the different possible permutations:</div>
<br />
<pre class="prettyprint"><code>deletes = ['eh', 'th', 'te']
transpose = ['eth', 'the']
#the replaces and inserts list is compacted but you get the idea
replaces = ['aeh', 'beh', 'ceh', 'deh', 'eeh', 'feh', ..., 'zeh',
'tah', 'tbh', 'tch', 'tdh', 'teh', 'tfh', ..., 'tzh'
'tea', 'teb', 'tec', 'ted', 'tee', 'tef', ..., 'tez']
inserts = ['ateh', 'bteh', 'cteh', 'dteh', 'eteh', 'fteh', ..., 'zteh',
'taeh', 'tbeh', 'tceh', 'tdeh', 'teeh', 'tfeh', ..., 'tzeh',
'teah', 'tebh', 'tech', 'tedh', 'teeh', 'tefh', ..., 'tezh',
'teha', 'tehb', 'tehc', 'tehd', 'tehe', 'tehf', ..., 'tehz']
</code>
</pre>
</div>
<br />
<div style="font-family: Verdana,sans-serif;">
Once we have generated all these possible permutations of the word at edit distance 1, we check in our dictionary which of these are real and valid words. It is always possible that more than one of these permutations is a valid word in the dictionary, in which case we pick up the word which occurs most frequently in our sample corpus used for building the dictionary (this is the training model used for this spell corrector).<br />
<br />
I suppose that explains what we need to do. Now time for some code:</div>
<i style="font-family: Verdana,sans-serif;"><span style="font-size: x-small;"><b>NOTE</b>: The following is a C implementation of Peter Norvig's spell corrector. It is written by me from scratch and is part of the apropos_replacement project, licensed under the two clause BSD license.</span></i></div>
<pre class="prettyprint"><code>
/*-
* Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
* FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
* COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
* AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
static char **
edits1 (char *word)
{
int i;
int len_a;
int len_b;
int counter = 0;
char alphabet;
int n = strlen(word);
set splits[n + 1];
/* calculate number of possible permutations and allocate memory */
size_t size = n + n -1 + 26 * n + 26 * (n + 1);
char **candidates = emalloc (size * sizeof(char *));
/* Start by generating a split up of the characters in the word */
for (i = 0; i < n + 1; i++) {
splits[i].a = (char *) emalloc(i + 1);
splits[i].b = (char *) emalloc(n - i + 1);
memcpy(splits[i].a, word, i);
memcpy(splits[i].b, word + i, n - i + 1);
splits[i].a[i] = 0;
}
/* Now generate all the permutations at maximum edit distance of 1.
* counter keeps track of the current index position in the array candidates
* where the next permutation needs to be stored.
*/
for (i = 0; i < n + 1; i++) {
len_a = strlen(splits[i].a);
len_b = strlen(splits[i].b);
assert(len_a + len_b == n);
/* Deletes */
if (i < n) {
candidates[counter] = emalloc(n);
memcpy(candidates[counter], splits[i].a, len_a);
if (len_b -1 > 0)
memcpy(candidates[counter] + len_a ,
splits[i].b + 1, len_b - 1);
candidates[counter][n - 1] = 0;
counter++;
}
/* Transposes */
if (i < n - 1) {
candidates[counter] = emalloc(n + 1);
memcpy(candidates[counter], splits[i].a, len_a);
if (len_b >= 1)
memcpy(candidates[counter] + len_a, splits[i].b + 1, 1);
if (len_b >= 1)
memcpy(candidates[counter] + len_a + 1, splits[i].b, 1);
if (len_b >= 2)
memcpy(candidates[counter] + len_a + 2,
splits[i].b + 2, len_b - 2);
candidates[counter][n] = 0;
counter++;
}
/* For replaces and inserts, run a loop from 'a' to 'z' */
for (alphabet = 'a'; alphabet <= 'z'; alphabet++) {
/* Replaces */
if (i < n) {
candidates[counter] = emalloc(n + 1);
memcpy(candidates[counter], splits[i].a, len_a);
memcpy(candidates[counter] + len_a, &alphabet, 1);
if (len_b - 1 >= 1)
memcpy(candidates[counter] + len_a + 1,
splits[i].b + 1, len_b - 1);
candidates[counter][n] = 0;
counter++;
}
/* Inserts */
candidates[counter] = emalloc(n + 2);
memcpy(candidates[counter], splits[i].a, len_a);
memcpy(candidates[counter] + len_a, &alphabet, 1);
if (len_b >=1)
memcpy(candidates[counter] + len_a + 1, splits[i].b, len_b);
candidates[counter][n + 1] = 0;
counter++;
}
}
return candidates;
}
/*
* known_word--
* Pass an array of strings to this function and it will return the word with
* maximum frequency in the dictionary. If no word in the array list is found
* in the dictionary, it returns NULL
* NOTE: The dictionary in our case is a table in the db with two fields:
* term, occurrences
*/
static char *
known_word(sqlite3 *db, char **list, int n)
{
int i, rc;
char *sqlstr;
char *termlist = NULL;
char *correct = NULL;
sqlite3_stmt *stmt;
/* Build termlist: a comma separated list of all the words in the list for
* use in the SQL query later.
*/
int total_len = BUFLEN * 20; /* total bytes allocated to termlist */
termlist = emalloc (total_len);
int offset = 0; /* Next byte to write at in termlist */
termlist[0] = '(';
offset++;
for (i = 0; i < n; i++) {
int d = strlen(list[i]);
if (total_len - offset < d + 3) {
termlist = erealloc(termlist, offset + total_len);
total_len *= 2;
}
memcpy(termlist + offset, "\'", 1);
offset++;
memcpy(termlist + offset, list[i], d);
offset += d;
if (i == n -1) {
memcpy(termlist + offset, "\'", 1);
offset++;
}
else {
memcpy(termlist + offset, "\',", 2);
offset += 2;
}
}
if (total_len - offset > 3)
memcpy(termlist + offset, ")", 2);
else
concat(&termlist, ")", 1);
easprintf(&sqlstr, "SELECT term FROM metadb.dict WHERE "
"occurrences = (SELECT MAX(occurrences) from metadb.dict "
"WHERE term IN %s) AND term IN %s", termlist, termlist);
rc = sqlite3_prepare_v2(db, sqlstr, -1, &stmt, NULL);
if (rc != SQLITE_OK) {
warnx("%s", sqlite3_errmsg(db));
return NULL;
}
if (sqlite3_step(stmt) == SQLITE_ROW)
correct = strdup((char *) sqlite3_column_text(stmt, 0));
sqlite3_finalize(stmt);
free(sqlstr);
free(termlist);
return (correct);
}
static void
free_list(char **list, int n)
{
int i = 0;
if (list == NULL)
return;
while (i < n) {
free(list[i]);
i++;
}
}
/*
* spell--
* The API exposed to the user. Returns the most closely matched word from the
* dictionary. It will first search for all possible words at distance 1, if no
* matches are found, it goes further and tries to look for words at edit
* distance 2 as well. If no matches are found at all, it returns NULL.
*/
char *
spell(sqlite3 *db, char *word)
{
int i;
char *correct;
char **candidates;
int count2;
char **cand2 = NULL;
char *errmsg;
const char *sqlstr;
int n;
int count;
lower(word);
/* If this word already exists in the dictionary then no need to go further */
correct = known_word(db, &word, 1);
if (!correct) {
n = strlen(word);
count = n + n -1 + 26 * n + 26 * (n + 1);
candidates = edits1(word);
correct = known_word(db, candidates, count);
/* No matches found ? Let's go further and find matches at edit distance 2.
* To make the search fast we use a heuristic. Take one word at a time from
* candidates, generate it's permutations and look if a match is found.
* If a match is found, exit the loop. Works reasonable fast but accuracy
* is not quite there in some cases.
*/
if (correct == NULL) {
for (i = 0; i < count; i++) {
n = strlen(candidates[i]);
count2 = n + n - 1 + 26 * n + 26 * (n + 1);
cand2 = edits1(candidates[i]);
if ((correct = known_word(db, cand2, count2)))
break;
else {
free_list(cand2, count2);
cand2 = NULL;
}
}
}
free_list(candidates, count);
free_list(cand2, count2);
}
return correct;
}</code></pre>
</div>
</div>
<br />
<span style="font-size: large;"><b>Demo:</b></span><br />
<div style="font-family: Verdana,sans-serif;">
Following are some sample runs of apropos:</div>
</div>
<pre class="prettyprint"><code>
$ ./apropos "funckiton for coping stings"
Did you mean "function for copying strings" ?
$ ./apropos "generat termcap databse"
Did you mean "generate termcap database" ?
$ ./apropos idcmp
Did you mean "icmp" ?
$ ./apropos "confguire kernal"
Did you mean "configure kernel" ?
$ ./apropos "packate fillter"
Did you mean "package filter" ?
$ ./apropos reeltek
Did you mean "realtek" ?
</code>
</pre>
</div>
<div style="font-family: Verdana,sans-serif;">
<b>Following are some screenshots of apropos_cgi (a CGI version of apropos for browsers):</b></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiob93SA2ePlTGq0OAv8YoRNSdvSQb4R27LbUnGGNI4KG-Uy2HQ5TruuJuXPZWz1XyLtsXzHKWMrYsosSjQhQwceaOxxqQSbGuJzBxk-ug_XikD0PB3wFv3gryFX1fQOLnfKDcPrlpskY8/s1600/foopen.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiob93SA2ePlTGq0OAv8YoRNSdvSQb4R27LbUnGGNI4KG-Uy2HQ5TruuJuXPZWz1XyLtsXzHKWMrYsosSjQhQwceaOxxqQSbGuJzBxk-ug_XikD0PB3wFv3gryFX1fQOLnfKDcPrlpskY8/s640/foopen.png" width="640" /></a></div>
<br />
<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3WNXXWfnQAJI5CsH4cxFDQoTJEIMkG1NO31fVDmDtoGhYg9vSNzf8Mohx-OByed4XnUfX4xJl4gNExfY0k8nH1NbjwFaVmdM34qMZOpPJN5sDy8qz5SpYIyJ65y8BZCaau5UT8wQuzxY/s1600/databoose.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="356" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3WNXXWfnQAJI5CsH4cxFDQoTJEIMkG1NO31fVDmDtoGhYg9vSNzf8Mohx-OByed4XnUfX4xJl4gNExfY0k8nH1NbjwFaVmdM34qMZOpPJN5sDy8qz5SpYIyJ65y8BZCaau5UT8wQuzxY/s640/databoose.png" width="640" /></a></div>
<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_I0bIVWDpzEf7bVkJndZE1a2kRGd5yIPlNA3kofKt-RIJoeZZG6f_ih2Q3Gj_VYecMvqohsKnbCfHend3S-3K8-4xcx2GzBwi99JQNES2fKrLuU6ph63cXt3ecjoDbcrb_pGoOaZQdW4/s1600/dns.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="356" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_I0bIVWDpzEf7bVkJndZE1a2kRGd5yIPlNA3kofKt-RIJoeZZG6f_ih2Q3Gj_VYecMvqohsKnbCfHend3S-3K8-4xcx2GzBwi99JQNES2fKrLuU6ph63cXt3ecjoDbcrb_pGoOaZQdW4/s640/dns.png" width="640" /></a></div>
<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUCFoihdJRlq4haUyt8oH1yjm3sQIzgRx669q_jZHrWwiqpKPicCfmjLQ8A0YYgQCshNyNzF-sZ8kK2e4HdlVb0d5ZP1dvTLhwFrF-W3oURgu5BFCunKUEJhVFgM9XoWL0LCTIPhtFlWw/s1600/reeltak.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="355" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUCFoihdJRlq4haUyt8oH1yjm3sQIzgRx669q_jZHrWwiqpKPicCfmjLQ8A0YYgQCshNyNzF-sZ8kK2e4HdlVb0d5ZP1dvTLhwFrF-W3oURgu5BFCunKUEJhVFgM9XoWL0LCTIPhtFlWw/s640/reeltak.png" width="640" /></a></div>
<br />
<br />
<div style="text-align: justify;">
<span style="font-size: large;"><b>Further Scope:</b></span> <span style="font-family: Verdana, sans-serif;">There are a few technical glitches in integrating this spell corrector with apropos so those need to be sorted. The suggestions are not always as expected, so probably the model for the spell corrector needs to be fine tuned (like Peter Norvig discussed at the end of his article). And while writing this post, it occurred to me that this implementation could make a fine small scale backend for auto completion feature in a web application (for example the apropos cgi above). ;-)</span></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
All this code is in the <a href="https://github.com/abhinav-upadhyay/apropos_replacement/tree/demo-spell">demo-spell </a>, <a href="https://github.com/abhinav-upadhyay/apropos_replacement/tree/exp-spell">exp-spell</a> branch of the project on github. </div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
I am not sure if anyone would read this far, but thanks anyway for reading and taking interest. :-)</div>
</div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com1tag:blogger.com,1999:blog-9185564337892058358.post-74084260165788589652011-10-03T11:32:00.000-07:002011-10-05T12:44:40.494-07:00Improvements to makemandb<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="font-family: Verdana,sans-serif; text-align: justify;">
Over a month passed since GSoC finished and I made some improvements and introduced new features (which are experimental) in apropos. I wanted to write about a few of the things I did in last one month.</div>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<span style="font-size: large;"><b>Indexing Additional metadata For Faster Update Operations:</b></span> Previously makemandb was maintaining md5 hashes of all the pages indexed. On each run, makemandb would read all the man pages, generate their md5s and compare those with the md5 hashes it already had in it's index. Then it would parse and store the pages whose md5 hash it did not find in the database, meaning these are the new or modified pages and need (re)indexing.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
Joerg pointed out that this wasn't a very efficient approach. It required unnecessarily reading up all the man pages. He suggested to index more metadata about the man page files, like the mtime, device id and the inode number. So rather than reading up the pages and generating their md5, makemandb would do a stat(2) on them, read their {device id, inode, mtime} and see if a matching triplet exists in the database or not and decide whether this page needs to be indexed or not. This is a more efficient approach when you are updating the index after installing some new man pages or updating few of the existing ones. Though when you are building the index from scratch, doing a stat(2) for all the pages just proves to be a roadblock.</div>
<br />
<br />
<br />
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<span style="font-size: large;"><b>Faster makemandb With Clever Memory Management:</b></span> Due to the above mentioned changes in makemandb it's runtime had increased by more than double. Earlier makemandb could build an index for 8000+ pages under 30-40 seconds but now it was taking 130-150 seconds to do the same job. The changes which made makemandb slow were necessary and could not be undone so I had to identify the other areas where it could do better.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
As it turns out, makemandb was managing the memory very poorly. It needs to perform one operation very frequently and that is of concatenating two strings, one of which contains previously parsed data from the man page and the other one contains newly parsed data. Doing such kind of string manipulation is always a tedious task in C. Most straightforward way is to call realloc(3) to allocate sufficient space to hold the contents of the new string and then copy the new string at the end of the old one. I had a function concat() which was doing just the same. In an average length man page there could be well over 100+ calls to concat() and for 8000+ pages this was a very large number of calls to malloc/realloc, and as the length of the string containing already parsed data increases, the realloc calls get even more expensive. So clearly this was the bottleneck which needed to be fixed.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<b>Solution:</b> The solution was very simple. Instead of doing memory allocations every time a new string needs to be concatenated, pre-allocate a large chunk of memory and keep writing to it until you fall short of space, in which case, just reallocate another large chunk and proceed as usual. This would reduce the calls to malloc from 100+ to around 10+ for a single page.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
</div>
<div style="text-align: justify;">
<code>
</code></div>
<pre class="prettyprint"><code>
/*
* A data structure for holding section specific data.
*/
typedef struct secbuff {
char *data;
int buflen; //Total length of buffer allocated initially
int offset; // Position of next byte to write at
} secbuff;
static void
append(secbuff *sbuff, const char *src, int srclen)
{
short flag = 0;
assert(src != NULL);
if (srclen == -1)
srclen = strlen(src);
if (sbuff->data == NULL) {
sbuff->data = (char *) emalloc (sbuff->buflen);
sbuff->offset = 0;
}
if ((srclen + 2) >= (sbuff->buflen - sbuff->offset)) {
sbuff->data = (char *) erealloc(sbuff->data, sbuff->buflen + sbuff->offset);
sbuff->buflen += sbuff->buflen;
flag++;
}
/* Append a space at the end of the buffer */
if (sbuff->offset || flag) {
memcpy(sbuff->data + sbuff->offset, " ", 1);
sbuff->offset++;
}
/* Now, copy src at the end of the buffer */
memcpy(sbuff->data + sbuff->offset, src, srclen);
sbuff->offset += srclen;
return;
}
</code></pre>
<code>
</code>
</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
The secbuff data structure keeps track of the next byte offset in the data buffer where the next character needs to be written. In this way, I could allocate a sufficiently large chunk of memory to a buffer and simply use memcpy to write the new data at it's end.
This approach brings large speed improvements to makemandb. The runtime has reduced from 130+ seconds to somewhere around ~45 seconds.</div>
<br /></div>
Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-2313647378745575382011-08-23T09:20:00.000-07:002011-11-13T00:00:08.511-08:00Final Report: NetBSD GSoC-2011 Apropos Replacement<div dir="ltr" style="text-align: left;" trbidi="on">
<div>
<div style="font-family: Verdana,sans-serif;">
This is the final status report for this project for this summer. I will try to summarise what the project was all about and what goals have been achieved.</div>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<span style="font-size: large;"><b>Objective Of The Project</b></span>: We all are well aware of the importance of quality documentation and Unix like operating systems ship with documentation in the form of manual pages (or man pages in short). They have been there since the early days to aid the system administrators as well as programmers in configuring and modifying the system. A very basic search utility was provided in the form of apropos(1). I am not aware the history apropos(1) but my guess is that, it was kept simple because of the hardware limitations of the day. There is usually a plain text file (whatis.db usually) which indexes the NAME section of the man pages and apropos(1) simply searches this file for the keywords specified by the user on the command line.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
In today's time we have much more powerful machines and also the research in information retrieval has progressed a lot. So the objective of this project was to develop a replacement for apropos(1) so as to develop a complete search tool. The aim was to index the complete content of the man pages and develop a tool to search that index.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
Full text search of the man pages was the main task but this entailed a few small other benefits to NetBSD as well. I will be discussing them as I go along.</div>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif;">
<span style="font-size: large;"><b>Implementation Strategy:</b></span></div>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif;">
The implementation had largely two main stages:</div>
<ol style="font-family: Verdana,sans-serif; text-align: justify;">
<li><b>Parsing and Indexing Man Pages</b>: The starting point of the project was to develop a utility which would parse all the man pages and build a full text search index. For parsing the man pages I used the libmandoc parser from the <a href="http://mdocml.bsd.lv/">mdocml</a> which is really an excellent and innovative project. And for the full text search I used the <a href="http://sqlite.org/fts3.html">FTS engine</a> of <a href="http://sqlite3.org/">Sqlite</a>. I used these tools to develop makemandb, which would traverse the set of directories containing man pages, parse them, extract the relevant data and store them in an FTS database.</li>
<li><b>Search</b>: Once the FTS index was there, the next thing to do was to develop a tool to search the database and a mechanism to rank the search results. A long part of the project involved experimenting what techniques for ranking the results well and which ones did not work very well. This stage lead to the development a new version of apropos.</li>
</ol>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
Deliverables: So the project resulted in following deliverables:</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<ul style="text-align: left;">
<li style="font-family: Verdana,sans-serif; text-align: justify;"> makemandb: It is a command line utility to parse the man pages installed on the system and build a full text search database as described above. </li>
<li style="font-family: Verdana,sans-serif; text-align: justify;">An option ('p') to man(1) to print the list of directories containing the man page sources.</li>
<li style="font-family: Verdana,sans-serif; text-align: justify;">apropos: This is the 2nd command line utility to search the FTS database. It sanitizes the user query, removes any stopwords from it, executes the search, and ranks the results in decreasing order of their weights which are computed on the basis of a term-frequency and inverse-document-frequency based algorithm.</li>
<li style="font-family: Verdana,sans-serif; text-align: justify;">An API: I developed a very small API (consisting of 4 functions or so) to allow building custom interfaces. For example in future it will be easy to build a CGI application on top of this code, or maybe a desktop GUI client.</li>
<li style="font-family: Verdana,sans-serif; text-align: justify;">Documentation: I have also developed detailed documentation in the form of manual pages. </li>
<li style="font-family: Verdana,sans-serif; text-align: justify;">There is also a patch for man(1) to modify the way symlinks/hardlinks are handled. This patch would eliminate the need of maintaining symlinks/hardlinks of man pages on the file system. A lot of man page files are simply links to other pages and this has to be specified explicitly in the makefiles of the packages via MKLINKS. This patch would take away this pain and also wipe off all the symlinks/hardlinks from the file system.</li>
</ul>
<div style="font-family: Verdana,sans-serif;">
<br /></div>
<div style="font-family: Verdana,sans-serif;">
<span style="font-size: large;"><b>Challenges on the way:</b></span> This was a relatively easier project to do, but every project has it's challenges.</div>
<ul style="font-family: Verdana,sans-serif; text-align: left;">
<li><b>Parsing man pages:</b> I had never looked at how man pages are written before this so I was overwhelmed. Though libmandoc made my task very easy, my understanding about the syntax kept growing as I progressed in the project. It was also particularly challenging the <a href="http://netbsd.gw.com/cgi-bin/man-cgi?man+7+NetBSD-current">man(7)</a> based pages, while parsing <a href="http://netbsd.gw.com/cgi-bin/man-cgi?mdoc+7+NetBSD-current">mdoc(7)</a> pages was relatively easy.</li>
<li><b>Ranking Algorithm</b>: Another challenge was to come up with a suitable ranking algorithm which helps in bringing up the relevant results at the top. I did some study for this and experimented with a number of different algorithms, and finally settled with a mixture of different ranking schemes ;-).</li>
<li><b>Testing</b>: On the way towards completion I had to make a number of changes in the code related to parsing. Initially I started with parsing only the NAME and DESCRIPTION section, then expanded to parsing all the sections and a lot of similar changes. Each time I made a change, I had to thoroughly review the results to make sure nothing wrong was happening as a result of the changes. I had to make sure both mdoc(7) and man(7) pages were getting parsed properly and the database had sane data. I agree that I should have written unit tests instead, maybe this is something worth doing in coming days.</li>
</ul>
</div>
<div style="font-family: Verdana,sans-serif;">
<span style="font-size: large;"><b>Results:</b></span> Attaching the output of some sample runs.</div>
<br />
#apropos "add a new user"<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb3uC1SuaDHAG3rYa7ZevlXau7hf4WIdlalyD2WYn-yY30ySvhBibcWNqut136-FuPnE7CwkoGkx0bCERibcDHzhGgkrl91Nl0WD0UFZeu0zFiBk-rrNPE5QsjQOQ0pu7MOifQ1SJZtUA/s1600/add-user.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb3uC1SuaDHAG3rYa7ZevlXau7hf4WIdlalyD2WYn-yY30ySvhBibcWNqut136-FuPnE7CwkoGkx0bCERibcDHzhGgkrl91Nl0WD0UFZeu0zFiBk-rrNPE5QsjQOQ0pu7MOifQ1SJZtUA/s640/add-user.png" width="640" /></a></div>
<br />
#apropos "generate password hash"<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN_UjgN6ALX47ZmLGpw3hsvgPwU_zoL8bsnSWhBAhzYf_G0Ck20X2tJBlVzp_si-awHuVWe844zuToauMCeSceSE12J6FRurf9nTbzC9lSevTQi4Ko0nTTBBKjd_aqSOSSKPHP_GcCbv4/s1600/password-hash.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN_UjgN6ALX47ZmLGpw3hsvgPwU_zoL8bsnSWhBAhzYf_G0Ck20X2tJBlVzp_si-awHuVWe844zuToauMCeSceSE12J6FRurf9nTbzC9lSevTQi4Ko0nTTBBKjd_aqSOSSKPHP_GcCbv4/s640/password-hash.png" width="640" /></a></div>
<br />
<br />
#apropos "convert signal number to string"<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUWzawe-xXuhR1Z4g67ShcN-cK6zFquuvrWTYGRCtH8k7_Ne1zLR3nndaynMdzSUkSJ9mqqFs-DEMi5cK6N5RW2rqrcqh4dLLw0GrBZnQyC6v1-vO0dJ06dfDgPDGJrqhXi9LY2ZxMjko/s1600/psignal.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUWzawe-xXuhR1Z4g67ShcN-cK6zFquuvrWTYGRCtH8k7_Ne1zLR3nndaynMdzSUkSJ9mqqFs-DEMi5cK6N5RW2rqrcqh4dLLw0GrBZnQyC6v1-vO0dJ06dfDgPDGJrqhXi9LY2ZxMjko/s640/psignal.png" width="640" /></a></div>
<br />
<br />
<br />
<br />
<br />
#apropos -3 "compute log"<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAz3QKsn5jIy_kVF5uJDopGBSM7txeJqCZgX2du7O_50bbBkDgfHjlDoh8CNUhGVFrwPbzhaDAC8vzsX8ulDfQfCyGs1EzSsI9DYn1g3ByuN-20MQMEBOivSaBeB_rzIcU2jWjIrYWFnw/s1600/log.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAz3QKsn5jIy_kVF5uJDopGBSM7txeJqCZgX2du7O_50bbBkDgfHjlDoh8CNUhGVFrwPbzhaDAC8vzsX8ulDfQfCyGs1EzSsI9DYn1g3ByuN-20MQMEBOivSaBeB_rzIcU2jWjIrYWFnw/s640/log.png" width="640" /></a></div>
<br />
<br />
<br />
<br />
#apropos "realtek"<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYIqvA9rgqPc4kXY7_V59KSry6tUREwmQnBSugFb71bGr10f_4wN5nZWiRPZUdORa6yDOpC5s6teK1IwGT4oa_fwF1OoNenD20v-tGnm_xPNuJ2mzlx_LBfDzpLFfn_ouY3me-_21zBIk/s1600/realtek.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYIqvA9rgqPc4kXY7_V59KSry6tUREwmQnBSugFb71bGr10f_4wN5nZWiRPZUdORa6yDOpC5s6teK1IwGT4oa_fwF1OoNenD20v-tGnm_xPNuJ2mzlx_LBfDzpLFfn_ouY3me-_21zBIk/s640/realtek.png" width="640" /></a></div>
<br />
<br />
<br />
<br />
<div style="font-family: Verdana,sans-serif;">
<span style="font-size: large;"><b>Acknowledgements: </b></span></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
I owe a big chunk of the success to my mentor Jörg Sonnenberger who was always there to answer my questions, offer advice and review the code. I have learnt a great deal from him and I am sure I have improved as a programmer. The best thing about working with him was that he never really disclosed the solution, instead he gently guided towards the direction of the solution, so I never lost a learning opportunity :-)</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
David Young also offered valuable guidance during the project. He provided some clever insights and tips to improve the search and ranking of the results. I decided to decompose the database into more columns based on different sections in a man page based on his idea only. </div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
Thanks to Kristaps Dzonsons as well who is responsible for the mdocml project. He also reviewed the code related to parsing of the pages and pointed out bugs in the code. I implemented makemandb based on his utility "mandocdb", so that was also a huge help. </div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
<br /></div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
Special thanks goes to Thomas Klausner for reviewing the man pages I wrote and also proving patches for the errors/mistakes I had made in them.</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
</div>
<div style="font-family: Verdana,sans-serif; text-align: justify;">
I must also thank Julio Merino, Jan Schaumann, Jukka Ruohonen, S.P.Zeidler for the interest they showed in the project and offered help throughout :-)<br />
<br />
And thanks to lots of other people in the community as well whose names I am forgetting. While there I would like to thank my friends and family as well for keeping up with me when I slept during the day and worked at night.</div>
<h2 style="font-family: Verdana,sans-serif;">
</h2>
<h2 style="color: black; font-family: Verdana,sans-serif;">
<span style="font-size: large;">What Next? </span></h2>
<div style="font-family: Verdana,sans-serif;">
Well no guesses here, I enjoyed my experience and I would try to grab another project and continue working in the NetBSD community. Systems programming always attracted me and although I don't have much practical knowledge in this field but I don't mind learning ;-)</div>
<span id="goog_1853976185"></span><span id="goog_1853976186"></span><br />
<div>
<ul style="text-align: left;"></ul>
</div>
</div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com3tag:blogger.com,1999:blog-9185564337892058358.post-40439091724271194022011-07-31T14:13:00.000-07:002011-11-13T00:00:31.112-08:00NetBSD GSoC: Project Update 5<div dir="ltr" style="text-align: left;" trbidi="on"><div style="text-align: justify;">First of all thanks to Joerg, David Young and all other people involved with GSoC as I cleared Midterm Evaluations. 20 days passed since I last posted an update of the project and I did not even realize it. I apologize if I seem to be inactive, but in my defense, I would like to just quote my <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commits/master">Commit Log</a> ;-). I have been actively pushing changes, fixing issues, adding new features all this while. As the size of the project is growing it is taking more and more time to make new changes, test them properly and fix any loose ends. </div><br />
So a brief overview of the things I did in last 3 weeks:<br />
<br />
<div style="text-align: justify;"><span style="font-size: large;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/33">[New Feature] Search Within Specific Sections</a>:</b></span> This feature has been on my TODO list for very long, but something or the other kept coming up which was more important to deal with. I wanted to do it this time before I posted this update. So here it is:</div><div style="text-align: justify;">I have added options to support search within one or more specified sections.</div>Commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/966c7ba552b94780b574c98ceee966a3e7846a26">966c7ba</a> <br />
You can do a search like this:<br />
<pre class="prettyprint">$apropos -1 "copy files"
#It will search in section 1 only.
$apropos -18 "adding new user"
#This will search in section 1 and 8 only.
#I hope you get the idea :)
</pre><br />
Some sample runs: <a href="http://paste2.org/p/1554491">http://paste2.org/p/1554491</a><br />
<a href="http://paste2.org/p/1554510">http://paste2.org/p/1554510</a><br />
<a href="http://paste2.org/p/1554509">http://paste2.org/p/1554509</a><br />
<div><br />
</div><div style="text-align: justify;"><b><span style="font-size: large;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/23">Indexing Performance Improvement</a>:</span></b> Joerg suggested a clever way to bring down the time taken by makemandb to index the pages. He suggested that instead of doing a separate transaction for each page, it is better to index all the pages inside a single transaction which will decrease the IO overhead substantially. So I did the changes, and the indexing time came down from 3 minutes to within range of 30 seconds or so.</div><div style="text-align: justify;">Commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/926746692cf0ff2f84ec4d5dbe21fe55948b8a4f">926746</a></div><br />
<div style="text-align: justify;"><span style="font-size: large;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/25">Parse And Index man(7) Pages</a>:</b></span> Till now we were indexing only the mdoc(7) pages and all the man(7) based pages were being ignored. Now when the project was working quite ok with mdoc(7) pages, it was time to scale up. Parsing man(7) pages was a bit more difficult as compared to parsing mdoc(7) pages. It took some 2-3 days to implement this code and the next 2-3 days to fix various bugs and testing whether it was working ok with the 7000+ man pages I have.</div>Commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/2014855e6a1743ee1bc63b1d8d473b7b20797883">2014855</a><br />
<div style="text-align: justify;"><br />
<span style="font-size: large;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/26">Too Large DB Size (Regression)</a>:</b></span> Parsing man(7) and mdoc(7) meant that I was indexing a whole lot of man pages. (7613 to be exact). This scale up in the number of indexed pages also scaled up some problems which were not really visible before this. One major problem that came up was the size of the DB. It had grown to almost 99M.<br />
<b>Root Cause:</b> The root cause for this was that we were also storing all the unique terms in the corpus and their term-weights in a separate table, which was almost doubling the space requirements.<br />
<b>Solution:</b> So as a quick solution to the problem I decided to remove the code related to pre-computation of the term-weights and drop this table. This brought down the DB size to around 60M and with a few optmizations it has come down in the range of 30-40M.<br />
<b>Drawbacks:</b> The pre-computation of weight had it's advantages, I was using it to implement some advanced ranking algorithms and I had some plans to improve the ranking further on the basis of this work but I had to let it go.<br />
<b>Alternatives:</b> The extra space was only helping to get more accurate results, it was a trade off between space and search quality. One alternative can be to let the user decide what does he/she want ?Let the user choose between the two versions.</div>Commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/7928fc52f7087d79fe66f4a087f2669295276a5e">7928fc5</a> <br />
<br />
<br />
<div style="text-align: justify;"><span style="font-size: large;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/27">Added Compression Option To The DB</a>:</b></span> To bring down the DB size further, I implemented code for compressing and decompressing data using zlib(3). It was also an exercise to make the zlib interface work with Sqlite. </div><div style="text-align: justify;">As a result of this the DB size came down to 43 M. </div><div style="text-align: justify;">Commit: <a href="http://www.blogger.com/goog_954596522">d878815</a></div><br />
<br />
<div style="text-align: justify;"><span style="font-size: large;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/28">Stopword Tokenizer</a>:</b></span> Implementing a custom tokenizer to filter out any stopword was already on my TODO list but with the increased DB size it became the priority. I patched the porter tokenizer from the Sqlite source to filter out any stopwords. </div><div style="text-align: justify;">The tokenizer seemed to be working fine, and it also helped in bringing the DB size down. When using the stopword tokenizer the size came to be around 31M. </div><div style="text-align: justify;">Due to a small bug I have disabled the use of this tokenizer for now.</div>Commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/76b45695f2962921f70a946bef04129a670ec04d">76b4769</a><br />
<br />
<div style="text-align: justify;"><span style="font-size: large;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/29">Parsing Additional sections & Storing Them In Individual Columns</a></b></span>: This was a required change. With such a large number of pages (7613) in the db, and all of the content in a single column mean a lot of noise and the search results were off the mark by a great margin. David Young had also suggested this previously to give weight to some prominent sections like "DIAGNOSTICS" than others like "ERRORS", etc. </div><div style="text-align: justify;">It was a big task to do. I first started with decomposing the mdoc(7) pages, then man(7) pages and then sat down to fix apropos to take in account the new columns in the DB and fix the ranking function. </div><div style="text-align: justify;">Commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/bd639b8a3033d8461b533f4f662a8f17348d87e5">decomposing mdoc(7) </a> , <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/ac8ff64ee548d79df0b1ec6f5999dbff08f05cd4">decomposing man(7)</a> , <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/cba20d4da3beb1e50d798321b8a24efa67e3c47e">changes in ranking function</a></div><br />
I would say, the time taken to implement this was worth it. Because it has helped in making the code more clean. In future if there was a requirement to parse another extra section, it will only require adding a switch case statement and a couple of extra lines of code. <br />
<br />
<code><pre class="prettyprint">static void
mdoc_parse_section(enum mdoc_sec sec, const char *string)
{
switch (sec) {
case SEC_LIBRARY:
concat(&lib, string);
break;
case SEC_SYNOPSIS:
concat(&synopsis, string);
break;
case SEC_RETURN_VALUES:
concat(&return_vals, string);
break;
case SEC_ENVIRONMENT:
concat(&env, string);
break;
case SEC_FILES:
concat(&files, string);
break;
case SEC_EXIT_STATUS:
concat(&exit_status, string);
break;
case SEC_DIAGNOSTICS:
concat(&diagnostics, string);
break;
case SEC_ERRORS:
concat(&errors, string);
break;
case SEC_NAME:
break;
default:
concat(&desc, string);
break;
}
}
</pre></code><br />
<br />
It also allows to fine tune the ranking function easily and play with it. If you want to experiment around with search, you can easily modify the column weights and rebuild to see the effects. The column weights are in the form of a double array in the rank_func function.<br />
<br />
<pre class="prettyprint">double col_weights[] = {
2.0, // NAME
2.00, // Name-description
0.55, // DESCRIPTION
0.25, // LIBRARY
0.10, //SYNOPSIS
0.001, //RETURN VALUES
0.20, //ENVIRONMENT
0.01, //FILES
0.001, //EXIT STATUS
2.00, //DIAGNOSTICS
0.05 //ERRORS
};
</pre><br />
<br />
<br />
<span style="font-size: large;"><b>[Feature Proposal]: Show additional data with search results-</b></span> Storing the different sections in separate column has it's advantages as well. One of them being the ability to fetch and show more specific content with search results. For example, I have already done something like this. Now, if you see the search results, you will also see the one line description of the result (.Nd macro).<br />
<div style="text-align: justify;">Similarly it is possible to show the library, exit values, return values where possible. But I was wondering if it is a useful feature ? Any views ?</div><div style="text-align: justify;"><br />
Besides this there are a lot of other things to be done that I had mentioned in my proposal like a CGI based interface and using the database for managing the man page aliases.These are now on top of my TODO list, and if no big issues come up, I would like to pick them up.</div><br />
</div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-39390257358774763462011-07-08T09:18:00.000-07:002011-11-13T00:00:31.103-08:00NetBSD GSoC: Midterm Project Update<div dir="ltr" style="text-align: left;" trbidi="on"><div style="text-align: justify;">Another update of the project after a gap of 2 weeks. I did not have much to write about last week, and come this week, we have reached the stage of midterm evaluations. I will try to explain what new changes I made in last two weeks, over all what is the present status of the project and what are the other things which I am planning to implement.</div><br />
<div style="text-align: justify;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/14"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">Printing the section number along with search results</b></span></a>: Last time around when I posted a few sample runs of the project, there were no section numbers along with the search results, but now we have them (thanks to Kristaps for suggesting the right way to extract meta data).</div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/17"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">Improve the ranking algorithm (Implement tf-idf):</b></span></a> tf-idf based term weighting schemes are very common in information retrieval systems. Till now, I was ranking the results only on the basis of the term frequency (tf). I improved this by including another factor for ranking, .i.e, the inverse document frequency (idf).</div><br />
<ul><li><i><b>Term Frequenc</b></i>y: Is usually defined as the number of times a given term appears in a particular document.</li>
<li><i><b>Inverse Document Frequency</b></i>: IDF of a term indicates in how many documents a given term appears (at least once).</li>
</ul><br />
<div style="text-align: justify;"><i>Term frequenc</i>y is a <u><i>local factor</i></u>, which is concerned only with the number of occurrences of the search terms in one particular document at a time.</div><div style="text-align: justify;">While <i>Inverse Document Frequency</i> is a <u><i>global factor</i></u>, in the sense that, it indicates the discriminating power of a term. If a term appears in only a selected set of documents, then it means, that that term separates that set of documents from the rest. So ranking obtained by combining these two factors brings up more relevant documents.</div><br />
So the weight of a term t in document d is calculated by the following formula:<br />
<br />
<blockquote>weight = tf * idf</blockquote><br />
<blockquote>Where tf = Term frequency of term t in document d<br />
idf = log (N / Nt)<br />
<br />
Where N = Total number of documents in the corpus<br />
Nt = Number of documents in which term t occurs (at least once).</blockquote><blockquote>So for a term which appears in only one document it will have<br />
IDF = log(n)<br />
while a term which appears in all the documents, it will have</blockquote><blockquote>IDF = log(1) = 0.</blockquote><br />
For example a term like "the" will have a high term frequency in any document, but at the same time it will have a lower inverse document frequency (almost close to 0), which will nullify it's effect on the quality of search results.<br />
<br />
<br />
<div style="text-align: justify;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/18"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">Pre-compute the term-weights</b></span></a>: While the tf-idf based term-weighting scheme improved the quality of search, it degraded it's performance. I could see apropos taking time in flushing out the results. The reason for this was that, all the calculations of the term-weights were being done on the fly when running apropos. An obvious solution to this problem was to pre-compute the term-weights while creating the index and store them in the database. Thus while doing the search, we only need to lookup the database rather than do both lookup and perform calculation! </div><br />
<div style="text-align: justify;">I implemented the code for pre-computing term weights in makemandb, but to my surprise, these changes made makemandb painfully slow. Earlier makemandb could index the man pages in under 2 minutes, but now it was taking close to 3 hours to do the pre-computation of each unique term in the corpus. In addition to that there were some<i> <a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/19">bugs</a></i> which were causing large deviations in the term-weights. I decided to first get the calculations right, it took me 3 days to get the calculations right, as after each bit of change in the code I had to re-run makemandb to do the indexing and see the results. Finally, I got it right, and then after some discussions with Joerg, the performance issue was also fixed. Basically the solution was to bring most of the processing inside Sqlite. Now makmandb does the indexing and pre-computation of weights, all under 3 minutes on my machine :-)</div><br />
<div style="text-align: justify;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/21"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">Further Improve the Ranking Algorithm</b></span></a>: In my free time I am doing some study on Information Retrieval. During my studies I came across a very interesting research paper by <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.9086&rep=rep1&type=pdf"><i><b>Salton and Buckley from 1988</b></i></a>, in which they discussed different term weighting schemes and their results. According to their study, the following formula for calculating term weights is most effective:</div><br />
<br />
for weight of a given term in a particular document we can calculate the weight as: <br />
<pre><code> </code></pre><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqeAmIzn64RrritqXGe07rrXiCM1T6JcExPiqaattWkYfQNM0W_ITyJWrKP62OVeWUxU7fiiw_GDuKW6WDtn-OBUH_hbUpyNzslrxwIy1TO2XdJEjBocBzHgBqrwj_w-7u_WX4dirPPEk/s1600/Selection_005.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqeAmIzn64RrritqXGe07rrXiCM1T6JcExPiqaattWkYfQNM0W_ITyJWrKP62OVeWUxU7fiiw_GDuKW6WDtn-OBUH_hbUpyNzslrxwIy1TO2XdJEjBocBzHgBqrwj_w-7u_WX4dirPPEk/s1600/Selection_005.png" /></a></div><div style="text-align: justify;"><span style="font-family: inherit;">I implemented this in a bit more simpler form. I avoided the calculation of powers in the denominator (square root and square) to avoid unnecessary overheads as these calculations are being done on the fly by apropos. The results have been pretty good.</span></div><br />
<span style="font-family: inherit;"><b>Sample Results: </b><a href="http://pastebin.com/PjdNY68m">http://pastebin.com/PjdNY68m</a><br />
</span><br />
<br />
<span style="font-family: inherit;"> </span><br />
<div style="text-align: justify;"><span style="font-family: inherit;"><b>Note</b>:<i> <span style="font-family: Times,"Times New Roman",serif;">The above mentioned change is in the search branch only at the moment. I did not merge this in master so that, if you guys want to compare the differences before and after the above change, you can easily checkout the master and search branches and see for yourself :-)</span></i></span></div><div style="text-align: justify;"><span style="font-family: inherit;"> </span></div><br />
<div style="text-align: justify;"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/15">A keyword Recognizer</a>:</b></span> I have been thinking to implement this feature for a while. Basically the idea is to scan the query for some keywords, which would indicate that the user is probably looking for results from a particular section. For example "functions to copy string" gives an indication that the user is looking for standard library functions from section 3.</div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">After some discussions with David, we came to the conclusion that probably a better way to implement a feature like this would be to do something like Google does. Google allows you to search within a specific website using a syntax like:</div><blockquote>[book on IR site: amazon.com]. </blockquote><div style="text-align: justify;">David suggested to use a similar interface, where user could specify a specific section using colon. So for example: </div><blockquote>apropos "kernel: function to allocate memory" </blockquote><div style="text-align: justify;">will search only within section 9.</div><div style="text-align: justify;">I started some work on this feature but it didn't work out properly, so it is at the moment on halt. I hope to resume work on it soon, but at the same time I would like to know if this feature is worth it ?</div><br />
<br />
<div style="text-align: justify;"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">Where Do We Stand At Midterm Evaluation</b></span>: As I promised in my <a href="http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/abhinav_upadhyay/1">proposal</a>, I have accomplished most of the requirements just in time for the midterm evaluation (maybe I need to write some documentation). Now is the time for some community feedback :-). I would love to hear about</div><br />
<ul><li style="text-align: justify;"> How good or bad the search results are ? If for some query you feel that right results are not coming up, please write to me about that query and what results you expected to see at the top. </li>
<li style="text-align: justify;">If you want to see any improvements or new features, tell me about them.</li>
</ul><br />
<div style="text-align: justify;"><span style="font-size: large;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">What New Features Are Next?</b></span> Apart from the keyword recognizer, there are another couple of features that I have in mind, although whether I will implement them or not is a different matter, as I need to make sure whether it is feasible to implement them.</div><br />
<div style="text-align: justify;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/16"><b><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">A Link Analysis Algorithm For Ranking</span></b></a>: Search engines these days do two types of ranking.</div><div style="text-align: justify;"><br />
</div><ol><li style="text-align: justify;"><b>Content based ranking</b>: It is concerned with finding relevant documents by matching the content. For example the tf-idf based term-weighting scheme is one way of doing content based ranking.</li>
<li style="text-align: justify;"><b>Popularity Based Ranking:</b> It tries to rank the documents based on their popularity. This popularity is calculated on the basis of a link analysis algorithm. For example <a href="http://en.wikipedia.org/wiki/PageRank">Google's PageRank</a> or <a href="http://en.wikipedia.org/wiki/HITS_algorithm">Jon Kleinberg's HITS </a>algorithm. </li>
</ol><br />
<div style="text-align: justify;">I am studying about the PageRank algorithm and I am tempted to implement it, but I am held backby the fact that Stanford University has a patent on the PageRank process, so I am in a dilemma whether I should implement it or not.</div><br />
<div style="text-align: justify;"><b style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">A Spell Checker:</b> It is a very common thing that the users do a typo while performing the search, which might lead to no results at all or in some cases wrong results. I am thinking to add a spell checker, which in case no results are found, would suggest to the user some related search terms (assuming that perhaps he made a typo).</div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">I am held back on this because personally I have never looked at what techniques are involved in spell checkers but I have heard that it is computationally very expensive.</div><div style="text-align: justify;"><br />
</div><br />
<div style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;"><b><span style="font-size: large;">Testing out apropos:</span></b></div><br />
</div><div><pre class="prettyprint" style="font-family: Georgia,"Times New Roman",serif;">#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
#Run make
$make
#Run makemandb
$./makemandb
#Run apropos
$./apropos "list directories"
</pre><br />
<div style="text-align: justify;">By default you will be on the master branch. The search branch has an improved ranking algorithm, so you might want to check it out and compare the results before and after the algorithm improvement:</div><br />
<pre class="prettyprint" style="font-family: Georgia,"Times New Roman",serif;">$git checkout -b search origin/search</pre><br />
and run make again to build it.<br />
<br />
<div style="text-align: justify;"><b>Prerequisites</b>:</div><ol style="text-align: left;"><li style="text-align: justify;">You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the <a href="http://abhinav-upadhyay.blogspot.com/2011/06/netbsd-gsoc-weekly-report-2.html">-p option to man(1)</a> which is being used by makemandb. </li>
<li style="text-align: justify;">You will also want to have the -current version of the man pages in /usr/share/man (at least).</li>
<li style="text-align: justify;">libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml </li>
</ol>Feedbacks are welcome :-)<br />
<ol style="text-align: left;"></ol></div><div></div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com1tag:blogger.com,1999:blog-9185564337892058358.post-92220722714975677942011-06-22T13:19:00.000-07:002011-06-22T14:56:07.222-07:00NetBSD GSoc Weekly report 3<div dir="ltr" style="text-align: left;" trbidi="on"><div style="text-align: justify;">This week I got some more work done. I did a barebones implementation of apropos(1) as well as fixed some nasty and some not so nasty issues in makemandb.</div><br />
<span style="font-size: large;"><b>Issues Fixed:</b></span><br />
<br />
<ul style="text-align: left;"><li style="text-align: justify;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/4">Handling .Nm macros</a></b>: As I said in the last post that .Nm macros are a special case. They are supplied an argument only once in the beginning of the man page and at rest of the places where .Nm occurs, the parser replaces it with it's previously specified argument value. I just had to add an extra check to see if we have encountered a .Nm macro and substitute it's value. Here is the commit which fixed it: <span class="Apple-style-span" style="color: #333333; font-family: Monaco,'Courier New','DejaVu Sans Mono','Bitstream Vera Sans Mono',monospace; font-size: 12px; line-height: 16px;"><a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/bbdab19ac263df09af82490426147dac68764b9b">bbdab19</a><span id="goog_832096988"></span><span id="goog_832096989"></span></span></li>
<li style="text-align: justify;"><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/9">ENOMEM in traversedir()</a>:</b> This was a nasty memory leak in makemandb which took away my sleep for a couple of nights. I somehow managed to track down the offending piece of code with Joerg's help of course :-) Here was the problem, a code similar to quoted below was running in a loop:</li>
</ul><pre class="prettyprint" style="font-family: Georgia,"Times New Roman",serif;"><div>char *desc;</div><div>if (desc == NULL)</div><div> desc = strdup(n->string);</div><div>else</div><div> asprintf(&desc, "%s %s", desc, n->string);</div></pre><div><br />
</div><div style="text-align: justify;"><blockquote>So the above <i style="font-family: Georgia,"Times New Roman",serif;">asprintf</i> call was leaking out <i style="font-family: Georgia,"Times New Roman",serif;">desc</i> at each step of the loop. This was causing makemandb to consume memory upto 2.6 GB (3 GB being my total physical memory). After fixing this bug, makemandb is consuming around 5 to 6 MB of memory :-)<br />
This is the commit which fixed it: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/cd53b9b5613fd377e1a147736fa96cad3d3879df"><b>cd53b9b</b></a></blockquote></div><div><br />
</div><ul style="text-align: justify;"><li><b><a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues/12">Avoid Hardlinks</a></b>: After running a few queries against the database, I noticed that some of the man pages were indexed multiple times. For example csh had 26 duplicate entries. Joerg told me that this is due to hardlinks. A lot of man page files are nothing but hardlinks to other man pages. To handle this I added a function check_md5 to makemandb. So before we start to parse a new file, we first calculate it's md5 hash and check in the database if it isn't already indexed (added a new column for storing hash as well). Here is the commit: <a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/14b024f8b7ed67f6b9926d168f457e8ae0d10e21"><span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"><b><span class="Apple-style-span" style="color: #333333; line-height: 18px;"><span style="line-height: 1.4em; margin: 0px; padding: 0px;">14b024f</span></span></b></span></a></li>
</ul><div style="font-family: inherit; text-align: justify;"><br />
<span class="Apple-style-span" style="color: #333333;"><span class="Apple-style-span" style="font-size: 12px; line-height: 18px;"><span style="font-size: large;"><b><span style="font-size: large;">Implementation of apropos.c</span>:</b><span style="font-family: inherit;"> </span></span><span style="font-family: inherit;">Besides fixing some issues, I was also able to write a barebones version of apropos(1). The initial version was pretty basic. It would take the user query as a command line argument, and simply run against the database using the FTS engine of Sqlite. The results were not very good, as Sqlite's FTS documentation itself says that it performs a boolean search, so it is upto us to perform the mathematics for finding out more relevant documents and ranking them up in the results. The <a href="https://github.com/abhinav-upadhyay/apropos_replacement/blob/master/apropos.c">master branch</a> on Github still has this basic version of apropos, </span></span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;"><br />
</span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;">I have started a new experimental branch <a href="https://github.com/abhinav-upadhyay/apropos_replacement/branches/search">search</a> on Github, where I will try to experiment with search related code, and after some reviews and feedback, I will chery pick the commits which look good.</span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;"><br />
</span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;">So Currently the search branch has following two features:</span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;"><br />
</span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;"><span style="font-size: large;"><b>Stopword Filter:</b></span><span style="font-family: inherit;"> I noticed that </span>Sqlite does not filter the user query for any stopwords, and tries to match the stopwords as well while performing the search. I have implemented a stopword filter for this. </span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 18px;">It works something like this: We store all the stop words in a hash table. We scan the user query word by word in a loop, at each iteration we lookup the hash table to know whether the word is a stopword or not. If it is a stopword, we omit it from the query. Here is the commit: </span></span><a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/ec2554638c26984794130d607947510ab55bd187"><span class="Apple-style-span" style="color: #333333; font-size: small; line-height: 16px;"></span></a><span class="Apple-style-span" style="color: #333333; font-size: small; line-height: 16px;"><a href="" style="color: #4183c4; line-height: 1.4em; margin: 0px; outline-style: none; padding: 0px; text-decoration: none;">ec25546</a> </span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small; line-height: 16px;"><br />
</span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small; line-height: 16px;"><span style="font-size: large;"><b>A Ranking Function:</b></span> As I said above, the plane Sqlite search wasn't much of a help. So we need to write a ranking function which will tell Sqlite what all search results are important and show them higher in the output. The Sqlite's FTS documentation provoides a sampl ranking function which is very simple but effective. I didn't try to fully understand it (I just wanted to see the effect of a ranking function on search results), but to me it seems to based on finding out the term frequency of the search phrases for each column in the database and multiplying them with a static weight assigned to each column, this procedure is repeated for each term in the query to find out the weight of each column. The overall rank of the page is obtained by summing up the weight of individual columns thus calculated.</span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small;"><span class="Apple-style-span" style="line-height: 16px;"><br />
</span></span></div><div style="font-family: inherit; text-align: justify;"><span class="Apple-style-span" style="color: #333333; font-size: small; line-height: 16px;">Commit for this: </span><a href="https://github.com/abhinav-upadhyay/apropos_replacement/commit/001a679fe9a4b4c04a8d60f636d3b06a22b7a968"><span class="Apple-style-span" style="color: grey; font-size: small; line-height: 17px;">001a679fe9a4b4c04a8d</span></a></div><div style="font-family: inherit;"><br />
</div><div style="font-family: inherit; text-align: justify;"><b>Some Sample Runs</b>: I ran some sample queries to check out how this ranking function performs. The results are much improved as compared to without any kind of ranking, but there is still much scope for improvement. Following is a sample run output. If you would like to see a few others, I pasted the output of some queries on pastebin: <a href="http://pastebin.com/qhQBRNd5">http://pastebin.com/qhQBRNd5</a></div><pre class="prettyprint" style="font-family: Georgia,"Times New Roman",serif;"><span style="font-size: small;">$ ./apropos "copy string"
memccpy
The memccpy function copies bytes from string src to string dst . If the character c...
strndup
...copies at most len characters from the string str always NUL terminating the copied
string...
bcopy
...copies len bytes from string src to string dst . Unlike bcopy 3 the two strings...
strlcat
size-bounded string copying and concatenation
bcopy
...bcopy function copies len bytes from string src to string dst . The two strings may...
memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...
memmove
...memmove function copies len bytes from string src to string dst . The two strings may...
memmove
...memmove function copies len bytes from string src to string dst . The two strings may...
memcpy
The memcpy function copies len bytes from string src to string dst . The arguments must...
strncpy
...copy the string src to dst (including the terminating \e0 character). The strncpy function
copies...</span></pre><br />
<div style="text-align: justify;">You might notice that few results are repeated here. I believe this is a bug in apropos(1). This is because some man pages have a number of different versions depending on the machine architecture. I think this duplication in results is because of that. I need to fix it :-)</div><span style="font-size: small;"><br />
</span><br />
<div style="font-family: inherit; text-align: justify;"><span style="font-size: large;"><b>How to Test</b></span>: If you are interested in checking out the functionality of the project, you are welcome, I would appreciate it even more if you report back any issues you notice or if you have some feedback on how the search results can be improved.</div><div style="font-family: inherit;"><br />
</div><div><pre class="prettyprint" style="font-family: Georgia,"Times New Roman",serif;">#Clone the repository:
$git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
#Run make
$make
#Run makemandb
$./makemandb
#Run apropos
$./apropos "list directories"
</pre><br />
<div style="text-align: justify;">By default you will be on the master branch, which currently does not have the stopword filter and ranking function features. So you might want to checkout the search branch, for that</div><br />
<pre class="prettyprint" style="font-family: Georgia,"Times New Roman",serif;">$git checkout -b search origin/search</pre><br />
and run make again to build it.<br />
<br />
<div style="text-align: justify;"><b>Prerequisites</b>:</div><ol style="text-align: left;"><li style="text-align: justify;">You will need the -Current version of man(1) from CVS. Joerg committed my patch for adding the <a href="http://abhinav-upadhyay.blogspot.com/2011/06/netbsd-gsoc-weekly-report-2.html">-p option to man(1)</a> which is being used by makemandb. </li>
<li style="text-align: justify;">You will also want to have the -current version of the man pages in /usr/share/man (at least).</li>
<li style="text-align: justify;">libmandoc. I am using the version of libmandoc available with -current (which at the moment is 1.11.1). You can build it by running make && make install in /usr/src/external/bsd/mdocml </li>
</ol></div><div><br />
I belive now lots of work and research is required to make the search better. Any feedback and suggestions will be highly welcome :-)</div></div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com2tag:blogger.com,1999:blog-9185564337892058358.post-9917039411838959782011-06-20T00:03:00.000-07:002011-06-20T00:03:02.441-07:00A memorable week!!<div dir="ltr" style="text-align: left;" trbidi="on">Last week was perhaps one of the most wonderful week of my life. I never thought that all the things that happened to me this last week, will ever happen to me (although I sure dreamt of them).<br />
<br />
First of all, Tomboy 1.7 was released earlier this week, this release had one of my patches, and the Tomboy developers mentioned my name in the <a href="http://git.gnome.org/browse/tomboy/plain/NEWS?id=1.7.0">ChangeLog</a> [1]. <br />
<br />
The next day, I updated my Tomboy from 1.6 to 1.7.1 and to the greatest of my surprises, I saw my name in the list of contributors to Tomboy. It's really a great feeling to see your name in the list of contributors of a software that you yourself love and use.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijw_iB3DFBu2kBYj3Ad76f7jd8_h8o8N1SPEUrwHTZjHrofmykzN17s-7vcFN9KT0herA9zC5S960KsSrLXikJ2ly99I3cbZMb-ceVZcH9gTJQMRJ-IIKNKkiRNyDSBWcrbPr32FU8IqI/s1600/Selection_004.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijw_iB3DFBu2kBYj3Ad76f7jd8_h8o8N1SPEUrwHTZjHrofmykzN17s-7vcFN9KT0herA9zC5S960KsSrLXikJ2ly99I3cbZMb-ceVZcH9gTJQMRJ-IIKNKkiRNyDSBWcrbPr32FU8IqI/s1600/Selection_004.png" /></a></div><br />
<br />
But the week just didn't end there, one more surprise was waiting for me. Famous Ubuntu developer <a href="http://daniel.holba.ch/blog/">Daniel Holbach</a> contacted me for an interview. He told me that he is starting a series of weekly interviews with new contributors to Ubuntu Development, and he would like to start with me. I couldn't have asked anything more :-) The interview was published on <a href="http://www.omgubuntu.co.uk/2011/06/ubuntu-11-10-development-update/">OMG! Ubuntu</a> [2]and <a href="http://ubuntu-news.org/2011/06/16/ubuntu-11-10-development-update/">ubuntu-news</a>. [3]<br />
<br />
For my little contributions to Ubuntu and Tomboy, this is more than I even dreamt of getting in return. I am grateful to the Tomboy devs, Daniel Holbach and everyone else around me who encouraged me all throughout this. I hope a lot many other people get encouraged by this, it's very easy to start contributing and you learn as you go.<br />
<br />
<br />
Footnote:<br />
[1]: <a href="http://git.gnome.org/browse/tomboy/plain/NEWS?id=1.7.0">Tomboy 1.7 release note and changelog</a><br />
[2]: <a href="http://www.omgubuntu.co.uk/2011/06/ubuntu-11-10-development-update/">The interview on OMG! Ubuntu</a><br />
[3]: <a href="http://ubuntu-news.org/2011/06/16/ubuntu-11-10-development-update/">The interview on ubuntu-news.org</a></div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com0tag:blogger.com,1999:blog-9185564337892058358.post-41110746882832996752011-06-14T10:56:00.000-07:002011-11-13T00:00:59.749-08:00NetBSD GSoC Weekly Report 2<div dir="ltr" style="text-align: left;" trbidi="on"><div dir="ltr" style="text-align: left;" trbidi="on">This was a relatively productive week as compared to the Ist week. A significant portion of work got done from the point of view of our first milestone (to have a working prototype).<br />
<br />
<span style="font-size: large;"><b>What did I do this week:</b></span><br />
<br />
<b>Project Repository:</b> The first thing I did was to create the project repository on Github. Here is the link: <a href="https://github.com/abhinav-upadhyay/apropos_replacement">https://github.com/abhinav-upadhyay/apropos_replacement</a><br />
<br />
<b>makemandb: </b>This is one of the crucial components of the project. makemandb is supposed to parse each and every man page installed on the user's system and store them in an Sqlite database. The reason I say, it is crucial is because we will be making a lot of changes in the database schema, the way how man pages are parsed and what information is extracted from them, until we reach near perfection in our search results. It is necessary to get this part right, because a good search experience comes only when we have done the indexing correctly.<br />
<br />
makemandb first calls 'man -p' and recursively traverses the list of directories to get the complete path of the man pages and then passes them on to the libmandoc functions. <br />
<br />
<br />
The parsing related code of makemandb is largely inspired from mandoc-db as I had no clue about how to use libmandoc and that too for parsing specific portions of the man pages, so it was a huge help for me. Thanks to Kristaps :)<br />
<br />
makemandb will create a new Sqlite database named 'apropos.db' (even if there was already an existing database). It will create a new virtual table in the database before starting to insert data, the present schema of the virtual table is something like this:<br />
<br />
<br />
<b>Table name:</b> mandb<br />
<br />
<br />
<table><tbody>
<tr> <th>Column Name </th> <th>Description </th> </tr>
<tr> <td>name </td> <td>For storing the Name of the man page </td> </tr>
<tr> <td>name_desc </td> <td>For storing the one line description of the man page from the NAME section </td> </tr>
<tr> <td>desc</td> <td>For storing the complete DESCRIPTION section<br />
</td> </tr>
</tbody></table></div><br />
<span style="font-size: large;"><b>Present Issues:</b></span><br />
<br />
<ol style="text-align: left;"><li><b>Handling .Nm macros:</b> .Nm macros seem to be special in the syntax of man pages. From what I have seen, the argument for the .Nm macro is specified only at the beginning of the man page (usually the NAME section) and after that if at any place in the rest of the man page .Nm macros is used, the parser will replace it with its original value specified previously at the top. So at the present moment, we are unable to handle this. So wherever .Nm macros is used again, it is being simply ignored. </li>
<li><b>Unable to parse Escape Sequences:</b> Man pages are filled with a number of escape sequences. Presently our code does not try to do anything special to handle the escape sequences and they are being parsed as it is. The current version of mdocml has a new function mandoc_escape(), which I think should be helpful to rectify this. Hope to see the latest version of mdocml in the -current to be able to use this.</li>
<li><b>Unable to parse automatically generated man pages:</b> Some of the man pages are generated automatically as a result their syntax is very different from the normal man pages, as a result we are unable to parse such man pages at the moment. </li>
</ol>There are a few more issues which I have listed on Github. (<a href="https://github.com/abhinav-upadhyay/apropos_replacement/issues">https://github.com/abhinav-upadhyay/apropos_replacement/issues</a>).<br />
<br />
So at the present moment, you can clone the repository, run make to compile the source and run './makemandb'. If all goes well, a new sqlite database (apropos.db) will be created in the present directory. You can run some select queries against it to test.<br />
<br />
Feedback will be highly appreciated :)</div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com2tag:blogger.com,1999:blog-9185564337892058358.post-20461758970644583742011-06-08T08:55:00.000-07:002011-06-08T09:21:36.291-07:00NetBSD GSoC Weekly Report 1<div dir="ltr" style="text-align: left;" trbidi="on"><div style="text-align: justify;">The coding period of GSoC started on the 23rd of May and we are in the 3rd week since then, but this is my first report because during the Ist week I was bogged down with my semester exams and then I picked up the work from the 1st of June. And just last night (8th June) I finally completed my first task. </div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;"><b>The Task:</b> In one of my <a href="http://abhinav-upadhyay.blogspot.com/2011/05/problem-no-1-building-manpath.html">previous posts</a> I described about the first task. The problem was to get the list of directories which contain the man page sources. We will be needing this information in future when creating a database index for searching the man pages. </div><div style="text-align: justify;"></div><div style="text-align: justify;">The information which we were seeking is present in the file /etc/man.conf. The following two bits are important in man.conf for us:</div><br />
<blockquote style="font-family: Georgia,"Times New Roman",serif;">_default /usr/{share,X11R6,X11R7,pkg,local}/man/</blockquote><br />
and<br />
<br />
<blockquote><div style="font-family: Georgia,"Times New Roman",serif;">_subdir cat1 man1 cat2 man2 cat4 man4... </div></blockquote><br />
<br />
From this we need to build the path of directories containing man pages like this<br />
<br />
<blockquote style="font-family: Georgia,"Times New Roman",serif;">/usr/share/man/man1<br />
/usr/share/man/man8<br />
/usr/pkg/man/man4<br />
...</blockquote><br />
<div style="text-align: justify;">I wrote a patch for man(1) to add a new option -p which will print this list of directories on the terminal in new line separated format. It took me a whole week to do this relatively simple task mostly because of my stupid mistakes.</div><br />
<div style="text-align: justify;">My initial patch was kind of based on Brute force approach of problem solving, it was working but it was too complicated to anyone's liking. </div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">It looked something like this: </div><div style="font-family: "Trebuchet MS",sans-serif;"><span style="font-size: x-small;"><br />
</span></div><pre class="prettyprint" style="font-family: "Trebuchet MS",sans-serif;"><span style="font-size: x-small;">+/**
+* Tests if if the directory at dirpath exists or not
+*/
+static int
+testdir(const char *dirpath)
+{
+ DIR *dp;
+ if ((dp = opendir(dirpath)) == NULL)
</span><div class="im"><span style="font-size: x-small;">+ return 0;
+ closedir(dp);
+ return 1;
+}
+
</span></div><div class="im"><span style="font-size: x-small;">+/**
+* Builds a list of directories containing man pages
+*/
+void
</span></div><span style="font-size: x-small;">+printmanpath(struct manstate *m)
+{
+ ENTRY *esubd, *epath;
+ char *manpath; /*it will store the actual manpath as it is built */
</span><div class="im"><span style="font-size: x-small;">+ char *manpath_tokens[3]; /* stores /usr/, {share, pkg, ...}, /man/ */
</span></div><span style="font-size: x-small;">+ char *defaultpath = NULL; /* stores the _default tag value obtained
</span><div class="im"><span style="font-size: x-small;">from man.conf */
+ char *str, *buf; /* for storing temporary values */
+ int i;
+
+ TAG *path = m->defaultpath;
+ TAG *subdirs = m->subdirs;
+ if (path == NULL ) {
+ manpath = NULL;
+ return ;
+ }
+
+ /** routine code to get the default man path from the TAG.
+ * path is of the form /usr/{share,X11R7,X11R6,pkg,</span><wbr></wbr><span style="font-size: x-small;">local}/man/ (see
/etc/man.conf)
+ * We will first tokenize it into 3 parts
+ * 1. /usr/
+ * 2. share,X11R7,X11R6, pkg, local
+ * 3. /man/
+ * and store them in the array manpath_tokens.
+ */
+ TAILQ_FOREACH(epath, &path->entrylist, q) {
</span></div><span style="font-size: x-small;">+ defaultpath = strdup(epath->s);
</span><div class="im"><span style="font-size: x-small;">+ for (str = strtok(defaultpath, (const char *)"{}"), i = 0; str; str
= strtok(NULL, (const char *)"{}"), i++) {
+ manpath_tokens[i] = str;
+ }
</span></div><div class="im"><span style="font-size: x-small;">+ free(str);
+ }
+ /**
+ * 1. Tokenize manpath_tokens[1] (share, X11R7, X11R6,...)
+ * 2. Traverse the tail queue subdirs and get the list of subdirs i.e.:
+ * man1, man2, man3, ... man9, etc. (see /etc/man.conf)
+ * 3. Finally build the complete path of the directory by concatenating the
+ * different parts
+ */
+ for (str = strtok(manpath_tokens[1], ","); str; str = strtok(NULL, ",")) {
+ TAILQ_FOREACH(esubd, &subdirs->entrylist, q) {
+ // we need only path of the actual man pages and not the cat ones
</span></div><span style="font-size: x-small;">+ if (strncmp(esubd->s, "man", 3))
</span><div class="im"><span style="font-size: x-small;">+ continue;
+
+ asprintf(&buf, "%s%s%s%s/", manpath_tokens[0], str,
manpath_tokens[2], esubd->s);
</span></div><div class="im"><span style="font-size: x-small;">+
+ // we should not add non-existing directories to the man path
+ if (!testdir(buf))
+ continue;
+
+ if (manpath == NULL)
+ asprintf(&manpath, "%s", buf);
+ else
</span></div><span style="font-size: x-small;">+ printf("%s\n", buf);
+ free(buf);
+ }
+ }
+
+ free(str);
+ free(defaultpath);
+} </span></pre><span style="font-family: Georgia,"Times New Roman",serif;"> </span> <br />
<div style="font-family: Georgia,"Times New Roman",serif;"><br />
</div><div style="font-family: Georgia,"Times New Roman",serif; text-align: justify;">My mentors David and Joerg are showing a lot patience with me. They went through the different versions of the patches and gave their useful reviews. Joerg suggested a more intuitive and efficient algorithm to build this path in a recursive fashion. In the end I discovered glob(3) which provided Csh style brace expansion, and I settled on using it, as it was easiest and ensured that nothing goes wrong.</div><div style="text-align: justify;"><br />
</div><br />
The final version of patch looked something like this:<br />
<br />
<pre class="prettyprint" style="font-family: "Trebuchet MS",sans-serif;"><span style="font-size: x-small;">+
+/*
+ * printmanpath --
+ * Prints a list of directories containing man pages.
+ */
+static void
+printmanpath(struct manstate *m)
+{
+ ENTRY *esubd;
+ char *defaultpath = NULL; /* _default tag value from man.conf. */
+ char *buf; /* for storing temporary values */
+ char **ap;
+ glob_t pg;
+ struct stat sb;
+ TAG *path = m->defaultpath;
+ TAG *subdirs = m->subdirs;
+
+ /* the tail queue is empty if no _default tag is defined in * man.conf */
+ if (TAILQ_EMPTY(&path->entrylist))
+ errx(EXIT_FAILURE, "Empty manpath");
+
+ defaultpath = TAILQ_LAST(&path->entrylist, tqh)->s;
+
+ if (glob(defaultpath, GLOB_BRACE | GLOB_NOSORT, NULL, &pg) != 0)
+ err(EXIT_FAILURE, "glob failed");
+
+ TAILQ_FOREACH(esubd, &subdirs->entrylist, q) {
+ /* Drop cat page directory, only sources are relevant. */
+ if (strncmp(esubd->s, "man", 3))
+ continue;
+
+ for (ap = pg.gl_pathv; *ap != NULL; ++ap) {
+ if (asprintf(&buf, "%s%s", *ap, esubd->s) == -1)
+ err(EXIT_FAILURE, "memory allocation error");
+ /* Skip non-directories. */
+ if (stat(buf, &sb) == 0 && S_ISDIR(sb.st_mode))
+ printf("%s\n", buf);
+
+ free(buf);
+ }
+ }
+ globfree(&pg);
+} </span></pre><br />
<div style="text-align: justify;"><b>What did I learn in the process</b>: It was a small and relatively simple task. Although I believe brace expansion is not a trivial thing to do manually. Overall, it was a great learning curve for me. I learnt about queue(3) interfaces, glob(3), a host of string utilities available as per the POSIX and ISO C standards and I was unaware of them. </div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">And the most important learning lesson was about memory management. I tried to take care of freeing memory at most of the points, but I came to learn about many corner cases which I didn't know about, but might lead to memory leaks. I hope I did learn a lesson here :) </div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">Hopefully this will be my first patch for NetBSD. Joerg promised to commit it soon to the repository.</div></div>Abhinav Upadhyayhttp://www.blogger.com/profile/10269563448156267741noreply@blogger.com2