My tech musings: Understanding Base64 Encoding

Introduction

Base64 is ubiquitously used in web development as well as lower level network programming. It is a scheme to encode arbitrary data, whether it's binary or plain text. In the early days of the Internet, sending binary data over the wire was complicated because the devices and softwares interpreted byte values outside of the printable ASCII range in their own ways. For example modems interpreted code 6 as an acknowledgement. This was problematic for transmitting binary data, such as compressed images or executables and therefore Base64 was designed as a mechanism to encode data into a subset of printable ASCII character set.

In modern day web development Base64 is also used as a way to encode the Authroization header to encode the username and password, when making an HTTP request call. Even though base64 mangles the input but by no means it is secure. If such a request is being transmitted over an insecure network, an attacker can easily decode and capture the user name and password. Many people confuse Base64 to be an encryption scheme but by no means it is encryption or a hashing scheme, it is a simple encoding scheme.

The name of the scheme is Base64 because of the way it works. In simple terms, the encoder scans the input 6 bits at a time and maps each unique 6 bit pattern to one of the 64 ASCII symbols (A-Z, a-z, 0-9, +, /). Since it works with 6 bits at a time, there are 2^6 = 64 unique mappings possible, and hence the name is Base64.

Base64 Encoding Implementation

As noted above, the encoding scheme works by looking at 6 bits of the input at a time and encoding that as one byte (or 8 bits) in the encoded output. Since one byte consists of 8 bits and we are working with 6 bits of input at a time, we need at least 24 bits (least common multiple of 6 and 8) of input in order to encode. One possibility would be to encode the input with 0s to make it a multiple of 24 bits but the convention followed has been to append two '=' characters in the output if the input length is 2 bytes short of a multiple of 3, or to append single '=' character if input length is 1 byte short of a multiple of 3.

Since we are encoding 6 bits of input into 8 bits of output, the output to input size ratio is 8 / 6 = 4/ 3, i.e. the output is 4/3 times larger than the input.

With these details out of the way, we can start looking at the actual implementation of the encoding. Even though the encoding scheme sounds simple enough but because of the low level bit manipulation it makes for a challenging implementation. Let's dig in.

#include <err.h>
#include <stdlib.h>

static const char *base64_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl" \
    "mnopqrstuvwxyz0123456789+/";

We have created a static array of all the base64 encoding characters so that we can easily map the input bits to one of these characters at encoding time.

char *
base64_encode(const char *input, size_t len)
{
    // Allocate enough space to hold the base64 encoding
    // for each 6 bits of input the encoded output is 8 bits,
    // so the output is 4/3 times the input.
    // Additionally, since we encode 6 bits of input at a time,
    // if the input is not a multiple of 24 bits then we need
    // to add '=' to the end of the output to make it a multiple
    //  of 24 bits, so the output length is 4/3 times the input
    // length plus 2 for the '=' and one byte for the null terminator
    size_t output_len = (len * 4 / 3 + 1) + 3; 
    char *out = malloc(output_len);
    if (out == NULL)
        err(EXIT_FAILURE, "malloc failed");


    size_t idx = 0;

    // we scan 6 bits of input at a time and map it to one
    // of the bytes from base64_chars
    do {
        out[idx++] = b64_chars[(input[0] & 0xFC) >> 2];
        if (len == 1) {
            out[idx++] = b64_chars[(input[0] & 0x03) << 4];
            out[idx++] = '=';
            out[idx++] = '=';
            break;
        }

        out[idx++] = b64_chars[(input[0] & 0x03) << 4 | (input[1] & 0xF0) >> 4];
        if (len == 2) {
            out[idx++] = b64_chars[(input[1] & 0x0F) << 2];
            out[idx++] = '=';
            break;
        }

        out[idx++] = b64_chars[(input[1] & 0x0F) << 2 | (input[2] & 0xC0) >> 6];
        out[idx++] = b64_chars[input[2] & 0x3F];
        input += 3;

    } while (len -= 3);
    out[idx] = 0;
    return out;
}

If you are not used to working with bitmasks then this code may look a bit intimidating at the beginning but it's not that complicated. Within the do while loop, the first thing we are doing is to take the first 6 bits off the first byte of the input and look up its corresponding mapping in the b64_chars. The way we are selecting the 6 bits of the first byte is interesting. We are using the bitmask 0xFC which is hexadecimal for the binary string 11111100. When we do a bitwise AND of this string with the first byte of the input, the last two bits of the input are set to 0 and we are left with only the first 6 bits. Then we are right shifting these bits by 2 so that we get a valid 6 bit number for which we can find a mapping.

Next, if the length of the input is just 1 byte, then we need to encode the remaining 2 bits of the first byte and append 2 '=' characters to the output. Now we are using the bitmask 0x03 for selecting the last 2 bits because it corresponds to the bit pattern 00000011. We shift these 2 bits to the left by 4 so that they become top 2 bits of the 6 bits (rest of the 4 bits will be 0 because the input finished).

On the other hand if input had another byte then we need to take the last 2 bits of the first byte and first 4 bits of the 2nd byte. For which we are doing this gymnastic

out[idx++] = b64_chars[(input[0] & 0x03) << 4 | (input[1] & 0xF0) >> 4];

The logic for selecting last 2 bits of first byte is same as above. The bitmask for selecting first 4 bits of 2nd byte is 0xF0 which is 11110000 in binary, we shift it to the right by 4 and do a bitwise OR with the other bits selected from the first byte. Combined these will give us the next 6 bits to encode.

Now, if the input had only 2 bytes, then we need to encode the remaining 4 bits of the 2nd byte and append an '=' character to the output. It's the same story, we select the last 4 bits of the 2nd byte and left shift them by 2 bits so that they form the top 4 bits of the 6 bits (last 2 bits will be 0 since the input is finished).

However, if the input did not finish then we take the last 4 bits of the 2nd byte and first 2 bits of the next byte to encode. Afterwards we are only left with the last 6 bits of the 3rd byte for which we can do the encoding pretty easily without any gymnastics required.

We keep doing this till we finished scanning the input bytes.

Base64 Decoding

The decoding process is just the reverse of what we did above. We scan each byte of the encoded input and map it to 6 bits of output. Since when generating the encoding we know what numeric representation we gave to each individual character of the base64 encoding, we can generate a table like this:

static const int unbase64 [] = {
  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63, 52,
  53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, 0, -1, -1, -1,
  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1,
  26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
  42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1
};

This is essentially the ASCII table, where the valid base64 character entries have been given their numeric sequence as per the encoding, e.g. 'A' is 0, 'B' is 1 and so on. The non base64 characters are assigned -1.

Now let's look at the decoding implementation

char *
base64_decode(const char *input, size_t len)
{
    if ((len & 0x03)) {
        warnx("input length expected to be a multiple of 4 bytes");
        return NULL;
    }
    size_t output_len = (len * 3) / 4 + 2;
    char *out = malloc(output_len);
    if (out == NULL)
        err(EXIT_FAILURE, "malloc failed");
    size_t offset = 0;
    
    do {
        for (int i = 0; i <= 3; i++) {
            if (input[i] > 127 || unbase64[input[i]]  == -1) {
                warnx("invalid base64 character %c:", input[i]);
                free(out);
                return NULL;
            }
        }

        out[offset++] = (unbase64[input[0]] << 2) | 
            ((unbase64[input[1]] & 0x30) >> 4);
        if (input[2] != '=') {
            out[offset++] = ((unbase64[input[1]] & 0x0F) << 4) | 
                ((unbase64[input[2]] & 0x3C) >> 2);
        }

        if (input[3] != '=') {
            out[offset++] = ((unbase64[input[2]] & 0x03) << 6) | 
                (unbase64[input[3]]);
        }
        input += 4;
    } while (len -= 4);
    return out;
}

The first thing we are doing is validating that this is valid base64 encoded string or not. For example we expect the input length to be a multiple of 4 bytes for it to be a valid base64 encoding. Similarly we are doing checks for valid base64 characters before we actually start decoding.

The decoding is pretty simple. We are going to take bytes of the encoded input and look up the corresponding decoded value from the table. Since all the numbers in the decoded space are in the range 0 to 63, they can be represented by 6 bits and hence in their byte representation the first 2 bits will always be 0. We need to form the bytes of the decoded output by stitching together these 6 bits at a time.

So, we take the first character of the input and find its corresponding decoded value from the table. We left shift the decoded value by 2 because the top 2 bits would have been unset as mentioned previously. Now that we have first 6 bits of the decoded output, we need another 2 bits to make it a valid decoded byte. For that we take the 2nd byte of the encoded input and lookup in the table. Since this will also be a 6 bit number we shift it to the right by 4 so that its top 2 bits move to the end and combined with the first 6 bits we got by decoding the first byte, we now have 8 bits of decoded output which can be stored in the output.

If the next byte is an '=' that means the input finished and we are done, otherwise we need to take the rest of the 4 bits we got from decoding the 2nd byte and take first 4 bits by decoding the 3rd byte and stitch them together.

(unbase64[input[1]] & 0x0F) << 4) -> this takes the last 4 bits of the decoded 2nd byte and moves them to the top.

((unbase64[input[2]] & 0x3C) >> 2) -> this decodes the 3rd byte and selects its 4 bits between positions 3 to 6 (inclusive). We are selecting the 4 bits starting at position 3 and not 1 because the first 2 bits will be unset as we are mapping to a 6 bit space. We then right shift these bits by 2 so that they form the last 4 bits. When this is OR'd with the previous 4 bits we got from the 2nd byte, we have another 8 bits of decoded output.

If the next input byte is not an '=' charcter that means we have another byte to decode. We take the remaining 2 last bits of the decoded 3rd byte, and combine them with first 6 bits of the decoded 4th byte. We don't have do any bit shifting for the 4th byte because we need the complete 6 bits at this time in order to make a complete byte of output.

We keep doing this until we reach the end of input.

Testing

We can also write a small test to verify that the encoding and decoding works as expected. I am going to take a test string and find out its base64 encoding from a well known tool, such as the Python REPL which comes with the base64 module and assert that our encoder generates the same value. Then when we try to decode the encoded string we get back the original string.

int
main(int argc, char **argv)
{
    char *input = "apple";
    char *expected_output = "YXBwbGU=";
    char *output = base64_encode(input, 5);
    printf("output: %s\n, expected output: %s\n", output, expected_output);
    printf("decoded: %s\n", base64_decode(output, strlen(output)));
    free(output);
    return 0;
}

Conclusion

Even though the Base64 encoding is simple and elegant yet its implementation is a bit handful to do unless you are used to doing bit masking and manipulations. This is a fun exercise to get used to the bit twiddling techniques.

My tech musings

Sunday, August 21, 2022

Understanding Base64 Encoding

Introduction

Base64 Encoding Implementation

Base64 Decoding

Testing

Conclusion

No comments:

Post a Comment