Sunday, January 2, 2022

To Compute or to Cache?

Modern CPUs have become so complicated that it's hard to have an intuition about their performance characteristics. One of the common things that we do is to cache a commonly done computation because intuitively it makes sense that computation is expensive and if we cache it, that will save us time. But that may not always be true. Let's see why.

When writing high level code and trying to analyze its performance, it helps to have a mental model of how it might be compiled to assembly or machine instructions and what the latency of those instructions might be. There is very helpful infographic that I came across a while back which lists down some of the most common CPU operations and their latency in number of CPU cycles and to put things in perspective it also shows the distance travelled by light by the time that operations finishes.



There are few interesting things here. Simple register register ops are very fast, such as adding two numbers. That means if the data is sitting in registers directly, it can be operated upon in no time.

We can also notice that floating point addition and multiplication of different numeric types is very fast, between 1-3 and 1-7 cycles respectively. At the same time we can see that a read from L1 cache is slightly slower than those ops (3-4 cycles). This tells us that it is faster to multiply two numbers than have the result of the multiplication sitting in the cache. Not only would it waste the precious cache space, it would be slower. This is revealing because most of us would think the opposite is true and write code where we store the computed result, thinking it saves CPU cycles.

But at the same time we should note that division is a very expensive operation (it's a relatively well known fact) and we should avoid it where possible or cache the result if doing it too often.

Let's try to test measure how vast is the difference in the performance in practice. I'm going to use C because the compiler generates code very close to the hardware model as oppossed to other higher level languages where things like the runtime of the language can make it hard to understand what's happening. 

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <math.h>

#define MEASURE_COUNT 200000000


static double
mean(double *values, uint32_t size)
{
double sum = 0.0;
for (uint32_t i = 0; i < size; i++)
sum += values[i];
return sum / size;
}
static double
std(double *values, uint32_t size, double mean)
{
double sum = 0.0;
for (uint32_t i = 0; i < size; i++)
sum += pow(values[i] - mean, 2);
return sqrt(sum / size);
}

static double
sum(double *values, uint32_t size)
{
double sum = 0.0;
for (uint32_t i = 0; i < size; i++)
sum += values[i];
return sum;
}


int
main(int argc, char **argv)
{
int a = 100;
int b = 200;
int *prod = malloc(sizeof(int));
*prod = a * b;
double *times1 = malloc(sizeof(double) * MEASURE_COUNT);
double * times2 = malloc(sizeof(double) * MEASURE_COUNT);
struct timespec begin, end;
double time_taken;
double mean_time, std_time, sum_time;

for (int i = 0; i < MEASURE_COUNT; i++) {
clock_gettime(CLOCK_MONOTONIC_RAW, &begin);
printf("%d\n", a * b);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
time_taken = (end.tv_nsec - begin.tv_nsec) / 1000000000.0 +
(end.tv_sec - begin.tv_sec);
times1[i] = time_taken;
}
for (int i = 0; i < MEASURE_COUNT; i++) {
clock_gettime(CLOCK_MONOTONIC_RAW, &begin);
printf("%d\n", *prod);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
time_taken = (end.tv_nsec - begin.tv_nsec) / 1000000000.0 +
(end.tv_sec - begin.tv_sec);
times2[i] = time_taken;
}

mean_time = mean(times1, MEASURE_COUNT);
sum_time = sum(times1, MEASURE_COUNT);
std_time = std(times1, MEASURE_COUNT, mean_time);
fprintf(stderr, "total time taken for product: %f, avg time: %f +- %f\n",
sum_time, mean_time, 2 * std_time);

mean_time = mean(times2, MEASURE_COUNT);
sum_time = sum(times2, MEASURE_COUNT);
std_time = std(times2, MEASURE_COUNT, mean_time);
fprintf(stderr, "total time taken for print_number: %f, avg time: %f +- %f\n",
sum_time, mean_time, 2 * std_time);
return 0;
}

Here we are running two loops to measure the latency of the two operations. The first loop computes the product of two integers a and b and prints it on stdout. 

The second loop dereferences the pointer to an integer which contains the precomputed product of a and b and prints it on stdout. Here we are using a pointer to an integer instead of the integer directly in order to bring the L1 cache into play. The generated assembly code would be such that the CPU would have to read the value from the memory address the pointer points to and put it in a register before printing it. The CPU will put the value in the cache after the first run and so the loop is effectively going to measure the latency of reading from L1 cache.

We can compile and run this code as follows:

➜ clang -o latency latency.c -lm

➜ ./latency 1>/dev/null

total time taken for product: 11.637324, avg time: 0.000000 +- 0.000000

total time taken for print_number: 12.010727, avg time: 0.000000 +- 0.000000


We can see that the difference between the two computations is not drastic but in the long run just mulitplying the numbers was slightly faster than keeping the result cached. The difference doesn't look that significant enough to really bother about such micro optimizations. That said, this can save precious space in the L1 cache. We know that the cache is small and scarce and having unnecessary data around in it can cause cache misses which eventually will cost more CPU time. So even if the cost of multiplication is on average same or just slightly better than getting the result from L1, it may be better to just always compute the product and save the cache space.

No comments:

Post a Comment