Modern CPUs have become so complicated that it's hard to have an intuition about their performance characteristics. One of the common things that we do is to cache a commonly done computation because intuitively it makes sense that computation is expensive and if we cache it, that will save us time. But that may not always be true. Let's see why.
When writing high level code and trying to analyze its performance, it helps to have a mental model of how it might be compiled to assembly or machine instructions and what the latency of those instructions might be. There is very helpful infographic that I came across a while back which lists down some of the most common CPU operations and their latency in number of CPU cycles and to put things in perspective it also shows the distance travelled by light by the time that operations finishes.
There are few interesting things here. Simple register register ops are very fast, such as adding two numbers. That means if the data is sitting in registers directly, it can be operated upon in no time.
We can also notice that floating point addition and multiplication of different numeric types is very fast, between 1-3 and 1-7 cycles respectively. At the same time we can see that a read from L1 cache is slightly slower than those ops (3-4 cycles). This tells us that it is faster to multiply two numbers than have the result of the multiplication sitting in the cache. Not only would it waste the precious cache space, it would be slower. This is revealing because most of us would think the opposite is true and write code where we store the computed result, thinking it saves CPU cycles.
But at the same time we should note that division is a very expensive operation (it's a relatively well known fact) and we should avoid it where possible or cache the result if doing it too often.
Let's try to test measure how vast is the difference in the performance in practice. I'm going to use C because the compiler generates code very close to the hardware model as oppossed to other higher level languages where things like the runtime of the language can make it hard to understand what's happening.
Here we are running two loops to measure the latency of the two operations. The first loop computes the product of two integers a and b and prints it on stdout.
The second loop dereferences the pointer to an integer which contains the precomputed product of a and b and prints it on stdout. Here we are using a pointer to an integer instead of the integer directly in order to bring the L1 cache into play. The generated assembly code would be such that the CPU would have to read the value from the memory address the pointer points to and put it in a register before printing it. The CPU will put the value in the cache after the first run and so the loop is effectively going to measure the latency of reading from L1 cache.
We can compile and run this code as follows:
➜ clang -o latency latency.c -lm
➜ ./latency 1>/dev/null
total time taken for product: 11.637324, avg time: 0.000000 +- 0.000000
total time taken for print_number: 12.010727, avg time: 0.000000 +- 0.000000
No comments:
Post a Comment