Sunday, August 8, 2021

Quick look at the new vector API in Java-16

One of the shiny new features of Java 16 is the new vector API which allows vectorized execution of numerical operations on arrays by JIT compiling the Java code to SIMD instructions. Compiling high level language code directly to SIMD instructions is a hard challenge and cannot be done as well as someone writing the assembly code by hand. There are libraries which make this easy by providing optimized implementations of these numerical computations, such as the various BLAS (Basic Linear Algebra Subprograms) libraries, usually written in FORTRAN and built on top of more lower level math libraries which eventually deal with the SIMD intrinsics. Most of the languages have libraries wrapped around the BLAS libraries, e.g. Python/R/Octave work this way. 

I've been ignorant of the options available for writing high performance numerical computing code in Java because I didn't have to care. But few years back when it was announced that some future release of Java will have a library to compile Java code directly to vectorized instructions, it was very exciting and finally the API is out in Java 16.

The vector API in java is nice but it is not that straightforward to use, it does leak out some details of the underlying hardware and the programmer has to be aware and deal with it. For example what is the size of the SIMD registers on the CPU, and you need to provide the stride by which to move the vectorized loop forward. It also differs in the some of the terminology which is widely used in other languages and libraries, e.g. in Python what is called a shape of the array, seems to be referred to as lanes here (I think that's what it means but I could be misunderstanding). 

Anyway, I just wanted to do a quick comparison of a simple numerical computation between vectorized API and the old school non-vectorized version and see the performance benefits. I'm going to create an array of 100,000 random floats and then compute its mean. The mean computation is repeated several times in a loop. It appears the performance benefit of vectorized API kicks only when repeating the operation many number of times, it could be because it takes a while for the JIT compiler to identify and compile the loop.


Here is the mean computation using the new vector API:


private static final VectorSpecies<Float> SPECIES = 
FloatVector.SPECIES_PREFERRED;

public static float vectorizedMean(float[] values) {
int i = 0;
float sum = 0.0f;
for (; i < SPECIES.loopBound(values.length); i += SPECIES.length()) {
FloatVector floatVector = FloatVector.fromArray(SPECIES, values, i);
sum += floatVector.reduceLanes(VectorOperators.ADD);
}
return sum / values.length;
}

This is the non-vectoried version:

private static float mean(float[] arr) {
float result = 0.0f;
for (float v : arr) {
result += v;
}
return result / arr.length;
}

Here, we are going to execute both methods a bunch of times and collect the timings

public static void main(String[] args) throws IOException {
float[] values = generateArray(100000);
List<Long> vecMeanTimes = new ArrayList<>();
List<Long> nonvecMeanTimes = new ArrayList<>();
long start = System.currentTimeMillis();
for (int i = 0; i < MAX_RUN_TIMES; i++) {
vectorizedMean(values);
final long vectorizedTimeTaken = System.currentTimeMillis() - start;
vecMeanTimes.add(vectorizedTimeTaken);
}
start = System.currentTimeMillis();
for (int i = 0; i < MAX_RUN_TIMES; i++) {
float v = mean(values);
final long timeTaken = System.currentTimeMillis() - start;
nonvecMeanTimes.add(timeTaken);
}
writeTimesToFile(vecMeanTimes, "vecMeanTimes.csv");
writeTimesToFile(nonvecMeanTimes, "nonvecMeanTimes.csv");
}


MAX_RUN_TIMES was set to 100,000. 

The following plot shows how the two comapre:





As I said the vectorized code is slower than the non vectorized code for a while before it kicks up and gets significantly faster. Let's zoom in to find the threshold at which it becomes faster





It appears it takes about 500 iterations before the JIT compiler realizes it needs to optimize the code, but it's just a guess. 

Any case, this looks great and exciting and much stuff can be built on top of it.

No comments:

Post a Comment