AVX512/VBMI2: A Programmer’s Perspective
Engineering

AVX512/VBMI2: A Programmer’s Perspective

Linus Torvalds had some interesting things to say about AVX512: “I hope AVX512 dies a painful death… I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do…”. Having had the unique opportunity of migrating a portion of SingleStore’s library of SIMD kernels from AVX2 to AVX512/VBMI over the last few months, I disagree. For one thing, AVX512 is not designed solely for floating-point workloads. SingleStore’s code fundamentally deals with whizzing bytes around, and AVX512 is more than up to the task. In case you haven't heard of SingleStore, it's an extremely high-performance distributed SQL database management system and cloud database service that can handle all kinds of workloads, from transactional to analytical. It really shines for real-time analytics (summary aggregate queries on large volumes of rapidly changing data). We get our speed through compilation and, on our columnstore access method, vectorization. Squeezing the last, best bit of performance out of our vectorized execution is where SIMD comes in. AVX2 has given us several-times speedups. I investigated whether we can double that yet again with AVX512/VBMI. As an Intel partner with early access, I tested the performance of Ice Lake which was launched today. The results were good: on individual kernels, I could often approach or achieve a 2x speedup over the AVX2 implementation, simply by virtue of the doubled register size. While previous generations of CPUs supporting AVX512 had downclocking issues, the Icelake chips seemed to have negligible drops in clock speed even when running an AVX512 workload on all cores. Below is a chart showing the performance of the three versions of ByteUnpacking, a kernel which takes an array of values of byte width X and extends each value to byte width Y. This is denoted as ByteUnpack_X_Y in the chart. SingleStore uses ByteUnpacking extensively as data is read from disk and decoded.
Read Post
Forrester
SingleStore Recognized In

The Forrester WaveTM

Translytical Data
Platforms Q4 2022