Please rotate your tablet to be horizontal.

You can dismiss this notice but please note that this experience has not been designed for mobile devices and so will be less than optimal

Back To Schedule

Fast C++ by using SIMD Types with Generic Lambdas and Filters

10:30 - 11:30 Thursday 15th September 2022 MDT Summit 8 & 9 / Online C
Intermediate
Advanced
Expert
Scientific Computing

We show that very high-performance code can be generated when generic lambdas are instantiated with vectorized numeric types ( SIMD wrappers ). Trivial lambdas can be used to create STL-like functionality but with a performance as high as 10-20 G Flops which is competitive with the standard libraries and some commercial libraries. A small framework that supports memory management, alignment and padding of vectors is used to support this style of programming. It applies users’ lambdas and filters over vector-like objects.

The performance of trivial implementations of some STL-like functions, memcopy , inner_product , sum of squared values, max_element, and accumulate ( with error correction) is investigated. In some cases, the STL generates scalar instructions and performs quite poorly when compared with the lambda generated code which is branchless and makes full use of the SIMD instructions.

Branching can be a problem for vectorized code. We explore cases where the branches have light, medium, and heavy calculation loads and different frequencies of being traversed. With frequent branching, we find that select operations are useful for handling conditional constants. This is also true for conditionals with lightweight and middleweight expressions in their branches. Our compound operation, transformSelect computes both branches and conditionally blends the in-register results appears to be quite a useful tool.

When heavy branches are traversed infrequently, we can filter to a value-based contiguous view and then perform a transform efficiently using vectorized instructions (filterTransform). In the limiting case where all branches have very heavy compute when compared to the cost of filtering, it is best to filter the cases to separate contiguous regions of memory and then apply the vectorized algorithms so that all the registers lanes can be used in the calculation.

We illustrate this with an example of writing a vectorized version of the inverse cumulative normal distribution function. We use VC++, Intel 2022, and clang compilers and compare the performance of different implementation approaches on silver and gold/W  Xeons with the same function in Intel’s short vector math library (SVML).

Andrew Drakeford

Andrew Drakeford

A Physics PhD who started developing C++ applications in the early 90's at British Telecom labs. For the last two decades he has worked in finance developing calculation libraries and trading systems in C++.