GPU Performance Portability Using Standard C++ with SYCL
The proliferation of accelerators, in particular GPUs, over the past decade is impacting the way software is being developed. Most developers who have been using CPU based machines are now considering how it's possible to improve the performance of applications by offloading execution to many core processors. Many emerging disciplines including AI, deep neural networks and machine learning have shown that GPUs can increase performance by many times compared to CPU-only architectures. New hardware features such as "tensor cores" are also starting to emerge to address specific problems including mixed precision computing. The new challenge for developers is figuring out how to develop for heterogeneous architectures that include GPUs made by different companies. Currently the most common way to develop software for GPUs is using the CUDA programming model but this has pitfalls. CUDA uses non-standard C++ syntax and semantics, is a proprietary interface, and can only be used to target Nvidia GPUs. Alternatives include HIP which offers another proprietary programming interface only capable of targetting AMD GPUs.
This presentation will demonstrate how standard C++ code with SYCL can be used to achieve performance portability on processors from multiple vendors including Nvidia GPUs, AMD GPUs and Intel GPUs. The SYCL programming interface is a royalty free and industry defined open standard designed to enable the latest features of accelerators. Using an open source project, we'll show how standard C++ syntax and semantics are used to define the SYCL kernel and memory management code required to offload parallel execution to a range of GPUs. Further to this, we'll explain how easy it is to compile this C++ code using a SYCL compiler so that it can be run on Nvidia, AMD and Intel GPUs and compare this execution performance with the same code written using proprietary CUDA and HIP environments. Lastly we'll share our tips for achieving the best performance on different processor architectures, including dealing with varying memory resources, using the most appropriate memory access patterns, using hardware specific features such as "tensor cores" and ensuring high utilization of the processor cores.
Hugh Delaney
Hugh is a software engineer at Codeplay, where he works on the DPC++ compiler. Hugh’s academic background is in mathematics and HPC with a focus on numerical algorithms and linear algebra. Hugh has been teaching mathematics and computing in some manner for all of his adult life.
Rod Burns
Rod Burns has been helping developers to build complex software for well over a decade with experience in organizing training, tutorials and workshops. At Codeplay Rod leads the effort to support and educate developers using SYCL. Rod helped to create “SYCL Academy,” an open source set of materials for teaching SYCL, that have already been adopted by some of the top universities in the world and has been used at multiple conferences to teach SYCL.