Accelerators are continuing to grow in importance and usage, what does this mean for developers?
Computers are a means to an end. They allow us to have faster solutions to complex problems, provide the ability to store and retrieve information across the globe, provide the backbone for remarkable technologies like robotics and self-driving cars (sort of) and AI, and hopefully uplift the lives of everyone on the planet. As the problems to solve have become more complex, computer architecture, programming languages, and programming models have continued to evolve. This has led to the growth of hardware accelerators and domain-specific programming models.
Professor David Patterson from UC Berkeley (the author of all the computer architecture books I had in college) has talked extensively about domain-specific architectures and accelerators. Today, when we talk about a hardware accelerator, we are often talking about a GPU. However, there are a variety of different types of accelerators that have arisen to solve various problems including deep learning and AI which utilize hardware specifically designed to perform large-scale matrix operations, the heart of deep learning workloads.
In addition, there are hardware acceleration technologies built into traditional CPUs like Intel® Advanced Vector Extensions (AVX) and Intel® Advanced Matrix Extensions (AMX).
With the rise of new accelerators, there is always the challenge of how to program for that accelerator. Most accelerators currently available are based on parallel execution and hence some form of parallel programming. This is the first in a series of articles where we will talk about parallel programming and various ways to enable it on modern accelerators.
Parallel programming is how we write code to express parallelism in any code/algorithm to get it to run on an accelerator, or perhaps just even multiple CPUs. But what is parallelism? Why does it matter?
Parallelism is when parts of a program can run at the same time as another part of the program. Typically we break this down into two categories: task parallelism and data parallelism.
Task parallelism is when multiple functions can be independently performed at the same time. An example of this would be preparing for a party. One person may get balloons, another person may get ice cream and a third person may get the cake. While the party won’t be complete without all three things, three different people can do what they need to do independently of everyone else, as long as they are done before the party and meet in the right place.
If we map this to a computer program, each person is analogous to some compute hardware (CPU/GPU/FPGA), and picking up the various items are the tasks to be run.
Data parallelism is when the same function can be performed independently on several pieces of identical data. Imagine the person in our example above goes to get ice cream, but four other people also want ice cream. Potentially, all five people can get ice cream from the freezer at the same time.
In this example, we again have the analogy of people to compute hardware, each one tasked with getting ice cream. The ice cream is our (delicious) analogy for data.
These examples are simplistic but hopefully instructive. There are more subtle types of parallelism, especially in the area of computer architecture, but these simple analogies may help you understand how to think about parallelism when doing parallel programming.
Some important things to consider when it comes to parallelism are the available resources and the available parallelism in a given solution space.
- We may be limited by the number of resources to complete a task. If I only have three people, I can only do three tasks at a time.
- We may be limited by the available tasks. If I only need balloons and cake for my party, a third person to perform a task doesn’t help me.
- We may be limited by the available data. If I have five people wanting to get ice cream, but only three containers of ice cream left, two people will have nothing to do.
Another separate issue to consider is contention. Resources are generally limited and often our attempts to parallelize a problem run into issues with resource access. Let’s imagine we go to the store to get our ice cream and there are 100 containers of ice cream in the freezer, but only three people can stand in front of the freezer at once. That means that even if 100 people are there to get ice cream, getting ice cream still has at most a parallelism of three because of the limited access to the freezer.
This analogy applies to various pieces of computer hardware. One example, think of the freezer access to be similar to a memory bus. If it is not wide enough to allow all the data to flow to the accelerator fast enough, your accelerator is stuck waiting for access to the data to do what it needs to do.
To make this slightly more concrete, let’s show some code for the data parallel shopping problem above. We start by creating a shopper class. The job of this class is to perform the work of shopping. In this case, I’m using a naive matrix multiply to be the cost of shopping.
Parallelism with OpenMP®
Now that I have defined the act of shopping, let’s look at how to make this code run in parallel using OpenMP, a portable solution that allows programmers to add parallelism to their programs. Pragma directives are added to existing code to tell the compiler how to run parts of the code in parallel.
This simple code consists of four simple pieces:
- Line 18: Create
- Line 21: Call the
- Line 10–12: Ask each shopper to go shopping
- Line 9: a pragma to tell the compiler when using OpenMP to run iterations of this loop in parallel
The rest of it is just timers so we can see the benefits of running the code in parallel. The nice thing about the OpenMP code is you can tell the compiler to use OpenMP or not to use OpenMP, which then results in a serial program.
For this article, I’m using an HP Envy laptop, which has an Intel® Core™ i7–12700H processor with 32GB of memory and an integrated Intel® Iris® Xe GPU. The system is running Ubuntu 22.04, a 6.0 kernel, and the Intel® oneAPI DPC++/C++ Compiler. First, I enabled the Intel compiler environment with this command:
The following commands compile the code:
> icx -lstdc++ grocery-omp.cpp -o serial-test
> icx -lstdc++ -fiopenmp grocery-omp.cpp -o omp-test
The last command includes
-fiopenmp flag to tell the compiler to enable parallelism using OpenMP. Running the executables, I get this output on my system:
Elapsed time in milliseconds: 27361 ms
Elapsed time in milliseconds: 4002 ms
The OpenMP run utilizes several of my CPU cores to do the work simultaneously, which results in a 6x–7x speed up in the OpenMP run.
Parallelism with oneAPI/SYCL
OpenMP is a fantastic way to enable parallelism through a directive approach. C++ with SYCL, part of the oneAPI specification, allows us to express parallelism using an explicit approach.
The SYCL code is larger than the previous code base for this example. However, we have the same core code:
- Line 31: Create
- Line 46: Call the
- Line 25: Ask each shopper to go shopping
Unlike the OpenMP example where we use directives, SYCL allows users to explicitly define the parallel behavior of their program via code and C++ constructs. This provides runtime flexibility which is unavailable in OpenMP.
Depending on your use case, one paradigm may make more sense than another. In this case, we have a way to take a single binary and run it in multiple ways using Intel’s oneAPI SYCL runtime and using the
SYCL_DEVICE_FILTER environment variable:
> SYCL_DEVICE_FILTER=host:host:0 ./sycl-test
Running on device: SYCL host device
Elapsed time in milliseconds: 27201 ms
> SYCL_DEVICE_FILTER=opencl:cpu:1 ./sycl-test
Running on device: 12th Gen Intel(R) Core(TM) i7-12700H
Elapsed time in milliseconds: 4197 ms
> SYCL_DEVICE_FILTER=opencl:gpu:3 ./sycl-test
Running on device: Intel(R) Graphics [0x46a6]
Elapsed time in milliseconds: 3988 ms
The values for
SYCL_DEVICE_FILTER come from running the sycl-ls command which comes as part of the Intel® oneAPI Base Toolkit.
The cool thing about the oneAPI C++ with SYCL implementation is that the binary can direct the work to multiple devices.
Understanding parallelism and writing parallel code is a complicated problem. This primer describes a simple example of how parallelism can be enabled using OpenMP and C++ with SYCL. There are lots of other considerations when creating a parallel program, such as how multiple compute resources share memory and variables and how to balance work properly across multiple resources.
I will address those in future articles, but for now, I encourage you to try grabbing the simple example above and see how it works on your systems.
Thanks for reading!
Want to Connect?If you want to see what random tech news I’m reading, you can follow me on Twitter.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.