Page Nav

HIDE

Breaking News:

latest

Ads Place

Programming Intel Arc Xe Matrix Extensions (XMX) with oneAPI

https://ift.tt/Me0LypT Using the joint_matrix API to drive the latest Intel acceleration technologies Image courtesy of the author Intr...

https://ift.tt/Me0LypT

Using the joint_matrix API to drive the latest Intel acceleration technologies

Image courtesy of the author

Introduction

With the release of the Intel® Arc™ Graphics desktop GPU, developers have some new hardware accelerator options to play with. As an Intel software architect and performance enthusiast, the first thing I always think of is how I can use new hardware to solve my problem faster. With Arc, the first thing I wanted to try was the Intel Xᵉ Matrix Extensions (XMX) hardware and its dedicated matrix engines.

Why does this matter?

Tensor operations are at the heart of deep learning workloads. One of the fundamental acceleration capabilities of Intel XMX is dedicated hardware to perform matrix operations, which higher-level tensor operations decompose into. For most AI end users, Tensorflow and PyTorch will be the level in the software at which we use this hardware. However, another class of users/developers like myself are also looking at this and thinking, how can I program this new hardware directly and use it for some other purpose?

oneAPI joint_matrix

As with most hardware, there are a few ways to program for XMX accelerators. You can write GPU assembly or use GPU intrinsics. For you brave folks, I would refer you to the oneAPI GPU optimization guide as a starting point. I wanted to try something simpler. Fortunately for us, there is an experimental oneAPI extension, joint_matrix, that allows us to program the hardware using a higher-level API.

In addition to supporting Intel hardware, the joint_matrix C++ API allows us to perform matrix operations on a variety of hardware. From the joint_matrix introduction:

This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefits from the maximum performance these different hardware can offer.

The Fun Part: Testing the Intel Arc A750

I have a brand new Intel Arc A750 card I just put into my personal Intel Alder Lake Core i9–12900KF Alienware R13 system. I also happen to be using Windows, so the instructions below may be slightly different if you are using Linux or WSL.

Image courtesy of the author

Breaking down a joint_matrix Matrix Multiply Example

My goal is to just exercise the hardware with some simple matrix operations. I am starting with an example from the Intel llvm-test-suite that runs a hardware-accelerated matrix multiply using bfloat16 and ensures the output is the same as using a simple CPU matrix multiply.

I modified the test a little bit to output which accelerator hardware the matrix multiply is running on. Here’s the snippet of code used to perform the joint_matrix enabled matrix multiply:

Here are some high-level things to note in the code:

  • Lines 1–11: The big_matrix class allows us to represent an arbitrary-sized matrix.
  • Lines 23–27: The device selector shows us which accelerator the algorithm is running on

Since a large matrix will not always fit into the hardware, the matrix multiply operation is performed by decomposing the two matrices to be multiplied into subgroups and then accumulating the result of multiplying these subgroups into the proper parts of the output matrix. The operations are the same as when we do a naive matrix multiply, just in a slightly different order since we don’t iterate over the entire column space before moving to the next row.

The core breakdown of how the matrix multiply is happening above is as follows:

  • Line 36: the parallel_for is splitting the work based on the 2-dimensional nd_range — this is how we are walking over the matrix space.
  • Lines 49–56: sub_a, sub_b, sub_c are initialized. Since the hardware cannot hold the whole matrix in memory, the algorithm is loading parts of the matrix into the hardware accelerator + registers using the joint_matrix API. sub_a and sub_b are the fragments of the matrices being multiplied, and sub_c is our target output matrix
  • Line 58: use the joint_matrix_fill API to not load the values from memory but instead directly initialize the register to a value. In this case, the value I initialized the register to is 0.
  • Line 64–71: load the parts of the matrix to multiply and accumulate into our output matrix
  • Lines 72: perform a matrix multiply and add using sub_a and sub_b as inputs and sub_c as the target using the XMX accelerator
  • Lines 74–77: store the in-progress output values for this part of the matrix computation back to memory

For reference, here is my complete joint_matrix_bfloat16_modified.cpp

Testing the Accelerator Matrix Multiply

Since this is using an experimental extension to oneAPI, this requires the Intel® oneAPI DPC++/C++ Compiler. I am using the latest version from the Intel® oneAPI Base Toolkit 2022.3 release.

This functionality has been enabled since the 2022.1 release, but some of the namespaces were updated. For example, the following namespace was updated from the former to the latter:

sycl::ext::intel::experimental::bfloat16
sycl::ext::oneapi::experimental::bfloat16

To compile this example, run the following commands after installing the Intel® oneAPI Base Toolkit and following the environment configuration steps:

> icx /EHsc /fsycl joint_matrix_bfloat16_modified.cpp

After compiling the code, we simply run the executable:

> joint_matrix_bfloat16_modified.exe
Running on device: Intel(R) Arc(TM) A750 Graphics
Elapsed time in milliseconds (accelerated): 142 ms
Elapsed time in milliseconds (reference): 2118 ms
passed

We can see that the matrix multiply was performed in 142 milliseconds using the Intel Arc GPU, and the non-accelerated version ran in 2118 milliseconds on the CPU. Note that if you try to run the accelerated version on hardware that does have support matrix operations, the current behavior as defined by the API is to report a failure due to a lack of supported matrix accelerator hardware. This prevents a user from unknowingly using a slower, fallback matrix implementation instead of the hardware-accelerated version.

Enabling Multiple Accelerators

The joint_matrix API is not just designed for portability in the abstract. The latest Intel DPC++ compiler has hardware support for Intel XMX, Intel® Advanced Matrix Extensions (AMX) and NVIDIA Tensor Cores. For those unfamiliar, AMX is the new x86 instruction set for matrix multiplication acceleration that utilizes hardware built into the upcoming 4th Gen Intel Xeon Scalable processors. If you are interested in Tensor Cores, there is an example you can compile and run here. It does require the open source Intel DPC++ Compiler built with NVIDIA backend support, as well as installing NVIDIA CUDA, so I will save that for a separate post.

For more details on what other functionality the joint_matrix API enables, see the API documentation:

llvm/sycl_ext_oneapi_matrix.asciidoc at sycl · intel/llvm

Conclusion

New hardware capabilities mean new programming abstractions. These abstractions may exist at many levels. While joint_matrix is a way to directly program the hardware, in future posts I will talk about how the Intel implementation of oneAPI libraries and popular AI/ML frameworks such as TensorFlow and PyTorch will take advantage of matrix accelerators like XMX and AMX.

If you have made it this far, you probably want fine-grain control of the hardware just like I do, which is why an API like joint_matrix is exciting. The joint_matrix API is available and ready to help you take advantage of some of the new matrix hardware available. I encourage you to download the toolchain, give the API a try, and provide feedback to help shape this exciting cross-vendor, matrix API.

If you want to see what random tech news I’m reading, you can follow me on Twitter.

Tony is a Software Architect and Technical Evangelist at Intel. He has worked on several software developer tools and most recently led the software engineering team that built the data center platform which enabled Habana’s scalable MLPerf solution.

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.


Programming Intel Arc Xe Matrix Extensions (XMX) with oneAPI was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/Ro0NCrV
via RiYo Analytics

No comments

Latest Articles