Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the DPC++ and LevelZero device driver #486

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

therault
Copy link
Contributor

@therault therault commented Feb 7, 2023

Introduce the DPC++ and LevelZero device driver and enable this device in DTD and PTG.

This PR supersedes PR #483, as the changes due to the integration to HIP made the port on top of PaRSEC master complicated. All commits of the level_zero branch are squashed in one commit to simplify this port, and many changes are added to the commit to factor the code between HIP CUDA and Level Zero code generation.

As the target BODY is in DPC++, there are two devices: the dpcpp device (natural for PTG, as this is the target BODY language), and the level_zero device (natural for the device, as all operations are at the level zero interface). I made both these devices synonyms, which makes some duplication for the default_stage_in/out and the kernel_submit.

This branch is based on common_gpu and should be merged only after common_gpu

Add a new level_zero device (WIP)

  • copy device_cuda in device_level_zero and rename things
  • module_init and module_fini for level_zero

Need to factorize a little bit more.

Factorizing (need to do it in base)

Port above new common

Add DPC++ to the loop...

  • Add multiple CMake logic files and commands
  • jdf2c.c now generates dpcpp output files when needed
  • make DEV_DPCPP be an alias to DEV_LEVEL_ZERO
  • Command Lists for I/O (streams of id 0 and 1) are still immediate
  • Command Lists for computations (streams of id >= 2) are now normal lists connected to a queue that queue exists as a compute level-zero queue and as a DPC++ queue
  • Missing compilation logic to compile generated dpc++ code and link it with the target binary

Risk: it is unclear that the user can still push orders / events in the command list, after it is closed, and it is necessary to close it to force the orders to be pushed on the queue. I might need to create a new command list after each close, and attach the command list to the event for garbage collection.

Adapt findlevel-zero.cmake to support systems where pkg-config is broken

Re-enable Level Zero test; update to latest level zero / oneAPI API

Update wrapper to allow testing both CUDA and Level Zero with new Level Zero update

use_cuda / use_cuda_index have been renamed to follow proper naming scheme; do the same for level_zero

Try to automate DPCPP generated code compilation; fix ordinal of memory allocation request in wrapper.

Command Lists need to be sent to the Command Queue if they are not created immediate (and they cannot be immediate if we want to get their Command Queue, which is necessary for the DPC++ interface)

Typo and multiple CMake fixes to make CMake link with DPCPP generated files

Add a standalone test for Zero Level capability and integration with DPC++ kernels

Rebase the entire Level Zero driver based on the susbsystem test

Buffer interface is not required. We can use the USM OneMKL interface, it seems to work ok. Need to check for performance.

We cannot mix immediate and non-immediate command lists apparently. Or at least it makes the passing of command queues unreliable

There is an exception in data.c how we handle GPU copies, it must be ported to Level Zero too.

The Level Zero runtime has a atexit procedure to delete command queues, and this seems to conflict with our own actions to delete the command queues...

Porting of the DTD GEMM test to Level Zero

NULL is not a valid MPI datatype when compiling with a clone of MPICH. The value doesn't matter in this case, just cast

Manage LEVEL_ZERO devices in DTD

Accept LEVEL_ZERO devices in the PTG generated code

Some fixes in device level_zero

Temp fix for termination detection -- tag size must be made portable. TODO!

Support LEVEL_ZERO devices in the DSL tests

Fix the subsystem test. Need to backport fixes in the MCA device

Fully functional sketch for level zero

Use level-zero fences to synchronize command lists and command queues, because command lists (or work) submitted to the command queues by SYCL (typically oneMKL) can complete in parallel with events belonging to other command lists.

Define the set of globals in DPC++ code after the includse happen to avoid polluting their namespace; cleanup some unused variables

Install LevelZero driver files; setup the environment to find the same LevelZero library as at compile time in PaRSECConfig.cmake

@therault therault requested a review from a team as a code owner February 7, 2023 19:46
@therault
Copy link
Contributor Author

therault commented Feb 7, 2023

@abouteiller can you test this PR on a HIP machine? Is the HIP port still working? I'm testing on CUDA and LevelZero machines.

…n DTD and PTG.

This branch is based on common_gpu and should be merged only after
common_gpu

Add a new level_zero device (WIP)

 - copy device_cuda in device_level_zero and rename things
 - module_init and module_fini for level_zero

Need to factorize a little bit more.

Factorizing (need to do it in base)

Port above new common

Add DPC++ to the loop...

  - Add multiple CMake logic files and commands
  - jdf2c.c now generates dpcpp output files when needed
  - make DEV_DPCPP be an alias to DEV_LEVEL_ZERO
  - Command Lists for I/O (streams of id 0 and 1) are still immediate
  - Command Lists for computations (streams of id >= 2) are now normal lists connected to a queue
    that queue exists as a compute level-zero queue and as a DPC++ queue
  - Missing compilation logic to compile generated dpc++ code and link it with the target binary

Risk: it is unclear that the user can still push orders / events in the command list, after it is closed,
and it is necessary to close it to force the orders to be pushed on the queue. I might need to create a
new command list after each close, and attach the command list to the event for garbage collection.

Adapt findlevel-zero.cmake to support systems where pkg-config is broken

Re-enable Level Zero test; update to latest level zero / oneAPI API

Update wrapper to allow testing both CUDA and Level Zero with new Level Zero update

use_cuda / use_cuda_index have been renamed to follow proper naming scheme; do the same for level_zero

Try to automate DPCPP generated code compilation; fix ordinal of memory allocation request in wrapper.

Command Lists need to be sent to the Command Queue if they are not created immediate (and they cannot be immediate if we want to get their Command Queue, which is necessary for the DPC++ interface)

Typo and multiple CMake fixes to make CMake link with DPCPP generated files

Add a standalone test for Zero Level capability and integration with DPC++ kernels

Rebase the entire Level Zero driver based on the susbsystem test

Buffer interface is not required. We can use the USM OneMKL interface, it seems to work ok. Need to check for performance.

We cannot mix immediate and non-immediate command lists apparently. Or at least it makes the passing of command queues unreliable

There is an exception in data.c how we handle GPU copies, it must be ported to Level Zero too.

The Level Zero runtime has a atexit procedure to delete command queues, and this seems to conflict with our own actions to delete the command queues...

Porting of the DTD GEMM test to Level Zero

NULL is not a valid MPI datatype when compiling with a clone of MPICH. The value doesn't matter in this case, just cast

Manage LEVEL_ZERO devices in DTD

Accept LEVEL_ZERO devices in the PTG generated code

Some fixes in device level_zero

Temp fix for termination detection -- tag size must be made portable. TODO!

Support LEVEL_ZERO devices in the DSL tests

Fix the subsystem test. Need to backport fixes in the MCA device

Fully functional sketch for level zero

Use level-zero fences to synchronize command lists and command queues, because command lists (or work) submitted to the command queues by SYCL (typically oneMKL) can complete in parallel with events belonging to other command lists.

Define the set of globals in DPC++ code after the includse happen to avoid polluting their namespace; cleanup some unused variables

Install LevelZero driver files; setup the environment to find the same LevelZero library as at compile time in PaRSECConfig.cmake
@therault therault mentioned this pull request Mar 24, 2023
therault added a commit to therault/parsec that referenced this pull request Aug 30, 2023
therault added a commit to therault/parsec that referenced this pull request Oct 23, 2023
@bosilca
Copy link
Contributor

bosilca commented Nov 4, 2023

Now that #570 has been merged is this PR still necessary ?

@therault
Copy link
Contributor Author

therault commented Nov 6, 2023

Now that #570 has been merged is this PR still necessary ?

Not quite: I still need to import the PTG support of DPC++ part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants