iframe-proxy

otbrown · 2026-04-24T17:16:49Z

Creating a facility for users to runtime set threads per block for tuning the GPU implementation. NOTE: only applies to kernels that are not handled by Thrust, which does its own thing. Resolves #735.

I considered and rejected the idea of creating a symmetric interface for the CPU for users who don't know OMP_NUM_THREADS or omp_set_num_threads() exist, but that's much riskier as the point of truth is external (in the OpenMP runtime).

TODO:

~~Should gpu_getNumThreadsPerBlock return a qindex? Probably~~.
~~Create a new home for user facing API, as environment doesn't really make sense.~~ Kicking this back to next release.
Add a compile time default value -- that way expert maintainers can compile a tuned default into a library which is used on a system.
Query seemingly unused branch at

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 453 in b7d4a29

if constexpr (NumTargs != -1) {
Add TPB to QuEST GPU environment reporting.
Add tests for new interface.
@JPRichings To check if this is really worthwhile, but please wait a week to tell me if it isn't.
- It is!

…. Yes, it's confusing. Yes, the OpenMP ARB know.

…they would be easy.

otbrown · 2026-04-24T17:20:30Z

Rudimentary testing done with:

#include <cstdio>
#include "quest.h"

int main (void)
{
  const int NQUBITS = 24;
  const int TPB = 32;


  initQuESTEnv();
  reportQuESTEnv();

  std::printf("Initial number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());

  setQuESTGpuThreadsPerBlock(TPB);
  std::printf("New number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());

  Qureg qureg = createForcedQureg(NQUBITS);

  std::printf("Initialising Qureg.\n");
  initPlusState(qureg);
  reportQureg(qureg);

  std::printf("Applying Quantum Fourier Transform.\n");
  applyFullQuantumFourierTransform(qureg, false);
  reportQureg(qureg);

  destroyQureg(qureg);
  finalizeQuESTEnv();

  return 0;
}

JPRichings · 2026-04-24T17:31:22Z

Why would gpu_getNumThreadsPerBlock be a qindex this is not a quantum quantity. uint should be fine ( I am sure there is a recommendation from the cuda api we can match.

TysonRayJones · 2026-04-25T19:26:54Z

Is there an advantage to users having to set this as a runtime hyperparameter? My (mostly undeveloped) belief is we can use occupancy tools (alluded to here) to automate this. I definitely shy from giving users a greater onus to optimise for their settings (like other prolific softwares), which the v4 overhaul was supposed to avoid (via e.g. the autodeployer).

Note too that the kernels so far are very primitive - each thread handles the updating of the minimum possible number of amplitudes (often just one!). I quite like that because it's very readable and simple (great for an open-source scientific project) but is an obvious site for optimisation.

Why would gpu_getNumThreadsPerBlock be a qindex this is not a quantum quantity. uint should be fine ( I am sure there is a recommendation from the cuda api we can match.

It's true that it will never be anywhere as big as the quantities qindex is expected to store (like the number of basis states), but I have already used it in places where I thought an int might be insufficient. Inoffensive either way as uint or qindex imo

JPRichings · 2026-04-25T20:56:31Z

Hi Tyson,

I just noticed the fixed value to 128 and have a feeling that it was large.

I just wanted a handle so I could write a benchmark so we can easily automate performance tuning ourselves.

I have not played with the occupancy tools but I should take a proper look as this might solve this automatically.

My other concern is that there are differences between Nvidia and AMD on optimal sizes due to hardware differences so we might not be able to reply on the occupancy tuning in all cases unless this becomes available on all platforms.

TysonRayJones · 2026-04-25T21:18:22Z

I just noticed the fixed value to 128 and have a feeling that it was large.

I guess it's very GPU specific! I think 128 was motivated by thinking of CC=3, which has a max active blocks per SM of 16, and a max active threads per SM of 2048. So using 128 threads per block perfectly maximally occupies the SMs (when there are enough amplitudes to admit more than 16 blocks per SM, of course!)

For illustration, the next smallest size is 96 (it must be a multiple of 32, else threads within a warp will be idle), which yields a number of active threads of 16 * 96 = 1536, which wastes 2048 - 1536 = 512 threads per SM!

Of course, newer GPUs support more active blocks per SM (even when the max active threads per SM is unchanged). E.g. CC=8 supports up to 32 active blocks per SM, so we could shrink to 64 threads per block while achieving the same occupancy - but I don't have a great intuition for the effect when we're memory-bandwidth bound.

Certainly seems prudent to consult a CUDA runtime API, if that doesn't hurt our AMD compatibility!

otbrown · 2026-04-27T11:14:08Z

Apologies, probably won't get to look at this again this week, but very happy to set this value programmatically if it can be done!

As it's architecture dependent, we definitely do need a way to adjust it, and ideally both at runtime and compile time. At compile time, so kindly HPC support teams can compile and maintain a tuned version, and at runtime, so they can scan through values without having to recompile in between. I'll have a chat with James abour approaches later this week!

I 100% agree that we don't really want unknowing users messing around with this. I think something like an architecture.h or perftune.h or similar might be the best solution. A set of functionality that we explicitly document is for users who know what they are doing to tune the performance of the library for a specific architecture. It might be this is the only value in there for the time being, but for slingshot-11 reasons we need to add a parameter capping the total in-flight data and this would be a good spot for that too.

TysonRayJones · 2026-04-29T21:54:01Z

Fair enough - you've convinced me! Being able to runtime adjust is of course extremely helping during development of a user-friendlier adaptive system anyhow.

I like the sound of perftune.h - it could also go into debug.h in the interim to there being more performance-tuning specific functionality.

otbrown · 2026-05-03T16:40:49Z

Should validate TPB is multiple of 32!

otbrown · 2026-05-19T17:59:19Z

Rather than attempt to post a thumbs up on each comment while on train WiFi, I'll just comment thanks @TysonRayJones here! I'm hoping to give this branch some proper attention on Thursday/Friday.

TysonRayJones · 2026-05-21T01:42:22Z

No prob! They're almost all nits I can sweep through rapidly. You can flag any you disagree with or for which there's nuance, and I can address the remainder next time I'm on QuEST. (No need to co-author the squash with me for my minor and mostly stylistic changes!)

- added comm_isActive to indicate whether QuEST is using MPI (which is distinct from whether MPI itself is initialised) - renamed comm_isInit to comm_isMpiInit, since it queries MPI directly/globally, and when true, does not indicate whether QuEST is actually using MPI - record isMpiUserOwned within comm_config.cpp, since failed-validation must not kill user-owned MPI, and it must know user-ownership before QuESTEnv succeeds/records it (because validation can fail DURING QuESTEnv initialisation) - explicitly divided (through doc) comm_config.cpp into things which query MPI globally, and thinks which query only QuEST's MPI env/communicator -

as found by Codex! All hail our new overlords

TysonRayJones · 2026-05-31T21:47:56Z

…-quest' into gpu-tpb-cleanup

(existing checks were internal errors, and incomplete)

TysonRayJones

(I'll address these in #767)

woops!

Woof that's a lot of boilerplate - but at least we have the safest environment variables in the business! 😅

since it should be added in a separate PR with the other intendedly programmatically-accessible fields. I know in my heart of hearts that if I left isHipCompiled attached, the other fields would never follow hehe

which is now the lowest-priority default, overridden at executable launch via the environment variable, in-turn overridden at runtime using the setter

given it can now be a result of the cmake var, the env var, or the runtime setter argument

Cleaned up numTPB handling

which is achieved by simply removing the overzealous internal error handling. Let users have fun!

for brevity! pixels r precious

* Fix applyMultiStateControlledSqrtSwap argument list (#738) Remove numControls argument from applyMultiStateControlledSqrtSwap overloaded definition taking std::vector<int> (cherry picked from commit 9c20792) Co-authored-by: D-Exposito <dexposito@cesga.es> * Trotterisation test update (#728) * tests/unit/trotterisation.cpp: updated to use REQUIRE_AGREE and cached statevecs and densmats, and both permutePaulis options * tests/utils/compare.hpp/cpp: added setters for test epsilon * tests/unit/trotterisation.cpp: adjusted test epsilon for quad precision imaginary time evolution tests * tests/unit/trotterisation.cpp: moved unitary time evo test to REQUIRE_AGREE * tests/utils/cache.hpp/cpp: added additional utilities for creating and destroying temp caches (which I guess makes them not caches?) with a set number of qubits * tests/unit/trotterisation.cpp: updated unitary time evo test to test across deployments * tests/unit/trotterisation.cpp: reduced number of qubits and increased number of steps to admit the possibility of testing density matrices too * tests/unit/trotterisation.cpp: added density matrix tests * reduce test precision to lazily pass CPU clang quad-precision * skip Trotter tests in paid CI * changing varname convention * renaming cache funcs --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> * added Daniel Patino to authorlist * CMake warn when non-release build (#742) --------- Co-authored-by: Oliver Thomson Brown <otbrown@users.noreply.github.com> * Stop Trotter funcs mutating PauliStrSum (#740) Formerly, the Trotter functions (such as applyTrotterizedPauliStrSumGadget()), when passed permutePaulis=true, would randomly permutate the order of the passed PauliStrSum, mutating it and affecting the outputs of subsequent functions like reportPauliStrSum(). The function also contained superfluous memory allocs/copies equal in size to the PauliStrSum. Now, the PauliStrSum is never mutated, and an internally allocated ordering list keeps track of the randomised permutation. We also updated the doc, renamed permutePaulis to permuteTerms, and improved validation. Note that 'permuteTerms' had not yet reached main/release, so these changes do not need to be documented in the v4.3 release notes. * Created custom backend complex types (#729) Created cpu_qcomp and gpu_qcomp (from a shared base_qcomp) to avoid std::complex arithmetic operators in hot loops which caused performance issues. Removed all prior compiler flags and related scaffolding attempting to mitigate the performance issue. Also gave MSVC build the params `/Zc:preprocessor -Xcompiler=/Zc:preprocessor /bigobj` as needed for compilation of the unit tests on my windows machines. * Replace vector<int> with SmallList (a stack array) (#743) This is to circumvent the std::vector performance overheads visible in few-qubit simulation (responsible for a performance regression from v3; see #720), and also so that qubit lists can be passed directly to CUDA kernels without conversion (as explored in #739). * Added few-qubit optimisations (#750) Optimisations include: - Adopted SmallView (const SmallList&) to avoid superfluous SmallList copies - Made internally created matrices static - Change accelerator dynamic function vectors to static arrays - Exit all validators early when validation is disabled Additional cleanup includes: - Tidied accelerator macros (replaced param-specific macros like "numCtrls" and "numTargs" with "param") - Fill ctrlStates vectors with default before localiser - Renamed getBitsFromInteger to setToBitsOfInteger - Adopted const in bitwise.hpp to better express intent Note that the naming of SmallList and SmallView will be subsequently changed to List64 and ConstList64 * Renamed debug API functions to contain "QuEST" (#752) * Renamed environment variables to begin with"QUEST" (#755) * Renamed CMake vars and preprocessors (#756) such that they all begin with QUEST, but some have additional changes * Renamed Small(List|View) to (Const)List64 (#757) * Defer Catch2 test discovery so that we can compile MPI tests on systems which cannot actually run with MPI, because they are missing an MPI or UCX library file, as is witnessed in the CI (when compiling with MPICH). It's generally irksome too to trigger an execution of the test binary (which itself initialises QuEST) during build when on a HPC platform with distinct submit and compute nodes * Enable user to take ownership of MPI (#722) * Added ENABLE_SUBCOMM build option * Moved from MPI_COMM_WORLD to mpiQuestComm * Decided passing *MPI_Comm was probably overly cautious, and updated function name to comm_getMpiComm * environment.cpp: added methods to reset rank and numNodes, and reporting for subcomm compiled * comm_config.hpp/cpp: added comm_setMpiComm * CMakeLists.txt: PUBLIC MPI::MPI_CXX turned out to be unhelpful, even for SubComm, because of course it enforces CXX * Added new custom QuESTEnv initialiser which allow user to positively declare that they take ownership of MPI * validation.cpp: updated comm_end call * comm_config.hpp: added config.h include so COMPILE_MPI is actually defined * subcommunicator.h/cpp: implemented QuESTEnv initialiser with custom MPI_Comm * CMake: added subcommunicator.cpp * comm_config.hpp: added missing config.h include... * comm_config.cpp: explicitly initialise mpiCommQuest to MPI_COMM_NULL, updated setComm for init only workflow * quest.h: added subcommunicator header * CMake: added MPI to application binaries when SUBCOMM is enabled * comm_routines.cpp: post Irecv before Isend which probably won't do anything but it makes MPI library implementers less nervous * tests: added new env test for initCustomMpiQuESTEnv * Added error throws to comm_config to cover new scenarios of badness with user owned MPI * subcommunicator.cpp: updated var names to match QuEST style * tests/unit/initialisations.cpp: slightly modified setQuregAmps test to avoid unexpected test failure due to range checking when compild in Debug configuration * Updated validation in comm_setMpiComm Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> * userOwnsMpi int->bool * comm_config.cpp: corrected call to MPI_Comm_free * subcommunicator.cpp: userOwnsMpi int->bool * subcommunicator.cpp: added comm_isInit guard around comm_setMpiComm * environment.cpp: USER_OWNS_MPI -> userOwnsMpi * comm_init: fixed case where useDistrib = 0 and userOwnsMpi = true * comm_init: moved (recently) misplaced MPI_Init * AUTHORS.txt: added iarejula-bsc * Added placeholder docstrings to new initialisers * docs/cmake.md: added ENABLE_SUBCOMM to list of QuEST CMake vars * Newly added COMPILE_MPI -> QUEST_COMPILE_MPI * ENABLE_SUBCOMM -> QUEST_ENABLE_SUBCOMM * CMake: corrected OpenMP and subcommunicator pre-processor definitions --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> * Add flush and sync around prints (#763) to reduce the likelihood of users printing from non-root nodes interrupting QuEST root output. This is not bullet-proof; we sync the active communicator rather than MPI_COMM_WORLD so the user-controlled non-participating processes may still be printing. Furthermore, even if all processes participate, some may have outstanding non-root prints that are not aggregated to the user screen by the time MPI_Barrier finishes. But these syncs greatly reduce the change of corruption, and are effectively free! * Add GPU-aware MPICH detection This enables CRAY MPICH platforms to leverage GPU-awareness, greatly accelerating distributed GPU simulation Co-authored-by: JPRichings <james.richings@ed.ac.uk> * Cleanup custom MPI flow (#762) Important changes: - permit user initialisation of MPI when QuEST is not distributed - changed QuESTEnv fields bool from int (e.g. isMultithreaded) - add user-input validation for custom MPI calls - disambiguated comm_config.cpp concepts of "MPI is initialised" (comm_isMpiInit) from "QuEST communication is active" (comm_isActive) - refactored comm_config.cpp flow, especially related to pre-quest-init flow (during validation) - added Oliver's custom-MPI examples (from #712) - moved new API functions to experimental.h - tweaked reportQuESTEnv output grouping * Added user-control of GPU num threads per block (#736) Added: - QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK CMake option - QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK environment variable - setQuESTNumGpuThreadsPerBlock() API function - getQuESTNumGpuThreadsPerBlock() API function - set_num_gpu_threads examples in examples/extended --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> * Fix compiler warnings (#770) Beware this included removing the superfluous `numControls` argument from the C++only `std::vector` overload of `applyMultiStateControlledCompMatr2`, which is technically a teeny tiny API break ¯\_(ツ)_/¯ * tests/unit/debug.cpp: updated setQuESTSeeds validation tests to include new validation (#771) Updated number of seeds test to use a valid pointer and added a separate NULL pointer test. * Fix Windows CI test_free.yml: added Release config to ctest commands (#773) --------- Co-authored-by: D-Exposito <dexposito@cesga.es> Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> Co-authored-by: JPRichings <james.richings@ed.ac.uk>

otbrown added 2 commits April 24, 2026 14:00

cpu_config.cpp: replaced omp_get_num_threads with omp_get_max_threads…

c150e0b

…. Yes, it's confusing. Yes, the OpenMP ARB know.

We do these things not because they are easy, but because we thought …

9b8ddd1

…they would be easy.

otbrown self-assigned this Apr 24, 2026

TysonRayJones reviewed Apr 25, 2026

View reviewed changes

Comment thread quest/src/cpu/cpu_config.cpp

TysonRayJones reviewed Apr 25, 2026

View reviewed changes

Comment thread quest/src/gpu/gpu_subroutines.cpp Outdated

TysonRayJones reviewed May 3, 2026

View reviewed changes

Comment thread quest/src/api/environment.cpp Outdated

otbrown added 2 commits May 4, 2026 14:06

updated var names to match QuEST style

680fdda

Merge branch 'devel' into set_gpu_tpb

af1e5cb

TysonRayJones mentioned this pull request May 11, 2026

Profile and optimise-away small GPU allocations #749

Closed

Merge remote-tracking branch 'origin/devel' into set_gpu_tpb

aaf5681