Added sb instruction support for ARMv9 architecture#2843
Added sb instruction support for ARMv9 architecture#2843salvatoredipietro wants to merge 3 commits intojemalloc:devfrom
Conversation
There was a problem hiding this comment.
This will have to be guarded:
https://app.travis-ci.com/github/jemalloc/jemalloc/jobs/632564580#L784
|
|
||
| /* Define HWCAP_SB if not already defined in system headers */ | ||
| #ifndef HWCAP_SB | ||
| #define HWCAP_SB (1UL << 56) /* Speculation Barrier */ |
There was a problem hiding this comment.
Need 1ULL for 32-bit builds.
https://app.travis-ci.com/github/jemalloc/jemalloc/jobs/632564590#L757
| C_SRCS += $(srcroot)src/zone.c | ||
| endif | ||
| ifneq (,$(filter arm%,$(HOST_CPU))) | ||
| C_SRCS += $(srcroot)src/spin_delay_arm.c |
| /* Use SB instruction if available, otherwise ISB */ | ||
| static inline void | ||
| spin_delay_arm(void) { | ||
| static int use_spin_delay_sb = -1; |
There was a problem hiding this comment.
Seems like every TU will have its own copy of this global variable, since it's static inside a static function in a header.
If there is a way to init it to 0 without any extra cost, then that can save some space in the image by living in .bss instead of in .data.
| if (__builtin_expect(use_spin_delay_sb, -1) == 1) { | ||
| /* SB instruction encoding */ | ||
| asm volatile(".inst 0xd50330ff \n"); | ||
| } else if (use_spin_delay_sb == 0) { | ||
| /* ISB instruction */ | ||
| asm volatile("isb"); | ||
| } else { | ||
| /* Initialize variable and check if SB is supported */ | ||
| if (getauxval(AT_HWCAP) & HWCAP_SB) { | ||
| use_spin_delay_sb = 1; | ||
| } else { | ||
| use_spin_delay_sb = 0; | ||
| } | ||
| } |
There was a problem hiding this comment.
This looks like concurrent accesses to a variable, forming a data-race.
| if (getauxval(AT_HWCAP) & HWCAP_SB) { | ||
| use_spin_delay_sb = 1; | ||
| } else { | ||
| use_spin_delay_sb = 0; | ||
| } |
There was a problem hiding this comment.
This seems like the type of thing that could happen in a __attribute__((constructor)) so that there never needs to be the fallback path in the function.
effa5c3 to
74512c4
Compare
| #endif // HWCAP_SB | ||
|
|
||
| /* Global variable to track SB support, initialized by constructor */ | ||
| static _Atomic int arm_has_sb_instruction = ATOMIC_VAR_INIT(0); |
There was a problem hiding this comment.
This is static, so it exists separately in each TU. This should be defined in only one TU, such as some spin_delay_arm.c. It cannot be static in the source file, and it would have to be extern in the header.
I'm not sure if it also has to be marked as hidden, in order to avoid being discoverable to other .as and .sos. Maybe there are other examples of similar variables.
Because this is only updated in a constructor function, and all reads from this happen-after the constructor function, this no longer needs to be atomic. If for some reason it is atomic anyway, then the stores and loads can all be memory_order_relaxed since they do not order any other memory locations beyond the atomic itself.
| __attribute__((constructor)) | ||
| static void | ||
| detect_arm_sb_support(void) { |
There was a problem hiding this comment.
Likewise, this is a __attribute__((constructor)) static, which exists separately in each TU. This should be defined in only one TU, again such as some spin_delay_arm.c.
There may be a non-static declaration of the function here. Likewise, I'm not sure if this function also has to be marked as hidden, or if somehow all symbols are automatically marked as hidden in some special linker script or build step.
There was a problem hiding this comment.
Alternatively, can also just use normal global variable initialization.
const int arm_has_sb_instruction = getauxval(AT_HWCAP) & HWCAP_SB);| /* Constructor function declaration - implementation in spin_delay_arm.c */ | ||
| __attribute__((constructor)) | ||
| void detect_arm_sb_support(void); |
There was a problem hiding this comment.
This constructor function declaration should not appear in the header anymore. It should only appear in, and only be marked as a constructor in, one TU.
| _Atomic int arm_has_sb_instruction = ATOMIC_VAR_INIT(0); | ||
|
|
||
| /* Constructor function to detect hardware capabilities at program startup */ | ||
| __attribute__((constructor)) |
There was a problem hiding this comment.
__attribute__ syntax is not supported by all compilers. It would have to be guarded for the compilers which support it - which likely includes all of the compilers which run on linux.
Alternatively, the variable can just be initialized directly and then no function is necessary.
#if defined(__linux__) && (defined(__aarch64__) || defined(__arm64__))
int arm_has_sb_instruction = getauxval(AT_HWCAP) & HWCAP_SB;
#endif
There was a problem hiding this comment.
If I try to do this, the compiler will fail with:
src/spin_delay_arm.c:6:30: error: initializer element is not constant
6 | int arm_has_sb_instruction = getauxval(AT_HWCAP) & HWCAP_SB;
| ^~~~~~~~~
make: *** [Makefile:502: src/spin_delay_arm.jet.sym.o] Error 1
| #include <stdatomic.h> | ||
|
|
||
| /* Global variable to track SB support, declared as extern to be defined in one TU */ | ||
| extern _Atomic int arm_has_sb_instruction; |
There was a problem hiding this comment.
Marking this as atomic may regress the caller. This does not need to be atomic, though, so it should be changed to a normal int.
If the caller uses default loads on an atomic variable, these loads have memory-order-seq-cst. If the caller uses explicit atomic loads with memory-order-relaxed, it is still a regression because many compilers will not hoist the load out of a loop.

We would like to add the support for ARMv9 architecture to use
sbinstruction instead ofisb.Based on our micro benchmark (patch attached),
sbis ~30% faster thanisb(ratio=1.710:1) and the change do not seems to introduce any regression on the spin_cpu_spinwait() function (isb_spin=8740725us vs standard_spin=8739722us).benchmark.patch