Added sb instruction support for ARMv9 architecture by salvatoredipietro · Pull Request #2843 · jemalloc/jemalloc · GitHub
Skip to content

Added sb instruction support for ARMv9 architecture#2843

Open
salvatoredipietro wants to merge 3 commits intojemalloc:devfrom
salvatoredipietro:dev
Open

Added sb instruction support for ARMv9 architecture#2843
salvatoredipietro wants to merge 3 commits intojemalloc:devfrom
salvatoredipietro:dev

Conversation

@salvatoredipietro
Copy link
Copy Markdown

We would like to add the support for ARMv9 architecture to use sb instruction instead of isb.
Based on our micro benchmark (patch attached), sb is ~30% faster than isb (ratio=1.710:1) and the change do not seems to introduce any regression on the spin_cpu_spinwait() function (isb_spin=8740725us vs standard_spin=8739722us).

# Jemalloc build
$ make clean all ; ./autogen.sh && ./configure && make -j4

# Test on m8g.2xlarge with patch
$ make tests_stress && ./test/stress/arm_spin_bench
Running on ARM64 architecture
SB instruction is supported
1000000 iterations, isb_spin=8740725us (8740.725 ns/iter), standard_spin=5110839us (5110.839 ns/iter), time consumption ratio=1.710:1

# Test on m8g.2xlarge without patch 
1000000 iterations, isb_spin=8739722us (8739.722 ns/iter), standard_spin=8739722us (8739.722 ns/iter), time consumption ratio=1.000:1

benchmark.patch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/* Define HWCAP_SB if not already defined in system headers */
#ifndef HWCAP_SB
#define HWCAP_SB (1UL << 56) /* Speculation Barrier */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread Makefile.in Outdated
C_SRCS += $(srcroot)src/zone.c
endif
ifneq (,$(filter arm%,$(HOST_CPU)))
C_SRCS += $(srcroot)src/spin_delay_arm.c
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem needed?

/* Use SB instruction if available, otherwise ISB */
static inline void
spin_delay_arm(void) {
static int use_spin_delay_sb = -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like every TU will have its own copy of this global variable, since it's static inside a static function in a header.

If there is a way to init it to 0 without any extra cost, then that can save some space in the image by living in .bss instead of in .data.

Comment on lines +15 to +20
if (__builtin_expect(use_spin_delay_sb, -1) == 1) {
/* SB instruction encoding */
asm volatile(".inst 0xd50330ff \n");
} else if (use_spin_delay_sb == 0) {
/* ISB instruction */
asm volatile("isb");
} else {
/* Initialize variable and check if SB is supported */
if (getauxval(AT_HWCAP) & HWCAP_SB) {
use_spin_delay_sb = 1;
} else {
use_spin_delay_sb = 0;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like concurrent accesses to a variable, forming a data-race.

Comment on lines +23 to +37
if (getauxval(AT_HWCAP) & HWCAP_SB) {
use_spin_delay_sb = 1;
} else {
use_spin_delay_sb = 0;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the type of thing that could happen in a __attribute__((constructor)) so that there never needs to be the fallback path in the function.

@salvatoredipietro salvatoredipietro force-pushed the dev branch 2 times, most recently from effa5c3 to 74512c4 Compare May 8, 2025 22:32
#endif // HWCAP_SB

/* Global variable to track SB support, initialized by constructor */
static _Atomic int arm_has_sb_instruction = ATOMIC_VAR_INIT(0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is static, so it exists separately in each TU. This should be defined in only one TU, such as some spin_delay_arm.c. It cannot be static in the source file, and it would have to be extern in the header.

I'm not sure if it also has to be marked as hidden, in order to avoid being discoverable to other .as and .sos. Maybe there are other examples of similar variables.

Because this is only updated in a constructor function, and all reads from this happen-after the constructor function, this no longer needs to be atomic. If for some reason it is atomic anyway, then the stores and loads can all be memory_order_relaxed since they do not order any other memory locations beyond the atomic itself.

Comment on lines +14 to +16
__attribute__((constructor))
static void
detect_arm_sb_support(void) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this is a __attribute__((constructor)) static, which exists separately in each TU. This should be defined in only one TU, again such as some spin_delay_arm.c.

There may be a non-static declaration of the function here. Likewise, I'm not sure if this function also has to be marked as hidden, or if somehow all symbols are automatically marked as hidden in some special linker script or build step.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, can also just use normal global variable initialization.

const int arm_has_sb_instruction = getauxval(AT_HWCAP) & HWCAP_SB);

Comment on lines +7 to +9
/* Constructor function declaration - implementation in spin_delay_arm.c */
__attribute__((constructor))
void detect_arm_sb_support(void);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor function declaration should not appear in the header anymore. It should only appear in, and only be marked as a constructor in, one TU.

Comment thread src/spin_delay_arm.c
_Atomic int arm_has_sb_instruction = ATOMIC_VAR_INIT(0);

/* Constructor function to detect hardware capabilities at program startup */
__attribute__((constructor))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__attribute__ syntax is not supported by all compilers. It would have to be guarded for the compilers which support it - which likely includes all of the compilers which run on linux.

Alternatively, the variable can just be initialized directly and then no function is necessary.

#if defined(__linux__) && (defined(__aarch64__) || defined(__arm64__))
int arm_has_sb_instruction = getauxval(AT_HWCAP) & HWCAP_SB;
#endif

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to do this, the compiler will fail with:

src/spin_delay_arm.c:6:30: error: initializer element is not constant
    6 | int arm_has_sb_instruction = getauxval(AT_HWCAP) & HWCAP_SB;
      |                              ^~~~~~~~~
make: *** [Makefile:502: src/spin_delay_arm.jet.sym.o] Error 1

#include <stdatomic.h>

/* Global variable to track SB support, declared as extern to be defined in one TU */
extern _Atomic int arm_has_sb_instruction;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this as atomic may regress the caller. This does not need to be atomic, though, so it should be changed to a normal int.

If the caller uses default loads on an atomic variable, these loads have memory-order-seq-cst. If the caller uses explicit atomic loads with memory-order-relaxed, it is still a regression because many compilers will not hoist the load out of a loop.

@salvatoredipietro
Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants