Vectorization non-options
Vectorized replace algorithm would write to its source.
To preserve the existing elements with only replacing some in a single vector register, one of these approaches could be used:
- Masked stores
- Element-wise stores
- Storing the old elements along with replaced elements
Before AVX512, masked stores are only for 32 and 64 sized elements and only for AVX2: mm_maskstore_epi32, _mm256_maskstore_epi32, _mm_maskstore_epi64, _mm256_maskstore_epi64. The SSE2 _mm_maskmoveu_si128 is not suitable due to a non-temporal memory hint. So, only AVX2 and only large element size is not a full set.
Element-wise stores will defeat the purpose of vectorization, at least the most of it, since the scalar replacement is fast, and need to do several elements at once to outperform it.
Redundant writes caveat
So having to resort to redundant writes. That is computing a vector with preserved and replaced elements, and writing it at once.
Redundant writes are not transparent, in a sense that one can pass read-only array with no elements to be replaced, and expect no crashes. With sophisticated VirtualAlloc/VirtualProtect use, only a part of array can be read only. This makes arbitrary redundant writes non-transparent.
There's a sophisticated approach possible to make the redundant element writes transparent. We can align the pointer at the beginning, and then store only when we have something to replace. So we still have redundant writes, but within the same page we would normally write, so these are transparent.
Sure the sophisticated approach adds complexity to maintain.
✨ Compiler magic! ✨
If instead of this:
|
for (; _UFirst != _ULast; ++_UFirst) {
|
|
if (*_UFirst == _Oldval) {
|
|
*_UFirst = _Newval;
|
|
}
|
|
}
|
We would use *_UFirst = *_UFirst == _Oldval ?_Newval : *_UFirst, to enable these redundant writes, the compiler would auto-vectorize the algorithm almost perfectly. I created DevCom-10606365 to improve codegen, but the improvement effect is small.
So we can just specialize the algorithm with if constexpr for proper types, if we accept these redundant stores.
Are redundant stores indeed harmful?
I dunno.
From the standard perspective, I'm not a language lawyer.
From the beyond standard perspective, assuming standard permits redundant stores, I think these are fine: there's no realistic use case that would rely on read-only array modifying attempt being successful if no modification happens.
From performance perspective, they might be potentially slightly harmful, but the benefit from vectorization clearly outweighs this potential harm
So the question is
Can vectorized std::replace have non-transparent redundant writes?
Vectorization non-options
Vectorized
replacealgorithm would write to its source.To preserve the existing elements with only replacing some in a single vector register, one of these approaches could be used:
Before AVX512, masked stores are only for 32 and 64 sized elements and only for AVX2:
mm_maskstore_epi32,_mm256_maskstore_epi32,_mm_maskstore_epi64,_mm256_maskstore_epi64. The SSE2_mm_maskmoveu_si128is not suitable due to a non-temporal memory hint. So, only AVX2 and only large element size is not a full set.Element-wise stores will defeat the purpose of vectorization, at least the most of it, since the scalar replacement is fast, and need to do several elements at once to outperform it.
Redundant writes caveat
So having to resort to redundant writes. That is computing a vector with preserved and replaced elements, and writing it at once.
Redundant writes are not transparent, in a sense that one can pass read-only array with no elements to be replaced, and expect no crashes. With sophisticated
VirtualAlloc/VirtualProtectuse, only a part of array can be read only. This makes arbitrary redundant writes non-transparent.There's a sophisticated approach possible to make the redundant element writes transparent. We can align the pointer at the beginning, and then store only when we have something to replace. So we still have redundant writes, but within the same page we would normally write, so these are transparent.
Sure the sophisticated approach adds complexity to maintain.
✨ Compiler magic! ✨
If instead of this:
STL/stl/inc/algorithm
Lines 3707 to 3711 in 8b081e2
We would use
*_UFirst = *_UFirst == _Oldval ?_Newval : *_UFirst, to enable these redundant writes, the compiler would auto-vectorize the algorithm almost perfectly. I created DevCom-10606365 to improve codegen, but the improvement effect is small.So we can just specialize the algorithm with
if constexprfor proper types, if we accept these redundant stores.Are redundant stores indeed harmful?
I dunno.
From the standard perspective, I'm not a language lawyer.
From the beyond standard perspective, assuming standard permits redundant stores, I think these are fine: there's no realistic use case that would rely on read-only array modifying attempt being successful if no modification happens.
From performance perspective, they might be potentially slightly harmful, but the benefit from vectorization clearly outweighs this potential harm
So the question is
Can vectorized
std::replacehave non-transparent redundant writes?