`std::replace` vectorization and redundant writes · Issue #4433 · microsoft/STL · GitHub
Skip to content

std::replace vectorization and redundant writes #4433

Description

@AlexGuteniev

Vectorization non-options

Vectorized replace algorithm would write to its source.

To preserve the existing elements with only replacing some in a single vector register, one of these approaches could be used:

  1. Masked stores
  2. Element-wise stores
  3. Storing the old elements along with replaced elements

Before AVX512, masked stores are only for 32 and 64 sized elements and only for AVX2: mm_maskstore_epi32, _mm256_maskstore_epi32, _mm_maskstore_epi64, _mm256_maskstore_epi64. The SSE2 _mm_maskmoveu_si128 is not suitable due to a non-temporal memory hint. So, only AVX2 and only large element size is not a full set.

Element-wise stores will defeat the purpose of vectorization, at least the most of it, since the scalar replacement is fast, and need to do several elements at once to outperform it.

Redundant writes caveat

So having to resort to redundant writes. That is computing a vector with preserved and replaced elements, and writing it at once.

Redundant writes are not transparent, in a sense that one can pass read-only array with no elements to be replaced, and expect no crashes. With sophisticated VirtualAlloc/VirtualProtect use, only a part of array can be read only. This makes arbitrary redundant writes non-transparent.

There's a sophisticated approach possible to make the redundant element writes transparent. We can align the pointer at the beginning, and then store only when we have something to replace. So we still have redundant writes, but within the same page we would normally write, so these are transparent.

Sure the sophisticated approach adds complexity to maintain.

✨ Compiler magic! ✨

If instead of this:

STL/stl/inc/algorithm

Lines 3707 to 3711 in 8b081e2

We would use *_UFirst = *_UFirst == _Oldval ?_Newval : *_UFirst, to enable these redundant writes, the compiler would auto-vectorize the algorithm almost perfectly. I created DevCom-10606365 to improve codegen, but the improvement effect is small.

So we can just specialize the algorithm with if constexpr for proper types, if we accept these redundant stores.

Are redundant stores indeed harmful?

I dunno.

From the standard perspective, I'm not a language lawyer.

From the beyond standard perspective, assuming standard permits redundant stores, I think these are fine: there's no realistic use case that would rely on read-only array modifying attempt being successful if no modification happens.

From performance perspective, they might be potentially slightly harmful, but the benefit from vectorization clearly outweighs this potential harm

So the question is

Can vectorized std::replace have non-transparent redundant writes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requestedresolvedSuccessfully resolved without a commit

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions