Zip: Handle preferred memory layout of inhomogenous inputs better#809
Conversation
They were previously pub, just because we didn't have the pub(crate) feature yet.
|
A big thank you for this PR! I tried doing it several months ago and abandoned, so I'm quite happy to have nerd sniped you ;) I ran the benchmarks and, as you wrote, they are noisy so it's kind of hard to spot the tiny changes. I ran them 5-6 times each and picked the best results. Having said this, it looks like the 'cc' cases are a little slower than they were, almost nothing, and probably in the noise range, but the 'ff' cases are now equal and this is a great news for us because we are stuck with loading and writing images in fortran order. Just a note, there are now 2 benchmarks named |
|
Thanks for running benchmarks! Reducing the overhead of Zip would be interesting. I'll try to deduplicate the benchmarks and maybe remove some of the mixed benchmarks here, I don't want to run them anyway, even if they are useful for comparison and perspective. Since you ran all benchmarks I'll share my tip of only compiling and running what you need, which we use to cope with building rust: My develop machine has changed, and the new one is very flaky at benchmarks. I can maybe see the point of criterion now; I used to have a setup that made for stable and reproducible benchmarks before. |
Using split tests performance of the Zip in parallelization, so that we can see if there are benefits to splitting arrays better.
Using the index shows more directly the overhead of indexed zip
Support both unroll over c- and f-layout preferred axis in Zip inner loop (the fallback when inputs are not all contiguous and same layout). Keep a tendency score when building the Zip, so that we know if the inputs are tending to be c- or f- layout. This improves performance on the just added zip_indexed_ff benchmark, so that it seems to match its (already fast) cc counterpart.
cd21da6 to
47b3654
Compare

For example, when we use
Zip::from(a).and(b); the Zip will examine the inputs and try to determine if they are all contiguous (and in the same way); it can now also determine what tendency the inputs have, to further guide which axis should be used for the innermost loop, even if not all the inputs are contiguous.This helps for example with indexed Zip on f-order producers. The index producer has no bias in either direction, so all the other inputs will determine the layout preference.
The improved layout preference also affects parallelism, because in some cases we can better choose which axis to split along to preserve locality better.
The
Layouttype was improved to make this possible. It now has flags for C/F-contig and for C/F-preference. The new layout bits are visible in the array debug output.Fixes #749