{{ message }}
Reduce overhead of sampling profiler by having only one thread do it#6433
Merged
Conversation
steven-johnson
approved these changes
Nov 23, 2021
steven-johnson
left a comment
Contributor
There was a problem hiding this comment.
LGTM, but the explanatory description in this PR should really be added in code comments somewhere.
Contributor
There was a problem hiding this comment.
ubernit: this is an unusual formatting for our code; we almost always put member var declarations one-per-line.
Member
Author
|
I'm a bit nervous about how hard it is to test this. I'd appreciate it if a someone could run an important production pipeline at Google before and after and tell me if the profile looks reasonable. |
Member
Author
|
Actually a systematic test of the app suite is probably enough. I'll do that and report back. |
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

The current built-in profiler has a lot of overhead for fine-grained compute_at schedules. E.g. for the bgu app it inflates runtime by about 50% to turn on the profiler. This is happening because all threads are writing their current state to the same cache line, causing a lot of cross-cache traffic. Each one of these writes is effectively a cache miss.
This PR changes it so that whenever we have lots of threads all doing the same thing (so in a leaf parallel loop body and not inside a fork node), one of them gets elected to write to that status field. The election is done by racing to grab a pipeline-scope token using an atomic op. The winner does the reporting. This speeds things up in two ways: First, the threads that don't write don't incur the cache misses. Second, the thread that does write can keep that line in cache, with the sampler thread just snooping on the bus traffic when it wants to read instead of invalidating the cache line (assuming I remember how MESI works properly).