- Ran on M1 Max, this is only tested on this one, not sure how well it runs on the other M-machines.
- CANNOT hard code a power limit and freq, results wildly differs from time to time, but all within 3% perf from the actual MPS impl.
- Xcode: Use Xcode 14.2. Newer Xcode/Metal SDK versions may not expose the Metal async copy header/APIs (e.g.
__metal_simdgroup_async_copy_2d) used by the async-copy kernel(s), otherwise won't compile for kernel 9. src/runner.cppwas partially vibe-coded for a ton of setup code, but looks good to me.- Got rejected from apple :-( but this was really fun to do just to learn a ton about their architecture to prep for interview.
Thanks to siboehm/SGEMM_CUDA for a ton of boilerplate code and the overall benchmarking setup inspiration. Also thanks to dougallj/applegpu issue #28 (George Hotz and Philip Turner) for the async device→threadgroup copy breadcrumbs.

