GitHub - fleurxt/SGEMM_METAL: Kinda fast metal SGEMM · GitHub
Skip to content

fleurxt/SGEMM_METAL

Repository files navigation

SGEMM_METAL

Benchmark results

Notes

  • Ran on M1 Max, this is only tested on this one, not sure how well it runs on the other M-machines.
  • CANNOT hard code a power limit and freq, results wildly differs from time to time, but all within 3% perf from the actual MPS impl.
  • Xcode: Use Xcode 14.2. Newer Xcode/Metal SDK versions may not expose the Metal async copy header/APIs (e.g. __metal_simdgroup_async_copy_2d) used by the async-copy kernel(s), otherwise won't compile for kernel 9.
  • src/runner.cpp was partially vibe-coded for a ton of setup code, but looks good to me.
  • Got rejected from apple :-( but this was really fun to do just to learn a ton about their architecture to prep for interview.

Acknowledgements

Thanks to siboehm/SGEMM_CUDA for a ton of boilerplate code and the overall benchmarking setup inspiration. Also thanks to dougallj/applegpu issue #28 (George Hotz and Philip Turner) for the async device→threadgroup copy breadcrumbs.

About

Kinda fast metal SGEMM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages