AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.
Latest results as of June 9th 2026: This refresh includes the addition of Gemini 3.5 Flash and moves older models (GPT 5.3 Codex, GPT 5.2 Codex, Claude Opus 4.6, Claude Opus 4.5, Gemini 3 Pro Preview, Gemini 2.5 Pro, and Gemini 2.5 Flash) to our new Archive page, where you can find previous versions of the leaderboard.

Learn more about Android Bench

Learn more about how we created a set of common Android developer tasks.
Many of the tasks are based on how we define high quality Android development, which is detailed in our developer documentation.
See the full repo so you can replicate the tests yourself.