My blog post: A Tiny Compiler for Data-Parallel Kernels
This is a tiny compiler (<180LOC) experiment for lowering simple kernel loops into explicit data-parallel code. It takes a small hand-written AST and prints a lowered form with lanes, masks, masked loads, and gathers.
More info in the blog post above!
python compiler.py
EXAMPLE: Scale audio volume
SOURCE
kernel scale_audio(samples, out, n, volume):
for i in range(n):
# Each sample can be adjusted without looking at its neighbors.
# That independence lets several samples run side by side.
out[i] = samples[i] * volume
LOWERED
kernel scale_audio(samples, out, n, volume):
vector_for base in range(0, n, LANES):
let i = (base + lane_id)
let active = (i < n)
masked_store(out, i, (masked_load(samples, i, active) * volume), active)
EXAMPLE: Move particles
SOURCE
kernel move_particles(position, velocity, out, n, dt):
for i in range(n):
# Particle i moves using its own position and velocity.
# It does not depend on particle i - 1 or particle i + 1.
out[i] = position[i] + velocity[i] * dt
LOWERED
kernel move_particles(position, velocity, out, n, dt):
vector_for base in range(0, n, LANES):
let i = (base + lane_id)
let active = (i < n)
masked_store(out, i, (masked_load(position, i, active) + (masked_load(velocity, i, active) * dt)), active)
EXAMPLE: Color by number
SOURCE
kernel color_by_number(color_number, colors, out, n):
for i in range(n):
# Each pixel stores a small color number, like a color-by-number page.
number = color_number[i]
# Neighboring pixels can name completely different colors.
# The lowered code must let each lane read its own color entry.
out[i] = colors[number]
LOWERED
kernel color_by_number(color_number, colors, out, n):
vector_for base in range(0, n, LANES):
let i = (base + lane_id)
let active = (i < n)
let number = masked_load(color_number, i, active)
masked_store(out, i, gather(colors, number, active), active)
