Compute Shaders#

WebGPU allows running compute shaders on the GPU for general-purpose computation (currently single precision only). This is useful for parallel algorithms that don’t need to render anything.

Vector addition#

A minimal example: add two float arrays on the GPU and read the result back.

[1]:

import numpy as np
import webgpu.jupyter  # sets up browser connection for GPU access
from webgpu.utils import (
    get_device, buffer_from_array, uniform_from_array,
    BufferBinding, UniformBinding, BufferUsage,
    run_compute_shader, read_buffer,
)

device = get_device()

a = np.array([1, 2, 3], dtype=np.float32)
b = np.array([4, 5, 6], dtype=np.float32)

N = a.size
mem_size = a.size * a.itemsize

a_gpu = buffer_from_array(a)
b_gpu = buffer_from_array(b)
res_gpu = device.createBuffer(mem_size, BufferUsage.STORAGE | BufferUsage.COPY_SRC)
uniform_N = uniform_from_array(np.array([N], dtype=np.uint32))

bindings = [
    BufferBinding(101, a_gpu),
    BufferBinding(102, b_gpu),
    BufferBinding(103, res_gpu, read_only=False),
    UniformBinding(104, uniform_N),
]

shader_code = """
@group(0) @binding(101) var<storage> vec_a : array<f32>;
@group(0) @binding(102) var<storage> vec_b : array<f32>;
@group(0) @binding(103) var<storage, read_write> vec_res : array<f32>;
@group(0) @binding(104) var<uniform> N : u32;

@compute @workgroup_size(256, 1, 1)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
    let tid = gid.x;
    if (tid < N) {
        vec_res[tid] = vec_a[tid] + vec_b[tid];
    }
}
"""

run_compute_shader(shader_code, bindings, n_workgroups=((N + 255) // 256, 1, 1))
result = read_buffer(res_gpu, np.float32)
print(f"a = {a}")
print(f"b = {b}")
print(f"a + b = {result}")

a = [1. 2. 3.]
b = [4. 5. 6.]
a + b = [5. 7. 9.]

Key concepts#

Workgroups and threads#

@workgroup_size(256, 1, 1) defines 256 threads per workgroup.
n_workgroups sets how many workgroups to dispatch — here ceil(N / 256).
Each thread gets a unique global_invocation_id (gid.x = workgroup index × 256 + local index).
The if (tid < N) guard prevents out-of-bounds access when N isn’t a multiple of 256.

Buffer types#

Function	Usage	Description
`buffer_from_array(arr)`	`STORAGE \\| COPY_SRC`	Read-only storage buffer initialized from a numpy array
`device.createBuffer(size, usage)`	manual	Raw buffer — use `STORAGE \\| COPY_SRC` for read-write output
`uniform_from_array(arr)`	`UNIFORM`	Small constant data (e.g. problem size N)

Bindings#

BufferBinding(slot, buf) — read-only var<storage> by default.
BufferBinding(slot, buf, read_only=False) — read-write var<storage, read_write>.
UniformBinding(slot, buf) — var<uniform>.

Reading results#

read_buffer(gpu_buf, dtype) copies data back from GPU to CPU as a numpy array. The buffer must have COPY_SRC usage for this to work.