Compute Shader

For certain types of calculations, compute shaders on the GPU can be thousands of times faster than on the CPU alone.

In this tutorial, we will simulate a star field using an ‘N-Body simulation’. Each star is affected by the gravity of every other star. For 1,000 stars, this means we have 1,000 x 1,000 = 1,000,000 million calculations to perform for each frame. The video has 65,000 stars, requiring 4.2 billion gravity force calculations per frame. On high-end hardware it can still run at 60 fps!

How does this work? There are three major parts to this program:

The Python code, which allocates buffers & glues everything together
The visualization shaders, which let us see the data in the buffers
The compute shader, which moves everything

Buffers

We need a place to store the data we’ll visualize. To do so, we’ll create two Shader Storage Buffer Objects (SSBOs) of floating point numbers from within our Python code. One will hold the previous frame’s star positions, and the other will be used to store calculate the next frame’s positions.

Each buffer must be able to store the following for each star:

The x, y, and radius of each star stored
The velocity of the star, which will be unused by the visualization
The floating point RGBA color of the star

Generating Aligned Data

To avoid issues with GPU memory alignment quirks, we’ll use the function below to generate well-aligned data ready to load into the SSBO. The docstrings & comments explain why in greater detail:

Generating Well-Aligned Data to Load onto the GPU

def gen_initial_data(
        screen_size: Tuple[int, int],
        num_stars: int = NUM_STARS,
        use_color: bool = False
) -> array:
    """
    Generate an :py:class:`~array.array` of randomly positioned star data.

    Some of this data is wasted as padding because:

    1. GPUs expect SSBO data to be aligned to multiples of 4
    2. GLSL's vec3 is actually a vec4 with compiler-side restrictions,
       so we have to use 4-length vectors anyway.

    Args:
        screen_size: A (width, height) of the area to generate stars in
        num_stars: How many stars to generate
        use_color: Whether to generate white or randomized pastel stars
    Returns:
     An array of star position data
    """
    width, height = screen_size
    color_channel_min = 0.5 if use_color else 1.0

    def _data_generator() -> Generator[float, None, None]:
        """Inner generator function used to illustrate memory layout"""

        for i in range(num_stars):
            # Position/radius
            yield random.randrange(0, width)
            yield random.randrange(0, height)
            yield 0.0  # z (padding, unused by shaders)
            yield 6.0

            # Velocity (unused by visualization shaders)
            yield 0.0
            yield 0.0
            yield 0.0  # vz (padding, unused by shaders)
            yield 0.0  # vw (padding, unused by shaders)

            # Color
            yield random.uniform(color_channel_min, 1.0)  # r
            yield random.uniform(color_channel_min, 1.0)  # g
            yield random.uniform(color_channel_min, 1.0)  # b
            yield 1.0  # a

Allocating the Buffers

Allocating the Buffers & Loading the Data onto the GPU

        self.center_window()

        # --- Create buffers

        # Create pairs of buffers for the compute & visualization shaders.
        # We will swap which buffer instance is the initial value and
        # which is used as the current value to write to.

        # ssbo = shader storage buffer object
        initial_data = gen_initial_data(self.get_size(), use_color=USE_COLORED_STARS)
        self.ssbo_previous = self.ctx.buffer(data=initial_data)
        self.ssbo_current = self.ctx.buffer(data=initial_data)

        # vao = vertex array object
        # Format string describing how to interpret the SSBO buffer data.
        # 4f = position and size -> x, y, z, radius
        # 4x4 = Four floats used for calculating velocity. Not needed for visualization.
        # 4f = color -> rgba
        buffer_format = "4f 4x4 4f"

        # Attribute variable names for the vertex shader
        attributes = ["in_vertex", "in_color"]

        self.vao_previous = self.ctx.geometry(
            [BufferDescription(self.ssbo_previous, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )
        self.vao_current = self.ctx.geometry(
            [BufferDescription(self.ssbo_current, buffer_format, attributes)],

Visualization Shaders

Now that we have the data, we need to be able to visualize it. We’ll do it by applying vertex, geometry, and fragment shaders to convert the data in the SSBO into pixels. For each star’s 12 floats in the array, the following flow of data will take place:

Vertex Shader

In this tutorial, the vertex shader will be run for each star’s 12 float long stretch of raw padded data in self.ssbo_current. Each execution will output clean typed data to an instance of the geometry shader.

Data is read in as follows:

The x, y, and radius of each star are accessed via in_vertex
The floating point RGBA color of the star, via in_color

shaders/vertex_shader.glsl

#version 330

in vec4 in_vertex;
in vec4 in_color;

out vec2 vertex_pos;
out float vertex_radius;
out vec4 vertex_color;

void main()
{
    vertex_pos = in_vertex.xy;
    vertex_radius = in_vertex.w;
    vertex_color = in_color;
}

The variables below are then passed as inputs to the geometry shader:

vertex_pos
vertex_radius
vertex_color

Geometry Shader

The geometry shader converts a single point into a quad, in this case a square, which the GPU can render. It does this by emitting four points centered on the input point.

shaders/geometry_shader.glsl

#version 330

layout (points) in;
layout (triangle_strip, max_vertices = 4) out;

// Use Arcade's global projection UBO
uniform Projection {
    uniform mat4 matrix;
} proj;


// The outputs from the vertex shader are used as inputs
in vec2 vertex_pos[];
in float vertex_radius[];
in vec4 vertex_color[];

// These are used with EmitVertex to generate four points of
// a quad centered around vertex_pos[0].
out vec2 g_uv;
out vec3 g_color;

void main() {
    vec2 center = vertex_pos[0];
    vec2 hsize = vec2(vertex_radius[0]);

    g_color = vertex_color[0].rgb;

    gl_Position = proj.matrix * vec4(vec2(-hsize.x, hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(0, 1);
    EmitVertex();

    gl_Position = proj.matrix * vec4(vec2(-hsize.x, -hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(0, 0);
    EmitVertex();

    gl_Position = proj.matrix * vec4(vec2(hsize.x, hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(1, 1);
    EmitVertex();

    gl_Position = proj.matrix * vec4(vec2(hsize.x, -hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(1, 0);
    EmitVertex();

    // End geometry emmission
    EndPrimitive();
}

Fragment Shader

A fragment shader runs for each pixel in a quad. It converts a UV coordinate within the quad to a float RGBA value. In this tutorial’s case, the shader produces the soft glowing circle on the surface of each star’s quad.

shaders/fragment_shader.glsl

#version 330

in vec2 g_uv;
in vec3 g_color;

out vec4 out_color;

void main()
{
    float l = length(vec2(0.5, 0.5) - g_uv.xy);
    if ( l > 0.5)
    {
        discard;
    }
    float alpha;
    if (l == 0.0)
        alpha = 1.0;
    else
        alpha = min(1.0, .60-l * 2);

    vec3 c = g_color.rgb;
    // c.xy += v_uv.xy * 0.05;
    // c.xy += v_pos.xy * 0.75;
    out_color = vec4(c, alpha);
}

Compute Shader

Now that we have a way to display data, we should update it.

We created pairs of buffers earlier. We will use one SSBO as an input buffer holding the previous frame’s data, and another as our output buffer to write results to.

We then swap our buffers each frame after drawing, using the output as the input of the next frame, and repeat the process until the program stops running.

shaders/compute_shader.glsl

#version 430

// Set up our compute groups.
// The COMPUTE_SIZE_X and COMPUTE_SIZE_Y values will be replaced
// by the Python code with actual values. This does not happen
// automatically, and must be called manually.
layout(local_size_x=COMPUTE_SIZE_X, local_size_y=COMPUTE_SIZE_Y) in;

// Input uniforms would go here if you need them.
// The examples below match the ones commented out in main.py
//uniform vec2 screen_size;
//uniform float frame_time;

// Structure of the star data
struct Star
{
    vec4 pos;
    vec4 vel;
    vec4 color;
};

// Input buffer
layout(std430, binding=0) buffer stars_in
{
    Star stars[];
} In;

// Output buffer
layout(std430, binding=1) buffer stars_out
{
    Star stars[];
} Out;

void main()
{
    int curStarIndex = int(gl_GlobalInvocationID);

    Star in_star = In.stars[curStarIndex];

    vec4 p = in_star.pos.xyzw;
    vec4 v = in_star.vel.xyzw;

    // Move the star according to the current force
    p.xy += v.xy;

    // Calculate the new force based on all the other bodies
    for (int i=0; i < In.stars.length(); i++) {
        // If enabled, this will keep the star from calculating gravity on itself
        // However, it does slow down the calcluations do do this check.
        //  if (i == x)
        //      continue;

        // Calculate distance squared
        float dist = distance(In.stars[i].pos.xyzw.xy, p.xy);
        float distanceSquared = dist * dist;

        // If distance is too small, extremely high forces can result and
        // fling the star into escape velocity and forever off the screen.
        // Using a reasonable minimum distance to prevents this.
        float minDistance = 0.02;
        float gravityStrength = 0.3;
        float simulationSpeed = 0.002;
        float force = min(minDistance, gravityStrength / distanceSquared) * -simulationSpeed;

        vec2 diff = p.xy - In.stars[i].pos.xyzw.xy;
        // We should normalize this I think, but it doesn't work.
        //  diff = normalize(diff);
        vec2 delta_v = diff * force;
        v.xy += delta_v;
    }


    Star out_star;
    out_star.pos.xyzw = p.xyzw;
    out_star.vel.xyzw = v.xyzw;

    vec4 c = in_star.color.xyzw;
    out_star.color.xyzw = c.xyzw;

    Out.stars[curStarIndex] = out_star;
}

The Finished Python Program

The code includes thorough docstrings and annotations explaining how it works.

main.py

"""
N-Body Gravity with Compute Shaders & Buffers
"""
import random
from array import array
from pathlib import Path
from typing import Generator, Tuple

import arcade
from arcade.gl import BufferDescription

# Window dimensions in pixels
WINDOW_WIDTH = 800
WINDOW_HEIGHT = 600

# Size of performance graphs in pixels
GRAPH_WIDTH = 200
GRAPH_HEIGHT = 120
GRAPH_MARGIN = 5

NUM_STARS: int = 4000
USE_COLORED_STARS: bool = True


def gen_initial_data(
        screen_size: Tuple[int, int],
        num_stars: int = NUM_STARS,
        use_color: bool = False
) -> array:
    """
    Generate an :py:class:`~array.array` of randomly positioned star data.

    Some of this data is wasted as padding because:

    1. GPUs expect SSBO data to be aligned to multiples of 4
    2. GLSL's vec3 is actually a vec4 with compiler-side restrictions,
       so we have to use 4-length vectors anyway.

    Args:
        screen_size: A (width, height) of the area to generate stars in
        num_stars: How many stars to generate
        use_color: Whether to generate white or randomized pastel stars
    Returns:
     An array of star position data
    """
    width, height = screen_size
    color_channel_min = 0.5 if use_color else 1.0

    def _data_generator() -> Generator[float, None, None]:
        """Inner generator function used to illustrate memory layout"""

        for i in range(num_stars):
            # Position/radius
            yield random.randrange(0, width)
            yield random.randrange(0, height)
            yield 0.0  # z (padding, unused by shaders)
            yield 6.0

            # Velocity (unused by visualization shaders)
            yield 0.0
            yield 0.0
            yield 0.0  # vz (padding, unused by shaders)
            yield 0.0  # vw (padding, unused by shaders)

            # Color
            yield random.uniform(color_channel_min, 1.0)  # r
            yield random.uniform(color_channel_min, 1.0)  # g
            yield random.uniform(color_channel_min, 1.0)  # b
            yield 1.0  # a

    # Use the generator function to fill an array in RAM
    return array('f', _data_generator())


class NBodyGravityWindow(arcade.Window):

    def __init__(self):
        # Ask for OpenGL context supporting version 4.3 or greater when
        # calling the parent initializer to make sure we have compute shader
        # support.
        super().__init__(
            WINDOW_WIDTH, WINDOW_HEIGHT,
            "N-Body Gravity with Compute Shaders & Buffers",
            gl_version=(4, 3),
            resizable=False
        )
        # Attempt to put the window in the center of the screen.
        self.center_window()

        # --- Create buffers

        # Create pairs of buffers for the compute & visualization shaders.
        # We will swap which buffer instance is the initial value and
        # which is used as the current value to write to.

        # ssbo = shader storage buffer object
        initial_data = gen_initial_data(self.get_size(), use_color=USE_COLORED_STARS)
        self.ssbo_previous = self.ctx.buffer(data=initial_data)
        self.ssbo_current = self.ctx.buffer(data=initial_data)

        # vao = vertex array object
        # Format string describing how to interpret the SSBO buffer data.
        # 4f = position and size -> x, y, z, radius
        # 4x4 = Four floats used for calculating velocity. Not needed for visualization.
        # 4f = color -> rgba
        buffer_format = "4f 4x4 4f"

        # Attribute variable names for the vertex shader
        attributes = ["in_vertex", "in_color"]

        self.vao_previous = self.ctx.geometry(
            [BufferDescription(self.ssbo_previous, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )
        self.vao_current = self.ctx.geometry(
            [BufferDescription(self.ssbo_current, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )

        # --- Create the visualization shaders

        vertex_shader_source = Path("shaders/vertex_shader.glsl").read_text()
        fragment_shader_source = Path("shaders/fragment_shader.glsl").read_text()
        geometry_shader_source = Path("shaders/geometry_shader.glsl").read_text()

        # Create the complete shader program which will draw the stars
        self.program = self.ctx.program(
            vertex_shader=vertex_shader_source,
            geometry_shader=geometry_shader_source,
            fragment_shader=fragment_shader_source,
        )

        # --- Create our compute shader

        # Load in the raw source code safely & auto-close the file
        compute_shader_source = Path("shaders/compute_shader.glsl").read_text()

        # Compute shaders use groups to parallelize execution.
        # You don't need to understand how this works yet, but the
        # values below should serve as reasonable defaults. Later, we'll
        # preprocess the shader source by replacing the templating token
        # with its corresponding value.
        self.group_x = 256
        self.group_y = 1

        self.compute_shader_defines = {
            "COMPUTE_SIZE_X": self.group_x,
            "COMPUTE_SIZE_Y": self.group_y
        }

        # Preprocess the source by replacing each define with its value as a string
        for templating_token, value in self.compute_shader_defines.items():
            compute_shader_source = compute_shader_source.replace(templating_token, str(value))

        self.compute_shader = self.ctx.compute_shader(source=compute_shader_source)

        # --- Create the FPS graph

        # Enable timings for the performance graph
        arcade.enable_timings()

        # Create a sprite list to put the performance graph into
        self.perf_graph_list = arcade.SpriteList()

        # Create the FPS performance graph
        graph = arcade.PerfGraph(GRAPH_WIDTH, GRAPH_HEIGHT, graph_data="FPS")
        graph.position = GRAPH_WIDTH / 2, self.height - GRAPH_HEIGHT / 2
        self.perf_graph_list.append(graph)

    def on_draw(self):
        # Clear the screen
        self.clear()
        # Enable blending so our alpha channel works
        self.ctx.enable(self.ctx.BLEND)

        # Bind buffers
        self.ssbo_previous.bind_to_storage_buffer(binding=0)
        self.ssbo_current.bind_to_storage_buffer(binding=1)

        # If you wanted, you could set input variables for compute shader
        # as in the lines commented out below. You would have to add or
        # uncomment corresponding lines in compute_shader.glsl
        # self.compute_shader["screen_size"] = self.get_size()
        # self.compute_shader["frame_time"] = self.frame_time

        # Run compute shader to calculate new positions for this frame
        self.compute_shader.run(group_x=self.group_x, group_y=self.group_y)

        # Draw the current star positions
        self.vao_current.render(self.program)

        # Swap the buffer pairs.
        # The buffers for the current state become the initial state,
        # and the data of this frame's initial state will be overwritten.
        self.ssbo_previous, self.ssbo_current = self.ssbo_current, self.ssbo_previous
        self.vao_previous, self.vao_current = self.vao_current, self.vao_previous

        # Draw the graphs
        self.perf_graph_list.draw()



if __name__ == "__main__":
    app = NBodyGravityWindow()
    arcade.run()

An expanded version of this tutorial whith support for 3D is available at: https://github.com/pvcraven/n-body