ECS with Python and Tkinter, Part 2


[Dynamic LinkStatic Link]
Posted: 2020-12-11
Last Edited: 2020-12-11

Basics of ECS

ECS stands for Entity Component System, and is a way of designing programs that leads to very clean code, and easy composition. If you have component A, with its relevant system (controlling said components behaviour), every entity with component A will exhibit its associated behavior (as it will be processed by system A). Say that system A causes an entity to oscillate up and down, and that you have another entity which corresponds to a rock in your scene. By simply adding component A to this other entity, it will (hopefully) start being processed by system A and will start moving up and down. This composition makes it really enjoyable to work with ECS.

Entities

Entities are the things in a game, simply unique identifier, with no data by themselves. They represent discrete objects in the game (i.e. the player, rocks in the scene, a hidden manager object). Some ECS architectures store tags and other lightweight metadata on the Entity, which is fine, but i chose not to for purity and simplicity reasons.

typedef unsigned long entity_t; // as simple as a uint64

typedef struct Entity_t {
    unsigned long id;
    unsigned long flags;
    unsigned long tags;
    // ...
} Entity; // or something more complex

Systems

Systems implement the behaviour seen in a game. From rendering and physics, to input handling and enemy AI, everything the player experiences is the result of a system. The only difference between a system and a standalone script, is that where a script acts on a single object, the system acts on an iterator of objects (be that an array or a vector). This means that the instruction cache of the CPU is kept hot while processing objects, as the CPU doesn't have to go from processing a Collider to rendering a Mesh to then deserialising some data from the network.

Whilst systems at their simplest process a single component type, in real situations it is often necessary to reference other components on the same entity. For example, when running physics, it is important to know an entities current position (e.g. stored in a Transform component) and its velocity (e.g. stored in a Velocity component).

One solution is to move velocity data into the Transform. This means that it is no longer possible to have an entity with no velocity, only one with zero velocity. This makes the Transform component larger, harming performance (as the cacheline can hold less Transform components; explained in more detail in the 'Components' section). There is a better way.

Another solution is to have sytems work not on individual components, but on unions of components instead. This is what the archetypal ECS architecture is about. By doing this, data can be kept in smaller, more cache-friendly chunks.

class RenderSystem:
    def actions():
        return {
            self.process_active: (Transform2D, ScreenElement),
            self.process_stale: (StaleTag,),
        }

This is the approach i took, where each System has an actions() method which returns delegates which process entities, and the archetypes those delegates operate on. This allows a single system to contain many processing methods, and thus keeps related state in one place instead of strew around many different systems. The above example shows how 1 system can handle multiple archetypes.

Components

Finally, components. Components hold all the data for the game. Every entity that needs to have its mesh rendered, for example, will have a MeshComponent. Each component stores only the data that it needs, and no more. This separation means that no memory is wasted for entities that dont need a certain component.

Another good reason for storing data in components is that when you have a huge number, cache locality and the efficient filling of cache becomes important. Storing all possible state in 1 struct, and iterating over an array of said struct just to do 1 operation on it (say, update physics) means that the cache is constantly cold for the next piece of data to operate on.

A little detour to hardware land!

When a CPU wants to fetch some data from RAM, it will load a certain slice (say 64 bytes) into cache first (this is the idea of a cacheline). By loading more data than necessary from RAM, the CPU makes the assumption that it will need to fetch data nearby soon, and so it preloads the data into cache (which is much much faster than RAM; see pg 22 of above pdf). The relative timings of cache vs RAM taken from an intel pdf are as follows:

Data Source                                             Latency
L3 Cache hit, line unshared between cores               ~40 cycles
L3 Cache hit, line shared readonly between cores        ~65 cycles
L3 Cache hit, line shared readwrite between cores       ~75 cycles
L3 Cache hit, from other core                           ~100-300 cycles
Local DRAM                                              ~60ns
Remote DRAM                                             ~100ns

For a CPU clocked at 2GHz (slower than modern CPUs), each cycle takes half a nanosecond. This means that an unshared L3 hit takes 20ns, whereas going to RAM takes triple that. Lower caches (L1 and L2) can be on the order of 5-2 ns for a hit, meaning that going to RAM takes between 12 and 30 times longer. Thus, the CPU can save a lot of cycles by loading chunks instead of fetching data byte by byte if nearby data is also used (and there will be a cache hit when said new data is accessed).

Getting back on track

Getting back to why components are a thing, with an example:

// clean separation into multiple discrete components
typedef struct Transform_t {
    double px, py, pz;
} Transform;

typedef struct Velocity_t {
    double vx, vy, vz;
} Velocity;

typedef struct Mesh_t {
    void *mesh_data;
} Mesh;

// chucking it all into one mixer and pressing "blend"
typedef struct State_t {
    Transform transform;
    Velocity velocity;
    Mesh mesh;
} State;

// ...

void main(void) {
    // this approach will have worse cacheline efficiency
    State[] game_objects = { ... };
    
    for (int i = 0; i < sizeof(game_objects); i++) {
        calculate_physics(game_objects[i]);
        render_object(game_objects[i]);
    }

    // this approach will have better cacheline efficiency
    Entity[] entities = { ... };
    Transform[] transforms = { ... };
    Velocity[] velocities = { ... };
    Mesh[] meshes = { ... };

    for (int i = 0; i < sizeof(entities); i++) {
        process_physics(transforms[i], velocities[i]);
    }

    for (int i = 0; i < sizeof(entities); i++) {
        render(transforms[i], meshes[i]);
    }
}

Why is the second approach better? Don't we iterate twice compared to once? The reason that the second approach is preferrable is that it packs the cacheline much more efficiently.

Below is a visualisation of this difference, with x, y, and z standing in for Transform, Velocity, and Mesh. The dashes show a simplified cacheline.

Mixed State:
| x0 y0 z0 x1 | y1 z1 x2 y2 | z2 x3 y3 z3 | x4 y4 z4    |
| ----------- | ----------- | ----------- | ----------- |

Components:
| x0 y0 x1 y1 | x2 y2 x3 y3 | x4 y4       |
| ----------- | ----------- | ----------- |

| x0 z0 x1 z1 | x2 z2 x3 z3 | x4 z4       |
| ----------- | ----------- | ----------- |

Not only does processing the mixed state do worse on the instruction cache, it also fills the regular data cache worse. 1 struct can fit at a time, and to get the next struct to process another fetch must be made. This contrasts with the component approach, where each cacheline can fit 2 entities (Transform + Other) at once. When the number of processed entities gets large, the worse cache performance of the mixed state approach becomes more and more evident.

The fact that you have to iterate twice doesn't really matter. You end up doing the same amount of work (you still do render() and physics() for each entity). The difference is that each fetch in the 1st approach has the opcache be cold. The 2nd approach means that the opcache is cold for the first entity, and hot thereafter. This means that the 2nd approach can outperform the 1st.