Writing an ECS with Python and Tkinter

The Specification

Recently at Uni we were given a piece of coursework that had us write a game in python. Normally, this is no big deal: simply import pygame, write a while not quit: ... loop, and be done with it. The catch was that we weren't allowed to use any non-standard features (i.e. no pip). This meant that pygame, pillow, and many other useful libraries were unavailable, and the only remaining windowing option was Tkinter. Now, I like to do things from scratch as much as the next person, but I would not have been the first to suggest Tkinter as a (viable) game development framework. But the coursework specification had had its say, and thus Tkinter it was to be.

Tkinter as a Game Engine

For fun and profit I decided to write an ECS game engine in python. Partly because I had never done an ECS engine before, and wanted to learn about the upsides and downsides of the ECS paradigm. Partly because I like to make my life harder than it has to be. Mainly because python is inherently slow and I wanted to make the game engine as fast as possible to amortize the cost of the language. Thus I started reading up about ECS and Tkinter. A great resource for Tk is effbot.org, which I used extensively during the course of researching for the engine. It seems to be down as of 08-12-2020, which is a shame, but it is still accessible on the wayback machine.

Whilst I may be come off as negative towards Tkinter, after working with it for a bit I do think it is quite good. Its performance surprised me, as it was easily able to draw thousands of canvas entities and move them around without issue at 60 frames per second. It was also really easy to work with, with a simple Tk() call to initialise the entire library and then a single Canvas() and canvas.pack() call to make the game screen show up.

Overall I enjoyed the project, and below is further explanation of the ECS engine itself. For now, the code for the game engine is available on my git instance.

Basics of an ECS

ECS stands for Entity-Component-System, and is a way of designing programs that allegedly leads to very clean code, and easy composition of behaviours. If you have component A, with its relevant system (controlling said components behaviour), every entity with component A will exhibit said associated behavior (as it will be processed by system A). Say that system A causes an entity to oscillate up and down, and that you have another entity which corresponds to a rock in your scene. By simply adding component A to this other entity, it will (hopefully) start being processed by system A, and will start moving up and down.

Entities

Entities are the "things" in a game, simply unique identifiers, with no data by themselves. They represent discrete objects in the game (i.e. the player, rocks in the scene, a hidden manager object). Some ECS architectures store tags and other lightweight metadata on the Entity, which is fine, but I chose not to for "purity" and simplicity reasons.


typedef unsigned long entity_t; // as simple as a uint64

typedef struct Entity {
  unsigned long id;
  unsigned long flags;
  unsigned long tags;
  // ...
} Entity_t; // or something more complex

Components

Components hold all the data for the game. Every entity that needs to have its mesh rendered, for example, will have a MeshComponent. Each component stores only the data that it needs, and no more. This separation means that no memory is wasted for entities that don't need a certain component.

Another good reason for storing data in components is that when you have a huge number of entities, cache locality and the efficient filling of cache becomes important. Storing all possible state in 1 struct, and iterating over an array of said struct just to do 1 operation on it (say, update physics) means that the cache is constantly cold for the next piece of data to operate on.

A Little Detour to Hardware Land!

When a CPU wants to fetch some data from RAM, it will load a certain slice (say 64 bytes) into cache first (this is the idea of a cacheline). By loading more data than necessary from RAM, the CPU makes the assumption that it will need to fetch data nearby soon, and so it preloads the data into cache (which is much much faster than RAM; see pg 22 of above pdf). The relative timings of cache vs RAM taken from an intel pdf are as follows:


Data Source                                             Latency
L3 Cache hit, line unshared between cores               ~40 cycles
L3 Cache hit, line shared readonly between cores        ~65 cycles
L3 Cache hit, line shared readwrite between cores       ~75 cycles
L3 Cache hit, from other core                           ~100-300 cycles
Local DRAM                                              ~60ns
Remote DRAM                                             ~100ns

For a CPU clocked at 2GHz (slower than modern CPUs), each cycle takes half a nanosecond. This means that an unshared L3 hit takes 20ns, whereas going to RAM takes triple that. Lower caches (L1 and L2) can be on the order of 5-2 ns for a hit, meaning that going to RAM takes between 12 and 30 times longer. Thus, the CPU can save a lot of cycles by loading chunks instead of fetching data byte by byte if nearby data is also used (and there will be a cache hit when said new data is accessed).

Getting Back on Track

Getting back to why components are a thing, with an example:


// clean separation into multiple discrete components
typedef struct Transform {
  double px, py, pz;
} Transform_t;

typedef struct Velocity {
  double vx, vy, vz;
} Velocity_t;

typedef struct Mesh {
  void *mesh_data;
} Mesh_t;

// chucking it all into one mixer and pressing "blend"
typedef struct State {
  Transform transform;
  Velocity velocity;
  Mesh mesh;
} State_t;

// ...

void main(void) {
  // this approach will have worse cacheline efficiency
  State game_objects[] = { ... };
	
  for (int i = 0; i < ARRLEN(game_objects); i++) {
    calculate_physics(game_objects[i]);
    render_object(game_objects[i]);
  }

  // this approach will have better cacheline efficiency
  Entity entities[] = { ... };
  Transform transforms[] = { ... };
  Velocity velocities[] = { ... };
  Mesh meshes[] = { ... };

  for (int i = 0; i < ARRLEN(entities); i++) {
    process_physics(transforms[i], velocities[i]);
  }

  for (int i = 0; i < ARRLEN(entities); i++) {
    render(transforms[i], meshes[i]);
  }
}

Why is the second approach better? Don't we iterate twice compared to once? The reason that the second approach is preferrable is that it packs the cacheline much more efficiently (leading to a higher percentage of cacheline hits, and thus time saved).

Below is a visualisation of this difference, with `x`, `y`, and `z` standing in for Transform, Velocity, and Mesh. The dashes show a simplified cacheline.


Mixed State:
| x0 y0 z0 x1 | y1 z1 x2 y2 | z2 x3 y3 z3 | x4 y4 z4    |
| ----------- | ----------- | ----------- | ----------- |

Components:
| x0 y0 x1 y1 | x2 y2 x3 y3 | x4 y4       |
| ----------- | ----------- | ----------- |

| x0 z0 x1 z1 | x2 z2 x3 z3 | x4 z4       |
| ----------- | ----------- | ----------- |

Not only does processing the mixed state do worse on the instruction cache, it also fills the regular data cache worse. 1 struct can fit at a time, and to get the next struct to process another fetch must be made. This contrasts with the component approach, where each cacheline can fit 2 entities (Transform + Other) at once. When the number of processed entities gets large, the worse cache performance of the mixed state approach becomes more and more evident.

The fact that you have to iterate twice doesn't really matter. You end up doing the same amount of work (you still do render() and physics() for each entity). The difference is that each fetch in the 1st approach has the opcache be cold. The 2nd approach means that the opcache is cold for the first entity, and hot thereafter. This means that the 2nd approach can outperform the 1st.

Systems

Systems implement the behaviour seen in a game. From rendering and physics, to input handling and enemy AI, everything the player experiences is the result of a "system". The only difference between a system and a standalone "script", is that where a script acts on a single object, the system acts on an iterator of objects (be that an array or a vector). This means that the instruction cache of the CPU is kept hot while processing objects, as the CPU doesn't have to go from processing a Collider to rendering a Mesh to then deserialising some data from the network.

Whilst systems at their simplest process a single component type, in real situations it is often necessary to reference other components on the same entity. For example, when running physics, it is important to know an entities current position (e.g. stored in a Transform component) and its velocity (e.g. stored in a Velocity component), as well as its current collider (yet another component).

One solution is to move velocity data into the Transform. This means that it is no longer possible to have an entity with no velocity, only one with zero velocity. This makes the Transform component larger, harming preformance, and eventually leads to a single "mega-component" holding all possible data an entity could have.

Another solution is to have sytems work not on individual components, but on unions of components instead. This is what the "archetypal" ECS architecture is about. By doing this, data can be kept in smaller, more cache-friendly chunks, as discussed earlier.


class RenderSystem:
  def actions():
    return {
      self.process_active: (Transform2D, ScreenElement),
      self.process_stale: (StaleTag,),
    }

This is the approach I took, where each System has an actions() method which returns delegates which process entities, and the archetypes those delegates operate on. This allows a single system to contain many processing methods, and thus keeps related state in one place instead of strewn around many different systems. The above example shows how 1 system can handle multiple archetypes.