Micro optimizations – Episode 1

Since I joined the team working on Dreadnought, I have been working almost exclusively on performance optimizations. For one target client platform, our game is CPU-bound, specifically the game thread. To give you some background, we use Unreal Engine 4 which has 1 game thread, 1 render thread, plus some helper task graph threads (based on your settings and core-count).

Usually it takes days – and perhaps even weeks – to fix one of the worst offenders, but sometimes the problems / solutions are simple, and at the same time have noticeable effects.

Such micro-optimizations are always welcomed and refreshing. The BEST optimizations are fixing already existing “optimizations”, which are problematic and sometimes might be doing more harm than good.

I will be adding multiple, bite-sized blog entries on those micro-optimizations. Today’s example is:

The Unaware SceneComponent

The Problem

When looking where is our frame time spent, it turned out that one of the worst offenders in the game is the calls to UpdateComponentToWorld, due to the massive number of moving actor components in the game. The problem here is that the calls to this function happen from multiple places in different points in time during the frame, which obviously already sub-optimal. We do not need to actually update all moving components, so an old optimization was added to skip the function execution – read: early return – on either server or client, via two bools (bIgnoreUpdateComponentServer and bIgnoreUpdateComponentClient), easy way for some components to ignore those expensive, unneeded updates on client / server based on their requirements. Right?

Not always! Turns out that 10% of the cost of UpdateComponentToWorld was spent doing these checks, totaling to 1 ms per frame on average. How could this happen?! The original code looked like the following:

USceneComponent::UpdateComponentToWorld()
{
  if (bIgnoreUpdateComponentServer && GetWorld() &&
    GetWorld()->GetNetMode() == ENetMode::NM_DedicatedServer)
  {
    return;
  }

  if (bIgnoreUpdateComponentClient&& GetWorld() &&
    (GetWorld()->GetNetMode() == ENetMode::NM_Client
    || GetWorld()->GetNetMode() == ENetMode::NM_Standalone))
  {
    return;
  }
}

This is how my eyes felt like ccc (yes I am inspired by Mike Acton here, and in many other places as well).

What is wrong with this code? The main thing is that it ignores basic and simple facts:

  1. The component’s owning world, and its net mode, NEVER change during a component life time, these are set when a new map is loaded, BEFORE the components are even created, and never change until a new map is loaded (which implies destroying and re-creating all components).
  2. We know which platform are we compiling for (i.e.: server vs. client), as there exists compile defines in UE4 already to tell you just that.

But it helped save some performance, didn’t it? Sure, but doing unneeded work is just flat out wrong, multiply by the number of components in the game (several thousands on average) and we have a noticeable problem. How come it is a problem, isn’t it saving more expensive work down the road? Below is the disassembly for just the two checks in the original code:

// Server check
cmp byte ptr [rbx+1CAh],0
je 000000001AC63CE2h
mov rax,qword ptr [rbx]
mov rdi,rbx
call qword ptr [rax+110h]
test rax,rax
je 000000001AC61CE2h
mov rax,qword ptr [rbx]
mov rdi,rbx
call qword ptr [rax+110h]
mov rdi,rax
call UWorld::GetNetMode() (000000001B302B50h)
cmp eax,1
je 000000001AC1721Bh

// Client check
cmp byte ptr [rbx+1CBh],0
je 000000001AC66D35h
mov rax,qword ptr [rbx]
mov rdi,rbx
call qword ptr [rax+110h]
test rax,rax
je 000000001AC61D34h
mov rax,qword ptr [rbx]
mov rdi,rbx
call qword ptr [rax+110h]
mov rdi,rax
call UWorld::GetNetMode() (000000001B302B40h)
cmp eax,3
je 000000001AC6731Bh
mov rax,qword ptr [rbx]
mov rdi,rbx
call qword ptr [rax+110h]
mov rdi,rax
call UWorld::GetNetMode() (000000001B305B50h)
test eax,eax
je 000000001AC6725Bh

Lots of instructions, multiple un-inlined calls, which contain even more instructions, those in turn contain branches, they also chase pointers down the road – causing cache misses – etc. etc…

One possible solution

One simple solution, which was what I ended up doing due to its simplicity and the time constraints then, is to do the following: first, I added this code to the scene component’s BeginPlay:

USceneComponent::BeginPlay()
{
  if (bIgnoreUpdateComponentServer)
  {
    bIgnoreUpdateComponentServer = GetWorld() &&
    GetWorld()->GetNetMode() == ENetMode::NM_DedicatedServer;
  }

  if (bIgnoreUpdateComponentClient)
  {
    bIgnoreUpdateComponentClient = GetWorld() &&
    (GetWorld()->GetNetMode() == ENetMode::NM_Client
    || GetWorld()->GetNetMode() == ENetMode::NM_Standalone);
  }
}

Then changed the checking code in UpdateComponentToWorld to the following:

USceneComponent::UpdateComponentToWorld()
{
#if UE_SERVER
  if (bIgnoreUpdateComponentServer)
  {
    return;
  }
#else
  if (bIgnoreUpdateComponentClient)
  {
    return;
  }
#endif // UE_SERVER
}

These are the generated instructions for the client platform, compared to what we had before:

// Client check
cmp byte ptr [r12+1CBh],0
jne 00000000510831FAh

Simple, 1ms gain, minimal engine code changes. Not bad.

More thoughts

I deliberately named the solution above “a possible” and not “the” solution, because there are definitely other solutions, many of them would perform even better. But in regards to the problem constraints (possible gains, time and complexity costs, run-time requirements) this solution was good enough for the problem, for now.

One possible – and way better – solution would be to use a list of all components to update, and just those, instead of checking for each one of them whether is needs an update. But since calls to UpdateComponentToWorld are coming from all over the place, it would have needed much more changes than the simple solution above.

Additionally, the biggest problem with this setup is that there is no utilization of multiple execution cores at all, unfortunately this is how this system is designed in UE4 currently, again changing that is a much more complex change.

That’s it for today, until next time!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s