Direct3D11 multithreading micro-benchmark in C# with SharpDX

This blog post was originally posted on my previous blog code4k
Multi-threading is an important feature added to Direct3D11 two years ago and has been increasingly used on recent game engine in order to achieve better performance on PC. You can have a look at "DirectX 11 Rendering in Battlefield 3" from Johan Anderson/DICE which gives a great insight about how it was effectively used in practice in their game engine. Usage of the Direct3D11 multithreading API is pretty straightforward, and while we are also using it successfully at our work in our R&D 3D Engine, I didn't take the time to sit down with this feature and check how to get the best of it.

I recently came across a question on the gamedev forum about "[DX11] Command Lists on a Single Threaded Renderer": If command lists are an efficient way to store replayable drawing commands, would it be efficient to use them even in a single threaded scenario where lots of drawing commands are repeatable?

In order to verify this, among other things, I did a simple micro-benchmark using C#/SharpDX, but while the results are somehow expectable, there are a couple of gotchas that deserve a more in-depth look...


Direct3D11 Multi-threading : The basics


I assume that general multi-threading concepts and advantages are already understood to focus on Direct3D11 multi-threading API.

There is already a nice "Introduction to Multithreading in Direct3D11" on msdn that is worth reading if you are already a little bit familiar with the Direct3D11 API.

In Direct3D10, we had only a class ID3D10Device to perform object/resource creation and draw calls, the API was not thread safe, but It was possible to emulate some kind of deferred rendering by using mutexes and a simplified command buffers to access safely the device.

In Direct3D11, preparation of the draw calls are now "parralelizable" while object/resource creation is thread safe. The API is now split between:
  • ID3D11Device which is responsible to create object/resources/shaders and device contexts.
  • ID3D11DeviceContext which holds all commands to setup shaders pipeline and perform all draw calls (including constant buffer update, setup of shader resource views, samplers, blendstate...etc.)

When a Direct3D11 device is created, it provides a default ID3D11DeviceContext called an immediate context that is effectively used for immediate rendering. There is only one immediate context available per device.

In order to use deferred rendering, we need to create new ID3D11DeviceContext called deferred context. One context for each thread responsible for preparing a set of draw calls.

Then the sequence of multithreaded draw calls are executed like this:
Each secondary threads are responsible to prepare draw calls in a set of ID3D11CommandList that will effectively be executed by the immediate context (in order to push them to the driver).

The simplified version of the code to write is fairly easy:

// Thread-1
context[threadIdn].InputAssembler.InputLayout = layout1;
context[threadIdn].InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList;
context[threadIdn].InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(vertices1, Utilities.SizeOf<Vector4>() * 2, 0));
[...]
context[threadId1].Draw(...)
commandLists[threadId1] = context[ThreadId1].FinishCommandList(false);
[...]
// Thread-n
context[threadIdn].InputAssembler.InputLayout = layoutn;
context[threadIdn].InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList;
context[threadIdn].InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(verticesn, Utilities.SizeOf<Vector4>() * 2, 0));
[...]
context[threadIdn].Draw(...)
commandLists[threadIdn] = context[ThreadIdn].FinishCommandList(false);

// Rendering Thread
for (int i = 0; i < threadCount; i++)
{
 var commandList = commandLists[i];
 // Execute the deferred command list on the immediate context
 immediateContext.ExecuteCommandList(commandList, false);
 commandList.Dispose();
}

The API provides several key advantages:
  • We can easily switch the code between immediate context and deferred context. Thus using the multi-threading part of the Direct3D11 API doesn't hurt our code.
  • The API is supported on downlevel hardware (from Direct3D11 down to Direct3D9)
  • The underlying driver can take advantages when calling FinishCommandList to perform some native layout that will help the deferred ExecuteCommandList command to run faster.
About the "native support from driver", It can be checked by using CheckFeatureSupport (or directly in SharpDX using CheckThreadingSupport) but it seems that almost only NVIDIA (and quite recently, around this year), is supporting this feature natively. On my previous ATI 6850 and now on my 6900M are not supporting it. Is this bad? We will see that the default Direct3D11 runtime is performing just fine for this, but doesn't provide any extra boost.

We will also see that there is an interesting issue with the usage of Map/Unmap or UpdateSubresource in order to update constant buffers, and their respective usage under a multithreading scenario could hurt performances.

MultiCube, a Direct3D11 Multi-threading micro-benchmark


In order to stress-test multi-threading using Direct3D11, I have developed a simple application called MultiCube (available as part of SharpDX samples: See Program.cs)


This application is performing the following benchmark: It renders n x n cubes on the screen, each cube has its own matrix rotation. You can modify the number of cubes from 1 (1x1) to 65536 (256x256). The title bar is including some benchmark measurement (FPS/ time per frame) and you can change the behavior of the application with following keys:
  • F1: Switch between Immediate Test (no threading), Deferred Test (Threading), and Frozen-Deferred Test (execute a pre-prepared CommandList on the ImmediateContext)
  • F2: Switch between Map/Unmap mode and UpdateSubresource mode to update constant buffers.
  • F3: Burn the CPU on/off. This is were multithreading usage is making the difference and we are going to analyse the results a little bit more. When this option is on, It simulates lots of CPU calculation on the deferred threads. If this is off, It will just batch the draw calls (which are simple, its just Cubes!)
  • Left-Right arrows: Decrease/Increase the number of cubes to display (default 64x64)
  • Down-Up arrows: Decrease/Increase the number of threads used (only for Deferred Test mode)
When the deffered mode is selected, each threads are rendering a set of rows in batch. If you have for example 100x100 cubes to render, and 5 threads, each thread will draw 20x100 cubes.

If your graphics driver doesn't support  natively multithreading, you will see a "*" just after Deferred node.

You can download the application here. It is a single exe that doesn't need anykind of install (apart the DirectX June 2010 runtime). Also, being able to pack this application into a single exe is a unique feature of SharpDX: static linking of a .NET exe with SharpDX Dlls.


Results


I ran 2 type of tests:
  1. Draw 65536 cubes with the Burn-Cpu option ON and OFF, and comparing Immediate and Deferred rendering (ranging from 1 thread to 6 threads).
  2. Draw 1024 cubes switching between Map/Unmap and UpdateSubresource, and comparing the results between Immediate and Deferred rendering.
Two machines with the same main processor Intel i7-2600K, 8Go RAM were used, one with NVIDIA GTX 570 and the other one with a ATI 6900M graphics card.


65536 Drawcalls - BurnCpu: On Threads
Type 1 2 3 4 6
Nvidia-GTX 570 Deferred 232ms 130ms 98ms 92ms 82ms
Nvidia-GTX 570 Immediate 220ms 220ms 220ms 220ms 220ms
ATI 6900M Deferred 231ms 131ms 98ms 93ms 84ms
ATI 6900M Immediate 228ms 228ms 228ms 228ms 228ms

Fig2. 65536 draw calls with CPU intensive threads, comparison between Immediate and Deferred rendering


65536 Drawcalls - BurnCpu: Off Threads
Type 1 2 3 4 6
Nvidia-GTX 570 Deferred 31ms 24ms 21ms 20ms 20ms
Nvidia-GTX 570 Immediate 19ms 19ms 19ms 19ms 19ms
ATI 6900M Deferred 32ms 28ms 28ms 28ms 28ms
ATI 6900M Immediate 28ms 28ms 28ms 28ms 28ms

Fig2. 65536 draw calls with CPU ligh threads, comparison between Immediate and Deferred rendering

And finally the Map/Unmap and UpdateSubresource test:

65536 Drawcalls - Type Map Update
Nvidia-GTX 570  Immediate - 1024 0.6ms 1.1ms
Nvidia-GTX 570  Deferred - 1024 0.92ms 7.32ms
ATI 6900M Immediate - 1024 0.6ms 0.6ms
ATI 6900M Deferred - 1024 0.6ms 0.6ms


Analysis


If we examine the results a little more carefully, there are a couple of interesting things to highlight:

  • Using multithreading and deferred context rendering is only relevant when the CPU is effectively used on each threads (that sounds obvious, but It is at least clear!). When we are not using the CPU, Immediate rendering is in fact faster!
  • Multithreading rendering with CPU intensive application can perform 3-4x times faster than a single threaded application (at the condition that we have enough CPU core to dispatch rendering jobs)
  • The "native support from driver" of Direct3D11 multithreading doesn't seem to change so much, compare to the NVIDIA graphics card that is supporting it, we don't see a huge difference with AMD.
  • Usage of UpdateSubresource on a NVIDIA card is 8x times slower in a multithreading situation and is hurting a lot the performance of the application: Use Map/Unmap instead!
Of course, as usual, this is a synthetic, micro-benchmark test that should be taken with caution and can not reflect every test cases, so you need to perform your own benchmark if you have to make the decision of using multithreading rendering!

Finally, to respond to the original gamedev question, I provided a "Frozen Deferred" test in MultiCube to test if rendering a pre-prepared CommandList is actually faster then executing it with an immediate context: It seems that It doesn't make currently any differences (but for this to be sure, I would have to run this benchmark on several different machines/CPU/graphics card/drivers configs in order to fully verify it).

Comments