Written by Aidan Hobson-Sayers

4 min read

How Do You Test A 10,000 Player Deathmatch Game?

  • Gaming

It's one thing to build and test a multi-player game, it's an order of magnitude more challenging when you want to build and test a game capable of 10,000 players in a single battle - but testing a game of this magnitude is precisely what we've had to do, and it's taken some creative thinking.

Why Load Test in the First Place?

Testing is common sense for any developer, but when you're trying to achieve something that has never been done before it is even more critical that the system is battle tested.

As Aether Engine is a distributed simulation engine it borrows many techniques from high-performance computing, and it's crucial for us to learn how the system behaves and performs when under significant load. Specifically, we're looking for the following types of information:

  • Peak user load before the simulation begins to slow down;
  • The maximum numbers of users the simulation can withstand before it becomes unusable; The impact that performance has on gameplay and vice versa;
  • Emergent behaviour in the load-balancing logic;
  • Performance improvement strategies on the server, across the network and in the client.

Developing tests for the above is straight forward for the majority of transactional based applications, but running this for a single persistent world where 10,000 players have the ability to interact with each other poses some unique environmental considerations that must be factored in.

The Cloud Doesn't Necessarily Represent Real-Life

As you may have seen this week, we have partnered with Microsoft and are using the performance and scale of their Azure Cloud to provide the backend compute for our 10,000 player deathmatch.

We're extremely excited to have the backing of such a battle-tested ecosystem, but it's easy to get a false sense of confidence in the player experience if we don't carefully consider what the infrastructure looks like between the the stress test and the players.

Taking A Leaf Out Of Netflix's Handbook and Introducing Some Chaos

Netflix is probably the best known advocate for wanting to know bad news early, having developed the chaos monkey to deliberately disrupt their live systems to see how they respond.

While we didn't use Netflix's suite of tools, we are advocates of the bad news early mindset. Taking this approach, we considered multiple options for the best way to achieve player scale and chaos. Options included:

  • Player bots running on dedicated Azure servers;
  • Player bots running on different Azure regions;
  • Player bots running on local machines.

Then we asked ourselves: what if we created player bots as Amazon Lambdas and spread them across multiple regions?

It would give us a reasonable degree of network uncertainty (internet traffic vs datacenter), geographic diversity, and it would remove the safety and confidence that we have in our own Azure set up.

We decided that this would be the best option for us as we could scale up indefinitely and put Aether Engine through its paces (we ran a test for over 50,000 players, but that's for another date).

The Player Bot Setup

We decided to use 7 Amazon Lambda regions (3 x EU, 2 x US and 2x Asia), as this would give us the geographical and bandwidth diversity we've observed in our player sign-ups.

From here we went about invoking 3 Player Bot Amazon Lambdas per region, each capable of creating 500 players. (7 Amazon Regions x 3 Lambda Invocations per region x 500 players per Lambda Invocation = 10,500 player bots).

At this point, we'd created a relatively simplistic model of a player bot, and found a magic number to get us over 10,000 players with some element of randomness involved with geographic deployment and network.

Next, we had to understand the Azure set-up to deliver an excellent experience to 10,000 players.

The Cloud Setup

When you're doing significant computation, you need big boxes, so we opted to use the Standard F64s_v2 (64 vcpus, 128 GB memory) series of VMs on Azure.

We provisioned the boxes as follows:

  • 7 x Multiplexer boxes - These boxes take the single state of the simulation and send it to many clients while implementing net relevancy, i.e. reducing bandwidth by not sending parts of the simulation that the client doesn't care about or can't see, and send less frequent updates to the client for things that are very far away.
  • 2 x Simulation boxes - These boxes distribute the spatial simulation by applying Aether Engine's distributed octree to merge and reclaim CPU cores depending on the computational complexity of the simulation.

The First Test

As we explained earlier in this post, our Player Bot behaviour was fairly simplistic in the early days. To begin with, we were purely interested in concurrent connection stability. In the video below you can see one of our engineers flying around the spatial simulation alongside these 10,000 player bots maintaining active connections from multi-region Lambdas. You can see there's some work needed on the torpedoes!




Since these tests we've worked considerably on optimisations and added a lot more realism to Player Bot behaviours.

In the video below you can see the live simulation from Aether Engine as the multi-region Player Bots display more lifelike activity including a hub of battle activity in the centre.


Realistic Agent Behaviour


What's Next

There are obvious limits to Player Bots, and we're working our way through live play tests with real players - our first 1,000 player test was last weekend and we're looking to make this bigger in the next one, coming soon!

We're looking forward to seeing how Aether Engine handles things, and what we can learn to improve the system further.

If you've not already registered for our 10,000 player deathmatch - Aether Wars - we'd love for you to be a part of it and help us to stretch Aether Engine to its full potential.

Sign up to play here.