Building massively multiplayer online games has a problem: the internet. Specifically, the internet connection close to the user, the “last mile”. These links have much lower bandwidth than links that are closer to the hub of the internet, datacenters. This is a problem because it means you can’t send 1MB of data over the last mile nearly as quickly as you can between two datacenters. The other half of the problem is that the latency of a link is more unpredictable as you get further from the datacenter-datacenter link.
Why is this a problem for games?
Games have a world, being simulated on a server somewhere in a datacenter, that the players need to see and interact with. Game developers are pushing the boundaries, making game worlds that are more and more complex and immersive. Games built with Aether Engine are even larger scale and higher fidelity than any seen before. Getting that (potentially huge) world data to the players is hard. At Hadean, we’ve found the state that changes each tick or update can be Gigabytes of data for some games. In our 10k player test at GDC this year, we had 23MBps of state.
There are two sides to this problem: players’ limited bandwidth (and bandwidth outgoing from datacenters to the players over the internet being expensive), and the cost of bandwidth between datacenters.
Although the links inside and between two datacenters might be able to handle this amount of data (and even if it can, your cloud bills can get large!), the last mile link to the player definitely can’t. This has been a limiting factor on the scale of games. Around 2001, an advanced game was able to push 1500 archers worth of world state over a 28.8kbps link. Since then, the techniques have not changed much, compression, interpolation, sharding, and limiting the gameplay (world size, world complexity, visibility radius).
One technique that has recently become viable, is something we call “Interest Management” at Hadean (we’ve also seen “Net Relevancy”, “LODing” in use at game studios). The key idea is that every player is likely to have a unique viewpoint in the world, and so the state that is visible or relevant to them will likely be different to them. This becomes even more true as worlds become larger and more complex.
To take advantage of this we can do per-player computation to prioritise the most relevant data to send to the player. An important example is distance-based net relevancy, that things further away from the player are less important proportional to the square of distance. This means things that are twice as far may be sent four times less frequently. In practice, at Hadean, we found this gave a 97.5% reduction in bandwidth, with a constant gameplay experience or at least minimal noticeable degradation. The exact benefit depends on the game. And the exact implementation may differ, a 1/x^2 priority function may be less suitable for land based games. This is why we decided to make our net relevancy implementation easily user customisable. A core belief of Hadean is that we should provide great defaults, but be completely customisable.
There are other approaches to solving this problem of game state distribution to players. Examples are Google’s Stadia and Microsoft’s xCloud which do all rendering in the cloud, and stream video to the user. Another approach is to “chop” the rendering pipeline closer to the source, so, streaming triangles to the players hardware, and using their hardware to render those triangles to the screen.
The different choices have their tradeoffs (our interest management implements great defaults with complete customisability to permit a deliberate choice of the tradeoffs), but they only solve one half of the problem of massive state distribution, the datacenter to player bandwidth. Datacenter to datacenter bandwidth remain unsolved. Fortunately, HadeanOS, Aether Engine and our Replication Layer complement these approaches, providing the other half of the puzzle and allowing enormous scale single-shard worldwide games to be simulated and rendered efficiently.
Azure has been an incredible partner in the development of our networking infrastructure. Their high bandwidth, low latency inter-datacenter connections have been critical. Azure also have datacenters in more global regions than any other provider, with backbone links between them. The edge datacenters will be located much closer to players than the central simulation datacenter. These geographically spread datacenters helped us significantly lower our latency to players, and provided ample compute resource to enable the Net Relevancy computation.
All of our players are playing in a single, seamless world, so it is critical to get the lowest possible latency. On the advice of one of Microsoft’s Azure Global black belts, we also route players incoming (to the simulation) traffic directly to the edge and then to the central datacenter over the Azure backbone. This means that the players traffic comes onto Azures network as early as possible. This makes the player to simulation latency both lower and much “smoother” (less jitter, more constant) than over the non-backbone internet.
To solve the second half of the problem we have to think about the graph of computation going on. This net relevancy calculation doesn’t need to happen in the central datacenter, it can just as well happen in the edge datacenters (possibly even better because it will have more recent information about the players). We can do the computation on the edge servers close to the player, but we still need to get the X MBps of state to the edge servers/datacenters. To do this we need a low latency, bandwidth-efficient broadcast mechanism.
The trick here is to recognise that all (or most) of the input to the net relevancy computation is the same across most instances, “the whole world state”. Then we only have to send one copy of the state to each edge datacenter, and even inside the datacenter, we can “fan out” and broadcast from one receiving server to many processing servers. At yet another level, on each server we can broadcast from one receiving core to 64 processing cores. This topology looks like a tree, with the root at the central simulation datacenter, the next level of nodes is the edge datacenters, one for each geographic region. The next level is the servers inside the edge datacenters, and the last level is the cores inside each server.
This broadcast tree topology gives us an enormous amount of computation available at the edge, plenty enough to do the net relevancy computation for thousands of players, and minimises the bandwidth necessary with a minimal latency cost.
Our foundational technology, HadeanOS makes this kind of global computation network easy to build, deploy and maintain. This is why we built Aether Engine on top of it, allowing Games companies to take advantage of this power.
Stay tuned for how we continue to stress our technology whilst exploring what deeper and more complex gameplay experiences feel like on a huge scale. To be part of this journey, signup to participate in EVE Aether Wars: Phase 2 on 18th August at 13:00 EDT / 17:00 UTC.
Header image ©️ Derek Owens
HadeanOS is a cloud-first operating system that has been engineered and optimized for performance across massively distributed computing infrastructures.
9 Appold St
T: 020 3514 1170