As a distributed systems company, Hadean’s technology (HadeanOS, Aether Engine) needs to be tested at enormous scale. This can require multiple cloud providers for testing scalability, analysing performance and reliability, and investigating system regression for any new feature.
From our experience during our 10,000-ship EVE: Aether Wars tech demo, we noticed that running all our tests (a dozen per day) on cloud platforms can get very expensive, both in time and money. To test our large-scale workloads, most of our environments used big, expensive resources and we often ended up with multiple environments running, which meant a substantial amount of cost.
So how can we work around this and propose a lightweight solution to experience multiple scenarios without the need to deploy the test workload in an expensive cloud environment?
One important entry point to understand distributed systems is the network traffic. A major part of our testing scenarios focused on bandwidth analysis and traffic management.
The idea of building our network simulator relies on giving developers the ability to analyse these communication aspects without the need to run a real test workload at scale using expensive environments.
The network simulator can be used either for deeply analysing performance issues in our current system or comparing new solutions, before proceeding to major optimisations and features implementation.
The Hadean Network Simulator (HNS) is a platform allowing a high-level representation of communication between the main Hadean components. It considers numerous aspects of our system.
There are different levels of abstractions that can be adopted when creating a Simulator. Based on our current usage of the HNS, we have decided to build a high-level representation of our communication system including the following components (Figure 1) :
Each one of these components is represented as a blackbox, sending and receiving configurable traffic through configurable channels.
Figure 1: An overview of the Hadean Network Simulator topology
In the current version of the simulator we are focusing on the communication aspects between the previously defined components. Let’s look at how our simulator covers internal and external communications between its components as blackboxes (figure 2):
Figure 2 : Communication modelling within the Hadean Network Simulator
The Hadean Network Simulator is built on top of the “NS3” network simulation platform. This choice was based on a previous study we’d performed on the state of the art, where NS-2 and NS-3 were serious candidates.
Besides being very popular open-source simulators, they are both applicable to a broad set of use cases with a good set of contributed modules. However, NS-3 is actively maintained with significant documentation and tutorials, while NS-2 is lightly maintained, especially over the last few years.
NS-3 also has the advantage of a higher modularity (compared to NS-2), as several external animators, data analysis and visualisation tools can be used with it. As part of our requirements, NS-3 supports pcap format, that can be used with different visualisation tools. The simulator components are implemented using NS-3 nodes and TCP channels.
The performance of any distributed system is based on the efficiency of its underlying network communications. As explained previously, client events are going through multiple layers before getting back to clients as a global state update.
Analysing communications at each level would give us insight into potential performance optimisations. Network-related limitations between data centers and the amount of data conveyed through it are aspects that our analysis can rely on.
The following examples give the early use cases of the Hadean Network Simulator.
When scaling up our system, one key point is being able to forecast the resource requirements given numerous parameters. We are using the network simulator to check our scaling model based on system insights (e.g traffic load) and cloud environment inputs (e.g hardware bandwidth). It also considers some performance parameters such as per client outgoing bandwidth, in order to guarantee a good user experience.
The simulator gives us information about the network performance under these specific parameters which allows us to adjust the system resources accordingly. It’s also easy to try different combinations of parameters and compare the associated performance when using our simulated model.
Figure 3: Scaling model
Each single process in the core simulation is broadcasting regular updates to the replication layer. The corresponding data throughput increases as the workload scales up.
In order to efficiently manage data transfer between the two components we need to compare different communication patterns (see Figure 4).
Figure 4 : Core to application layer communication patterns
Most of the time, it’s hard to get all the solutions implemented and compared. However, the network simulator offers us an easy and less costly way to do it. Thus, we are using it for evaluating and investigating new solutions and making decisions for future implementations.
In this post, we have given a high-level overview of the current design of the Hadean Network Simulator by describing the main simulated components and some relevant use cases. We are still working on improving our simulation platform to be integrated at different levels of the product life (e.g. POC, CI).
For this perspective, our current work is focusing on improving the simulator accuracy to get deeper insights on the different network aspects of our system. When we next revisit Hadean Network Simulator in our blog, it will be to explore the progress we’ve made.