Blog
Menu
Blog

The Hadean Network Simulator

Safae Dahmani
Jun 26, 2019 1:16:33 PM

Why do we need a network simulation environment?

As a distributed systems company, Hadean’s technology (HadeanOS, Aether Engine) needs to be tested at enormous scale. This can require multiple cloud providers for testing scalability, analysing performance and reliability, and investigating system regression for any new feature.

From our experience during our 10,000-ship EVE: Aether Wars tech demo, we noticed that running all our tests (a dozen per day) on cloud platforms can get very expensive, both in time and money. To test our large-scale workloads, most of our environments used big, expensive resources and we often ended up with multiple environments running, which meant a substantial amount of cost.

So how can we work around this and propose a lightweight solution to experience multiple scenarios without the need to deploy the test workload in an expensive cloud environment?

One important entry point to understand distributed systems is the network traffic. A major part of our testing scenarios focused on bandwidth analysis and traffic management.

The idea of building our network simulator relies on giving developers the ability to analyse these communication aspects without the need to run a real test workload at scale using expensive environments.

The network simulator can be used either for deeply analysing performance issues in our current system or comparing new solutions, before proceeding to major optimisations and features implementation.

What's the Hadean Network Simulator?

The Hadean Network Simulator (HNS) is a platform allowing a high-level representation of communication between the main Hadean components. It considers numerous aspects of our system.

Topology

There are different levels of abstractions that can be adopted when creating a Simulator. Based on our current usage of the HNS, we have decided to build a high-level representation of our communication system including the following components (Figure 1) :

  • Core simulation: Consists of the main computational processes. They are generating data traffic both internally (inter-processes) and to the replication layer (incoming and outgoing).
  • Replication layer: responsible for sending updates from the core simulation to the clients. In real life, this layer is geographically distributed within different data centers.
  • Client connections: As a part of the validation process, as well as scalability testing, we need to consider the client side in our representation of the system. Client connections are directly linked to the replication layer based on their geographical distribution.

Each one of these components is represented as a blackbox, sending and receiving configurable traffic through configurable channels.

overview-1

Figure 1: An overview of the Hadean Network Simulator topology

Communications

In the current version of the simulator we are focusing on the communication aspects between the previously defined components. Let’s look at how our simulator covers internal and external communications between its components as blackboxes (figure 2):

  • Intra-core communication: As we are modelling a distributed system, internal communications are a major part of the network simulator design. Intra-core communication covers communication aspects of the Hadean processes including communication protocol, interprocess throughput and resource distribution.
  • Core to replication layer communications: Communications between the core simulation and the clients are all routed through the replication layer. This part could be a real bottleneck in our system. In fact the simulation performance, network bandwidth, and geographical distribution of the replication layer are considered as impactful aspects.
  • From replication layer to clients: All updates from the core simulation are going across the replication layer to get to all clients. Modelling this connection depends on the distance between data centers and the client endpoints, in addition to the underlying network bandwidth.
communication-1

Figure 2 : Communication modelling within the Hadean Network Simulator

How did we implement our Network Simulator?

The Hadean Network Simulator is built on top of the “NS3” network simulation platform. This choice was based on a previous study we’d performed on the state of the art, where NS-2 and NS-3 were serious candidates.

Besides being very popular open-source simulators, they are both applicable to a broad set of use cases with a good set of contributed modules. However, NS-3 is actively maintained with significant documentation and tutorials, while NS-2 is lightly maintained, especially over the last few years.

NS-3 also has the advantage of a higher modularity (compared to NS-2), as several external animators, data analysis and visualisation tools can be used with it. As part of our requirements, NS-3 supports pcap format, that can be used with different visualisation tools. The simulator components are implemented using NS-3 nodes and TCP channels.

Hadean Network Simulator use cases

The performance of any distributed system is based on the efficiency of its underlying network communications. As explained previously, client events are going through multiple layers before getting back to clients as a global state update.

Analysing communications at each level would give us insight into potential performance optimisations. Network-related limitations between data centers and the amount of data conveyed through it are aspects that our analysis can rely on.

The following examples give the early use cases of the Hadean Network Simulator.

Scaling model

When scaling up our system, one key point is being able to forecast the resource requirements given numerous parameters. We are using the network simulator to check our scaling model based on system insights (e.g traffic load) and cloud environment inputs (e.g hardware bandwidth). It also considers some performance parameters such as per client outgoing bandwidth, in order to guarantee a good user experience.

The simulator gives us information about the network performance under these specific parameters which allows us to adjust the system resources accordingly. It’s also easy to try different combinations of parameters and compare the associated performance when using our simulated model.

scaling-1

Figure 3: Scaling model

Centralised versus decentralised communication patterns

Each single process in the core simulation is broadcasting regular updates to the replication layer. The corresponding data throughput increases as the workload scales up.

In order to efficiently manage data transfer between the two components we need to compare different communication patterns (see Figure 4).

Communication Patterns

Figure 4 : Core to application layer communication patterns

Most of the time, it’s hard to get all the solutions implemented and compared. However, the network simulator offers us an easy and less costly way to do it. Thus, we are using it for evaluating and investigating new solutions and making decisions for future implementations.

Future work and perspectives

In this post, we have given a high-level overview of the current design of the Hadean Network Simulator by describing the main simulated components and some relevant use cases. We are still working on improving our simulation platform to be integrated at different levels of the product life (e.g. POC, CI).

For this perspective, our current work is focusing on improving the simulator accuracy to get deeper insights on the different network aspects of our system. When we next revisit Hadean Network Simulator in our blog, it will be to explore the progress we’ve made.

Subscribe by Email

No Comments Yet

Let us know what you think