Thoughts on testing in a distributed system

4 min readDec 3, 2021

Disclaimer: I’m talking about what I know; from my experience, this is not the “best practice”; it is what I think of as the right way to do the testing in a distributed system.

A distributed system is a system spread across multiple servers on the network. Together it will complete tasks more efficiently than a single server. Depending on how we allocate the resource, a server may contain one or many nodes, like a program. A single node will process a unit of work. We can think of it as a single program to do one job or several smaller jobs but contribute to the final result.

Since it does one job, we want it to do the job well. That is why we need the test also well.

When we talk about the distributed system, usually, it comes with an event-driven architecture. An event is a set of data that we know exactly how it looks so we can easily define the input and output to test. We have the stateful node (service), which connects to the database. And stateless node (worker) receives event messages from the message bus to process. Workers will need to call services to persist the data.

But things will soon become a mesh when the business grows.

Because the first version of a node is simple, we can move fast to “breaking things.” We added more features with the tight allocated time, which has led to substantial technical debt. There is a loop of:

- We need more features and may need to refactor the program to support it.
- But we don’t have much time. Can we just write code and test it manually?
- It is okay. Make sure you tested it well.

Yeah, trust me, it doesn’t go well. The program will always lack unit tests.

And to cover it, we were thinking of the black-box test. No matter what is inside the program, this is an event-driven system. I give it input A and expected output B.

Then we have to simulate this node working in the actual production server. We start a local database, start all the related services via a config file, remap mounting paths, and prepare the data. It is a little complex but works exactly as we expected. We just need to provide the input and expected output.

Perfect!

Now it is the integration test, not the unit test anymore. And then, we add more features. Maintaining the previous tests is a huge pain because to mimic the production, we need some tricks here and there, manage the database state before and after the test, and manage the state of the other services the test calls to. Time is tight, and managers are pushing. Let’s just write code to support the new feature and test manually, run the old test to make sure it doesn’t break, and now we are back to the loop above.

Now, how should we make it better?

We need to stop simulating the production environment. We should mock the services and database. We shouldn’t cover the service testing because it is out of scope. Let’s expect service is always exemplary. If it doesn’t, then it is the service’s fault.
The worker should be small, do one job and do it well. The test in the worker should cover its logic only.
Do the integration test on the service (and unit test for sure).
Think of adding the code coverage, like Coverlet for C# or Instanbul for Nodejs. Let’s deal with the management about the time budget to set a coverage number, typically 85%.
Use a robust testing framework.

Or we can change the architecture.

To avoid dealing with the tightening of services, workers, and the database. The more loosely it is, the easier it is to test.

I once experienced Clean Architecture and found it much easier to apply the test at all levels. It will be out of scope to talk about it in this post. I will mention a few key points on why it is better.

In the domain (business layer), we have tasks and know precisely the input and output. Then we can easily add unit tests. It is just the pure programming language and delegates the infrastructure works to the outside layers.
We don’t have too many services then we don’t have to mock many service calls like in the microservices architecture. We just mock the API to 3rd party and set up a local database to run the integration test. Sure we still need to prepare data and clean up it before and after, but it is much easier.
It is the separation of concern. The business layer cares about business logic. The presentation layer (API server) cares about authentication, permission, logging, routing, and input and output mapping. Once it’s only concerned about small things, it will easier to test.
Environment-based behavior. For example, in the old way, we simulated the production then the logging module sent logs to the elastic search server. We can only avoid it by mocking the server. But in Clean Architecture, we apply the dependency injection. Then we can configure it to just send the log to the dev or test environment console.

Another example is the database. We may want the test only to store the data in memory. No need to connect to the database. We can do it too.

Conclusion

Testing is less critical in the early stage of the product. We usually care more about fast development at the early stage than robust development. Then when the product is mature, the bad choice of architecture may lead to a buggy system that is hard to maintain and increases the cost many times.

Let’s think of the feature as a spear, and the test is a shield. It needs to be tailored for the team structure and is easier to maintain or replace.

Thoughts on testing in a distributed system

But things will soon become a mesh when the business grows.

Now, how should we make it better?

Or we can change the architecture.

Conclusion

Written by Finn Nguyen