Thoughts on testing in a distributed system
Testing is a crucial part of software development.
Disclaimer: I’m talking about what I know, from my experience, this is not the “best practice”, it is what I think of the right way to do the testing in a distributed system.
A distributed system is a system that is spread across multi servers on the network. Together it will complete tasks more efficiently than a single server. Depending on how we allocate the resource, a server may contain one or many nodes, like a program. A single node will process a unit of work, we can think of it as a single program to do one job, or several smaller jobs but contribute the final result.
Since it does one job, we want it to do the job well, that is why we need the test also well.
When we talk about the distributed system, usually it comes with the event-driven architecture. An event is a set of data that we know exactly how it looks so we can easily define the input and output to test. We have the stateful node (service) which connects to the database. And stateless node (worker), which receive event message from the message bus to process. Workers will need to call services to persist the data.
But things will soon become a mesh when the business grows.
Because the first version of a node is simple, we can move fast to “breaking things”, we add more features with the tight time allocated, which leads to a huge technical debt lately. There is a loop of:
- We need more features, may we need to refactor the program to support it.
- But we don’t have much time, can we just write code and test it manually?
- It is fine, make sure you tested it well.
Yeah, trust me, it doesn’t go well, the program will always lack unit tests.
And to cover it, we were thinking of the black-box test, no matter what inside the program, this is the event-driven system, I give it input A and expected output B.
Then we have to simulate this node is working in the real production server, we start a local database, start all the related services via a config file, remap mounting paths, prepare the data… It is a little complex, but it works exactly like what we expected, we just need to provide the input and expected output.
Now it is the integration test not the unit test anymore. And then we add more features, it is the huge pain to maintain the previous tests because to mimic the production, we need some tricks here and there, manage the database state before and after the test, manage the state of the other services which the test calls to. Time is tight, managers are pushing, let’s just write code to support the new feature and test manually, run the old test to make sure it doesn’t break, now we are back to the loop above.
End of the story.
Now, how should we make it better?
- We need to stop simulating the production environment. We should mock the services and database. Let expect service is always right, if it doesn’t then it is the service’s fault. We shouldn’t cover the service testing because it is out of scope.
- The worker should be small, do one job and do it well. The test in the worker should cover its logic only.
- Do the integration test on the service (and unit test for sure).
- Think of adding the code coverage, like Coverlet for C# or Instanbul for Nodejs. Let deal with the management about the time budget to set a coverage number, normally is 85%.
- Use a robust testing lib, framework.
Or we can change the architecture
To avoid dealing with the tightening of services, workers and the database. The more loosely it is, the easier it is to test.
I once experienced Clean Architecture, and I found out it is much easier to apply the test at all levels. It will be out of scope to talk about it in this post. I will mention a few key points on why it is better.
- In the domain layer (business layer), we have tasks, we know exactly input and output then we can easily add unit tests. It is just the pure programming language and delegates the infrastructure works to the outside layers.
- We don’t have too many services then we don’t have to mock a lot of service calls like in the microservices architecture. We just mock the API to 3rd party and set up a local database to run the integration test. Sure we still need to prepare data, cleanup it before and after, but it is much easier.
- It is the separation of concern. The business layer cares about business logic. The presentation layer (API server) cares about authentication, permission, logging, routing, mapping the input, output. Once it’s only concerned about small things it will easier to test.
- Environment based behaviour. For example, in the old way, we simulate the production then the logging module will send logs to the elastic search server, we can only avoid it by mocking the server. But in Clean Architecture, we apply the dependency injection, then we can config it to just send the log to the console in the dev or test environment. Another example is the database, we may want the test only storing the data in memory, no need to connect to the database, we can do it too.
Testing is less important in the early stage of the product, at this stage we usually care more about fast development rather than robust development. Then when the product is mature, the bad choice on the architecture may lead to a buggy system which is hard to maintain and increases the cost many times.
Let think of the feature like a spear, and the test is a shield. It needs to be tailored for the team structure, also easier to maintain or replace.