Resilience | microservices

Resilience in a microservice architecture is the idea to protect the system against failures in other microservices. Simply said - but how can we really protect our systems against partial failures? To come into it some samples will be shown here and some solution approches.

Lets start having a look at the remote procedure call. Probalby one of the most problematic request styles when it comes to resilience.

This pictures shows us a caller (can be a gateway, a gui, another microservie) calling a Microservice. The queue balancing will call an instance in a round robin fashion.

What are strategies if the caller gets a timeout or an internal server error?

Strategy 1
You do a retry on the caller's side -> you can have luck and you are rooted to a healty service instance. But what has happened with your fist, failed call in case of a cud operation? (Partially written datas?) If you are again on the instance which failed the problem is still here. How often are you doing a retry?

Strategy 2

All microservice instances establish an active-passive mode. So if a heartbeat is not received from instance 1 then instance 2 switches to active mode -> this complicates the system and you still need to find a way to load balance the instances.

Strategy 3

In case the operation is a cud operation you could switch to an asynchronous communication and rely on the delivery guarantees of your broker. In case you expect an Id of the create operation, you can model it this way: RequestDto.FutureId = 17 (typically this is a uuid/guid). In this case the caller sets the id under which the future object will exist. Another benefit is your system becomes faster as there is no waiting for a response involved.

You probably agree with me that strategy 3 is the simplest solution. You make your system more resilient by using asynchronous calls. If the asynchronous call is processed later in time and there is an exception, the payload is moved to a dead-letter queue and the organization has to check it. It really depends on your use case.