Fault tolerance - Queuing

In a distributed architecture, you should always favor asynchronous process when it is possible for several reasons:

  • better load absorption
  • consumption flexibility (stop the consumer, increase the number of consumers)
  • fault tolerance

Here we gonna talk about the last reason.

Sync vs Async

When a failure occurred in a synchronous process, this one cannot be recovered:

Synchronous process

In this example, the order service send analytic data but the analytic service is down. The order process stops even if this step is not mandatory as these two services are strongly coupled.

The same process in asynchronous way:

Asynchronous process

Here the order service pushes its request in a queue and doesn't care whether the analytic service is up or not. Request can be replayed with a consumer queue once the target service is up.

Dead letter queue (DLQ)

Up to now, we are in the scenario where the failure is coming from the system unavailability, but what if the message itself is wrong. We can be in the situation when one message blocks other ones:

Messages blocked

That's why we need another queue called dead letter queue:

After a defined number of retries, the message is sent to DLQ, thus the next one can be delivered. DLQ is just a retention queue where messages are not consumed anymore.

Warning: this kind of mechanism can be painful to handle if you have a lot of rejects. In most of queuing systems, you don't have the possibility to query a specific message among all ones. Messages are consumed in the order of push (FIFO). So if the message stays too long in queue (eg. invalid payload), you should consider to put it in retention table.


Moreover, it's strongly recommended to set a Time to live on each message to avoid queue overload. Thus, once the time to live is expired, the message is send to the retention table.

Retention table allows to handle a specific message contrary to queuing system. Document oriented database are generally well adapted for this case as they allow flexible data structure.

Idempotency

When an outage occurred during the message delivering (between the queuing system and the target service), the message will return in the queue. In this situation, you don't know if the message has been delivered or not and you must redeliver this one.
It's very convenient when the target system is idempotent. Idempotency means that we can repeat a operation several times without side effect (eg. create several times the same order). Of course it involves that the system is able to identify which request has been processed or not.
Thus, when a error occurred, we can resend a request without fear of side effect.

There are several ways for a system to be idempotent and it isn't the purpose of this article to cover this topic but generally you have to store the unique message identifier.
For queuing systems which implement the AMQ protocol, you should consider the Message ID property.
The id can be stored in a simple in-memory cache for short living process (eg. Redis) or in a dedicated table if you want to replay messages in a large time frame.

Conclusion

When you are designing your process, always ask to yourself if it could be done in an asynchronous way. Besides the fact that services are loosely coupled, queuing makes process less vulnerable to failure as errors can be recovered easily thanks to retry mechanisms. But you should be cautious about a large number of rejects as it can be tedious to handle in a queuing system.