10 points by yakirbitan 2 days ago | 7 comments
The problem arises when the database crashes for a period of time. If the worker has already pulled a message from the queue, it disappears unless handled properly.
Proposed Solution: 1. Use ACK Mechanism Properly • If the database is available, insert the data and send an ACK to remove the message from the queue. • If the database is down, send a NO_ACK so that RabbitMQ requeues the message for another attempt. 2. Prevent Infinite Retry Loops • If we simply send NO_ACK, the message will immediately return to the queue, leading to a tight retry loop that overloads the system. • To solve this, implement a progressive retry delay: • The worker should sleep for a few seconds before retrying the same task: • 1s → 3s → 5s → 30s → 60s (capped at a max delay). 3. Limit Retry Attempts • Introduce a retry counter (e.g., 5 attempts). • After 5 failures, move the message to a Dead Letter Queue (DLQ) instead of retrying indefinitely.
Alternative Approach:
Instead of relying on RabbitMQ’s NO_ACK retry cycle, an alternative would be to keep the message in memory and attempt 5 internal retries within the worker itself: 1. The worker tries to insert the data into the DB. 2. If the DB is down, it retries up to 5 times, with a sleep interval between attempts. 3. If all retries fail, move the message to the DLQ.
Questions: • Which approach is preferable? Should I rely on RabbitMQ to handle retries, or should I manage them within the worker itself? • Are there better practices for handling failures in a high-scale distributed system with RabbitMQ and a database backend?
ciaovietnam 2 days ago
Firstly, every job gets stored in an activity table with a retry and result column.
If the job should be run instantly without queueing, no queue entry is needed, the retry column is 0 and the result column is filled.
If the job fails for whatever reason, a queue entry containing the activity ID is added to let the queue worker process it later on. When the worker processes the job, the retry column will be incremented no matter of the outcome. If the worker succeeds, the activity will be updated with the result.
If the worker fails to process it and the retry number is less than 5, another queue entry will be added (as the current one is already removed from the queue). When to process that queue entry depends on the retry number, they can be 1m, 5m, 25m, 2.1h, 10.4h, 2.2 days using the following formula:
$interval = 300 * pow(5, ($retry - 1));
This approach also helps in case of queue failure, you can just rebuild the queue from the activity table for entries with no result (or status) and retry number is less than 5.
To be honest, I don't work with queue regularly but I have to implement it anyway. I'm sharing this approach so we all can improve it.
4ndrewl 2 days ago
Don't forget a mechanism to redrive back from the DLQ and consider if order is important (might be if you're using a FIFO queue, but unsure as to whether Rabbit supports that)
kbouck 2 days ago
Proposed Solution 2 without ACKs would be vulnerable to message loss if a worker were to crash before successful message delivery.
gregw2 2 days ago
gregw2 2 days ago