241 points by yusufaytas 3 days ago | 94 comments
jojolatulipe 3 days ago
wslh 2 days ago
refset 2 days ago
robertlagrant 3 days ago
calmoo 3 days ago
Icathian 3 days ago
eknkc 3 days ago
Felt pretty safe about it so far but I just realised I never check if the db connection is still ok. If this is a db related job and I need to touch the db, fine. Some query will fail on the connection and my job will fail anyway. Otherwise I might have already lost the lock and not aware of it.
Without fencing tokens, atomic ops and such, I guess one needs a two stage commit on everything for absolute correctness?
Quekid5 3 days ago
AFAIK the only correct way to do what you probably thought you were doing is "EXCLUSIVE" or "ACCESS EXCLUSIVE"... or two-phase commit or idempotency for the operations you're doing.
[0] https://www.postgresql.org/docs/current/explicit-locking.htm...
skrause 2 days ago
Are you sure that you're talking about the same locks? What are the pitfalls exactly?
candiddevmike 3 days ago
Quekid5 3 days ago
joatmon-snoo 3 days ago
Postgres allows obtaining advisory locks at either the session _or_ transaction level. If it's session-level, then you have, ergo, a connection-level lock.
https://www.postgresql.org/docs/current/explicit-locking.htm...
skrause 2 days ago
m11a 2 days ago
eknkc 2 days ago
And if you add something like pgBouncer or whatever, this should still work but a session lock would fuck things up.
antirez 3 days ago
Btw, things to note in random order:
1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.
2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.
3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.
4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.
computerfan494 3 days ago
Would you use RedLock in a situation where the timeout is fairly short (1-2 seconds maybe), the work done usually takes ~90% of that timeout, and the work you do while holding a RedLock lock MUST NOT be done concurrently with another lock holder?
I think the correct answer here is always "No" because the risk of the lease sometimes expiring before the client has finished its work is very high. You must alter your work to be idempotent because RedLock cannot guarantee mutual exclusion under all circumstances. Optimistic locking is a good way to implement this type of thing while the work done is idempotent.
kgeist 3 days ago
We had corrupted data bacause of this.
antirez 3 days ago
Btw, things to note in random order:
1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.
2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.
3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.
4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.
computerfan494 3 days ago
The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.
antirez 3 days ago
1. E-commerce system where there are a limited amount of items of the same kind, you don't want to oversell.
2. Hotel booking system where we don't want to reserve the same dates/rooms multiple times.
3. Online medical appointments system.
In all those systems, to re-open the item/date/... after some time it's ok, even after one day. And if the lock hold time is not too big, but a very strict compromise (it's also a reasonable choice in the spectrum), and it could happen that during edge case failures three items are sold and there are two, orders can be cancelled.
So yes, there is a tension between timeout, race condition, recovery time, but in many systems using something like RedLock the development and end-user experience can be both improved with a high rate of success, and the random unhappy event can be handled. Now the algorithm is very old, still used by many implementations, and as we are talking problems are solved in a straightforward way with very good performances. Of course, the developers of the solution should be aware that there are tradeoffs between certain values: but when are distributed systems easy?
P.S. why 10 years of strong usage count, in the face of a blog post telling that you can't trust a system like that? Because even if DS issues emerge randomly and sporadically, in the long run systems that create real-world issues, if they reach mass usage, are known. A big enough user base is a continuous integration test big enough to detect when a solution has real world serious issues. So of course RedLock users picking short timeouts with tasks that take a very hard to predict amount of time, will indeed incur into knonw issues. But the other systemic failure modes described in the blog post are never mentioned by users AFAIK.
computerfan494 3 days ago
If you want to say "RedLock is correct a very high percentage of the time when lease timeouts are tuned for the workload", I would agree with you actually. I even possibly agree with the statements "most systems can tolerate unlikely correctness failures due to RedLock lease violations. Manual intervention is fine in those cases. RedLock may allow fast iteration times and is worth this cost". I just think it's important to be crystal clear on the guarantees RedLock provides.
I first read Martin's blog post and your response years ago when I worked at a company that was using RedLock despite it not being an appropriate tool. We had an outage caused by overlapping leases because the original implementor of the system didn't understand what Martin has pointed out from the RedLock documentation alone.
I've been a happy Redis user and fan of your work outside of this poor experience with RedLock, by the way. I greatly appreciate the hard work that has gone into making it a fantastic database.
anonzzzies 3 days ago
cosmicradiance 3 days ago