244 points by yusufaytas 9 months ago | 95 comments
jojolatulipe 9 months ago
wslh 8 months ago
refset 8 months ago
amievil 8 months ago
robertlagrant 9 months ago
calmoo 9 months ago
Icathian 9 months ago
eknkc 9 months ago
Felt pretty safe about it so far but I just realised I never check if the db connection is still ok. If this is a db related job and I need to touch the db, fine. Some query will fail on the connection and my job will fail anyway. Otherwise I might have already lost the lock and not aware of it.
Without fencing tokens, atomic ops and such, I guess one needs a two stage commit on everything for absolute correctness?
Quekid5 9 months ago
AFAIK the only correct way to do what you probably thought you were doing is "EXCLUSIVE" or "ACCESS EXCLUSIVE"... or two-phase commit or idempotency for the operations you're doing.
[0] https://www.postgresql.org/docs/current/explicit-locking.htm...
skrause 8 months ago
Are you sure that you're talking about the same locks? What are the pitfalls exactly?
candiddevmike 9 months ago
Quekid5 9 months ago
joatmon-snoo 9 months ago
Postgres allows obtaining advisory locks at either the session _or_ transaction level. If it's session-level, then you have, ergo, a connection-level lock.
https://www.postgresql.org/docs/current/explicit-locking.htm...
skrause 8 months ago
m11a 8 months ago
eknkc 8 months ago
And if you add something like pgBouncer or whatever, this should still work but a session lock would fuck things up.
antirez 9 months ago
Btw, things to note in random order:
1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.
2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.
3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.
4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.
computerfan494 9 months ago
Would you use RedLock in a situation where the timeout is fairly short (1-2 seconds maybe), the work done usually takes ~90% of that timeout, and the work you do while holding a RedLock lock MUST NOT be done concurrently with another lock holder?
I think the correct answer here is always "No" because the risk of the lease sometimes expiring before the client has finished its work is very high. You must alter your work to be idempotent because RedLock cannot guarantee mutual exclusion under all circumstances. Optimistic locking is a good way to implement this type of thing while the work done is idempotent.
kgeist 9 months ago
We had corrupted data bacause of this.
antirez 9 months ago
Btw, things to note in random order:
1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.
2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.
3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.
4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.
computerfan494 9 months ago
The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.
antirez 9 months ago
1. E-commerce system where there are a limited amount of items of the same kind, you don't want to oversell.
2. Hotel booking system where we don't want to reserve the same dates/rooms multiple times.
3. Online medical appointments system.
In all those systems, to re-open the item/date/... after some time it's ok, even after one day. And if the lock hold time is not too big, but a very strict compromise (it's also a reasonable choice in the spectrum), and it could happen that during edge case failures three items are sold and there are two, orders can be cancelled.
So yes, there is a tension between timeout, race condition, recovery time, but in many systems using something like RedLock the development and end-user experience can be both improved with a high rate of success, and the random unhappy event can be handled. Now the algorithm is very old, still used by many implementations, and as we are talking problems are solved in a straightforward way with very good performances. Of course, the developers of the solution should be aware that there are tradeoffs between certain values: but when are distributed systems easy?
P.S. why 10 years of strong usage count, in the face of a blog post telling that you can't trust a system like that? Because even if DS issues emerge randomly and sporadically, in the long run systems that create real-world issues, if they reach mass usage, are known. A big enough user base is a continuous integration test big enough to detect when a solution has real world serious issues. So of course RedLock users picking short timeouts with tasks that take a very hard to predict amount of time, will indeed incur into knonw issues. But the other systemic failure modes described in the blog post are never mentioned by users AFAIK.
computerfan494 9 months ago
If you want to say "RedLock is correct a very high percentage of the time when lease timeouts are tuned for the workload", I would agree with you actually. I even possibly agree with the statements "most systems can tolerate unlikely correctness failures due to RedLock lease violations. Manual intervention is fine in those cases. RedLock may allow fast iteration times and is worth this cost". I just think it's important to be crystal clear on the guarantees RedLock provides.
I first read Martin's blog post and your response years ago when I worked at a company that was using RedLock despite it not being an appropriate tool. We had an outage caused by overlapping leases because the original implementor of the system didn't understand what Martin has pointed out from the RedLock documentation alone.
I've been a happy Redis user and fan of your work outside of this poor experience with RedLock, by the way. I greatly appreciate the hard work that has gone into making it a fantastic database.
anonzzzies 9 months ago
cosmicradiance 9 months ago
egcodes 9 months ago
dataflow 9 months ago
Hold on, this sounds absurd to me:
First, if your client crashes, then you don't need a timed lease on the lock to detect this in the first place. The lock would get released by the OS or supervisor, whether there are any timeouts or not. If both of those crash too, then the connection would eventually break, and the network system should then detect that (via network resets or timeouts, lack of heartbeats, etc.) and then invalidate all your connections before releasing any locks.
Second, if the problem becomes that your client is buggy and thus holds the lock too long without crashing, then shouldn't some kind of supervisor detect that and then kill the client (e.g., by the OS terminating the process) before releasing the lock for everybody else?
Third, if you are going to have locks with timeouts to deal with corner cases you can't handle like the above, shouldn't they notify the actual program somehow (e.g., by throwing an exception, raising a signal, terminating it, etc.) instead of letting it happily continue execution? And shouldn't those cases wait for some kind of verification that the program was notified before releasing the lock?
The whole notion that timeouts should somehow permit the program execution to continue ordinary control flow sounds like the root cause of the problem, and nobody is even batting an eye at it? Is there an obvious reason why this makes sense? I feel I must be missing something here... what am I missing?
winwang 9 months ago
dataflow 9 months ago
winwang 9 months ago
dataflow 9 months ago
To your question, could you clarify what exactly you mean by the rack "going down"? This encompasses a lot of different scenarios, I'm not sure which one you're asking about. The obvious interpretation would break all the connections the program has to the outside world, thus preventing the problem by construction.
winwang 9 months ago
dataflow 9 months ago
winwang 9 months ago
dataflow 9 months ago
I get that -- and honestly, I'm not expecting a treatise on distributed consensus here. But what took me aback was that the blog post didn't even attempt to mention anything about the fact that the premise (at first glance) looks glaringly broken. If he'd even said 1 single sentence like "it's {difficult/infeasible/impossible} to design a client that will never continue execution past a timeout", it'd have been fine, and I would've happily moved along. But the way it is written right now, it reads a little bit like: "we design a ticking time bomb that we can't turn off; how can we make sure we don't forget to reset the timer every time?"... without bothering to say anything about why we should be digging ourselves into such a hole in the first place.
winwang 9 months ago
dataflow 9 months ago
Also for what it's worth, I can guess what some of the answers might be. For example, it's possible you'd need very precise timing facilities that aren't always available, in order to be able to guarantee high throughput with correctness (like Google Spanner's). Or it might be that doing so requires a trade-off between availability and partition-tolerance that in some applications isn't justified. But I'm curious what the answer actually is, rather than just (semi-)random guesses as to what it could be.
wbl 9 months ago
neonbrain 9 months ago
neonbrain 9 months ago
dataflow 9 months ago
neonbrain 9 months ago
dataflow 9 months ago
Timeouts were a red herring in my comment. My problem wasn't with the mere existence of timeouts in corner cases, it was the fact that the worker is assumed to keep working merrily on, despite the timeouts. That's what I don't understand the justification for. If the worker is dead, then it's a non-issue, and the lease can be broken. If the system is alive, the host can discover (via RST, heartbeats, or other timeouts) that the storage system is unreachable, and thus prevent the program from continuing execution -- and at that point the storage service can still break the lease (via a timeout), but it would actually come with a timing-based guarantee that the program will no longer continue execution.
hoppp 9 months ago
Its using foundationdb, a distributed db. The deno instances running on local devices all connect to the same Deno KV to acquire the lock.
But using postgres, a select for update also works, the database is not distributed tho.
jroseattle 9 months ago
Our use case: handing out a ticket (something with an identifier) from a finite set of tickets from a campaign. It's something akin to Ticketmaster allocating seats in a venue for a concert. Our operation was as you might expect: provide a ticket to a request if one is available, assign some metadata from the request to the allocated ticket, and remove it from consideration for future client requests.
We had failed campaigns in the past (over-allocation, under-allocation, duplicate allocation, etc.) so our concern was accuracy. Clients would connect and request a ticket; we wanted to exclusively distribute only the set of tickets available from the pool. If the number of client requests exceeded the number of tickets, the system should protect for that.
We tried Redis, including the naive implementation of getting the lock, checking the lock, doing our thing, releasing the lock. It was ok, but administrative overhead was a lot for us at the time. I'm glad we didn't go that route, though.
We ultimately settled on...Postgres. Our "distributed lock" was just a composite UPDATE statement using some Postgres-specific features. We effectively turned requests into a SET operation, where the database would return either a record that indicated the request was successful, or something that indicated it failed. ACID transactions for the win!
With accuracy solved, we next looked at scale/performance. We didn't need to support millions of requests/sec, but we did have some spikiness thresholds. We were able to optimize read/write db instances within our cluster, and strategically load larger/higher-demand campaigns to allocated systems. We continued to improve on optimization over two years, but not once did we ever have a campaign with ticket distribution failures.
Note: I am not an expert of any kind in distributed-lock technology. I'm just someone who did their homework, focused on the problem to be solved, and found a solution after trying a few things.
nh2 9 months ago
Your UPDATE transaction lasts just a few microseconds, so you can just centralise the problem and that's good because it's simpler, faster and safer.
But this is not a _distributed_ problem, as the article explains:
> remember that a lock in a distributed system is not like a mutex in a multi-threaded application. It’s a more complicated beast, due to the problem that different nodes and the network can all fail independently in various ways
You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock.
fny 9 months ago
nh2 9 months ago
"the machines involved can fail" must also include the postgres machines.
To get that, you need to coordinate multiple postgres servers, e.g. using ... distributed locking. Postgres does not provide that out of the box -- neither multi-master setups, nor master-standby synchronous replication with automatic failover. Wrapper software that provides that, such as Stolon and Patroni, use distributed KV stores / lock managers such as etcd and Consul to provide it.
jroseattle 9 months ago
50000?
> You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock. From my experience, locks are needed to ensure synchronized access to resources. Distributed locks are a form of that isolation being held across computing processes, as opposed to the mutex example provided.
And while our implementation definitively did not use a distributed lock, we could still see those machines fail.
I fail to understand why a distributed lock is needed for anything due to it's duration.
throwawaythekey 8 months ago
If you require high throughput and have a high duration then partitioning/distribution are the normal solution.
stickfigure 9 months ago
In your case, the constraint is "don't sell more than N tickets". For most realistic traffic volumes for that kind of problem, you can solve it with traditional rdbms transactional behavior and let it manage whatever locking it uses internally.
I wish developers were a lot slower to reach for "I'll build distributed locks". There's almost always a better answer, but it's specific to each application.
jroseattle 9 months ago
Maybe we were lucky in our implementation, but a key factor for our decision was understanding how to manage the systems in our environment. We would have skilled up with Redis, but we felt our Postgres solution would be a good first step. We just haven't had a need to go to a second step yet.
nasretdinov 9 months ago
tonyarkles 9 months ago
Whenever anyone would come and ask for help with a planned distributed system the first question I would always ask is: does this system actually need to be distributed?! In my 15 years of consulting I think the answer was only actually “yes” 2 or 3 times. Much more often than was helping them solve the performance problems in their single server system; without doing that they would usually just have ended up with a slow complex distributed system.
Edit: lol this paper was not popular in the Distributed Systems Group at my school: https://www.usenix.org/system/files/conference/hotos15/hotos...
“You can have a second computer once you’ve shown you know how to use the first one.”
Agingcoder 9 months ago
etcd 9 months ago
wwarner 9 months ago
hansvm 9 months ago
That's a bit strong. Like most of engineering, it depends. Postgres is a good solution if you only have maybe 100k QPS, the locks are logically (if not necessarily fully physically) partially independent, and they aren't held for long. Break any of those constraints, or add anything weird (inefficient postgres clients, high DB load, ...), and you start having to explore either removing those seeming constraints or using other solutions.
wwarner 9 months ago
zbobet2012 9 months ago
OnlyMortal 9 months ago
It’s based on Postgres but performance was not good enough.
We’re now moving to RDMA.
apwell23 9 months ago
galeaspablo 9 months ago
Or they care but don’t bother checking whether what they’re doing is correct.
For example, in my field, where microservices/actors/processes pass messages between each other over a network, I dare say >95% of implementations I see have edge cases where messages might be lost or processed out of order.
But there isn’t an alignment of incentives that fixes this problem. Ie the payment structures for executives and engineers aren’t aligned with the best outcome for customers and shareholders.
noprocrasted 9 months ago
"Microservices" itself is often a symptom of this problem.
Everyone and their dog wants to introduce a network boundary in between function calls for no good reason just so they can subsequently have endless busywork writing HTTP (or gRPC if you're lucky) servers, clients & JSON (de?)serializers for said function calls and try to reimplement things like distributed transactions across said network boundary and dealing with the inevitable "spooky action at a distance" that this will yield.
sethammons 9 months ago
The monoliths I have worked in, very contrastingly, have had issues coordinating changes within the codebases, code crosses boundaries it should not and datastores get shared and coupled to (what should be) different domains leading to slow, inefficient code and ossified options for product changes.
shepherdjerred 9 months ago
sethammons 9 months ago
sethammons 9 months ago
Getting everyone onboard is hard and that is why good leadership is needed. When customers start to churn because bugs pop up and new features are slow or non existent, then the case is very easy to make quality part of the process. Mature leaders get ahead of that as early as possible.
galeaspablo 9 months ago
Leaders tend to be impatient and think of this quarter’s OKRs as opposed to the business’ long term financial health. In other word the leaders of leaders use standard MBA prescribed incentive structures.
secondcoming 9 months ago
Eek. This sort of thing can end up with innocent people in jail, or dead.
[0] https://en.wikipedia.org/wiki/British_Post_Office_scandal
noprocrasted 9 months ago
So I'm not particularly sure this is a good example - if anything, it sets the opposite incentives, that even jailing people or driving them to suicide won't actually have any consequences for you.
mrkeen 9 months ago
But I don't see anyway to convince yesterday's managers to give us time to build it right.
jmull 9 months ago
* If you have something like what the article calls a fencing token, you don't need any locks.
* The token doesn't need to be monotonically increasing, just a passive unique value that both the client and storage have.
Let's call it a version token. It could be monotonically increasing, but a generated UUID, which is typically easier, would work too. (Technically, it could even be a hash of all the data in the store, though that's probably not practical.) The logic becomes:
(1) client retrieves the current version token from storage, along with any data it may want to modify. There's no external lock, though the storage needs to retrieve the data and version token atomically, ensuring the token is specifically for the version of the data retrieved.
(2) client sends the version token back along with any changes.
(3) Storage accepts the changes if the current token matches the one passed with the changes and creates a new version token (atomically, but still no external locks).
Now, you can introduce locks for other reasons (hopefully goods ones... they seem to be misused a lot). Just pointing out they are/should be independent of storage integrity in a distributed system.
(I don't even like the term lock, because they are temporary/unguaranteed. Lease or reservation might be a term that better conveys the meaning.)
cnlwsu 9 months ago
zeroxfe 9 months ago
You're misinterpreting the problem described, and proposing a solution for a different problem.
karmakaze 9 months ago
jameshart 9 months ago
zeroxfe 9 months ago
karmakaze 9 months ago
I'm not even sure how it could be used for exclusive update to a resource elsewhere--all clients will think they 'have' the lock and change the resource, then find out they didn't when they update the lock. Or if they bump the lock first, another client could immediately 'have' the lock too.
wh0knows 9 months ago
> Efficiency: Taking a lock saves you from unnecessarily doing the same work twice (e.g. some expensive computation). If the lock fails and two nodes end up doing the same piece of work, the result is a minor increase in cost (you end up paying 5 cents more to AWS than you otherwise would have) or a minor inconvenience (e.g. a user ends up getting the same email notification twice).
I think multiple nodes doing the same work is actually much worse than what’s listed, as it would inhibit you from having any kind of scalable distributed processing.
karmakaze 9 months ago
jmull 9 months ago
To be clear, my point is don't use distributed locking for correctness. There are much better options.
Now, the atomicity I mention implies some kind of internal synchronization mechanism for multiple requests, which could be based on locks, but those would be real, non-distributed ones.
jmull 9 months ago
Efficiency is one, as you say.
The other main one that comes to mind is to implement other "business rules" (hate that term, but that's what people use), like for a online shopping app, the stock to fulfill an order might be reserved for a time when the user starts the checkout process.
bootsmann 9 months ago
I.e. your storage system has two nodes and there are two read-modify-write processes running. Process 1 acquires the first token "abc" and process two also acquires the token "abc". Now process 1 commits, the token is changed to "cde" and the change streamed to node 2. Due to network delay, the change to node 2 is delayed. Meanwhile process 2 commits to node 2 with token "abc". Node 2 accepts the change because it has not received the message from node 1 and your system is now in an inconsistent state.
Note that this cannot happen in a scenario where we have monotonically increasing fencing tokens because that requirement forces the nodes to agree on a total order of operations before they can supply the fencing token.
computerfan494 9 months ago
bootsmann 9 months ago
computerfan494 9 months ago
bootsmann 9 months ago
jmull 9 months ago
So node2 doesn't get to accept changes. It can only send changes to storage, which may or may not be accepted by it.
bootsmann 9 months ago
eru 9 months ago
(Honestly, they should rename `--force-with-lease` to just `--force`, and rename the old `--force` behaviour to `--force-with-extreme-prejudice` or something like that. Basically make the new behaviour the default `--force` behaviour.)
pyrolistical 9 months ago
eru 8 months ago