remix logo

Hacker Remix

How to do distributed locking (2016)

241 points by yusufaytas 3 days ago | 94 comments

jojolatulipe 3 days ago

At work we use Temporal and ended up using a dedicated workflow and signals to do distributed locking. Working well so far and the implementation is rather simple, relying on Temporal’s facilities to do the distributed parts of the lock.

wslh 2 days ago

I just discovered Temporal, and I have to say thank you! From what I've seen so far, it seems like the holy grail for workflows, offering very clear high-level task management over complex infrastructure. Is Temporal unique in this space, or are there other alternatives of similar caliber? Given that it was spun off from Uber and is used by top vendors, it sounds like it’s been thoroughly battle-tested.

refset 2 days ago

DBOS [0] is outwardly similar although it's much younger. IIUC, internally DBOS is able to be more efficient and support lower latencies than Temporal because of the way it can push work down into Postgres stored procedures.

[0] https://www.dbos.dev/

robertlagrant 3 days ago

I'm keen to use Temporal, but I've heard it can be flaky. In your experience has it worked well?

calmoo 3 days ago

Rock solid in my experience and kind of a game changer. I’m surprised it’s not more widespread in large orgs.

Icathian 3 days ago

We use it a ton at my shop for internal things like release rollouts. Fairly big tech company, and same experience. It's an excellent product.

eknkc 3 days ago

I tend to use postgresql for distributed locking. As in, even if the job is not db related, I start a transaction and obtain an advisory lock which stays locked until the transaction is released. Either by the app itself or due to a crash or something.

Felt pretty safe about it so far but I just realised I never check if the db connection is still ok. If this is a db related job and I need to touch the db, fine. Some query will fail on the connection and my job will fail anyway. Otherwise I might have already lost the lock and not aware of it.

Without fencing tokens, atomic ops and such, I guess one needs a two stage commit on everything for absolute correctness?

Quekid5 3 days ago

Advisory locks have many pitfalls, see [0].

AFAIK the only correct way to do what you probably thought you were doing is "EXCLUSIVE" or "ACCESS EXCLUSIVE"... or two-phase commit or idempotency for the operations you're doing.

[0] https://www.postgresql.org/docs/current/explicit-locking.htm...

skrause 2 days ago

You link to table level locks which are different from advisory locks: https://www.postgresql.org/docs/current/explicit-locking.htm...

Are you sure that you're talking about the same locks? What are the pitfalls exactly?

candiddevmike 3 days ago

One gotcha maybe with locks is they are connection specific AFAIK, and in most libraries you're using a pool typically. So you need to have a specific connection for locks, and ensure you're using that connection when doing periodic lock tests.

Quekid5 3 days ago

Why would locks be connection-specific? ... considering that only one operation can be in flight at a time on a single connection. (Usually, at least.)

joatmon-snoo 3 days ago

Different DBs implement locks differently.

Postgres allows obtaining advisory locks at either the session _or_ transaction level. If it's session-level, then you have, ergo, a connection-level lock.

https://www.postgresql.org/docs/current/explicit-locking.htm...

skrause 2 days ago

PostgreSQL has pg_advisory_xact_lock which releases the lock automatically when the transaction is over.

m11a 2 days ago

But then you’d be holding a DB connection for the entire duration of your task (which may include HTTP calls, etc). You might even do asynchronous work in parallel, which doesn’t quite work with txn locks. So the session based locks seem a bit better imo.

eknkc 2 days ago

I personally do these in .NET, I obtain a connection dedicated to that operation, start a transaction, obtain lock and go crazy. Upon completion of the async workflow, the transaction closes and lock releases. I know I'm holding up a connection and putting some pressure on postgres by keeping a transaction open but session management might be harder as the underlying connection provider uses pooling and it is easier to use transactions rather than sessions here.

And if you add something like pgBouncer or whatever, this should still work but a session lock would fuck things up.

antirez 3 days ago

I suggest reading the comment I left back then in this blog post comments section, and the reply I wrote in my blog.

Btw, things to note in random order:

1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.

2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.

3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.

4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.

computerfan494 3 days ago

To be honest I've long been puzzled by your response blog post. Maybe the following question can help achieve common ground:

Would you use RedLock in a situation where the timeout is fairly short (1-2 seconds maybe), the work done usually takes ~90% of that timeout, and the work you do while holding a RedLock lock MUST NOT be done concurrently with another lock holder?

I think the correct answer here is always "No" because the risk of the lease sometimes expiring before the client has finished its work is very high. You must alter your work to be idempotent because RedLock cannot guarantee mutual exclusion under all circumstances. Optimistic locking is a good way to implement this type of thing while the work done is idempotent.

kgeist 3 days ago

>because the risk of the lease sometimes expiring before the client has finished its work is very high

We had corrupted data bacause of this.

antirez 3 days ago

The timeout must be much larger than the time required to do the work. The point is that distributed locks without a release mechanism are in practical terms very problematic.

Btw, things to note in random order:

1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.

2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.

3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.

4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.

computerfan494 3 days ago

Locking without a timeout is indeed in the majority of use-cases a non-starter, we are agreed there.

The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.

antirez 3 days ago

In most real world scenarios, the tradeoffs are a bit softer than what people in the formal world dictates (and doing so they forced certain systems to become suboptimal for everything but during failures, kicking them out of business...). Few examples:

1. E-commerce system where there are a limited amount of items of the same kind, you don't want to oversell.

2. Hotel booking system where we don't want to reserve the same dates/rooms multiple times.

3. Online medical appointments system.

In all those systems, to re-open the item/date/... after some time it's ok, even after one day. And if the lock hold time is not too big, but a very strict compromise (it's also a reasonable choice in the spectrum), and it could happen that during edge case failures three items are sold and there are two, orders can be cancelled.

So yes, there is a tension between timeout, race condition, recovery time, but in many systems using something like RedLock the development and end-user experience can be both improved with a high rate of success, and the random unhappy event can be handled. Now the algorithm is very old, still used by many implementations, and as we are talking problems are solved in a straightforward way with very good performances. Of course, the developers of the solution should be aware that there are tradeoffs between certain values: but when are distributed systems easy?

P.S. why 10 years of strong usage count, in the face of a blog post telling that you can't trust a system like that? Because even if DS issues emerge randomly and sporadically, in the long run systems that create real-world issues, if they reach mass usage, are known. A big enough user base is a continuous integration test big enough to detect when a solution has real world serious issues. So of course RedLock users picking short timeouts with tasks that take a very hard to predict amount of time, will indeed incur into knonw issues. But the other systemic failure modes described in the blog post are never mentioned by users AFAIK.

computerfan494 3 days ago

I feel like you're dancing around admitting the core issue that Martin points out - RedLock is not suitable for systems where correctness is paramount. It can get close, but it is not robust in all cases.

If you want to say "RedLock is correct a very high percentage of the time when lease timeouts are tuned for the workload", I would agree with you actually. I even possibly agree with the statements "most systems can tolerate unlikely correctness failures due to RedLock lease violations. Manual intervention is fine in those cases. RedLock may allow fast iteration times and is worth this cost". I just think it's important to be crystal clear on the guarantees RedLock provides.

I first read Martin's blog post and your response years ago when I worked at a company that was using RedLock despite it not being an appropriate tool. We had an outage caused by overlapping leases because the original implementor of the system didn't understand what Martin has pointed out from the RedLock documentation alone.

I've been a happy Redis user and fan of your work outside of this poor experience with RedLock, by the way. I greatly appreciate the hard work that has gone into making it a fantastic database.

bluepizza 3 days ago

Could you provide links?

saikatsg 3 days ago

anonzzzies 3 days ago

I am updating my low level and algo knowledge; what are good books about this (I have the one written by the author). I am looking to build something for fun, but everything is either a toy or very complicated.

cosmicradiance 3 days ago

System Design Interview I and II - Alex Xu. Take one of the topics and do it practically.