Hacker Remix

How to do distributed locking (2016)

244 points by yusufaytas 9 months ago | 95 comments

jojolatulipe 9 months ago

At work we use Temporal and ended up using a dedicated workflow and signals to do distributed locking. Working well so far and the implementation is rather simple, relying on Temporal’s facilities to do the distributed parts of the lock.

wslh 8 months ago

I just discovered Temporal, and I have to say thank you! From what I've seen so far, it seems like the holy grail for workflows, offering very clear high-level task management over complex infrastructure. Is Temporal unique in this space, or are there other alternatives of similar caliber? Given that it was spun off from Uber and is used by top vendors, it sounds like it’s been thoroughly battle-tested.

refset 8 months ago

DBOS [0] is outwardly similar although it's much younger. IIUC, internally DBOS is able to be more efficient and support lower latencies than Temporal because of the way it can push work down into Postgres stored procedures.

[0] https://www.dbos.dev/

amievil 8 months ago

Sounds interesting, could you elaborate a bit? I am interested in building something similar using temporal.

robertlagrant 9 months ago

I'm keen to use Temporal, but I've heard it can be flaky. In your experience has it worked well?

calmoo 9 months ago

Rock solid in my experience and kind of a game changer. I’m surprised it’s not more widespread in large orgs.

Icathian 9 months ago

We use it a ton at my shop for internal things like release rollouts. Fairly big tech company, and same experience. It's an excellent product.

eknkc 9 months ago

I tend to use postgresql for distributed locking. As in, even if the job is not db related, I start a transaction and obtain an advisory lock which stays locked until the transaction is released. Either by the app itself or due to a crash or something.

Felt pretty safe about it so far but I just realised I never check if the db connection is still ok. If this is a db related job and I need to touch the db, fine. Some query will fail on the connection and my job will fail anyway. Otherwise I might have already lost the lock and not aware of it.

Without fencing tokens, atomic ops and such, I guess one needs a two stage commit on everything for absolute correctness?

Quekid5 9 months ago

Advisory locks have many pitfalls, see [0].

AFAIK the only correct way to do what you probably thought you were doing is "EXCLUSIVE" or "ACCESS EXCLUSIVE"... or two-phase commit or idempotency for the operations you're doing.

[0] https://www.postgresql.org/docs/current/explicit-locking.htm...

skrause 8 months ago

You link to table level locks which are different from advisory locks: https://www.postgresql.org/docs/current/explicit-locking.htm...

Are you sure that you're talking about the same locks? What are the pitfalls exactly?

candiddevmike 9 months ago

One gotcha maybe with locks is they are connection specific AFAIK, and in most libraries you're using a pool typically. So you need to have a specific connection for locks, and ensure you're using that connection when doing periodic lock tests.

Quekid5 9 months ago

Why would locks be connection-specific? ... considering that only one operation can be in flight at a time on a single connection. (Usually, at least.)

joatmon-snoo 9 months ago

Different DBs implement locks differently.

Postgres allows obtaining advisory locks at either the session _or_ transaction level. If it's session-level, then you have, ergo, a connection-level lock.

https://www.postgresql.org/docs/current/explicit-locking.htm...

skrause 8 months ago

PostgreSQL has pg_advisory_xact_lock which releases the lock automatically when the transaction is over.

m11a 8 months ago

But then you’d be holding a DB connection for the entire duration of your task (which may include HTTP calls, etc). You might even do asynchronous work in parallel, which doesn’t quite work with txn locks. So the session based locks seem a bit better imo.

eknkc 8 months ago

I personally do these in .NET, I obtain a connection dedicated to that operation, start a transaction, obtain lock and go crazy. Upon completion of the async workflow, the transaction closes and lock releases. I know I'm holding up a connection and putting some pressure on postgres by keeping a transaction open but session management might be harder as the underlying connection provider uses pooling and it is easier to use transactions rather than sessions here.

And if you add something like pgBouncer or whatever, this should still work but a session lock would fuck things up.

antirez 9 months ago

I suggest reading the comment I left back then in this blog post comments section, and the reply I wrote in my blog.

Btw, things to note in random order:

1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.

2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.

3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.

4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.

computerfan494 9 months ago

To be honest I've long been puzzled by your response blog post. Maybe the following question can help achieve common ground:

Would you use RedLock in a situation where the timeout is fairly short (1-2 seconds maybe), the work done usually takes ~90% of that timeout, and the work you do while holding a RedLock lock MUST NOT be done concurrently with another lock holder?

I think the correct answer here is always "No" because the risk of the lease sometimes expiring before the client has finished its work is very high. You must alter your work to be idempotent because RedLock cannot guarantee mutual exclusion under all circumstances. Optimistic locking is a good way to implement this type of thing while the work done is idempotent.

kgeist 9 months ago

>because the risk of the lease sometimes expiring before the client has finished its work is very high

We had corrupted data bacause of this.

antirez 9 months ago

The timeout must be much larger than the time required to do the work. The point is that distributed locks without a release mechanism are in practical terms very problematic.

Btw, things to note in random order:

1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.

2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.

computerfan494 9 months ago

Locking without a timeout is indeed in the majority of use-cases a non-starter, we are agreed there.

The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.

antirez 9 months ago

In most real world scenarios, the tradeoffs are a bit softer than what people in the formal world dictates (and doing so they forced certain systems to become suboptimal for everything but during failures, kicking them out of business...). Few examples:

1. E-commerce system where there are a limited amount of items of the same kind, you don't want to oversell.

2. Hotel booking system where we don't want to reserve the same dates/rooms multiple times.

3. Online medical appointments system.

In all those systems, to re-open the item/date/... after some time it's ok, even after one day. And if the lock hold time is not too big, but a very strict compromise (it's also a reasonable choice in the spectrum), and it could happen that during edge case failures three items are sold and there are two, orders can be cancelled.

So yes, there is a tension between timeout, race condition, recovery time, but in many systems using something like RedLock the development and end-user experience can be both improved with a high rate of success, and the random unhappy event can be handled. Now the algorithm is very old, still used by many implementations, and as we are talking problems are solved in a straightforward way with very good performances. Of course, the developers of the solution should be aware that there are tradeoffs between certain values: but when are distributed systems easy?

P.S. why 10 years of strong usage count, in the face of a blog post telling that you can't trust a system like that? Because even if DS issues emerge randomly and sporadically, in the long run systems that create real-world issues, if they reach mass usage, are known. A big enough user base is a continuous integration test big enough to detect when a solution has real world serious issues. So of course RedLock users picking short timeouts with tasks that take a very hard to predict amount of time, will indeed incur into knonw issues. But the other systemic failure modes described in the blog post are never mentioned by users AFAIK.

computerfan494 9 months ago

I feel like you're dancing around admitting the core issue that Martin points out - RedLock is not suitable for systems where correctness is paramount. It can get close, but it is not robust in all cases.

If you want to say "RedLock is correct a very high percentage of the time when lease timeouts are tuned for the workload", I would agree with you actually. I even possibly agree with the statements "most systems can tolerate unlikely correctness failures due to RedLock lease violations. Manual intervention is fine in those cases. RedLock may allow fast iteration times and is worth this cost". I just think it's important to be crystal clear on the guarantees RedLock provides.

I first read Martin's blog post and your response years ago when I worked at a company that was using RedLock despite it not being an appropriate tool. We had an outage caused by overlapping leases because the original implementor of the system didn't understand what Martin has pointed out from the RedLock documentation alone.

I've been a happy Redis user and fan of your work outside of this poor experience with RedLock, by the way. I greatly appreciate the hard work that has gone into making it a fantastic database.

bluepizza 9 months ago

Could you provide links?

saikatsg 9 months ago

http://antirez.com/news/101

anonzzzies 9 months ago

I am updating my low level and algo knowledge; what are good books about this (I have the one written by the author). I am looking to build something for fun, but everything is either a toy or very complicated.

cosmicradiance 9 months ago

System Design Interview I and II - Alex Xu. Take one of the topics and do it practically.

egcodes 9 months ago

Once I wrote a dist. lock blog using this resource. Here it is: https://medium.com/sahibinden-technology/an-easy-integration...

dataflow 9 months ago

> The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it). However, if the GC pause lasts longer than the lease expiry period, and the client doesn’t realise that it has expired, it may go ahead and make some unsafe change.

Hold on, this sounds absurd to me:

First, if your client crashes, then you don't need a timed lease on the lock to detect this in the first place. The lock would get released by the OS or supervisor, whether there are any timeouts or not. If both of those crash too, then the connection would eventually break, and the network system should then detect that (via network resets or timeouts, lack of heartbeats, etc.) and then invalidate all your connections before releasing any locks.

Second, if the problem becomes that your client is buggy and thus holds the lock too long without crashing, then shouldn't some kind of supervisor detect that and then kill the client (e.g., by the OS terminating the process) before releasing the lock for everybody else?

Third, if you are going to have locks with timeouts to deal with corner cases you can't handle like the above, shouldn't they notify the actual program somehow (e.g., by throwing an exception, raising a signal, terminating it, etc.) instead of letting it happily continue execution? And shouldn't those cases wait for some kind of verification that the program was notified before releasing the lock?

The whole notion that timeouts should somehow permit the program execution to continue ordinary control flow sounds like the root cause of the problem, and nobody is even batting an eye at it? Is there an obvious reason why this makes sense? I feel I must be missing something here... what am I missing?

winwang 9 months ago

This isn't a mutex, but the distributed equivalent of one. The storage service is the one who invalidates the lock on their side. The client won't detect its own issues without additional guarantees not given (supposedly) by Redlock.

dataflow 9 months ago

I understand that. What I'm hung up on is, why does the storage system feel it is at liberty to just invalidate a lock and thus let someone else reacquire it without any sort of acknowledgment (either from the owner or from the communication systems connecting the owner to the outside world) that the owner will no longer rely on it? It just seems fundamentally wrong. The lock service just... doesn't have that liberty, as I see it.

winwang 9 months ago

What if the rack goes down? But I think the author is saying a similar thing to you. The fenced token is essentially asserting that the client will no longer rely on the lock, even if it tries to. The difference is the service doesn't need any acknowledgement, no permission needed to simlly deny the client later.

dataflow 9 months ago

To be clear, my objection is to the premise, not to the offered solution.

To your question, could you clarify what exactly you mean by the rack "going down"? This encompasses a lot of different scenarios, I'm not sure which one you're asking about. The obvious interpretation would break all the connections the program has to the outside world, thus preventing the problem by construction.

winwang 9 months ago

The rack could go down from the point of view of the storage service, but the machine/VM itself could be perfectly fine.

dataflow 9 months ago

In that scenario the machine would become aware that it can't reach the storage service either, no? In which case the host can terminate the program, or the network can break all the connections between them, or whatever. By default I would think that the lease shouldn't be broken until the network partition gets resolved, but I think the storage system could have a timeout for breaking the lease in that scenario if you really want, but then it would come with a time-based guarantee that the program isn't running anymore, no?

winwang 9 months ago

Everything you're saying is plausibly possible in the absurdly large search space of all possible scenarios. The author's premise, however, is rooted in the specific scenario they lay out, with historical supporting examples which you can look into. Even then, the premise before all that was essentially: Redlock does not do what people might expect of a distributed lock. Btw I do have responses to your questions, but often times in these sorts of discussions, I find that there can always be an objection to an objection to ... etc. The "sense" (or flavor) in this case is that "we are taking a complex topic too lightly". In fact, I should probably continue reading the author's book (DDIA) at some point...

dataflow 9 months ago

> The "sense" (or flavor) in this case is that "we are taking a complex topic too lightly".

I get that -- and honestly, I'm not expecting a treatise on distributed consensus here. But what took me aback was that the blog post didn't even attempt to mention anything about the fact that the premise (at first glance) looks glaringly broken. If he'd even said 1 single sentence like "it's {difficult/infeasible/impossible} to design a client that will never continue execution past a timeout", it'd have been fine, and I would've happily moved along. But the way it is written right now, it reads a little bit like: "we design a ticking time bomb that we can't turn off; how can we make sure we don't forget to reset the timer every time?"... without bothering to say anything about why we should be digging ourselves into such a hole in the first place.

winwang 9 months ago

Yeah, that makes sense now. I think, personally, I've simply seen that design around a bunch, but great on you to question it and call it out -- also plausible that my own headcanon doesn't check out.

dataflow 9 months ago

Thanks, yeah. For what it's worth, partly what led me to even leave this comment is that when he wrote "the code above is broken", I stared at it, and for the life of me I couldn't see why it was broken. Because, of course, the code was lying: there was no mention of leases or timeouts. Having a "lease" suddenly pulled out of nowhere really felt like a fast one being pulled on me (and really unfairly so!), hence I decided I'd actually leave the comment and question what the basis for this hidden time bomb even was in the first place?! If the code had said leaseLock(filename, timeout), I think the bug would've been glaringly obvious, and far fewer people would've been surprised by looking at the code.

Also for what it's worth, I can guess what some of the answers might be. For example, it's possible you'd need very precise timing facilities that aren't always available, in order to be able to guarantee high throughput with correctness (like Google Spanner's). Or it might be that doing so requires a trade-off between availability and partition-tolerance that in some applications isn't justified. But I'm curious what the answer actually is, rather than just (semi-)random guesses as to what it could be.

wbl 9 months ago

The process that owns the lock is never heard from again.

neonbrain 9 months ago

My understanding is that the dataflow user was talking about a notification which the server is supposed to receive from the OS in the case of a broken client connection. This notification is usually received, but cannot be guaranteed in a distributed environment.

neonbrain 9 months ago

The assumption that your server will always receive RST or FIN from your client is incorrect. There are some cases when these packets are being dropped, and your server will stay with an open connection while the client on the remote machine is already dead. P.S. BTW, it's not me who downvoted you

dataflow 9 months ago

I made no such assumption this will always happen though? That's why the comment was so much longer than just "isn't TCP RST enough?"... I listed a ton of ways to deal with this that didn't involve letting the program continue happily on its path.

neonbrain 9 months ago

Sorry didn't see your message. What I mean is that if you are not getting RST/FIN or any other indication for your closed communication channel, you only left to the mechanism of timeouts to recognize a partitioned/dead/slow worker client. Basically, you've mentioned them yourself ("timeouts, lack of heartbeats, etc" in your post are all forms of timeouts). So you can piggyback on these timeouts or use a smaller timeout configured in the lease, whatever suits your purpose, I guess. This is what I believe Kleppmann referring here to. He's just being generic in his description.

dataflow 9 months ago

> What I mean is that if you are not getting RST/FIN or any other indication for your closed communication channel, you only left to the mechanism of timeouts to recognize a partitioned/dead/slow worker client.

Timeouts were a red herring in my comment. My problem wasn't with the mere existence of timeouts in corner cases, it was the fact that the worker is assumed to keep working merrily on, despite the timeouts. That's what I don't understand the justification for. If the worker is dead, then it's a non-issue, and the lease can be broken. If the system is alive, the host can discover (via RST, heartbeats, or other timeouts) that the storage system is unreachable, and thus prevent the program from continuing execution -- and at that point the storage service can still break the lease (via a timeout), but it would actually come with a timing-based guarantee that the program will no longer continue execution.

hoppp 9 months ago

I did distributed locking with Deno, and Deno KV hosted by Deno Deploy.

Its using foundationdb, a distributed db. The deno instances running on local devices all connect to the same Deno KV to acquire the lock.

But using postgres, a select for update also works, the database is not distributed tho.

jroseattle 9 months ago

We reviewed Redis back in 2018 as a potential solution for our use case. In the end, we opted for a less sexy solution (not Redis) that never failed us, no joke.

Our use case: handing out a ticket (something with an identifier) from a finite set of tickets from a campaign. It's something akin to Ticketmaster allocating seats in a venue for a concert. Our operation was as you might expect: provide a ticket to a request if one is available, assign some metadata from the request to the allocated ticket, and remove it from consideration for future client requests.

We had failed campaigns in the past (over-allocation, under-allocation, duplicate allocation, etc.) so our concern was accuracy. Clients would connect and request a ticket; we wanted to exclusively distribute only the set of tickets available from the pool. If the number of client requests exceeded the number of tickets, the system should protect for that.

We tried Redis, including the naive implementation of getting the lock, checking the lock, doing our thing, releasing the lock. It was ok, but administrative overhead was a lot for us at the time. I'm glad we didn't go that route, though.

We ultimately settled on...Postgres. Our "distributed lock" was just a composite UPDATE statement using some Postgres-specific features. We effectively turned requests into a SET operation, where the database would return either a record that indicated the request was successful, or something that indicated it failed. ACID transactions for the win!

With accuracy solved, we next looked at scale/performance. We didn't need to support millions of requests/sec, but we did have some spikiness thresholds. We were able to optimize read/write db instances within our cluster, and strategically load larger/higher-demand campaigns to allocated systems. We continued to improve on optimization over two years, but not once did we ever have a campaign with ticket distribution failures.

Note: I am not an expert of any kind in distributed-lock technology. I'm just someone who did their homework, focused on the problem to be solved, and found a solution after trying a few things.

nh2 9 months ago

You are right that anything that needs up to 50000 atomic, short-lived transactions per second can just use Postgres.

Your UPDATE transaction lasts just a few microseconds, so you can just centralise the problem and that's good because it's simpler, faster and safer.

But this is not a _distributed_ problem, as the article explains:

> remember that a lock in a distributed system is not like a mutex in a multi-threaded application. It’s a more complicated beast, due to the problem that different nodes and the network can all fail independently in various ways

You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock.

fny 9 months ago

You could just have multiple clients attempt to update a row that defines the lock. Postgres transactions have no limit and will unwind on client failure. Since connections are persistent, there’s no need to play a game to determine the state of a client.

nh2 9 months ago

Your scenario still uses a centralised single postgres server. Failure of that server takes down the whole locking functionality. That's not what people usually mean by "distributed".

"the machines involved can fail" must also include the postgres machines.

To get that, you need to coordinate multiple postgres servers, e.g. using ... distributed locking. Postgres does not provide that out of the box -- neither multi-master setups, nor master-standby synchronous replication with automatic failover. Wrapper software that provides that, such as Stolon and Patroni, use distributed KV stores / lock managers such as etcd and Consul to provide it.

jroseattle 9 months ago

> up to 50000 atomic, short-lived transactions per second

50000?

> You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock. From my experience, locks are needed to ensure synchronized access to resources. Distributed locks are a form of that isolation being held across computing processes, as opposed to the mutex example provided.

And while our implementation definitively did not use a distributed lock, we could still see those machines fail.

I fail to understand why a distributed lock is needed for anything due to it's duration.

throwawaythekey 8 months ago

Mostly guessing but -> duration is usually inversely correlated with throughput.

If you require high throughput and have a high duration then partitioning/distribution are the normal solution.

stickfigure 9 months ago

I think this illustrates something important, which is that: You don't need locking. You need <some high-level business constraint that might or might not require some form of locking>.

In your case, the constraint is "don't sell more than N tickets". For most realistic traffic volumes for that kind of problem, you can solve it with traditional rdbms transactional behavior and let it manage whatever locking it uses internally.

I wish developers were a lot slower to reach for "I'll build distributed locks". There's almost always a better answer, but it's specific to each application.

jroseattle 9 months ago

This is exactly how we arrived at our solution. We needed to satisfy the constraint; locking was one means of addressing the constraint.

Maybe we were lucky in our implementation, but a key factor for our decision was understanding how to manage the systems in our environment. We would have skilled up with Redis, but we felt our Postgres solution would be a good first step. We just haven't had a need to go to a second step yet.

nasretdinov 9 months ago

So basically your answer (and the correct answer most of the time) was that you don't really need distributed locks even if you think you do :)

tonyarkles 9 months ago

Heh, in my local developer community I have a bit of a reputation for being “the guy” to talk to about distributed systems. I’d done a bunch of work in the early days of the horizontal-scaling movement (vs just buying bigger servers) and did an M.Sc focused on distributed systems performance.

Whenever anyone would come and ask for help with a planned distributed system the first question I would always ask is: does this system actually need to be distributed?! In my 15 years of consulting I think the answer was only actually “yes” 2 or 3 times. Much more often than was helping them solve the performance problems in their single server system; without doing that they would usually just have ended up with a slow complex distributed system.

Edit: lol this paper was not popular in the Distributed Systems Group at my school: https://www.usenix.org/system/files/conference/hotos15/hotos...

“You can have a second computer once you’ve shown you know how to use the first one.”

Agingcoder 9 months ago

I wanted to post the same paper. With Adrian Colyer’s explanations: https://blog.acolyer.org/2015/06/05/scalability-but-at-what-...

etcd 9 months ago

I guess this is embarassingly parralelizable in that you can shard by concert to different instances. Might even be a job for that newfangled cloudflare sqlite thing.

wwarner 9 months ago

This is the best way, and actually the only sensible way to approach the problem. I first read about it here https://code.flickr.net/2010/02/08/ticket-servers-distribute...

hansvm 9 months ago

> only sensible way

That's a bit strong. Like most of engineering, it depends. Postgres is a good solution if you only have maybe 100k QPS, the locks are logically (if not necessarily fully physically) partially independent, and they aren't held for long. Break any of those constraints, or add anything weird (inefficient postgres clients, high DB load, ...), and you start having to explore either removing those seeming constraints or using other solutions.

wwarner 9 months ago

Ok fair; I'm not really talking about postgres (the link i shared uses mysql). I'm saying that creating a ticket server that just issues and persists unique tokens, is a way to provide coordination between loosely coupled applications.

zbobet2012 9 months ago

Yeah that's cookies. They are great.

OnlyMortal 9 months ago

Interesting. We went through a similar process and ended up with Yugabyte to deal with the locks (cluster).

It’s based on Postgres but performance was not good enough.

We’re now moving to RDMA.

apwell23 9 months ago

Classic tech interview question

galeaspablo 9 months ago

Many engineers don’t truly care about the correctness issue, until it’s too late. Similar to security.

Or they care but don’t bother checking whether what they’re doing is correct.

For example, in my field, where microservices/actors/processes pass messages between each other over a network, I dare say >95% of implementations I see have edge cases where messages might be lost or processed out of order.

But there isn’t an alignment of incentives that fixes this problem. Ie the payment structures for executives and engineers aren’t aligned with the best outcome for customers and shareholders.

noprocrasted 9 months ago

> there isn’t an alignment of incentives that fixes this problem

"Microservices" itself is often a symptom of this problem.

Everyone and their dog wants to introduce a network boundary in between function calls for no good reason just so they can subsequently have endless busywork writing HTTP (or gRPC if you're lucky) servers, clients & JSON (de?)serializers for said function calls and try to reimplement things like distributed transactions across said network boundary and dealing with the inevitable "spooky action at a distance" that this will yield.

sethammons 9 months ago

I've worked with microservices at scale and it was fantastic. We couldn't break backwards compatibility with our API without a lot of coordination. Outside of that, you could deploy as frequently as needed and other services could update as needed to make use of new features.

The monoliths I have worked in, very contrastingly, have had issues coordinating changes within the codebases, code crosses boundaries it should not and datastores get shared and coupled to (what should be) different domains leading to slow, inefficient code and ossified options for product changes.

shepherdjerred 9 months ago

If you're hand-writing clients/servers/serializers instead of generating them from schema definitions then you have more fundamental issues than using microservices.

sethammons 9 months ago

We hand wrote clients at my last microservice based gig. It was marginally slower than automated clients and we did run into a few cases of teams "waisting" their time writing their own clients; that was fixed by the authoring service team also authoring clients. It wasn't a big issue

sethammons 9 months ago

The path to fixing this requires first measuring and monitoring it, then establishing service level objectives that represent customer experience. Product and engineering teams have to agree on them. If the SLOs become violated, focus shifts towards system stability.

Getting everyone onboard is hard and that is why good leadership is needed. When customers start to churn because bugs pop up and new features are slow or non existent, then the case is very easy to make quality part of the process. Mature leaders get ahead of that as early as possible.

galeaspablo 9 months ago

Good leadership is spot on! Agreed. The cynic part of me sees incentives that discourage mature leadership styles.

Leaders tend to be impatient and think of this quarter’s OKRs as opposed to the business’ long term financial health. In other word the leaders of leaders use standard MBA prescribed incentive structures.

secondcoming 9 months ago

> 95% of implementations I see have edge cases where messages might be lost or processed out of order.

Eek. This sort of thing can end up with innocent people in jail, or dead.

[0] https://en.wikipedia.org/wiki/British_Post_Office_scandal

noprocrasted 9 months ago

The problem (or the solution, depending on which side you're on) is that innocent people are in jail or dead. The people that knowingly allowed this to happen are still free and wealthy.

So I'm not particularly sure this is a good example - if anything, it sets the opposite incentives, that even jailing people or driving them to suicide won't actually have any consequences for you.

mrkeen 9 months ago

I think there's a bit of an alignment of incentives: the edge cases are tricky enough that your programmers probably need to handle a lot of support tickets, which isn't good for anyone.

But I don't see anyway to convince yesterday's managers to give us time to build it right.

jmull 9 months ago

This overcomplicates things...

* If you have something like what the article calls a fencing token, you don't need any locks.

* The token doesn't need to be monotonically increasing, just a passive unique value that both the client and storage have.

Let's call it a version token. It could be monotonically increasing, but a generated UUID, which is typically easier, would work too. (Technically, it could even be a hash of all the data in the store, though that's probably not practical.) The logic becomes:

(1) client retrieves the current version token from storage, along with any data it may want to modify. There's no external lock, though the storage needs to retrieve the data and version token atomically, ensuring the token is specifically for the version of the data retrieved.

(2) client sends the version token back along with any changes.

(3) Storage accepts the changes if the current token matches the one passed with the changes and creates a new version token (atomically, but still no external locks).

Now, you can introduce locks for other reasons (hopefully goods ones... they seem to be misused a lot). Just pointing out they are/should be independent of storage integrity in a distributed system.

(I don't even like the term lock, because they are temporary/unguaranteed. Lease or reservation might be a term that better conveys the meaning.)

cnlwsu 9 months ago

You’re describing compare and swap which is a good solution. You’re pushing complexity down to the database, and remember this is distributed locking. When you have a single database it’s simple until the database crashes leaving you in state of not knowing which of your CAS writes took effect. In major systems that demand high availability and multi datacenter backups this becomings pretty complicated with scenarios that break this as well around node failure. Usually some form of paxos transaction log is used. Never assume there is an easy solution in distributed systems… it just always sucks

zeroxfe 9 months ago

> This overcomplicates things...

You're misinterpreting the problem described, and proposing a solution for a different problem.

karmakaze 9 months ago

This is known as 'optimistic locking'. But I wouldn't call it a distributed locking mechanism.

jameshart 9 months ago

Optimistic locks are absolutely a distributed locking mechanism, in that they are for coordinating activity among distributed nodes - but they do require the storage node to have strong guarantees about serialization and atomicity of writes. That means it isn’t a distributed storage solution, but it is something you can build over the top of a distributed storage solution that has strong read after write guarantees.

zeroxfe 9 months ago

This is unconventional use of the term "distributed locking". This alternative just punts the hard part of locking to the storage system.

karmakaze 9 months ago

I normally see it as a version column in a database where it being with the data makes it non-distributed.

I'm not even sure how it could be used for exclusive update to a resource elsewhere--all clients will think they 'have' the lock and change the resource, then find out they didn't when they update the lock. Or if they bump the lock first, another client could immediately 'have' the lock too.

wh0knows 9 months ago

This neglects the first reason listed in the article for why you would use a lock.

> Efficiency: Taking a lock saves you from unnecessarily doing the same work twice (e.g. some expensive computation). If the lock fails and two nodes end up doing the same piece of work, the result is a minor increase in cost (you end up paying 5 cents more to AWS than you otherwise would have) or a minor inconvenience (e.g. a user ends up getting the same email notification twice).

I think multiple nodes doing the same work is actually much worse than what’s listed, as it would inhibit you from having any kind of scalable distributed processing.

karmakaze 9 months ago

As mentioned in the article, a non-100%-correct lock can be used for efficiency purposes. So basically use an imperfect locking mechanism for efficiency and a reliable one for correctness.

jmull 9 months ago

> and a reliable one for correctness

To be clear, my point is don't use distributed locking for correctness. There are much better options.

Now, the atomicity I mention implies some kind of internal synchronization mechanism for multiple requests, which could be based on locks, but those would be real, non-distributed ones.

jmull 9 months ago

Sure, that's why I said you might introduce "locks" (reservations is a much better term) for other reasons.

Efficiency is one, as you say.

The other main one that comes to mind is to implement other "business rules" (hate that term, but that's what people use), like for a online shopping app, the stock to fulfill an order might be reserved for a time when the user starts the checkout process.

bootsmann 9 months ago

Won't this lead to inconsistent states if you don't do monotonically increasing tokens?

I.e. your storage system has two nodes and there are two read-modify-write processes running. Process 1 acquires the first token "abc" and process two also acquires the token "abc". Now process 1 commits, the token is changed to "cde" and the change streamed to node 2. Due to network delay, the change to node 2 is delayed. Meanwhile process 2 commits to node 2 with token "abc". Node 2 accepts the change because it has not received the message from node 1 and your system is now in an inconsistent state.

Note that this cannot happen in a scenario where we have monotonically increasing fencing tokens because that requirement forces the nodes to agree on a total order of operations before they can supply the fencing token.

computerfan494 9 months ago

In the above description of optimistic locking, it is assumed that it is impossible to issue the same token to multiple clients. Nodes can agree that a given token has also never been issued before just like a monotonically increasing value. The nice property about non-monitonically-increasing tokens is that nodes may generate them without coordinating if you can make other assumptions about that system. A good example is when nodes use an ID they were assigned beforehand as part of the token generation, guaranteeing that the leasing tokens they mint will not conflict with other nodes' as long as node IDs are not reused.

bootsmann 9 months ago

I have a hard time wrapping my head around what you are proposing here. Say client A requests data, they get the token a-abc. Then client B requests data, they get the token b-cde. Client A commits their write, does the storage reject it because they already issued another token (the one from client B) or does it accept it?

computerfan494 9 months ago

My understanding of what the OP was discussing is an optimistic locking system where the nodes only accept commits if the last issued token matches the token included in the commit. While agreeing on the last token requested requires coordination, unlike monotonically increasing tokens you could have well-behaved clients generate token content themselves without coordination. That may or may not be useful as a property.

bootsmann 9 months ago

Got it, thank you for clarifying this.

jmull 9 months ago

"node1", "node2", and "storage" are three separate things in the distributed environment. Only storage accepts changes, and it's what verifies the incoming token matches the current token.

So node2 doesn't get to accept changes. It can only send changes to storage, which may or may not be accepted by it.

bootsmann 9 months ago

If the storage is a singular entity then this is not a distributed systems problem at all, no?

eru 9 months ago

Git push's `--force-with-lease` option does essentially this.

(Honestly, they should rename `--force-with-lease` to just `--force`, and rename the old `--force` behaviour to `--force-with-extreme-prejudice` or something like that. Basically make the new behaviour the default `--force` behaviour.)

pyrolistical 9 months ago

`--force-unsafe`

eru 8 months ago

That would be the saner name, yes.