327 points by GarethX 2 months ago | 204 comments
codeulike 2 months ago
And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).
The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.
klabb3 2 months ago
I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.
Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)
Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.
mweidner 2 months ago
It is still tempting to turn to CRDTs to solve the next problem: how to apply server-side changes to a client when the client has its own pending local operations. But this can be solved in a fully general way using server reconciliation, which doesn't restrict your operations or data structures like a CRDT does. I wrote about it here: https://mattweidner.com/2024/06/04/server-architectures.html...
klabb3 2 months ago
> how to apply server-side changes to a client when the client has its own pending local operations
I liked the option of restore and replay on top of the updated server state. I’m wondering when this causes perf issues? First local changes should propagate fast after eg a network partition, even if the person has queued up a lot of them (say during a flight).
Anyway, my thinking is that you can avoid many consensus problems by just partitioning data ownership. The like example is interesting in this way. A like count is an aggregate based on multiple data owners, and everyone else just passively follows with read replication. So thinking in terms of shared write access is the wrong problem description, imo, when in reality ”liked posts” is data exclusively owned by all the different nodes doing the liking (subject to a limit of one like per post). A server aggregate could exist but is owned by the server, so no shared write access is needed.
Similarly, say you have a messaging service. Each participant owns their own messages and others follow. No conflicts are needed. However, you can still break the protocol (say liking twice). Those can be considered malformed and eg ignored. In some cases, you can copy someone else’s data and make it your own: for instance to protect against impersonations: say that you can change your own nickname, and others follow. This can be exploited to impersonate but you can keep a local copy of the last seen nickname and then display a ”changed name” warning.
Anyway, I’m just a layman who wants things to be simple. It feels like CRDTs have been the ultimate nerd-snipe, and when I did my own evaluations I was disappointed with how heavyweight and opaque they were a few years ago (and probably still).
ochiba 2 months ago
I agree with this. CRDTs are cool tech but I think in practice most folks would be surprised by the high percentage of use cases that can be solved with much simpler conflict resolution mechanism (and perhaps combined with server reconciliation as Matt mentioned). I also agree that collaborative document editing is a niche where CRDTs are indeed very useful.
satvikpendem 2 months ago
halfcat 2 months ago
I suspect the generalized solution is much harder to achieve, and looks more like batch-based reconciliation of full snapshots than streaming or event-driven.
The challenge is if you aim to sync data sources where the parties managing each data source are not incentivized to provide robust sync. Consider Dropbox or similar, where a single party manages the data set, and all software (server and clients), or ecosystems like Salesforce and Mulesoft which have this as a stated business goal, or ecosystems like blockchains where independent parties are still highly incentivized to coordinate and have technically robust mechanisms to accomplish it like Merkle trees and similar. You can achieve sync in those scenarios because independent parties are incentivized to coordinate (or there is only one party).
But if you have two or more independent systems, all of which provide some kind of API or import/export mechanisms, you can never guarantee those systems will stay in sync using a streaming or event-driven approach. And worse, those systems will inevitably drift out of sync, or even more worse, will propagate incorrect data across multiple systems, which can then only be reconciled by batch-like point-in-time snapshots, which then begs the question of why use streaming if you ultimately need batch to make it work reliably.
Put another way, people say batch is a special case of streaming, so just use streaming. But you could also say streaming is a fragile form of sync, so just use sync. But sync is a special case of batch, so just use batch.
9rx 2 months ago
Or CRDTs at all. Google Docs is based on operational transforms and Figma on what they call multiplayer technology.
josephg 2 months ago
Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!
CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.
I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.
miki123211 2 months ago
In most cases, the easiest approach there is just "slap a blockchain on it", as a good and modern (think Ethereum, not Bitcoin) blockchain essentially "abstracts away" the decentralization and mostly acts like a centralized computer to higher layers.
That is certainly not the only viable approach, and I wish we looked at others more. For example, a decentralized DNS-like system, without an attached cryptocurrency, but with global consensus on what a given name points to, would be extremely useful. I'm not convinced that such a thing is possible, you need some way of preventing one bad actor from grabbing all the names, and monetary compensation seems like the easiest one, but we should be looking in this direction a lot more.
josephg 2 months ago
In my mind, this is just the second category again. It’s just a shared data system, except with data validation & Byzantine fault tolerance requirements.
It’s a surprisingly common and thorny problem. For example, I could change my local git client to generate invalid / wrong hashes for my commits. When I push my changes, other peers should - in some way - reject them. PVH (of Ink&Switch) has a rule when thinking about systems like this. He says you’re free to deface your own copy of the US constitution. But I don’t have to pull your changes.
Access control makes the BFT problem much worse. The classic problem is that if two admins concurrently remove each other, it’s not clear what happens. In a crdt (or git), peers are free to backdate their changes to any arbitrary point in the past. If you try and implement user roles on top of a crdt, it’s a nightmare. I think CRDTs are just the wrong tool for thinking about access control.
jkaptur 2 months ago
One thing I think that is missing in the discussion about shared data (and maybe you can correct me) is that there are two ways of looking at the problem: * The "math/engineering" way, where once state is identical you are done! * The "product manager" way where you have reasonable-sounding requests like "I was typing in the middle of a paragraph, then someone deleted that paragraph, and my text was gone! It should be its own new paragraph in the same place."
Literally having identical state (or even identical state that adheres to a schema) is hard enough, but I'm not aware of techniques to ensure 1) identical state 2) adhering to a schema 3) that anyone on the team can easily modify in response to "PM-like" demands without being a sync expert.
ochiba 2 months ago
I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.
danielvaughn 2 months ago
That said, there’s work that has been done towards fixing some of those issues.
Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:
rapnie 2 months ago
tmpfs 2 months ago
I've been working on sync for the latter use case for a while and CRDTs would definitely be overkill.
layer8 2 months ago
lifty 2 months ago
cyanydeez 2 months ago
jdvh 2 months ago
jakelazaroff 2 months ago
jdvh 2 months ago
When clients disagree about the the order of events and a conflict results then clients can be required to roll back (apply the inverse of each change) to the last point in time where all clients were in agreement about the world state. Then, all clients re-apply all changes in the new now-agreed-upon order. Now all changes have been applied and there is agreement about the world state and the process starts anew.
This way multiple clients can work offline for extended periods of time and then reconcile with other clients.
satvikpendem 2 months ago
dboreham 2 months ago
jdvh 2 months ago
mackopes 2 months ago
Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil. There are no shortcuts to building truly high quality product at a large scale.
wim 2 months ago
I definitely believe sync engines are the future as they make it so much easier to enable things like no-spinners browsing your data, optimistic rendering, offline use, real-time collaboration and so on.
I'm also not entirely convinced yet though that it's possible to get away with something that's not custom-built, or at least large parts of it. There were so many micro decisions and trade-offs going into the engine: what is the granularity of updates (characters, rows?) that we need and how does that affect the performance. Do we need a central server for things like permissions and real-time collaboration? If so do we want just deltas or also state snapshots for speedup. How much versioning do we need, what are implications of that? Is there end-to-end-encryption, how does that affect what the server can do. What kind of data structure is being synced, a simple list/map, or a graph with potential cycles? What kind of conflict resolution business logic do we need, where does that live?
It would be cool to have something general purpose so you don’t need to build any of this, but I wonder how much time it will save in practice. Maybe the answer really is to have all kinds of different sync engines to pick from and then you can decide whether it's worth the trade-off not having everything custom-built.
mentalgear 2 months ago
btw: excellent questions to ask / insights - about the same I also came across in my lo-fi ventures.
Would be great if someone could assemble all these questions in a "walkthrough" step-by-step interface and in the end, the user gets a list of the best matching engines.
Edit: Mh ... maybe something small enough to vibe code ... if someone is interested to help let me know!
jdvh 2 months ago
1) in a decentralized system who is responsible for backups? What happens when you restore from a backup?
2) in a decentralized system who sends push notifications and syncs with mobile devices?
I think that in an age of $5/mo cloud vms and free SSL having a single coordination server has all the advantages and none of the downsides.
tonsky 2 months ago
- Sync engines might only solve small and medium scale, that would be a huge win even without large scale
thr0w 2 months ago
Remember Meteor?
xg15 2 months ago
> It’s also ill-advised to try to solve data sync while also working on a product. These problems require patience, thoroughness, and extensive testing. They can’t be rushed. And you already have a problem on your hands you don’t know how to solve: your product. Try solving both, fail at both.
Also, you might not have that "large scale" yet.
(I get that you could also make the opposite case, that the individual requirements for your product are so special that you cannot factor out any common behavior. I'd see that as a hypothesis to be tested.)
tbrownaw 2 months ago
The first rule of network transparency is: the network is not transparent.
> Or: I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying
Is boost::multi_index_container no longer a thing?
Also there's SQLite with the :memory: database.
And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
anonyfox 2 months ago
Helps a lot with high read situations and takes considerable load off the database with probably 1 hour of coding effort if you know what you're doing.
TeMPOraL 2 months ago
Depends on the shop. I haven't seen one in production so far, but I don't doubt some people use it.
> Also there's SQLite with the :memory: database.
Ah, now that's cheating. I know, because I did that too. I did that because of the realization that half the members I'm stuffing into classes to store my game state are effectively poor man's hand-rolled tables, indices and spatial indices, so why not just use a proper database for this?.
> And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?
(But then, so did most of the industry.)
tbrownaw 2 months ago
https://en.wikipedia.org/wiki/OpenEdge_Advanced_Business_Lan...
Dates back to 1981, called "Progress 4GL" until 2006.
https://docs.progress.com/bundle/abl-reference/page/DEFINE-T...
phyrex 2 months ago
aiono 2 months ago
That's precisely why current request model is painful.
ximm 2 months ago
If that was true, we would ultimately end up with a single layer. Instead I would say that major shifts happen when we move the boundaries between layers.
The author here proposes to replace servers by synced client-side data stores.
That is certainly a good idea for some applications, but it also comes with drawbacks. For example, it would be easier to avoid stale data, but it would be harder to enforce permissions.
worthless-trash 2 months ago
There was still a server, its just not YOUR server. In this case, there will still be servers, just maybe not something that you need to manage state on.
This misnaming creates endless conflict when trying to communicate this with hyper excited management who want to get on the latest trend.
Cant wait to be on the meeting and hearing: "We dont need servers when we migrate to client side data stores".
TeMPOraL 2 months ago
Diederich 2 months ago
Over time, the meaning of the word 'Xerox' changed. More specifically, it gained a new meaning. For a long time, Xerox only referred to a company named in 1961. Some time in the late 60s, it started to be used as a verb, and as I was growing up in the 70s and 80s, the word 'Xerox' was overwhelmingly used in its verb form.
Our society decided as a whole that it was ok for the noun Xerox to be used a verb. That's a normal and natural part of language development.
As others have noted, management doesn't care whether the serverless thing you want to use is running on servers or not. They care that they don't have to maintain servers themselves. CapEx vs OpEx and all that.
I agree that there could be some small hazard with the idea that, if I run my important thing in a 'serverless' fashion, then I don't have to associate all of the problems/challenges/concerns I have with 'servers' to my important thing.
It's an abstraction, and all abstractions are leaky.
If we're lucky, this abstraction will, on average, leak very little.
philsnow 2 months ago
https://www.youtube.com/watch?v=PZbqAMEwtOE#t=5m58s I don't think this dramatization (of a court proceedings from 2010) is related to Xerox's plight with losing their trademark, but said dramatization is brilliant nonetheless
szundi 2 months ago