remix logo

Hacker Remix

Sync Engines Are the Future

327 points by GarethX 2 months ago | 204 comments

codeulike 2 months ago

I've been thinking about this a lot - nearly every problem these days is a synchronisation problem. You're regularly downloading something from an API? Thats a sync. You've got a distributed database? Sync problem. Cache Invalidation? Basically a sync problem. You want online and offline functionality? sync problem. Collaborative editing? sync problem.

And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).

The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.

klabb3 2 months ago

> The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech

I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.

Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.

mweidner 2 months ago

Indeed, the simple approach of "send your operations to the server and it will apply them in the order it receives them" gives you good-enough conflict resolution in many cases.

It is still tempting to turn to CRDTs to solve the next problem: how to apply server-side changes to a client when the client has its own pending local operations. But this can be solved in a fully general way using server reconciliation, which doesn't restrict your operations or data structures like a CRDT does. I wrote about it here: https://mattweidner.com/2024/06/04/server-architectures.html...

klabb3 2 months ago

Just got to reading this.

> how to apply server-side changes to a client when the client has its own pending local operations

I liked the option of restore and replay on top of the updated server state. I’m wondering when this causes perf issues? First local changes should propagate fast after eg a network partition, even if the person has queued up a lot of them (say during a flight).

Anyway, my thinking is that you can avoid many consensus problems by just partitioning data ownership. The like example is interesting in this way. A like count is an aggregate based on multiple data owners, and everyone else just passively follows with read replication. So thinking in terms of shared write access is the wrong problem description, imo, when in reality ”liked posts” is data exclusively owned by all the different nodes doing the liking (subject to a limit of one like per post). A server aggregate could exist but is owned by the server, so no shared write access is needed.

Similarly, say you have a messaging service. Each participant owns their own messages and others follow. No conflicts are needed. However, you can still break the protocol (say liking twice). Those can be considered malformed and eg ignored. In some cases, you can copy someone else’s data and make it your own: for instance to protect against impersonations: say that you can change your own nickname, and others follow. This can be exploited to impersonate but you can keep a local copy of the last seen nickname and then display a ”changed name” warning.

Anyway, I’m just a layman who wants things to be simple. It feels like CRDTs have been the ultimate nerd-snipe, and when I did my own evaluations I was disappointed with how heavyweight and opaque they were a few years ago (and probably still).

ochiba 2 months ago

> Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

I agree with this. CRDTs are cool tech but I think in practice most folks would be surprised by the high percentage of use cases that can be solved with much simpler conflict resolution mechanism (and perhaps combined with server reconciliation as Matt mentioned). I also agree that collaborative document editing is a niche where CRDTs are indeed very useful.

satvikpendem 2 months ago

You might not need a CRDT [0]. But also, CRDTs are the future [1].

[0] https://news.ycombinator.com/item?id=33865672

[1] https://news.ycombinator.com/item?id=24617542

halfcat 2 months ago

> I believe we need is end-to-end bidirectional stream based data communication

I suspect the generalized solution is much harder to achieve, and looks more like batch-based reconciliation of full snapshots than streaming or event-driven.

The challenge is if you aim to sync data sources where the parties managing each data source are not incentivized to provide robust sync. Consider Dropbox or similar, where a single party manages the data set, and all software (server and clients), or ecosystems like Salesforce and Mulesoft which have this as a stated business goal, or ecosystems like blockchains where independent parties are still highly incentivized to coordinate and have technically robust mechanisms to accomplish it like Merkle trees and similar. You can achieve sync in those scenarios because independent parties are incentivized to coordinate (or there is only one party).

But if you have two or more independent systems, all of which provide some kind of API or import/export mechanisms, you can never guarantee those systems will stay in sync using a streaming or event-driven approach. And worse, those systems will inevitably drift out of sync, or even more worse, will propagate incorrect data across multiple systems, which can then only be reconciled by batch-like point-in-time snapshots, which then begs the question of why use streaming if you ultimately need batch to make it work reliably.

Put another way, people say batch is a special case of streaming, so just use streaming. But you could also say streaming is a fragile form of sync, so just use sync. But sync is a special case of batch, so just use batch.

9rx 2 months ago

> In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs

Or CRDTs at all. Google Docs is based on operational transforms and Figma on what they call multiplayer technology.

josephg 2 months ago

I agree! Lots more things are sync. Also: the state of my source files -> my compiler (in watch mode), about 20 different APIs in the kernel - from keyboard state to filesystem watching to process monitoring to connected USB devices.

Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!

CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.

I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.

miki123211 2 months ago

And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.

In most cases, the easiest approach there is just "slap a blockchain on it", as a good and modern (think Ethereum, not Bitcoin) blockchain essentially "abstracts away" the decentralization and mostly acts like a centralized computer to higher layers.

That is certainly not the only viable approach, and I wish we looked at others more. For example, a decentralized DNS-like system, without an attached cryptocurrency, but with global consensus on what a given name points to, would be extremely useful. I'm not convinced that such a thing is possible, you need some way of preventing one bad actor from grabbing all the names, and monetary compensation seems like the easiest one, but we should be looking in this direction a lot more.

josephg 2 months ago

> And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.

In my mind, this is just the second category again. It’s just a shared data system, except with data validation & Byzantine fault tolerance requirements.

It’s a surprisingly common and thorny problem. For example, I could change my local git client to generate invalid / wrong hashes for my commits. When I push my changes, other peers should - in some way - reject them. PVH (of Ink&Switch) has a rule when thinking about systems like this. He says you’re free to deface your own copy of the US constitution. But I don’t have to pull your changes.

Access control makes the BFT problem much worse. The classic problem is that if two admins concurrently remove each other, it’s not clear what happens. In a crdt (or git), peers are free to backdate their changes to any arbitrary point in the past. If you try and implement user roles on top of a crdt, it’s a nightmare. I think CRDTs are just the wrong tool for thinking about access control.

jkaptur 2 months ago

I can't wait to read that blog post. I know you're an expert in this and respect your views.

One thing I think that is missing in the discussion about shared data (and maybe you can correct me) is that there are two ways of looking at the problem: * The "math/engineering" way, where once state is identical you are done! * The "product manager" way where you have reasonable-sounding requests like "I was typing in the middle of a paragraph, then someone deleted that paragraph, and my text was gone! It should be its own new paragraph in the same place."

Literally having identical state (or even identical state that adheres to a schema) is hard enough, but I'm not aware of techniques to ensure 1) identical state 2) adhering to a schema 3) that anyone on the team can easily modify in response to "PM-like" demands without being a sync expert.

ochiba 2 months ago

> And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).

I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.

danielvaughn 2 months ago

CRDTs work well for linear data structures, but there are known issues with hierarchical ones. For instance, if you have a tree, then two clients could send a transaction that would cause a node to be a parent of itself.

That said, there’s work that has been done towards fixing some of those issues.

Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:

https://martin.kleppmann.com/papers/move-op.pdf

rapnie 2 months ago

Martin Kleppmann in one of his recent talks about the future of local-first, mentions the need for a generic sync service for the 'local-first end-game' [0] as he calls it. Standardization is needed. Right now everyone and their mother is doing sync differently and building production platforms around their own protocols and mechanisms.

[0] https://www.youtube.com/watch?v=NMq0vncHJvU&t=1016s

tmpfs 2 months ago

The problem is that the requirements can be vastly different. A collaborative editor is very different to say syncing encrypted blobs. Perhaps there is a one size fits all but I doubt it.

I've been working on sync for the latter use case for a while and CRDTs would definitely be overkill.

layer8 2 months ago

Automatic conflict resolution will always be limited. For example, who seriously believes that we’ll ever be able to fully automate the handling of merge conflicts in version control? (Even if recorded every single edit operation on the syntax-tree level.) And in regular documents the situation is worse, because you don’t have formal parsers and type checkers and unit tests for them. Even for schematized structured data, there are similar issues on the semantic level, that a mere “it conforms to the schema” doesn’t solve.

lifty 2 months ago

Indeed. So conflict resolution that takes input from the user needs to be part of the protocol. Just like in Git.

cyanydeez 2 months ago

or from the LLM. It can't be super complicated to describe a logic schema to do a merge like that. It's just a business logic vs universal logic.

jdvh 2 months ago

As long as all clients agree on the order of CRDT operations then cycles are no problem. It's just an invalid transaction that can be dropped. Invalid or contradictory updates can always happen (regardless of sync mechanism) and the resolution is a UX issue. In some cases you might want to inform the user, in other cases the user can choose how to resolve the conflict, in other cases quiet failure is fine.

jakelazaroff 2 months ago

Unfortunately, a hard constraint of (state-based) CRDTs is that merging causally concurrent changes must be commutative. ie it is possible that clients will not be able to agree on the order of CRDT operations, and they must be able to arrive at the same state after applying them in any order.

jdvh 2 months ago

I don't think that's required, unless you definitionally believe otherwise.

When clients disagree about the the order of events and a conflict results then clients can be required to roll back (apply the inverse of each change) to the last point in time where all clients were in agreement about the world state. Then, all clients re-apply all changes in the new now-agreed-upon order. Now all changes have been applied and there is agreement about the world state and the process starts anew.

This way multiple clients can work offline for extended periods of time and then reconcile with other clients.

satvikpendem 2 months ago

Eg-walker seems similar to what you're proposing [0]. A more in-depth video by the creator [1].

[0] https://loro.dev/docs/advanced/event_graph_walker

[1] https://www.youtube.com/watch?v=rjbEG7COj7o

dboreham 2 months ago

That's not how the CRDT concept works.

jdvh 2 months ago

You're free to argue that this isn't "pure" CRDT, but the CRDT algorithm still runs normally, just a bit later than it otherwise would.

mackopes 2 months ago

I'm not convinced that there is one generalised solution to sync engines. To make them truly performant at large scale, engineers need to have deep understanding of the underlying technology, their query performance, database, networking, and build a custom sync engine around their product and their data.

Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil. There are no shortcuts to building truly high quality product at a large scale.

wim 2 months ago

We've built a sync engine from scratch. Our app is a multiplayer "IDE" but for tasks/notes [1], so it's important to have a fast local first/office experience like other editors, and have changes sync in the background.

I definitely believe sync engines are the future as they make it so much easier to enable things like no-spinners browsing your data, optimistic rendering, offline use, real-time collaboration and so on.

I'm also not entirely convinced yet though that it's possible to get away with something that's not custom-built, or at least large parts of it. There were so many micro decisions and trade-offs going into the engine: what is the granularity of updates (characters, rows?) that we need and how does that affect the performance. Do we need a central server for things like permissions and real-time collaboration? If so do we want just deltas or also state snapshots for speedup. How much versioning do we need, what are implications of that? Is there end-to-end-encryption, how does that affect what the server can do. What kind of data structure is being synced, a simple list/map, or a graph with potential cycles? What kind of conflict resolution business logic do we need, where does that live?

It would be cool to have something general purpose so you don’t need to build any of this, but I wonder how much time it will save in practice. Maybe the answer really is to have all kinds of different sync engines to pick from and then you can decide whether it's worth the trade-off not having everything custom-built.

[1] https://thymer.com

mentalgear 2 months ago

Optimally, a sync engine would have the ability to be configed to have the best settings for the project (e.g. central server or completely decentralised). It'd be great if one engine would be so performant/configurable, but having a lot of sync engines to choose from for your project is the best alternative.

btw: excellent questions to ask / insights - about the same I also came across in my lo-fi ventures.

Would be great if someone could assemble all these questions in a "walkthrough" step-by-step interface and in the end, the user gets a list of the best matching engines.

Edit: Mh ... maybe something small enough to vibe code ... if someone is interested to help let me know!

jdvh 2 months ago

Completely decentralized is cool, but I think there are two key problems with it.

1) in a decentralized system who is responsible for backups? What happens when you restore from a backup?

2) in a decentralized system who sends push notifications and syncs with mobile devices?

I think that in an age of $5/mo cloud vms and free SSL having a single coordination server has all the advantages and none of the downsides.

tonsky 2 months ago

- You can have many sync engines

- Sync engines might only solve small and medium scale, that would be a huge win even without large scale

thr0w 2 months ago

> Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil.

Remember Meteor?

xg15 2 months ago

That might be true, but you might not have those engineers or they might be busy with higher-priority tasks:

> It’s also ill-advised to try to solve data sync while also working on a product. These problems require patience, thoroughness, and extensive testing. They can’t be rushed. And you already have a problem on your hands you don’t know how to solve: your product. Try solving both, fail at both.

Also, you might not have that "large scale" yet.

(I get that you could also make the opposite case, that the individual requirements for your product are so special that you cannot factor out any common behavior. I'd see that as a hypothesis to be tested.)

tbrownaw 2 months ago

> decoupled from the horrors of an unreliable network

The first rule of network transparency is: the network is not transparent.

> Or: I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying

Is boost::multi_index_container no longer a thing?

Also there's SQLite with the :memory: database.

And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.

anonyfox 2 months ago

In Elixir/Erlang thats quite common I think, at least I do this for when performance matters. Put the specific subset of commonly used data into a ETS table (= in memory cache, allowing concurrent reads) and have a GenServer (who owns that table) listen to certain database change events to update the data in the table as needed.

Helps a lot with high read situations and takes considerable load off the database with probably 1 hour of coding effort if you know what you're doing.

TeMPOraL 2 months ago

> Is boost::multi_index_container no longer a thing?

Depends on the shop. I haven't seen one in production so far, but I don't doubt some people use it.

> Also there's SQLite with the :memory: database.

Ah, now that's cheating. I know, because I did that too. I did that because of the realization that half the members I'm stuffing into classes to store my game state are effectively poor man's hand-rolled tables, indices and spatial indices, so why not just use a proper database for this?.

> And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.

Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?

(But then, so did most of the industry.)

tbrownaw 2 months ago

> Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?

https://en.wikipedia.org/wiki/OpenEdge_Advanced_Business_Lan...

Dates back to 1981, called "Progress 4GL" until 2006.

https://docs.progress.com/bundle/abl-reference/page/DEFINE-T...

phyrex 2 months ago

ABAP, the SAP language has that, if i remember correctly

aiono 2 months ago

> The first rule of network transparency is: the network is not transparent.

That's precisely why current request model is painful.

ximm 2 months ago

> have a theory that every major technology shift happened when one part of the stack collapsed with another.

If that was true, we would ultimately end up with a single layer. Instead I would say that major shifts happen when we move the boundaries between layers.

The author here proposes to replace servers by synced client-side data stores.

That is certainly a good idea for some applications, but it also comes with drawbacks. For example, it would be easier to avoid stale data, but it would be harder to enforce permissions.

worthless-trash 2 months ago

I feel like this is the "serverless" discussion all over again.

There was still a server, its just not YOUR server. In this case, there will still be servers, just maybe not something that you need to manage state on.

This misnaming creates endless conflict when trying to communicate this with hyper excited management who want to get on the latest trend.

Cant wait to be on the meeting and hearing: "We dont need servers when we migrate to client side data stores".

TeMPOraL 2 months ago

I think the management isn't hyper excited about naming - in fact, they couldn't care less for what the name means (it's just a buzzword). What they're excited about is what the thing does - which is, turn more capex into opex. With "cloud", we can subscribe to servers instead of owning them. With "serverless", we can subscribe directly to what servers do, without managing servers themselves. Etc.

Diederich 2 months ago

Recently, something quite rare happened. I needed to Xerox some paper documents. Well, such actions are rare today, but years ago, it was quite common to Xerox things.

Over time, the meaning of the word 'Xerox' changed. More specifically, it gained a new meaning. For a long time, Xerox only referred to a company named in 1961. Some time in the late 60s, it started to be used as a verb, and as I was growing up in the 70s and 80s, the word 'Xerox' was overwhelmingly used in its verb form.

Our society decided as a whole that it was ok for the noun Xerox to be used a verb. That's a normal and natural part of language development.

As others have noted, management doesn't care whether the serverless thing you want to use is running on servers or not. They care that they don't have to maintain servers themselves. CapEx vs OpEx and all that.

I agree that there could be some small hazard with the idea that, if I run my important thing in a 'serverless' fashion, then I don't have to associate all of the problems/challenges/concerns I have with 'servers' to my important thing.

It's an abstraction, and all abstractions are leaky.

If we're lucky, this abstraction will, on average, leak very little.

philsnow 2 months ago

> Over time, the meaning of the word 'Xerox' changed. More specifically, it gained a new meaning. For a long time, Xerox only referred to a company named in 1961. Some time in the late 60s, it started to be used as a verb, and as I was growing up in the 70s and 80s, the word 'Xerox' was overwhelmingly used in its verb form.

https://www.youtube.com/watch?v=PZbqAMEwtOE#t=5m58s I don't think this dramatization (of a court proceedings from 2010) is related to Xerox's plight with losing their trademark, but said dramatization is brilliant nonetheless

szundi 2 months ago

[dead]