remix logo

Hacker Remix

Drasi: Microsoft's open source data processing platform for event-driven systems

327 points by benocodes 3 days ago | 67 comments

CharlieDigital 3 days ago

Very interesting choice of using Cypher[0]

In 2014, we built a similar type event-driven system (but specifically for document distribution (a document can be distributed to a target set of entities; if a new entity is added, we need to resolve which distributions match)) and also ended up using Cypher via Neo4j (because of the complex taxonomical structure of how we mapped entities).

It is a super underrated query language and while most of the queries could also be translated to relational SQL, Cypher's linear construction using WITH clauses is far, far easier to reason about, IMO.

EDIT: feel like the devs went overboard with the mix of languages. Shoehorned in C# Blazor? Using JS and Jest for e2e testing?

[0] https://drasi.io/reference/query-language/

leeoniya 3 days ago

> while most of the queries could also be translated to relational SQL, Cypher's linear construction using WITH clauses is far, far easier to reason about, IMO.

https://prql-lang.org/

CharlieDigital 3 days ago

Didn't look too deeply, but one of the keys with Cypher (at least in the context of graph databases) is that it has a nice way of representing `JOIN` operations as graph traversals.

    MATCH (p:Person)-[r]-(c:Company) RETURN p.Name, c.Name
Where `r` can represent any relationship (AKA `JOIN`) between the two collections `Person` and `Company` such as `WORKS_AT`, `EMPLOYED_BY`, `CONTRACTOR_FOR`, etc.

So I'd say that linear queries are one of the things I like about Cypher, but the clean abstraction of complex `JOIN` operations is another huge one.

inkyoto 2 days ago

> […] Where `r` can represent any relationship […]

… and «-[r]-» can represent any relationship direction, which obviates the need for constructing separate queries for inverse traversing relationships. Kinda like running a compiler forward and backward.

UltraSane 3 days ago

The neat thing about Neo4j is that the [r] isn't a join, it is an actual relationship stored on disk.

refset 2 days ago

Like a many-to-many join table?

CharlieDigital 20 hours ago

Like a many-to-many set of join tables because the `[r]` can represent any relationship between any two collections.

UltraSane 2 days ago

[dead]

robertlagrant 3 days ago

We made a health backend partly using Cypher and the only thing I found was the simple queries looked amazing, but as soon as you need to join non-linearly it started looking a lot like SQL again. And when you're using an ORM it stops mattering. And when you need migrations it gets painful!

CharlieDigital 3 days ago

    > but as soon as you need to join non-linearly
At least in our use case, even with some very gnarly 20+ line Cypher queries, it never got to the point where it felt like SQL and certainly, those same queries would be even gnarlier as nested sub-selects, CTEs, or recursive selects, IMO.

Perhaps a characteristic of our model (a taxonomy of Region, Country, Sponsor, Program, Trial, Site, Staff for global clinical trials and documents required by Region/Country/Program/Trial).

UltraSane 3 days ago

Cypher works really well with a well defined taxonomy.

UltraSane 3 days ago

"you need to join non-linearly "

What does this mean?

FromOmelas 2 days ago

presumably it has a semantic model of sorts, defining intrinsic relationships between entities (parent-child, composed-of, sibling-of, and so on)

A bit similar how certain joins in SQL can be very straightforward with the "USING" clause, or when it can rely on extra information such as analytic views to derive materialized views (vendor specific)

JanSt 3 days ago

I too have great memories of cypher. Such an elegant way to write queries.

CharlieDigital 3 days ago

If you haven't been following it, I recently found out that it is now supported in a limited capacity by Google Spanner[0]. The openCypher initiative started a few years back and it looks like it's evolved into the (unfortunate moniker) GQL[1].

So it may be the case that we'll see more Cypher out in the wild.

[0] https://cloud.google.com/spanner/docs/graph/opencypher-refer...

[1] https://neo4j.com/blog/cypher-gql-world/

dxxvi 2 days ago

Is this what can be done with Apache Kafka Connect (to get data from another source to a Kafka cluster), Kafka (including Kafka Streams)? This image (https://github.com/drasi-project/community/raw/main/images/d...) is like Kafka Streams with a single topic. This image (https://github.com/drasi-project/community/raw/main/images/c...) is like joining 2 streams in Kafka Streams.

ultrafez 2 days ago

It also seems reminiscent of KSQL - consuming multiple input topics, and producing output to a topic defined using a query written in a SQL-like language that defines how the inputs are combined and filtered.

otterley 3 days ago

Looks very Azure-centric. Both installation guides (https://drasi.io/how-to-guides/install-sample-applications/b... and https://drasi.io/how-to-guides/install-sample-applications/c...) require Azure to work.

And then there's this:

> Installing Drasi in an EKS cluster can be significantly more complex than a standard installation on other platforms. Instead of downloading a CLI binary using the provided installation scripts, this approach requires modifying the source code of the Drasi CLI and building a local version of the CLI.

Is this an actual requirement or just the current easy path?

stackskipton 3 days ago

Azure SRE here, it doesn't appear to have any Azure dependencies. CLI rebuild seems to be that "drasi init" assumes Azure Kubernetes Service built in StorageClasses for Kubernetes PVC for Redis and Mongo and thus fails when running against EKS. I assume same thing would be required on GKE. Yes, it should be more modular but MVP.

As for other stuff, it's using Gremlin Query Language or Postgres which are both open. In fact, it's going out of way it's not to use Azure authenication as loading connection string as Kubernetes secret is 100% AGAINST Azure Kubernetes Best Practice. Best Practice would be Workload Identity.

bob1029 2 days ago

> CLI rebuild seems to be that "drasi init" assumes Azure Kubernetes Service built in StorageClasses for Kubernetes PVC for Redis and Mongo and thus fails when running against EKS. I assume same thing would be required on GKE. Yes, it should be more modular but MVP.

None of these words are in the Bible.

ryanwjwaite 17 hours ago

You're right, it should work better on AWS, GCP, and other clouds. We'll get to that in future builds of Drasi. We've submitted to CNCF and, just like with Radius and Dapr, we'll make sure it works well on more than just Azure.

devjab 2 days ago

Every bit of Microsoft open source is created at least partly as a sales strategy for Azure. They usually start within the Azure infrastructure because, well, why wouldn’t they? Then eventually they tend make it to where you can use them outside of Azure but they never quite leave the part where they are “better” if you’re an Azure customer.

Time will tell if Drasi is going to go the path where it becomes more easily useable outside of Azure (and in this case AWS) or it’ll go more of a Bicep route.

agentofreality 1 day ago

As mentioned above, Kubernetes is (intended to be) our only platform dependency right now. Drasi is not yet ready for production use and can be explored using k3s, kind, AKS, and EKS, which we felt provided sufficient initial options for people to choose from.

In the coming weeks we will get more of our Sources and Reactions documented as well as docs on how to create custom Sources and Reactions. In the short term, if people have Sources and Reactions they want so they can integrate with a wider range of up and downstream systems, we would love to help support their efforts in developing these.

The Drasi Team is most active over on discord channel (https://aka.ms/drasidiscord), where we are happy to answer questions and help people get started using Drasi.

agentofreality 2 days ago

I am the Drasi engineering lead and can assure you that any Azure-centricity is purely one of historical convenience and a lag in getting more of our non-Azure-centric doc, samples, and components published.

The main current dependency is having a K8s cluster.

You can run Drasi for dev/test on k3s(https://drasi.io/how-to-guides/installation/install-on-k3s/) or kind(https://drasi.io/how-to-guides/installation/install-on-kind/) and docker desktop also works but is undocumented.

Cloud based options include AKS (we will release the instructions soon) and EKS as mentioned. When we tested on EKS, we hit some storage class issues and decided to publish this with some work-arounds instead of holding back until we do a proper fix, which we will prioritize if there is demand.

On prem K8s should also work, but we haven't put resources into testing those scenarios. We would love to engage with anybody that would be willing to try this out.

Also, in the future we are thinking about other delivery platforms, not just K8S. You will see in the code that our dependency on k8s is abstracted.

If you have any questions, the Drasi Team is most active over on our discord channel (https://aka.ms/drasidiscord) and we would love to answer your questions and help ypu get started using Drasi.

gtani 5 minutes ago

is there any lineage between this project and ReactiveX family developed at endjin now?

dtquad 3 days ago

That is usual for new Microsoft open source projects. It takes 1-2 months for the Azure dependencies to go away.

3abiton 2 days ago

I'm curious about the other examples? I get it though, as many of these projects are built fulfilling a specific need within MS infrastructure.

gigatexal 3 days ago

Oh this very much reminds me of [feldera](https://feldera.com) — they do incremental loads and computations using some novel approaches (most of which i am too dumb to follow). Really nice folks too.

woozyolliew 3 days ago

Or the related Materialize stuff https://materialize.com/

hobofan 3 days ago

I took a brief look into Drasi and it looks like it doesn't do any of the differential/timely dataflow stuff (like Materialize does), or any other sophisticated incremental view maintenance methods that are rooted in Microsoft Research.