remix logo

Hacker Remix

Show HN: Hatchet v1 – A task orchestration platform built on Postgres

215 points by abelanger 1 day ago | 69 comments

Hey HN - this is Alexander from Hatchet. We’re building an open-source platform for managing background tasks, using Postgres as the underlying database.

Just over a year ago, we launched Hatchet as a distributed task queue built on top of Postgres with a 100% MIT license (https://news.ycombinator.com/item?id=39643136). The feedback and response we got from the HN community was overwhelming. In the first month after launching, we processed about 20k tasks on the platform — today, we’re processing over 20k tasks per minute (>1 billion per month).

Scaling up this quickly was difficult — every task in Hatchet corresponds to at minimum 5 Postgres transactions and we would see bursts on Hatchet Cloud instances to over 5k tasks/second, which corresponds to roughly 25k transactions/second. As it turns out, a simple Postgres queue utilizing FOR UPDATE SKIP LOCKED doesn’t cut it at this scale. After provisioning the largest instance type that CloudSQL offers, we even discussed potentially moving some load off of Postgres in favor of something trendy like Clickhouse + Kafka.

But we doubled down on Postgres, and spent about 6 months learning how to operate Postgres databases at scale and reading the Postgres manual and several other resources [0] during commutes and at night. We stuck with Postgres for two reasons:

1. We wanted to make Hatchet as portable and easy to administer as possible, and felt that implementing our own storage engine specifically on Hatchet Cloud would be disingenuous at best, and in the worst case, would take our focus away from the open source community.

2. More importantly, Postgres is general-purpose, which is what makes it both great but hard to scale for some types of workloads. This is also what allows us to offer a general-purpose orchestration platform — we heavily utilize Postgres features like transactions, SKIP LOCKED, recursive queries, triggers, COPY FROM, and much more.

Which brings us to today. We’re announcing a full rewrite of the Hatchet engine — still built on Postgres — together with our task orchestration layer which is built on top of our underlying queue. To be more specific, we’re launching:

1. DAG-based workflows that support a much wider array of conditions, including sleep conditions, event-based triggering, and conditional execution based on parent output data [1].

2. Durable execution — durable execution refers to a function’s ability to recover from failure by caching intermediate results and automatically replaying them on a retry. We call a function with this ability a durable task. We also support durable sleep and durable events, which you can read more about here [2]

3. Queue features such as key-based concurrency queues (for implementing fair queueing), rate limiting, sticky assignment, and worker affinity.

4. Improved performance across every dimension we’ve tested, which we attribute to six improvements to the Hatchet architecture: range-based partitioning of time series tables, hash-based partitioning of task events (for updating task statuses), separating our monitoring tables from our queue, buffered reads and writes, switching all high-volume tables to use identity columns, and aggressive use of Postgres triggers.

We've also removed RabbitMQ as a required dependency for self-hosting.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

[0] https://www.postgresql.org/docs/

[1] https://docs.hatchet.run/home/conditional-workflows

[2] https://docs.hatchet.run/home/durable-execution

bosky101 20 minutes ago

Here is my feedback after spending 15 mins on your docs.

You may want to replace Hello world examples with real world scenarios.

The workflows that involve multiple steps tasks, dag in your terminology - the code simply isn't intuitive.

You now have to get into the hatchets mindset, patterns, terminology. Eg: the random number example is riddled with too many. How many of the logos on your homepage did you have to write code for? Be honest.

Knowing js should be 90% enough.

   // send("hi", user => user.signed_up_today)
   //  .waitFor("7d")
   //  .send("upgrade", user => !user.upgraded)
Just made this up, but something like this is more readable. The whole point of being smart is for your team at hatchet to absorb difficulty at the benefit of an easy interface that looks simple and magic. Your 5 line examples has types to learn, functions to learn, arguments to know, 5-10 kinds of things to learn. It showed little effort to make it easy for customers.

An engineering post on what's under the hood makes sense. But customers really don't care about your cloud infra flexes in a post introducing your company pitching the product. It's just koolaid.

Same with complete rewrite so early. I'm glad you are open to change. But the workflow market today with so many options, i don't belive this is the last rewrite or pivot to come.

The DAGs itself aren't very readable. You are better off switching to something like react flow that lets you nocode edit as well.

Focus on automation journeys that are common. Like cookbooks. And allow folks to just import them or change some comfigurations. like drip marketing.

How does a workflow replace a saas they are paying $99 for. That's powerful.

Good luck, and sorry for coming off as rude.

bn-l 2 minutes ago

Are you saying the JavaScript api shouldn’t use types?

drdaeman 3 hours ago

Looks nice on the first glance, congrats on the launch! May I ask a few questions, please?

- Does it support durable tasks that should be essentially ran forever and produce an endless "stream" of events, self-healing in case of intermittent failures? Or would those be a better fit for some different kind of orchestrator?

- Where and how task inputs and outputs are stored? Are there any conveniences to make passing "weird" (that is, not some simple and reasonably-small JSON-encoded objects) things around easier (like Dagster's I/O managers) or is it all out of scope for Hatchet?

- Assuming that I can get ballpark estimates for the desirable number of tasks, their average input and output sizes, and my PostgreSQL instance's size and I/O metrics, can I somehow make a reasonable guesstimate on how many tasks per second the whole system can put through safely?

I'm currently in search of the Holy Grail (haha), evaluating all sorts of tools (Temporal, Dagster, Prefect, Faust, now looking at Hatchet) to find something that I would like the most. My project is a synchronization+processing system that has a bunch of dynamically-defined workflows that continuously work with external services (stores), look for updates (determine new, updated, or deleted products) and spawn product-level workflows to process those updates (standardize store-specific data into an unified shape, match against the canonical product catalog, etc etc). Surely, this kind of a pipeline can be built on nearly anything - I'm just trying to get a gist of how each of those system feels like to work with, what it's actually good at and what are the gotchas and limitations, and which tool would allow me to have least amount of boilerplate.

Thanks!

followben 15 hours ago

How does this compare to other pg-backed python job runners like Procrastinate [0] or Chancy [1]?

[0] https://github.com/procrastinate-org/procrastinate/

[1] https://github.com/TkTech/chancy

gabrielruttner 11 hours ago

Gabe here, one of the hatchet founders. I'm not very familiar with these runner so someone please correct me if I missed something.

These look like great projects to get something running quickly, but likely will experience many of the challenges Alexander mentioned under load. They look quite similar to our initial implementation using FOR UPDATE and maintaining direct connections from workers to PostgreSQL instead of a central orchestrator (a separate issue that deserves its own post).

One of the reasons for this decision to performantly support more complex scheduling requirements and durable execution patterns -- things like dynamic concurrency [0] or rate limits [1] which can be quite tricky to implement on a worker-pull model where there will likely be contention on these orchestration tables.

They also appear to be pure queues to run individual tasks in python only. We've been working hard on our py, ts, and go sdks

I'm excited to see how these projects approach these problems over time!

[0] https://docs.hatchet.run/home/concurrency [1] https://docs.hatchet.run/home/rate-limits

TkTech 5 hours ago

Chancy dev here.

I've intentionally chosen simple over performance when the choice is there. Chancy still happily handles millions of jobs and workflows a day with dynamic concurrency and global rate limits, even in low-resource environments. But it would never scale horizontally to the same level you could achieve with RabbitMQ, and it's not meant for massive multi-tenant cloud hosting. It's just not the project's goal.

Chancy's aim is to be the low dependency, low infrastructure option that's "good enough" for the vast majority of projects. It has 1 required package dependency (the postgres driver) and 1 required infrastructure dependency (postgres) while bundling everything inside a single ASGI-embeddable process (no need for separate processes like flower or beat). It's used in many of my self-hosted projects, and in a couple of commercial projects to add ETL workflows, rate limiting, and observability to projects that were previously on Celery. Going from Celery to Chancy is typically just replacing your `delay()/apply_async()` with `push()` and swapping `@shared_task()` with `@job()`.

If you have hundreds of employees and need to run hundreds of millions of jobs a day, it's never going to be the right choice - go with something like Hatchet. Chancy's for teams of one to dozens that need a simple option while still getting things like global rate limits and workflows.

wcrossbow 14 hours ago

Another good one is pgqueuer https://github.com/janbjorge/pgqueuer

INTPenis 15 hours ago

Celery also has postgres backend, but I maybe it's not as well integrated.

igor47 14 hours ago

It's just a results backend, you still have to run rabbitmq or redis as a broker

stephen 8 hours ago

Do queue operations (enqueue a job & mark this job as complete) happen in the same transaction as my business logic?

Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?

Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.

And I might as well be using SQS at that point.

williamdclt 2 hours ago

My understanding is that hatchet isn’t just a queue, it’s a workflow orchestrator: you can use it as a queue but it’s kind of like using a computer as a calculator: it works but indeed it’d likely be simpler to use a calculator.

On your point of using transactions for idempotency: you’re right that it’s a great advantage of a db-based queue, but I’d be wary about taking it as a holy grail for a few reasons:

- it locks you into using a db-based queue. If for any reason you don’t want to anymore (eg you’re reaching scalability issues) it’ll be very difficult to switch to another queue system as you’re relying on transactions for idempotency.

- you only get transactional idempotency for db operations. Any other side effect won’t be automatically idempotent: external API calls, sending messages to other queues, writing files…

- if you decide to move some of your domain to another service, you lose transactional idempotency (it’s now two databases)

- relying on transactionality means you’re not resilient to having duplicate tasks in the queue (duplicate publishing). That can easily happen: bug of the publisher, two users triggering an action concurrently… it’s quite often a very normal thing to trigger the same action multiple times

So I’d avoid having my tasks rely on transactionality for idempotency, your system is much more resilient if you don’t

lyu07282 2 hours ago

Just no, your tasks should be idempotent. Distributed transactions are stupid.

williamdclt 2 hours ago

They’re not talking about distributed transactions: it’s not about a task being published and consumed atomically, it’s about it being consumed and executed atomically.

lyu07282 2 hours ago

the workers aren't talking to postgres directly, thats why you would need distributed transactions.