Hacker Remix

Show HN: Decentralized robots (and things) orchestration system

69 points by hannesfur 6 months ago | 39 comments

Hi HN, we build an open-source operating system extension for orchestrating robot swarms fully decentralized.

This first beta version allows you to create fully decentralized robot swarms. The system will set up a wireless mesh network and run a p2p networking stack on top of it, such that nodes can interact with each other through various abstractions using our SDKs (Rust, Python, TypeScript) or a CLI.

We hope this is a step toward better inter-robot communication (and a fun project if you have a few Raspberry Pis lying around).

Our mesh network is created by B.A.T.M.A.N.-adv and we’ve combined this with optimized decentralized algorithms. To a user, it becomes very easy to write decentralized applications involving several peers since we’ve abstracted away much of the complexity. Our system currently offers several orchestration primitives (Key-Value Store, Pub-Sub, Discovery, Request-Response, Mesh Inspection, Debug Services, etc.)

Internally, everything except the SDKs is written in Rust, building on top of libp2p. We use gRPC to communicate between the SDKs and the CLI, so libraries for other languages are possible, and we welcome contributions (or feedback).

The C++ SDK and a ROS package that should feel natural to roboticists are in the works. Soon we also want to support a collaborative SLAM and a distributed task queue.

We’d love to hear your thoughts! :)

wngr 6 months ago

Great idea combining batman with libp2p! You guys have the heart in the right place :-).

Currently, your project seems to be an opinionated wrapper ontop of libp2p. For this to become a proper distributed toolkit you lack an abstraction to for apps to collaborate over shared state (incl. convergence after partition). Come up with a good abstraction for that, and make it work p2p (e.g. delta state based CRDTs, or op-based CRDTs based on a replicated log; event sourcing ..). Tangentially related, a consensus abstraction might also be handy for some applications.

Also check out [iroh](https://github.com/n0-computer/iroh) as a potential awesome replacement for p2p; as well as [Actyx](https://github.com/Actyx/Actyx) as an inspiration of similar (sadly failed) project using rust-libp2p.

Oh, and you might want to give your docs a grammar review.

Kudos for showing!

hannesfur 6 months ago

You are right. At the moment, we are an opinionated wrapper, but we take a different approach to discovery than other libp2p-based networks with our custom batman-adv-based neighbor discovery.

Abstractions for collaboration are currently in the works, and we hope to release that soon. The work on consensus has already started. Your suggestions seem all very interesting, and we'll definitely consider them. We are also currently in the process of talking to potential users to build handy and approachable abstractions for them.

I saw that [freenet](https://docs.freenet.org/components/contracts.html) went with CRDTs, but I think they made it too complicated. We were thinking about a graph (or wide-column) with an engine similar to Kassandara and a frontend like (or ideally just) SurrealDB.

I remember that iroh moved away from libp2p when they dropped IPFS compatibility and moved to a self-built stack: https://www.iroh.computer/blog/a-new-direction-for-iroh When we got started, the capabilities of iroh didn't really fit our bill, but it seems like it's time to reevaluate that. As a former contributor to rust-libp2p, I never quite got the frustration with libp2p that many people have, Iroh included, especially since many of the described problems seemed fixable, and I would have preferred if they did that instead, and libp2p remains the shared base people build these things on.

I remember Actyx being a rust-libp2p user, but I wasn't aware that they failed. Do you have more info? How and why? It would be great if we could learn from them.

Grammar will be reviewed ;) thank you!

wngr 6 months ago

> I remember Actyx being a rust-libp2p user, but I wasn't aware that they failed. Do you have more info? How and why? It would be great if we could learn from them.

They (we?) unfortunately never found product-market-fit. Actyx targeted the SME factory space with a p2p application platform. Turns out that developers in general don’t want to deal with the additional complexities of anything lesser than strong consistency, especially if they don’t fully drank the distribute-everything kool-aid. And SMEs don’t really bother either.

Philosophically decentralization is the right thing to do, but I’m thinking more and more that federation might actually be the compromise in the long run, at least for consumer apps. The only valid use cases for p2p edge devices with loose connectivity are in military applications.

Animats 6 months ago

Read the architecture document here.[1]

The usual problems with these things are discovery and security. Discovery is done via local WiFi broadcast. Not clear how security is done. How do you allow ad-hoc networking yet disallow hostile actors from connecting?

[1] https://docs.p2p.industries/concepts/architecture/

hannesfur 6 months ago

We do discovery via the mesh, yes, but instead of broadcasting (like mDNS), we query batman-adv for the current visible neighbors. If a new neighbor is discovered, we will contact them directly (via WiFi) to exchange the addresses in the P2P network and then dial them. From that, we populate the local Kademlia routing table with the content of the neighbor.

Security is still a big issue. In the current state, there is no security other than application-layer encryption (QUIC & TLS v1.3). That is fine for pilot projects, but it should not be used for anything sensitive. Maybe we should point this out more clearly in the docs. However, some Wi-Fi chips (not the ones on Raspberry Pi, sadly) also allow setting a password in adhoc (IBSS) mode and 802.11s has native support for encryption. In general is here a problem with lack of adoption of standards by the WiFi chip manufacturers and with Broadcom (the chip on the RP) a lack of support in the Linux kernel driver.

We are planning to implement authentication and encryption in the upcoming release, but this might be a paid feature.

Typically, enterprise networks are encrypted via 802.11x (since a leak of the key wouldn't compromise the whole network), and we might be able to build a decentralised Radius server, but I'm not very fond of that idea.

Ideally, the damage one can do by joining the network unauthorized should be very limited anyway, and authentication and encryption should happen on Layer 5.

Would love feedback / inspiration / suggestions

jazzyjackson 6 months ago

Might consider good old x509 certificates, mTLS authentication. You can query and find peers but don’t exchange any data with them unless they can present a certificate signed by whatever issuer. Agree its probably an enterprise upsell because the openssl tooling is a PITA if you’ve never done it before, but somebody pointed me to KeyStore Explorer [0] and I’m going to give that a try to be my own certificate authority.

I wish it could be a more mainstream, hobbyist auth solution tho, it’s completely free and open and self sovereign etc etc and makes strong security guarantees, just a steep learning curve to grok what’s happening. I think it would be a big achievement if somebody slapped a friendly API / wizard over configuring a CA and creating certs to install on each of your robots / IoT sensors whathaveyou. Corsha [1] is one provider in this space, and Yubico is contributing too [2], allowing you to sign cert requests with your Yubikey.

[0] https://keystore-explorer.org/features.html

[1] https://corsha.com/

[2] https://www.yubico.com/resources/glossary/what-is-certificat...

hannesfur 6 months ago

That's actually a pretty great idea! I will look into it! Thanks!

clayhacks 6 months ago

If you want to do the mTLS, I’d also suggest step-ca [0], it’s an open source TLS and SSH CA. You can setup a variety of methods to be the identity provider, then have step provide the certs

0: https://smallstep.com/docs/step-ca/

Animats 6 months ago

It's not encryption that's needed. It's authentication. How do you decide who's allowed to join your mesh if it runs on WiFi discovery?

hannesfur 6 months ago

Like others suggested a basic step would be to use a certificate based approach where a company (or basically any deployment) gives out certificates for robots allowed to join and you only communicate with them.

Animats 6 months ago

But how do you distribute the certificates? It's cold-starting peer to peer distributed systems that's hard.

hannesfur 6 months ago

When you setup the robots you could load them with the PKI and then load each other robot joining with a signed certificate. Not ideal, I admit.

Another way would be to somehow prove that you belong.

Animats 6 months ago

This is a general problem with all federated systems.

It's annoying that we don't have a decent solution to this even for home automation. You ought to be able to take a "house ID key", probably a Yubikey, and present it to all your devices to tell them "you're mine now". Then they can talk to each other.

There are military cryptosystems which have such hardware. There's a handheld device called the Simple Key Loader.[1] That's what's used to load secure voice keys into radios, encrypted GPS keys into GPS units, identify-friend-foe codes into aircraft, and such. It's 15 years old, runs Windows CE, has a screen with a pen, and is far too big. The Tactical Key Loader is smaller and simpler.[2] 7 buttons and a small screen. About the same size as a Flipper Zero, but ruggedized and expensive.

[1] https://info.publicintelligence.net/SKLInstructionGuide.pdf

[2] https://www.l3harris.com/all-capabilities/kik-11-tactical-ke...

NotAnOtter 6 months ago

Very fun. Is this primarily a passion project or are you hoping to get corporate sponsorship & adoption?

Can you provide some insight as to why this would be preferred over an orchestration server? In this context - Would a 'mothership'/Wheel-and-spoke drone responsible for controlling the rest of the hive be considered an orchestration server?

This isn't my area of expertise but I think "Hive mind drones" tickles every engineer.

lmeierhoefer 6 months ago

> Is this primarily a passion project or are you hoping to get corporate sponsorship & adoption?

We are in the current YC W25 batch and our vision is to build a developer framework for autonomous robotics systems from the system we already have.

> Can you provide some insight as to why this would be preferred over an orchestration server?

It heavily depends on your application, there are applications where it makes sense and others where it doesn’t. The main advantages are that you don’t need an internet connection, the system is more resilient against network outages, and most importantly, the resources on the robots, which are idle otherwise, are used. I think for hobbyists, the main upsides is that it’s quick to set up, you only have to turn on the machines and it should work without having to care about networking or setting up a cloud connection.

> Would a 'mothership'/Wheel-and-spoke drone responsible for controlling the rest of the hive be considered an orchestration server?

If the mothership is static, in the sense that it doesn’t change over time, we would consider it an orchestration server. Our core services don’t need that and we envision that most of the decentralized algorithms running on our system also don’t rely on such central point of failure. However, there are some applications where it makes sense to have a “temporary mothership”. We are just currently working on a “group” abstraction, which continuously runs a leader election to determine a “mothership” among the group (which is fault-tolerant however, as the leader can fail anytime and the system will instantly determine another one).

NotAnOtter 6 months ago

> The main advantages are that you don’t need an internet connection

To that end, I'm not clear on benefit in this model. To solve that problem I would just take a centralized framework and stick it inside an oversized drone/vehicle capable of carrying the added weight (in CPU, battery, etc.). There are several centralized models that don't require an external data connection

> the resources on the robots, which are idle otherwise, are used

But what's the benefit of this? I don't see the use case of needing the swarm to perform lots of calculations beyond the ones required for it's own navigation & communication with others. I suppose I could imagine a chain of these 'idle' drones acting as a communication relay between two separate, active hives. But the benefit there seems marginal.

> our system also don’t rely on such central point of failure

This seems like the primary upside, and it's a big one. I'm imagining a disaster or military situation where natural or human forces could be trying to disable the hive. Now instead of knocking out a single mothership ATV - each and every drone need to be removed to full disable it. Big advantage.

> We are just currently working on a “group” abstraction

Makes sense to me. That's the 'value add', might as well really spec that out

> leader election to determine a “mothership” among the group

This seems perfectly reasonable to me and doesn't remove the advantages of the disconnected "hive". But I do find it funny that the solution to decentralization seems to be simply having the centralization move around easily / flexibly. It's not a hive of peers, it's a hive of temporary kings.

lmeierhoefer 6 months ago

Thanks for the feedback!

> I would just take a centralized framework and stick it inside an oversized drone/vehicle capable of carrying the added weight

Makes sense. I think there are scenarios where such “base stations” are a priori available and “shielded,” so in this case, it might make more sense to just go with a centralized system. This could also be built on top of our system, though.

> But what’s the benefit of this?

I agree that, in many cases, the return on saving costs might be marginal. However, say you have a cluster of drones equipped with computing hardware capable enough to run all algorithms themselves—why spin up a cloud instance for running a centralized version of that algorithm? It is more of an engineering-ideological point, though ;)

> But I do find it funny that the solution to decentralization seems to be simply having the centralization move around easily / flexibly. It’s not a hive of peers, it’s a hive of temporary kings.

Most of our applications will not need this group leader. For example, the pubsub system does not work by aggregating and dispatching the messages at a central point (like MQTT) but employs a gossip mechanism (https://docs.libp2p.io/concepts/pubsub/overview/).

What I meant is that, in some situations, it might be more efficient (and it’s easier to reason about) to elect a leader. For example, say you have an algorithm that needs to do a matching between neighboring nodes —i.e., each node has some data point, and the algorithm wants to compute a pairwise similarity metric and share all computed metrics back to all nodes. You could do some kind of “ring-structure” algorithm, where you have an ordering among the nodes, and each node receives data points from the predecessor, computes its own similarity against the incoming data point, and forwards the received data point to its successor. If one node fails, the neighboring nodes in the ring will switch to the successor. This would be truly decentralized, and there is no single point of failure. However, in most cases, this approach will have a higher computation latency than just electing a temporary leader (by letting the leader compute the matchings and send them back to everyone). So someone caring about efficiency (and not resiliency) will probably want such a leader mechanism.

accurrent 6 months ago

This looks super cool, Ive been working in a similar space for some time. I work with open-rmf which is semi-decentralized and provides tools for task and traffic deconfliction (we don't handle the network layer at all). Excited to see more similar software coming up.

One question I have is how quickly can discovery and switching take place. For instance, if I have a robot enter a lift. Can the lift detect the robots entrance and trigger its behaviours?

hannesfur 6 months ago

Possibly yes, I can't give you a number on how quick batman discovers neighbors (will look into that) but we poll it every second. For this use case it could work!

iugtmkbdfil834 6 months ago

I am neophyte in this realm, but I like what I am seeing so far. Since I want to get into robotics for my own fun, I will be looking at it more closely this weekend:D

hannesfur 6 months ago

Please do and let us know if you have any questions!

jfantl 6 months ago

This is awesome stuff, I'm going to look into getting this running my Pis this weekend. How hard would it be to add in custom services? I like to play with decentralized algorithms such as Size Estimation and Clock Synchronization (https://jasonfantl.com/) and have always wanted to get them running on real hardware.

hannesfur 6 months ago

Awesome! From what I see, the clock synchronization can be implemented with our SDKs (mainly pub-sub).

I think the size estimation could also be implemented within the provided abstractions (mainly request-response) but might require you to keep track of neighbors. I think you could implement both algorithms by using our SDKs (none for Go yet).

If you need more control or performance, beyond what we expose through our SDKs, you might need to write a custom libp2p behavior and add it to our daemon. The libp2p part is fairly involved, but I would love to help you with that. Either way I would love to help you out :)

I'm so disappointed that I've never seen your blog before. The stuff you write about is so interesting and actually addresses some issues we are facing. I just sent you an email :)

ubj 6 months ago

Looks very interesting!

How does this compare with Zenoh [1]?

Also, I'm curious why you all included "OS" in the name? Almost every introduction to ROS/ROS2 has to explain that it isn't actually an operating system, and it seems like the name hyveOS will have the same misunderstanding.

[1]: https://zenoh.io/

matthewfcarlson 6 months ago

I've been thinking about building a little tiny SLAM robot to have something to drive around the house when I'm out of town (I don't want always on cameras everywhere but having a camera that can move around sounds useful). The ideas here are awesome and I'm looking forward to the tutorials being more fleshed out.

lmeierhoefer 6 months ago

Yeah, SLAM seems also like a natural showcase for us. I am just working on a decentralised collaborative SLAM package on top of our system, where multiple robots can drive around and continuously merge their maps without a coordination server, using the Mesh integration and PubSub system. Should be out in about a week.

matthewfcarlson 6 months ago

Now that sounds absolutely fascinating. I'll look forward to that

cookiengineer 6 months ago

batman-advanced is a pretty interesting choice for a routing / handshake mechanism. Are your own applications mostly focussed on 802.11s Wi-Fi or where does the choice come from?

In Profinet / Ethernet based networks it's more common to use ARP or mDNS for the discovery because multicast addresses are supported everywhere. Multicast DNS would be independent on top of the existing network layer and compatible with smartphones and other consumer devices (and even printers). That's why I'm asking, was there a specific reason to not use mDNS?

Airprint, airscan, filedrop and other things are based on bonjour (mDNS-SD), and supported pretty well on consumer devices and routers. [1]

[1] http://dns-sd.org/

hannesfur 6 months ago

We chose batman-adv because it's the most widely adopted and supported implementation of a wireless ad hoc mesh network :) Freifunk is running deployments with several thousand nodes.[1] Since that was the scale we were aiming for, this is what we chose. We also considered 802.11s for a long time, but the hardware support is quite poor, especially on a device where people would try something like this: Raspberry Pis.

You are right that mDNS is a popular choice, and we even have it in our codebase to be able to debug on macOS.

However, mDNS is a flooding/multicast protocol, and although batman-adv optimizes those, it puts unnecessary strain on a mesh network. That’s why we didn't go with it.

It also gave us the opportunity to try out this other, in our view, more elegant approach to discovery.

Since IBSS is not really useful outside of Linux popular device support, like smartphones, it fell out of the picture pretty quickly (sadly). However, you can connect bridges to our mesh network (again exposing a batman-adv feature), which would give you the ability to connect phones (and of course, printers) to your robots.

[1]: http://grafana.freifunk-muensterland.de/goto/lAqihJDHg?orgId...

diggan 6 months ago

> We’d love to hear your thoughts! :)

Have you ever played any of the Horizon (Zero Dawn/Forbidden West) games? :)

Jokes aside, it looks pretty cool. What kind of hardware have you tested it with so far? Is this using WiFi only?

Zollerboy1 6 months ago

> Is this using WiFi only?

Currently, our software stack only supports WiFi and Ethernet at the data link layer, but that’s mainly because these are easy to set up with linux/batman, not because of an architectural problem.

Other technologies are not only possible but also on our roadmap. Just recently, for example, I’ve been taking a first look into getting LoRa working with hyveOS, but this would probably be a more long-term project.

hannesfur 6 months ago

Actually, just very briefly at a friend’s ;)

Thank you! So far, we have tested it with Raspberry Pi 4/5. Jetson boards are on backorder. We have some Intel WiFi chips (since they support some stuff we want), and we will get around to trying them next.

The binaries were also tested on x86 machinery.

In general, I'm not too worried about hardware support since batman-adv is quite widely deployed on a diverse set of hardware and the rest is hardware agnostic.

matthewfcarlson 6 months ago

It's not clear what the hardware requirements for a system that can run this would be. Raspberry Pi is mentioned but it seems like an actual OS (not ESP32 for example) is a requirement.

lmeierhoefer 6 months ago

You are right that, at the moment, the system inherently requires a 64-bit OS. We currently support Debian-based distros; it should work with other parent distributions as well, but you need to translate the installer script ;) But we definitely need to highlight this more clearly in the docs. Thanks for pointing it out!

We also don’t have a definitive hardware spec requirement yet. We’ve tested it on Raspberry Pi 3s and later models (so anything more capable than a 3 should be fine).

> not ESP32 for example

Running on ESP32 is tricky because it would require porting libp2p to a embedded (which, as far as we know, nobody has done yet). However, we are considering support for embedded “light” nodes that run only a limited portion of the stack. It depends on the feedback we get. Do you have a use case where you’d need it to run on embedded?

gdestus 6 months ago

Sigh....I quite literally laid out this exact design in my head during my 6 hour business travel today. I was brainstorming designs for military swarm applications. Congrats lol

hannesfur 6 months ago

That's exiting! I would love to talk to you more if you're open to it :)

rgbrgb 6 months ago

this is so cool, congrats on launching. what kind of biz model are you guys going after?

lmeierhoefer 6 months ago

We will most likely go with an open-core model. The main part will stay open source (the Core OS extension is under GPL3, and everything SDK-related is MIT).

For paid features, we have several ideas: a hosted management plane to configure and control the swarm (with company rbac integration) when one of the nodes is connected to the internet; advanced security (currently no access management or authentication is happening); sophisticated orchestration primitives; and LoRa connectivity (to scale the mesh radius to miles).

Appreciate your feedback on this!

rgbrgb 6 months ago

cool, makes sense. Thanks! I think often devs (me lol) are suspicious of free since we are in on or aware a lot of the data selling/ad surveillance schemes.

I love the general ideas around free and powerful p2p functionality with hosted management or higher level features.