Hacker Remix

An epic treatise on error models for systems programming languages

221 points by transpute 4 months ago | 76 comments

tuukkah 4 months ago

From the end:

> Pithy statements such as “only use exceptions for exceptional cases” or “Let it crash”, while catchy, do not do justice to the complexities that programmers need to deal with when deciding how to define, propagate, handle and reason about errors.

I think the author is on to something here w.r.t. how current programming languages and conventions brush over what good code might need.

lo_zamoyski 4 months ago

Which languages and in what ways? And compared to which "non-current" languages and in what ways?

gavinhoward 4 months ago

I searched for conditions and restarts. I did not find them.

That makes me sad because that is what my systems programming language uses.

For context, conditions and restarts come from the Lisp world. They do not unwind the stack. Instead, they run an error handler on error, and if it handles the error, you have the option of restarting the operation that failed.

I have implemented them in C; they are great for systems programming.

masfuerte 4 months ago

FWIW, Windows structured exception handling uses this model. When an exception occurs your handler is called and it chooses whether to continue execution or to continue by unwinding the stack.

layer8 4 months ago

You can, of course, do the equivalent with explicit callbacks and a bit of additional boilerplate. However, I have never felt the need to do that (as opposed to, say, having a numRetries parameter), so I always wonder how people who advocate for such a mechanism design and modularize their code. It feels like it leads to dependencies on implementation details, as opposed to only relying on a procedure’s interface.

gavinhoward 4 months ago

The errors that a procedure can run into are part of its interface. Languages like Zig and Swift make that explicit.

The difference is that callers (even indirect callers) can register handlers to handle those errors without unwinding the stack.

Maybe the error handler makes it return a default value. Maybe it will deallocate a cache to free up memory in a memory-constrained situation. Maybe it purely ignores EOF or a FILE DOES NOT EXIST error where those do not matter.

I also have a special error handler for parsing. If the error is a parse error, it prints the parse error in standard format, with source file, line, etc. Otherwise, it passes the error on to the next error handler in the chain.

This helps me modularize my code by keeping the parse error concerns separate from error handling for everything else.

andriesm 4 months ago

20 years go while programming java, I was struck by how simple and universally useful it is, to have any method that throws exceptions to be forced to declare them in the method signature, and all callers are then forced to either pass on the exception or provide a try catch that deals with it, making it almost impossible to have programd crash from un handled exceptions.

It was perfect, except for all the crashes from null variables back then. And my current favourite language Dart cuts null pointers by 99.99 percent by giving you null safety.

Checked exceptions and null safety, and suddenly programs crash like 90 percent less.

Now we must just figure out how to systematically eliminate index-out-of-bounds too, then 98 percent of program crashes are gone. All without writing unit tests and other costly rituals.

Just good language design.

eska 4 months ago

The worry I have with code like that is that instead of letting the program crash it often just swallows exceptions. This can lead to a lot of silent errors that one becomes aware of way too late, for example after a lot of data has been corrupted for hours or years.

typesanitizer 4 months ago

Hi Gavin, I've seen your blog before, including some posts about Yao.

After feedback on Lobste.rs, I plan on adding a section on conditions and restarts, hopefully sometime later today if time permits. :)

I'd be happy to add some information about Yao as well alongside Common Lisp. It would be helpful for me to have some more details about Yao before writing about it, so I have some follow-up questions below. Please feel free to link me to existing writing; I may be misremembering details, but I don't think the answers to these have been covered somewhere.

I looked at the docs on the master branch as of mid Jan 2025 (could you confirm if these are up-to-date), particularly design.md, and I noticed these points:

> That means that Yao will have unsafe code, like Rust's unsafe. However, unlike Rust, Yao's way of doing unsafe code will be harder to use,

So Yao has a delineation between safe and unsafe code, correct? Does "safe" in Yao have the same (or stronger) set of guarantees as Rust (i.e. no memory safety if all necessary invariants are upheld by unsafe code, and boundary of safe-unsafe code)?

> Yao's memory management will be like C++'s RAII [..]

Does Yao guarantee memory safety in the absence of unsafe blocks, and does the condition+restart system in Yao fall under the safe subset of the language? If so, I'm curious how lifetimes/ownership/regions are represented at runtime (if they are), and how they interact with restarts. Specifically:

1. Are the types of restart functions passed down to functions that are called? 2. If the type information is not passed, and conditions+restarts are part of the safe subset of Yao, then how is resumption logic checked for type safety and lifetime safety? Can resumptions cause run-time type errors and/or memory unsafety, e.g. by escaping a value beyond its intended lifetime?

---

For reading the docs, I used this archive.org link (I saw you're aware of the Gitea instance being down in another comment): https://web.archive.org/web/20250114231213/https://git.yzena...

gavinhoward 4 months ago

Hey! Good to talk to you.

Sorry, I was AFK for several hours.

Thank you for the offer, but I don't think Yao should be featured yet.

Edit: I guess I'll answer your questions anyway.

> So Yao has a delineation between safe and unsafe code, correct?

Correct. However, Yao's "unsafe" will actually be separate files, written directly in an LLVM-like assembly (Yvm in the repo). That's how it will be harder to use.

> Does "safe" in Yao have the same (or stronger) set of guarantees as Rust (i.e. no memory safety if all necessary invariants are upheld by unsafe code, and boundary of safe-unsafe code)?

Stronger.

First, Yvm (Yao's assembly/unsafe) is made specifically for structured languages, and it will still do bounds checks by default. Yes, there will be ways of not doing bounds checks, of course, but even in "unsafe," bounds checks will exist.

Second, Yao and Yvm are both explicitly designed for better formal verification. [1] This includes user-defined formal properties.

> Does Yao guarantee memory safety in the absence of unsafe blocks?

Yes.

> does the condition+restart system in Yao fall under the safe subset of the language?

It's still being implemented (hence why Yao should not be featured), but it will be a part of the safe subset. That is a guarantee; I will completely redesign Yao if I cannot fit conditions and restarts in the safe subset. But I am confident that they will work as-is because I implemented a prototype in C that is safer than C itself.

> 1. Are the types of restart functions passed down to functions that are called?

Not directly. My C code uses, and Yao will use, what I call "context stacks," an idea that comes from Jonathan Blow's Jai.

These are more useful than just restart functions, but there is explicitly one context stack for restart functions, per thread. Registering a restart function means pushing it onto the context stack.

Then, when an error happens, the context stack is walked backwards until a restart function handles the error. If no function handles it, the context stack for the parent thread is walked (starting at the point where the child thread was created), and so on until some function handles it.

I push a default restart function at the root, so errors will always be handled.

> 2. If the type information is not passed, and conditions+restarts are part of the safe subset of Yao, then how is resumption logic checked for type safety and lifetime safety? Can resumptions cause run-time type errors and/or memory unsafety, e.g. by escaping a value beyond its intended lifetime?

This is one of the craziest parts of Yao: it will have the capability to be generic over types at runtime. In addition, Yao has something like Go interfaces or Rust traits. What it has is more powerful, though.

The end result is that errors will actually be interfaces, and everything will be type checked at comptime, but be memory safe.

I hope this answers your questions.

[1]: https://gavinhoward.com/2024/05/what-rust-got-wrong-on-forma...

paddy_m 4 months ago

I have read about the CL condition system. I get how restarts are useful for interactive programming.

I don't see how they could be used in most cases where you want a program to run without a programmer intervening. Could you list some more usecases.

gavinhoward 4 months ago

Restarts don't always need a debugger or intervention.

A classic case would be restarting memory allocation after failure and after the error handler freed cache memory.

Another case would be retrying to open a file when the process has too many files open already. The error handler may close a few that are not critical, then have the restart fire.

Another case is sending something through TCP. Perhaps you try to send something, and it gives you an error. Unbeknownst to you, the message was already sent, but you wait a second or until other connections do less, then restart and try again, and it succeeds. The other end gets a duplicate, but no matter; it's TCP.

Another case is DNS. Say you need to get the IP address for some URL, and you connect to your first default DNS server. However, it happens to be run by your local incompetent sysadmins, and it happens to be down. Your error handler may choose a different, maybe public, DNS server, like Cloudflare or Google, and then restart.

If you think, 'Oh, well, I could program those in without restarts,' you are correct, but the thing is that doing so couples things.

Take the DNS example: if you put that extra error handling logic in the code that actually tries resolving things, then how do you change error handling when you need to?

Let's make the example even more detailed: perhaps you have a fleet of servers, a whole data center. Most of those servers could use a public DNS if they needed to, but perhaps your head node must NEVER use a public DNS for security reasons. The typical way to implement that would mean having an `if` statement for acting differently based on whatever condition would indicate head node or not. That is coupling the DNS resolution with error handling.

But if you have conditions and restarts, then you simply register a different DNS error handler at startup based on if it's the head node or not. Or the error handler could have the `if` statement instead. Either way would decouple DNS resolution from the error handling.

I hope all of that helps.

paddy_m 4 months ago

That does help. Network stuff is a usecase I could see.

It seems like with a restart system you could do something like this, in a generic reusable library way, correct me if I'm wrong. On network failure, register a callback with some OS level network activity watcher, once the network has resumed working, continue execution as normal.

gavinhoward 4 months ago

You could do that, except that you would register the handler before the network code executes. Then, on failure, the network code would run the handler that waits until activity resumes, then restarts.

sesm 4 months ago

Did you publish your implementation anywhere? Would be very interesting to see. Thanks!

gavinhoward 4 months ago

I did, but my Gitea instance stopped working. I am trying to replace it.

zozbot234 4 months ago

That's just an overly fancy version of ON ERROR RESUME NEXT.

gavinhoward 4 months ago

No, it is much more powerful.

ON ERROR RESUME NEXT is equivalent to a catch all in a try catch. Conditions and restarts let calling code set up error handlers, and if one of them handles the error, it would be like starting over from the top of the try block.

demurgos 4 months ago

Very nice article. I agree with the theses and the resulting design makes sense.

Regarding context handling, one issue that I struggle with is where the context should be attached: by the caller or callee? In particular, in the context of intermediate libraries, I would like to limit needless work attaching the context if the caller already handles it (some sort of opt-out).

Usually I just attach the context in the callee as it's the simplest, but I'd like some systematic/reusable approach that would control this. It would be nice if it fits in the type system too. I use a bunch of languages like Rust, TypeScript, Kotlin, Python and it feels like a larger design issue.

To give a more concrete example. In Node, `fs.readSync(path)` includes the error path when it fails: the callee attaches the context to the error. In Rust, `std::fs::read(path)` does not attach the path to the error: the caller is responsible for the context. I would like some lightweight way to control if the file read should include or not the path. The ancestor scanning example from the article is a case where caller context is good IMO, but usually callee context is better for unrecoverable errors. Since it's contextual, having control would be nice.

jandrewrogers 4 months ago

I've spent a lot of time thinking about error context management in the context of C++. There seems to be a pretty deep conflict between having consistent elegant error context and efficient high-performance code. Explicit delegation works but only sometimes without incurring a runtime overhead. Even trying to find a reasonable and robust middle ground has proven to be elusive.

Expanding on your caller/callee example, sometimes both are far removed from the path, e.g. only having a file descriptor. There are plenty of ways to resolve that file descriptor into a path but one that consistently has zero overhead except in the case where it is required to contextualize an error is not trivial. Being able to resolve outside context implies the management and transmission of context in cases where it is not needed.

I am also coming around to the idea that the error handling context needs to be at least partly co-designed with the logging APIs and internals to have a little more flexibility around when and where things happen. In most software neither of these is really aware of the other.

jstimpfle 4 months ago

What is hard or costly about accompanying a file descriptor with a path or string?

Galanwe 4 months ago

> Almost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signaled in software

So in a laps of 15 years (2010-1025), they hand picked 20 bugs from 5 open source filesystem projects (198 total), and extrapolated this result.

That is not science.

typesanitizer 4 months ago

Hi, blog post author here (unrelated to paper authors).

> So in a laps of 15 years (2010-1025),

The paper was published in 2014, so the period is 2010-2014, not 2010-2025.

> they hand picked 20 bugs from 5 open source filesystem projects (198 total)

The bugs were randomly chosen; "hand picked" would imply that the authors investigated the contents of the bug reports deeply before deciding whether to include them (which would certainly fall under "bad science"). The paper states the following:

> We studied 198 randomly sampled, real world fail- ures reported on five popular distributed data-analytic and storage systems, including HDFS, a distributed file system [27]; Hadoop MapReduce, a distributed data- analytic framework [28]; HBase and Cassandra, two NoSQL distributed databases [2, 3]; and Redis, an in- memory key-value store supporting master/slave replica- tion [54]

So only 1 out of 5 projects is a file system.

> and extrapolated this result. That is not science.

The authors also provide a measure of statistical confidence in the 'Limitations section'.

> (3) Size of our sample set. Modern statistics suggests that a random sample set of size 30 or more is large enough to represent the entire population [57]. More rigorously, under standard assumptions, the Central Limit Theorem predicts a 6.9% margin of error at the 95% confidence level for our 198 random samples. Obviously, one can study more samples to further reduce the margin of error

Do you believe that this is insufficient or that the reasoning in this section is wrong?

Galanwe 4 months ago

> The paper was published in 2014, so the period is 2010-2014, not 2010-2025.

Oh indeed, my bad.

> The bugs were randomly chosen; "hand picked" would imply that the authors investigated the contents of the bug reports deeply before deciding whether to include them

No, no, they were not randomly chosen, they even have a whole paragraph explaining how they randomly picked _from a pool of manually selected bugs_. Their criterion of selection varied from "being serious", to "having a lot of comments" and "they can understand the patch", or "the patch is not from the reporter".

> The authors also provide a measure of statistical confidence in the 'Limitations section'.

This is a measure of confidence of their random sampling being representative of the hand picked bugs...

> under standard assumptions, the Central Limit Theorem predicts a 6.9% margin of error at the 95% confidence level

I would love to see them prove the normality assumption of bug root cause distribution.

Also, the whole categorization they do seem purely qualitative.

This paper seems to lack rigor to me.

gtirloni 4 months ago

> The whole paper is a statistical joke.

I have been guilty of such kind of comments and I have come to realize they do nothing to further my point. I would suggest to read the guidelines because they are very well written about this: https://news.ycombinator.com/newsguidelines.html

Galanwe 4 months ago

Fair point, I edited my answer to be less dismissive of the paper.

Thanks for the considerate feedback.

epr 4 months ago

So, analyzing databases, we picked Java, Java, Java, Java, and one in C. This does not seem very random. I suppose this may provide insight into failure modes in Java codebases in particular, but I'm not sure I'd be in a hurry to generalize.

dbtc 4 months ago

It is writing, though. You've picked one sentence out of rather more and made a negative statement about it. What is that?

perching_aix 4 months ago

Critique. The idea is that the evaluation in the article may not have been representative, but this quote is self-contained enough to be representative of its own stance.

Instead of going for rhetorical KOs, how about remaining constructive? You're well aware that this is what they meant, you could have just written it plain out if you disagree.

namaria 4 months ago

You know, there is a programming language based on a PhD thesis on fault tolerance. This programming language is built around a VM that implements lots of great design on error handling. And there is a functional programming language that compiles to the bytecode that this VM runs. It is a delight to use, gives you all the fault tolerance goodness for free, but the name is often cursed in HN threads for some reason.

edit: I might be giving some random episode too much weight. Seems like Elixir and Erlang are actually quite well liked around here

typesanitizer 4 months ago

Hi, author here. I'm a big fan of Armstrong's work, I've watched several of his talks multiple times and always get something new out of them even if I don't agree entirely. :)

I do mention Erlang near the start of the post, around the 360 word mark:

> Joe Armstrong’s talk The Do’s and Don’ts of Error Handling: Armstrong covers the key requirements for handling and recovering from errors in distributed systems, based on his PhD thesis from 2003 (PDF) [sidenote 3].

> [sidenote 3] By this point in time, Armstrong was about 52 years old, and had 10+ years of experience working on Erlang at Ericsson.

> Out of the above, Armstrong’s thesis is probably the most holistic, but it’s grounding in Erlang means that it also does not take into account one of the most widespread forms of static analysis we have today – type systems

dmix 4 months ago

When I was reading your article my first thought was most people would only read the intro but many don’t even do that before commenting.

Given how prevalent just reading the headline -> immediately posting a preconceived take is, most information is probably communicated through the top comments these days.

namaria 4 months ago

It's a great article but calling Erlang's philosophy of 'Let it crash' a pithy, catchy statement that brushes over the complexities of the real world was an unwarranted attack. There's much more to it and with Elixir on top it goes a long way in lightening the cognitive load of error handling and building fault tolerant software.

They aren't perfect tools but they are also not merely misguided academic exercised combined with vapid catch phrases, as the analysis seems to imply.

typesanitizer 4 months ago

I have removed the mention of "Let it crash" from that section, and added a clarification for my original intent. I did not mean it as criticism of Erlang or Joe Armstrong, although I 100% understand how it could've been interpreted as such.

Thanks for the critical feedback.

namaria 4 months ago

Hey, thanks for listening.

You seem to put a lot of import on typing, any thoughts on the Elixir effort to introduce that to the language?

ejiblabahaba 4 months ago

Which language is this? I'm sure some people can clue this together from the hints, but I'm not one of them.

Philpax 4 months ago

Erlang / BEAM would seem to fit, but I haven't seen much disdain for it from HN posters; quite the opposite, actually.

pavlov 4 months ago

Yeah, it’s a surprising take because Erlang originally used to be a central object of reverence — some would say fetish — on HN.

There was “Erlang Day” when the entire HN front page was about Erlang because pg had made some offhand request for more technical content:

https://news.ycombinator.com/front?day=2009-03-11

This was nearly 16 years ago apparently… (In my mind it’s in the “sometime before Covid happened” bucket. Sigh.)

Maybe it’s time for another Erlang Day this March 11?

flir 4 months ago

Coincidence: I was just noodling around youtube and saw that "Still Alive" from Portal was added 17 years ago. I'm pretty sure I've discovered a date bug in youtube.

eru 4 months ago

I think Erlang day was mostly a joke to make the website more 'boring' to wannabe l33t h4x0rs.

atombender 4 months ago

I think OTP is pretty much universally admired, the language syntax and tooling much less so.

Erlangers generally dismiss criticisms about syntax as unimportant, but looking at the popularity of Elixir (which compiles to the same VM) suggests they're wrong about discounting syntax as important when it comes to adoption. Another new language that targets the BEAM VM is Gleam, which looks fantastic.

From reading Armstrong's stories about the origins of Erlang, it seems to me the Prolog syntax was an accident of history and a pragmatic engineering decision at the time, more than it was the deliberate choice of a careful language designer. That suggests that other options shouldn't be dismissed.

dmix 4 months ago

Wow 16 years ago. I remember that day.

namaria 4 months ago

> some would say fetish

And there it is

pavlov 4 months ago

But it’s all fetish-positive — people really do admire the language and the runtime.

namaria 4 months ago

I seem to remember a phase when talking about it was denounced as vitriol, together with praising Rust. Maybe it was just a phase, maybe I took some random criticism too seriously.

dcminter 4 months ago

Really!? I must have missed that; I genuinely thought you must be talking about Java but couldn't make the description fit!

jasinjames 4 months ago

Probably Java/Clojure

flir 4 months ago

Erlang/Elixir is also a possibility.

But I don't think either Java or Erlang were based on a PhD project?

(Edit: I just looked it up, and Joe Armstrong's PhD was 2003)

namaria 4 months ago

I meant Erlang/Elixir

Java is also based on a doctorate research on fault tolerance?

nesarkvechnep 4 months ago

No, God, no. Java isn’t based on academic research.

kbn 4 months ago

I think erlang

jesperwe 4 months ago

Don't forget Gleam! It is young, but has a less steep learning curve than Erlang/Elixir. And it is strongly and soundly typed, with errors as values.

nahuel0x 4 months ago

Missing piece: Common Lisp continuable condition system.

nesarkvechnep 4 months ago

More like restartable. You may be conflating it with continuations.

typesanitizer 4 months ago

Copying my comment from the Lobste.rs thread (https://lobste.rs/s/az2qlz/epic_treatise_on_error_models_for...)

> Hi, author here, the title also does say “for systems programming languages” :)

> For continuations to work in a systems programming language, you can probably only allow one-shot delimited continuations. It’s unclear to me as to how one-shot continuations can be integrated into a systems language where you want to ensure careful control over lifetimes. Perhaps you (or someone else here) knows of some research integrating ownership/borrowing with continuations/algebraic effects that I’m unfamiliar with?

> The closest exception to this that I know of is Haskell, which has support for both linear types and a primitive for continuations. However, I haven’t seen anyone integrate the two, and I’ve definitely seen some soundness-related issues in various effect systems libraries in Haskell (which doesn’t inspire confidence), but it’s also possible I missed some developments there as I haven’t written much Haskell in a while.

fanf2 4 months ago

You don’t need continuations to implement resumable errors. The trick is simply to not unwind the stack when an error happens.

http://axisofeval.blogspot.com/2011/04/whats-condition-syste...

ForTheKidz 4 months ago

Yes, I can see how if you view lifetime management as core to systems processing lisps might be unappealing to you.

zozbot234 4 months ago

You can use async to implement continuations, and async is very much included in quite a few 'systems languages'.

diffxx 4 months ago

It sometimes feels like there is a strange resistance to defining ad hoc sum types. If your language has good pattern matching capabilities and an expressive syntax for declaring data types, then this scales very well.

For example in a pseudolanguage:

    data FooResult = Number | Error
    def foo(n: Number): FooResult = ...
    foo(42) match {
      case n: Number => print(n + 1)
      case e: Error => // handle error case
    }

This gets you very far most of the time. If the tag overhead matters (which should largely only matter in a hot loop), then try to refactor so that the hot loop doesn't error out.

Granted in many languages the above cannot be expressed succinctly and is often avoided for this reason.

uzerfcwn 4 months ago

That looks more like ad hoc union types than sum types. What happens if you do:

    data FooResult = Number | Number

The reason ad hoc union types are avoided in Hindley-Milner (HM) languages is that their usability depends on subtyping, which is incompatible with HM. You could get around this by saying that ad hoc sum types require the types to be distinct, but this would greatly impede usability. For example:

    ErrorA = FileNotFoundError | ParsingError
    ErrorB = FileNotFoundError | PermissionError
    ErrorC = ErrorA | ErrorB // Oops, trying to ad hoc sum FileNotFoundError twice

The tried and true solution in HM is to use generic types like Result<A,B>, where you separate the cases with labels Ok and Err.

On the other hand, languages with subtyping aren't that averse to ad hoc union types. TS and Python already have them and C# is adding them soonish [1].

[1] https://github.com/dotnet/csharplang/blob/main/proposals/Typ...

chrishill89 4 months ago

What a great resource.

I’ve been in the errors-as-values (like Rust and Haskell) camp for over a decade. But I think you need to go a bit beyond that philosophy (i.e. not just as a slogan) in order to truly unlock what treating errors as values can be like.

Errors as values are just values. Not any more complex than other values. But except in cases like `None | Error` (no regular value or an error) you at least double the work that you need to do. The thing about the “happy path” is that it’s just one channel of information. With errors you layer on one more channel.

I think that has lead me to be careful about treating errors in the same way I treat other kinds of values. Because simplifying errors down to just one value, like a string, simplifies the whole error channel. But how is error-as-string useful if you are using the code as a library and not just erroring out and letting the end-user/operator read it with their primate brain? Well that lead me to add variants, sum types for all the different error possibilities. Along with associated data like how this int was larger than some exected bound.

Already here I run into a sort of sub-error problem: only advertising what variants of an error that I can return from a function. It seems that this isn’t supported directly in Rust, which is what I was using at that time. So if I just advertise that I can return all the error variants (which is a soft lie) then I might have to deal with them downstream.

More expressive error values seem to both (1) burden the implementation (sub-errors) and (2) the clients who have to consume them.

So do I use less expressive errors so that they become easier to handle for client code?

Well it seems here that I’m still stuck treating errors as special values. Because I still seem hesitant to treat errors in their full generality as values. Think of a value:

1. It’s not just a single “atom”/scalar like an integer

2. It can be in a list or a dictionary/map or on a stack or some other data structure

3. It can be transformed with a map or reduce operation since you either want to transform the error or leave out some details

Maybe the library should provide a list of errors. But the client only cares about the last one. So it takes the head of the list. Maybe the library should provide a detailed error containing associated data about the bounds, expected and actual. But the client just cares about the error type and elides the rest.

Maybe this doesn’t seem useful for errors. But take the `Hurdle` type in the article. These are effectively a mandatory (product type) list of warnings. Maybe the client code only cares about five out of eight of them so he filters out the rest. Or he doesn’t care about them at all so he just ignores them.

And this seems to lead into iterators and lazy evaluation. In high level languages at least we’ve come to a point where we value pretty functional, declarative, and coarse transformation of data. “Coarse” in the sense that we use a few datastructures and operations to process the data without worrying about special datastructures and micromanaging control flow. But crucially these constructs don’t have to allocate N intermediary collections. They can be lazy. So maybe errors (and warnings (hurdles)) should be as well. Maybe you can have a very rich error/warning declaration while only generating them when the client code wants it. And what happens when you make a nod (at least) that all the error computations can be elided if only, say, you just kill the application on the first sight of any problems (like in a script)? People are incentivized to treat errors as any other value without fearing that it is wasted effort (computation).

Now on the other hand I do program in Java. And personally, philosophically, I am fine with exceptions for the type of application advocated in the article: report errors way up the stack. But there I’ve been running into the same problem: treating exceptions as not-quite-values. For any other custom class I would create however many fields and methods I need. But for custom exceptions I just give them a name. And maybe pass a string to the constructor. Which I could do for a built-in or third-party error. (So effectively I just give them a new name.)

But why? There must have been a block in my mind. Because Java exceptions are just classes. And they can have fields. And ultimately we log errors that bubble all the way up. So why not record all the associated data as fields on the exception and use the error handler (with `instanceof`) to gather up all the data into a structured log? That should be a ton better than just creating format strings everywhere and sticking the associated data into them. And of course you can do whatever else with the exception objects in the error handler; once it’s structured you can use the structure directly, without any parsing or indirect access.

demurgos 4 months ago

Great reply and it mirrors some of my reflections.

For Rust errors, I consider defining a single error enum for the crate or module to be an anti-pattern, even if it's common. The reason is because of what you described: the variants no longer match what can actually fail for specific functions. There's work on pattern types to support subsets of enums, but until then I found that the best is to have dedicated errors per pub function.

I also feel that there's an ecosystem issue in Rust where despite "error as values" being a popular saying, they are still treated as second class citizens. Often, libs don't derive Eq, Clone or Serialize on them. This makes it very tedious to reliably test error handling or write distributed programs with strong error handling.

For Java, my experience with Exceptions is that they don't compose very well at the type level (function signatures). I guess that it's part of why they're used more for coarse handling and tend to be stringy.

marcosdumay 4 months ago

Oh, here comes my usual rant about "algebraic" sum types that break all of the algebra rules on sums...

On those languages, if you define the types X = A + B and Y = A + B, and if you try to match X = Y, they don't match at all. Besides, they don't compose (X = A + B, Y = X + C, Y = A + B + C is false).

The result is that if you go and declare `read :: Handler -> Either FileAccessError String`, you can't just declare your function as `f :: String -> (FileAccessError | NetworkError | MyCustomError)` expect it to work well with the rest of your code. Because every time you need a different set here, you'll have to declare a different type, and transform the data from one type to the other.

looperhacks 4 months ago

The "sum" and "product" is more about the cardinality of the type. If you have a type A with 3 possible values and B with 5 possible values, then the sum type X = A+B has 8 possible distinct values while the product type Y = A*B has 15 possible values

marcosdumay 4 months ago

Yeah, that's the bad definition that leads to problems like the one the OP describes.

gf000 4 months ago

You would want union types for that, they behave more like what you have in mind. (Scala 3 has such a feature, as an example)

Also, Rich Hickey's Maybe Not talk is sort of about this "non-composability" of ADTs. E.g. a function accepting a String can be relaxed to accept `String | None` without recompile, but changing it to `Maybe<String>` would make a refactor necessary.

At the same time Maybe<Maybe<String>> may encode a useful state that will be unrepresentable with `(String | None) | None = String | None`.

marcosdumay 4 months ago

Yes, unions are the proper type sums. Just as fixed cardinality sets are the proper type products, not tuples. I didn't know about this feature in Scala 3, all it seems to be missing for a fully algebraic sum is type negation (like `FileError and not FileSeekUnsupportedError`).

> At the same time Maybe<Maybe<String>> may encode a useful state that will be unrepresentable with `(String | None) | None = String | None`.

We already have the operation that takes a type and converts it into a non-matching type. It doesn't have a well accepted name, in Haskell it's called `newtype` but every language with strict types has it.

There is no need to mix those operations in the fundamental types in a language. The usual emuns and tuples are just less expressive suggared synonyms.

nikajon_es 4 months ago

I don't understand how they don't compose, from your example:

    (X = A + B, Y = X + C, Y = A + B + C is false)

I understand that types aren't math values, but isn't the point of using a `+` to describe the communicative value of the type so that `(A + B) + C = A + B + C`?

Also

marcosdumay 4 months ago

data A = A

data B = B

data X = XA A | XB B

data Y = YA A | YB B

f :: X -> () f = undefined

let r = f (undefined :: Y)

There's absolutely no way to write this so it compiles. In fact, there isn't even a way to define the composed types so that they only express a sum, you have to add extra baggage.

CyberDildonics 4 months ago

Is this an "epic treatise" ?

sirwhinesalot 4 months ago

Really nice article that covers quite well the complexities involved in dealing with errors. "They're just another kind of value" and "Just crash and restart" are both wrong, error handling is much more complex than that.

Different kinds of errors need to be handled in different ways in various contexts as the article points out.

I won't comment on the error model proposed in the article as I find that without actually trying out stuff like this it's very hard to get a "feel" for it, but at first glance it looks a bit over-engineered.

I think the gist of what makes a good error handling system is as follows:

Certain errors are contract violations/bugs that should not occur in normal program operation, and should not "taint" the code with explicit error handling. The language should provide a mechanism that allows to abort the current "task" when such an error occurs and trigger some error handler. Exceptions are good for this, but it falls on the programmer to define a "task boundary" appropriately. Defensive try/catch all over the place is not that, and is usually done because...

Certain errors are part of normal program operation and the programmer should be reminded to deal with them. Exceptions are terrible for this but Result<T,E> types are great. Having to use Result<T,E> in situations where the error is actually not normal part of program operation (e.g., you're accessing a key on a map that can only be missing due to a bug) is annoying but is easily solved by providing two separate APIs (like Python's __getitem__() and get() methods on dict).

Often times an expected error has no good way to be dealt with at the point it occurs, so there should be an easy way to just abort the current "task" when that happens. This is your ".expect()" or just a giant pile of ? propagation in Rust but Rust panics don't carry enough information by default like you can with an Exception type. Note that is different from the two API version above. You use the "panicking on error" API only in cases where the error is not expected, whereas you use some .expect() equivalent as a way to abort the current task for an expected error.

Sometimes you also want to collect multiple errors and abort only after every error is collected, like in a typechecker. This is often done with "NullObject" patterns or "ZII" (most of the time handled very poorly) but there are ways to do it right where the error case "poisons" the data up to the task boundary but cannot propagate further than that (turning into a Result<T,E> equivalent at that point).

The most important part is understanding task boundaries. I find most programs (including my own) do a very poor job of this.

The information contained in the error (sometimes called the context) is very tricky to deal with as what constitutes relevant information for a library or for an app using that library can be very different and there's a performance cost involved with its collection.

gf000 4 months ago

Well, arguably Java had the right idea with their checked/unchecked distinction.

I believe effect types (that are the true generalization of checked exceptions) are a pretty good solution to error handling, though as you also write, certain functions where an "error" case is expected should just use ADTs or so (e.g. parsing a string to an int)

pjmlp 4 months ago

While people love to put this on Java, the idea predates the language by at least two decades, having appeared at least in CLU, Modula-3 and C++, before Java came to be.

sirwhinesalot 4 months ago

Effect types have many good properties, the most important of which being that they provide clear semantics for the composition of multiple different kinds of effects (like exceptions and async).

But I disagree that they are a good solution to "error handling", because a single solution cannot cover all situations. For example, should the indexing function on arrays have an error effect? It'll taint every function that indexes an array even if the function cannot possibly go out of bounds. Same thing with situations like integer overflow, should every function that does integer arithmetic have an error effect?

Unchecked exceptions (or "panics" as the new languages like to call them, rose by any other name and all that) are clearly the superior solution for situations like that, what I like to call "contract violations".

The Elm language does not have effect types, it just uses Result types, and is vehemently against exceptions. The outcome of that is that division by 0 returns 0 because tainting every function that divides two integers with a Result type would be bonkers. Personally I find turning an erroneous situation into a "safe default" that isn't really safe to be a very silly thing to do, specially for a language that proudly claims to have no runtime errors (just logic errors instead I guess?)

At the other end you have Python iterators throwing an exception when they complete, which is one of the weirdest design decisions I've ever seen. How is an iterator reaching the end of the thing it is iterating an exceptional situation?

This last one really boggles my mind.

vacuity 4 months ago

Agreed on the crucial aspect of task boundaries. Errors should be dealt with as close to the edge of the task as possible, and tasks should be cleanly separable from one another. Think of an app as a small driving program that uses many libraries written by others. Although it takes much more effort, I believe it is an overall better design for clarity, error handling, flexibility, and whatnot.

chrishill89 4 months ago

> "They're just another kind of value" and "Just crash and restart" are both wrong, error handling is much more complex than that.

Regarding the first slogan: it is correct that they are just another kind of value. But they look more complex. The reason isn’t the error values as such. The reason is that they always (except for `None | Error`) appear together with a rich normal-value. In turn they end up looking more complex, simply by context-association.

And treating errors as “just another kind of value” necessitates general value-manipulation facilities. Because we expect to manipulate normal values in whatever ways we want. But for errors we tend to get stumped once we want to do something else than whatever the default is, which might be (using your example) to accumulate errors instead of bailing out after finding just one.

But we tend to get stuck trying to simplify errors and their processing too much. When really we should move towards generality; the “just” in “just another kind of value” should mean that we can use a lot of whatever we have (not invent new things). “Just” shouldn’t hint at “and so it’s easy/simple”.