remix logo

Hacker Remix

Test-driven development with an LLM for fun and profit

207 points by crazylogger 2 days ago | 84 comments

xianshou 1 day ago

One trend I've noticed, framed as a logical deduction:

1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

2. Coding agents do massively better when they have a test-driven reward signal.

3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.

4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.

5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.

Sure enough, I see HN projects evolving in that direction.

swatcoder 1 day ago

> 3. If a problem can be framed in a way that a coding agent can solve...

This reminds me of the South Park underwear gnomes. You picked a tool and set an expectation, then just kind of hand wave over the hard part in the middle, as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.

Does it sometimes take 50x effort to understand a problem and the agent well enough to get that done? Are there classes of problems where it can't be done? Are either of those concerns something you can recognize before they impact you? At commercial quality, is it an accessible skill for inexperienced people or do you need a mastery of coding, the problem domain, or the coding agent to be able to rely on it? Can teams recruit people who can reliable achieve any of this? How expensive is that talent? etc

hitchstory 1 day ago

>as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.

It's not, but if you can A) make it cheap to try out different types of framings - not all of them have to work and B) automate everything else then the labor intensity of programming decreases drastically.

>At commercial quality, is it an accessible skill for inexperienced people

I'd expect the opposite, it would be an extremely inaccessible skill requiring high skill and high pay. But, if 2 people can deliver as much as 15 people at a higher quality and they're paid triple, it's still way cheaper overall.

I would still expect somebody following this development pattern to routinely discover a problem the LLM can't deal with and have to dive under the hood to fix it - digging down below multiple levels of abstraction. This would be Hard with a capital H.

myko 2 hours ago

> as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem

It is imminently solvable! All that is necessary is to use a subset of language easier for the machine to understand and use in a very defined way; we could call this "coding language" or something similar. Even build tools to ensure we write this correctly (to avoid confusing the machine). Perhaps we could define our own algorithms using this "language" to help them along!

emptiestplace 1 day ago

We've had failed projects since long before LLMs. I think there is a tendency for people to gloss over this (3.) regardless, but working with an LLM it tends to become obvious much more quickly, without investing tens/hundreds of person-hours. I know it's not perfect, but I find a lot of the things people complain about would've been a problem either way - especially when people think they are going to go from 'hello world' to SaaS-billionaire in an hour.

I think mastery of the problem domain is still important, and until we have effectively infinite context windows (that work perfectly), you will need to understand how and when to refactor to maximize quality and relevance of data in context.

dingnuts 1 day ago

well according to xianshou's profile they work in finance so it makes sense to me that they would gloss over the hard part of programming when describing how AI is going to improve it

ziddoap 1 day ago

Working in one domain does not preclude knowledge of others. I work in cybersec but spent my first working decade in construction estimation for institutional builds. I can talk confidently about firewalls or the hospital you want to build.

No need to make assumptions based on a one-line hacker news profile.

Arcuru 1 day ago

> 5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.

Not necessarily. You can get the same benefits you described in (1)-(3) by using clearly defined modules in your codebase, they don't need to be separate microservices.

lolinder 15 hours ago

I wonder if we'll see a return of the kind of interface file present in C++, Ocaml, and Ada. These files, well commented, are naturally the context window to use for reference for a module.

Even if languages don't grow them back as a first class feature, some format that is auto generated from the code and doesn't include the function bodies is really what is needed here.

senkora 14 hours ago

Python (which I mention because it is the preferred language of LLM output) has grown stub files that would work for this:

https://peps.python.org/pep-0484/#stub-files

I guess that this usecase would be an argument to include docstrings in your Python stub files, which I hadn’t considered before.

sdesol 1 day ago

Agreed. If the microservice does not provide any value from being isolated, it is just a function call with extra steps.

__MatrixMan__ 1 day ago

I think the argument is that the extra value provided is a small enough context window for working with an LLM. Although I'd suggest making it a library if one can manage, that gives you the desired context reduction bounded by interfaces without taking on the complexities of adding an additional microservice.

I imagine throwing a test at an LLM and saying:

> hold the component under test constant (as well as the test itself), and walk the versions of the library until you can tell me where they're compatible and where they break.

If you tried to do that with a git bisect and everything in the same codebase, you'd end up varying all three (test, component, library) which is worse science than holding two constant and varying the third would be.

sdesol 1 day ago

> I think the argument is that the extra value provided is a small enough context window for working with an LLM.

I'm not sure moving something that could work as function to a microservice would save much context. If anything, I think you are adding more context, since you would need to talk about the endpoint and having it route to the function that does what you need. When it is all over, you need to describe what the input and output is.

__MatrixMan__ 14 hours ago

Oh certainly. I was arguing that if you need more isolation than a function gives you, don't jump to the conclusion that you need a service. Consider a library as a middle ground.

ben_w 4 hours ago

Indeed; I think there's a strong possibility that there's certain architectural choices where LLMs can do very well, and others where they would struggle.

There are with humans, but it's inconsistent; personally I really dislike VIPER, yet I've never felt the pain others insist comes with too much in a ViewController.

theptip 14 hours ago

Yeah, I think monorepos will be better for LLMs. Easier to refactor module boundaries as context grows or requirements change.

But practices like stronger module boundaries, module docs, acceptance tests on internal dev-facing module APIs, etc are all things that will be much more valuable for LLM consumption. (And might make things more pleasant for humans too!)

steeeeeve 1 day ago

So having clear requirements, a focused purpose for software, and a clear boundary of software responsibility makes for a software development task that can be accomplished?

If only people had figured out at some point that the same thing applies when communicating to human software engineers.

PoppinFreshDo 1 day ago

If human software engineers refused to work unless those conditions were met, what a wonderful world it would be.

intelVISA 21 hours ago

They do implicitly: you can only be accidentally productive without those preconditions.

PoppinFreshDo 5 hours ago

[dead]

sdesol 1 day ago

> you can speed up development by 10x.

If you know what you are doing, then yes. If you are a domain expert and can articulate your thoughts clearly in a prompt, you will most likely see a boost—perhaps two to three times—but ten times is unlikely. And if you don't fully understand the problem, you may experience a negative effect.

throwup238 1 day ago

I think it also depends on how much yak-shaving is involved in the domain, regardless of expertise. Whether that’s something simple like remembering the right bash incantation or something more complex like learning enough Terraform and providers to be able to spin up cloud infrastructure.

Some projects just have a lot of stuff to do around the edges and LLMs excel at that.

smusamashah 1 day ago

On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.

joshstrange 1 day ago

Absolutely. LLMs are a "need to verify" the results almost always. LLMs (for me) shine by pointing me in the right direction, getting a "first draft", or for things like code where I can test it.

nyrikki 1 day ago

It is really the only safe way to use it IMHO.

Even in most simple forms of automation, humans suffer from Automation Bias and Complacency and one of the better ways to avoid those issues is to instill a fundamental mistrust of those systems.

IMHO it is important to look at other fields and the human factors studies to understand this.

As an example ABS was originally sold as a technology that would help you 'stop faster'. Which it may do in some situations, and it is obviously mandatory in the US. But they had to shift how they 'sell' it now, to ensure that people didn't rely on it.

https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/200...

    2.18 – Antilock Braking Systems (ABS)

    ABS is a computerized system that keeps your wheels from locking up during hard brake applications.
    ABS is an addition to your normal brakes. It does not decrease or increase your normal braking capability. ABS only activates when wheels are about to lock up.
    ABS does not necessarily shorten your stopping distance, but it does help you keep the vehicle under control during hard braking.

Transformers will always produce code that doesn't work, it doesn't matter if that is due to what they call hallucinations, Rice's theory, etc...

Maintaining that mistrust is the mark of someone who understands and can leverage the technology. It is just yet another context specific tradeoff analysis that we will need to assess.

I think forcing people into the quasi-TDD thinking model, where they focus on what needs to be done first vs jumping into the implementation details will probably be a positive thing for the industry, no matter where on the spectrum LLM coding assistants arrive.

That is one of the hardest things to teach when trying to introduce TDD, focusing on what is far closer to an ADT than implementation specific unit tests to begin with is very different but very useful.

I am hopeful that required tacit experience will help get past the issues with formal frameworks that run into many barriers that block teaching that one skill.

As LLM's failure mode is Always Confident, Often Competent, and Inevitably Wrong, it is super critical to always realize the third option is likely and that you are the expert.

Marceltan 1 day ago

Agree. My biggest pain point with LLM code review tools is that they sometimes add 40 comments for a PR changing 100 lines of code. Gets noisy and hard to decipher what really matters.

Along the lines of verifiability, my take is that running a comprehensive suite of tests in CI/CD is going to be table stakes soon given that LLMs are only going to be contributing more and more code.

sdesol 1 day ago

> On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

I'm working on a LLM chat app that is built around mistrust. The basic idea is that it is unlikely a supermajority of quality LLMs can get it wrong.

This isn't foolproof though, but it does provide some level of confidence in the answer.

Here is a quick example in which I analyze results from multiple LLMs that answered, "When did Homer Simpson go to Mars?"

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

If you look at the yes and no table, all except GPT-4o and GPT-4o mini said no. After asking GPT-4o who was correct, it provided "evidence" on an episode so I asked for more information on that episode. Based on what it said, it looks like the mission to Mars was a hoax and when I challenged GPT-4o on this, it agreed and said Homer never went to Mars, like others have said.

I then asked Sonnet 3.5 about the episode and it said GPT-4o misinterpreted the plot.

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

At this point, I am confident (but not 100% sure) Homer never went to Mars and if I really needed to know, I'll need to search the web.

ehnto 1 day ago

It's the backwards reasoning that really frustrates me when using LLMs. You ask a question, it says sure do these things, they don't work out and you ask the LLM why not, and it replies yes that thing I told you to do wouldn't work because of these clear reasons.

It would be nice to start at the end of that chain of reasoning instead of the other side.

Another regular example is when it "invents" functions or classes that don't exist, when pressed about them, it will reply of course that won't work, that function doesn't exist.

Okay great, so don't tell me it does with such certainty, is what I would tell a human feeding me imagination as facts all the time. But of course an LLM is not reasoning in the same sense, so this reverse chain of thought is the outcome.

I am finding LLMs far more useful for soft skill topics than engineering type work, simply because of how often it leads me down a path that is eventually a dead end, because of some small detail that was wrong at the very beginning.

sdesol 1 day ago

> I am finding LLMs far more useful for soft skill topics than engineering type work, simply because of how often it leads me down a path that is eventually a dead end, because of some small detail that was wrong at the very beginning.

Yeah I felt the same way in the beginning which is why I ended up writing my own chat app. What I've found while developing my spelling and grammar checker is that it is very unlikely for multiple LLMs to mess up at the same time. I know they will mess up, but I'm also pretty sure they won't at the same time.

So far, I've been able to successfully create working features that actually saved me time by pitting LLMs against their own responses and others. My process right now is, I'll ask 6+ models to implement something and then I will ask models to evaluate everyone's responses. More often than not, a model will find fault or make a suggestion that can be used to improve the prompt or code. And depending on my confidence level, I might repeat this a couple of times.

The issue right now is tracking this "chain of questioning" which is why I am writing my own chat app. I need an easy way to backtrack and fork from different points in the "chain of questioning". I think once we get a better understanding of what LLMs can and can't do as a group, we should be able to produce working solutions easier.

willy_k 1 day ago

I believe that this is what chain of thought models attempt to address.

horsawlarway 1 day ago

Isn't this essentially making the point of the post above you?

For comparison - if I just do a web search for "Did homer simpson go to mars" I get immediately linked to the wikipedia page for that exact episode (https://en.wikipedia.org/wiki/The_Marge-ian_Chronicles), and the plot summary is less to read than your LLM output - It clearly summarizes that Marge & Lisa (note - NOT homer) almost went to mars, but did not go. Further - the summary correctly includes the outro which does show Marge and Lisa on mars in the year 2051.

Basically - for factual content, the LLM output was a garbage game of telephone.

sdesol 1 day ago

> Isn't this essentially making the point of the post above you?

Yes. This is why I wrote the chat app, because I mistrust LLMs, but I do find them extremely useful when you approach them with the right mindset. If answering "Did Homer Simpson go to Mars?" correctly is critical, then you can choose to require a 100% consensus, otherwise you will need a fallback plan.

When I asked all the LLMs about the Wikipedia article, they all correctly answered "No" and talked about Marge and Lisa in the future without Homer.

manmal 1 day ago

Relatedly, asking LLMs what happens in a TV episode, or a series in general, I usually get very low quality and mostly flat out wrong answers. That baffles me, as I thought there are multiple well structured synopses for any TV series in the training data.

simonw 1 day ago

Here's the Go app described in the post: https://github.com/yfzhou0904/tdd-with-llm-go

Example usage from that README (and the blog post):

  % go run main.go \
  --spec 'develop a function to take in a large text, recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list' \
  --sig 'func ParseCidrs(input string) ([]*net.IPNet, error)'
The all important prompts it uses are in https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...

voiceofunreason 1 day ago

I have yet to see an LLM + TDD essay where the author demonstrates any mastery of Test Driven Development.

Is the label "TDD" being hijacked for something new? Did that already happen? Are LLMs now responsible for defining TDD?