remix logo

Hacker Remix

Show HN: Magnitude – open-source, AI-native test framework for web apps

163 points by anerli 21 hours ago | 40 comments

Hey HN, Anders and Tom here - we’ve been building an end-to-end testing framework powered by visual LLM agents to replace traditional web testing.

We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:

- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution

- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.

The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

It’s completely open source. Would love to have more people try it out and tell us how we can make it great.

Repo: https://github.com/magnitudedev/magnitude

NitpickLawyer 20 hours ago

> The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?

anerli 20 hours ago

So this is a path that we definitely considered. However we think its a half-measure to generate actual Playwright code and just run that. Because if you do that, you still have a brittle test at the end of the day, and once it breaks you would need to pull in some LLM to try and adapt it anyway.

Instead of caching actual code, we cache a "plan" of specific web actions that are still described in natural language.

For example, a cached "typing" action might look like: { variant: 'type'; target: string; content: string; }

The target is a natural language description. The content is what to type. Moondream's job is simply to find the target, and then we will click into that target and type whatever content. This means it can be full vision and not rely on DOM at all, while still being very consistent. Moondream is also trivially cheap to run since it's only a 2B model. If it can't find the target or it's confidence changed significantly (using token probabilities), it's an indication that the action/plan requires adjustment, and we can dynamically swap in the planner LLM to decide how to adjust the test from there.

ekzy 18 hours ago

Did you consider also caching the coordinates returned by moondream? I understand that it is cheap, but it could be useful to detect if an element has changed position as it may be a regression

anerli 16 hours ago

So the problem is if we cache the coordinates and click blindly at the saved positions, there's no way to tell if the interface changes or if we are actually clicking the wring things (unless we try and do something hacky like listen for events on the DOM). Detecting whether elements have changed position though would definitely be feasible if re-running a test with Moondream, could compared against the coordinates of the last run.

chrisweekly 8 hours ago

sounds a lot like snapshot testing

tomatohs 13 hours ago

This is exactly our workflow, though we defined our own YAML spec [1] for reasons mentioned in previous comments.

We have multiple fallbacks to prevent flakes; The "cheap" command, a description of the intended step, and the original prompt.

If any step fails, we fall back to the next source.

1. https://docs.testdriver.ai/reference/test-steps

chrisweekly 8 hours ago

This looks pretty cool, at least at first glance. I think "traditional web testing" means different things to different people. Last year, the Netflix engineering team published "SafeTest"[1] an interesting hybrid / superset of unit and e2e testing. Have you guys (Magnitude devs) considered incorporating any of their ideas?

1. https://netflixtechblog.com/introducing-safetest-a-novel-app...

anerli 6 hours ago

Looks cool! Thanks for sharing! The idea of having a hybrid framework for component unit testing + end to end testing is neat. Will definitely consider how this might be applicable to magnitude.

o1o1o1 5 hours ago

Thanks for sharing, this looks interesting.

However, I do not see a big advantage over Cypress tests.

The article mentions shortcomings of Cypress (and Playwright):

> They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline.

The simple solution is to containerise the whole application (including whatever OAuth provider is used), which then allows you to simply launch the whole thing and then run the tests. Most apps (especially in enterprise) should already be containerised anyway, so most of the times we can just go ahead and run any tests against them.

How is SafeTest better than that when my goal is to test my application in a real world scenario?

retreatguru 3 hours ago

Any advice about using ai to write test cases? For example recording a video while using an app and converting that to test cases. Seems like it should work.