remix logo

Hacker Remix

Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

997 points by weirdcat 13 hours ago | 536 comments

anotherpaulg 7 hours ago

The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).

  84% Claude 3.5 Sonnet 10/22
  80% o1-preview
  77% Claude 3.5 Sonnet 06/20
  72% DeepSeek V2.5
  72% GPT-4o 08/06
  71% o1-mini
  68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

  92% Sonnet 10/22
  75% o1-preview
  72% Opus
  64% Sonnet 06/20
  49% GPT-4o 08/06
  45% o1-mini
https://aider.chat/docs/leaderboards/

artemisart 6 hours ago

Thanks! I was waiting for your benchmarks. Do you plan to test haiku 3.5 too? It would be nice to show API prices needed to run the whole benchmark too to have a better idea of how many internal tokens o1 models consume.

usaar333 2 hours ago

FWIW, the refactor benchmark is quite mechanical - it just stresses reliability of LLMs over long context windows:

Questions are variants of:

Refactor the _set_csrf_cookie method in the CsrfViewMiddleware class to be a stand alone, top level function. Name the new function _set_csrf_cookie, exactly the same name as the existing method. Update any existing self._set_csrf_cookie calls to work with the new _set_csrf_cookie function.

miki123211 2 hours ago

When using these models via the official Anthropic API, do I have to do anything to "opt in" to the new Sonnet, or am I switched over automatically?

simonw 2 hours ago

That depends on the model ID you are using.

If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.

If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.

ianeigorndua 6 hours ago

Are these synthetic or real-world benchmarks?

Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”

Not gonna start looking for a job any time soon

zeroonetwothree 6 hours ago

Example I chose at random:

> Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).

So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.

stavros 4 hours ago

I use Claude for coding and it's fantastic. I definitely have outsourced a lot of my coding to it.

LASR 8 hours ago

This is actually a huge deal.

As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.

I wanted to take at stab at this problem and started researching some daily busineses and how they use software.

My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.

My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.

I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

Boy was I wrong.

I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.

aduffy 8 hours ago

This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.

Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.

CSMastermind 7 hours ago

RPA has been a huge pain to work with.

It's almost always a framework around existing tools like Selenium that you constantly have to fight against to get good results from. I was always left with the feeling that I could build something better myself just handrolling the scripts rather than using their frameworks.

Getting Claude integrated into the space is going to be a game changer.

xxpor 5 hours ago

I can see it now, Claude generating expect scripts. 1994 and 2024 will be fully joined.

falcor84 4 hours ago

The big thing I expect at the next level is in using Claude to first generate UI-based automation based on an end user's instructions, then automatically defining a suite of end-to-end tests, confirming with the user "is this how it should work?", and then finally using this suite to reimplement the flow from first principles.

I know we're still a bit far from there, but I don't see a particular hurdle that strikes me as requiring novel research.

SoftTalker 2 hours ago

But does it do any better at soliciting the surprise requirements from the user, who after confirming that everything works, two months later reports a production bug because the software isn't correctly performing the different reqirements on the first Tuesday of each quarter that you never knew about.

monkeydust 8 hours ago

Exactly. I have been wondering for a while how GenAI might upend RPA providers guess this might be the answer.

arach 2 hours ago

I've been wondering the same and started exploring building a startup around this idea. My analysis led me to the conclusion that if AI gets even just 2 orders of magnitude better over the next two years, this will be "easy" and considered table stakes. Like connecting to the internet, syncing with cloud or using printer drivers

I don't think there will be a very big place for standalone next gen RPA pure plays. it makes sense that companies that are trying to deliver value would implement capabilities this. Over time, I expect some conventions/specs will emerge. Either Apple/Google or Anthropic/OpenAI are likely to come up with an implementation that everyone aligns on

In other words, I agree

tkellogg 7 hours ago

Honestly, this is going to be huge for healthcare. There's an incredible amount of waste due to incumbent tech making interoperability difficult.

voidmain0001 7 hours ago

Hopefully.

I’ve implemented quite a few RPA apps and the struggle is the request/response turn around time for realtime transactions. For batch data extract or input, RPA is great since there’s no expectation of process duration. However, when a client requests data in realtime that can only be retrieved from an app using RPA, the response time is abysmal. Just picture it - Start the app, log into the app if it requires authentication (hope that the authentication's MFA is email based rather than token based, and then access the mailbox using an in-place configuration with MS Graph/Google Workspace/etc), navigate to the app’s view that has the data or worse, bring up a search interface since the exact data isn’t known and try and find the requested data. So brittle...

miki123211 2 hours ago

Healthcare has the extra complication of HIPAA / equivalent local laws, and institutions being extremely unwilling to process patient data on devices they don't directly control.

I don't think this is going to work in that industry until local models get good enough to do it, and small enoguh to be affordable to hospitals.

HeatrayEnjoyer 58 minutes ago

Hospitals use O365, there are HIPAA-compliant editions of any prominent cloud service.

SoftTalker 2 hours ago

That industry only thinks it controls its devices. Crowdstrike showed there are many bridges over that moat.

SoftTalker 2 hours ago

> There's an incredible amount of waste due to incumbent tech making interoperability difficult.

So the solution to that is to add another layer of complex AI tech on top of it?

simonw 1 hour ago

Well nothing else we've tried has worked.

girvo 7 hours ago

We’ll see. Having worked in this space in the past, the technical challenges are able to overcome today with no new technology: its a business sales and regulation challenge more than a tech one.

claytongulick 5 hours ago

Sometimes.

In my case I have a bunch of nurses that waste a huge amount of time dealing with clerical work and tech hoops, rather than operating at the top of their license.

Traditional RPAs are tough when you're dealing with VPNs, 2fa, remote desktop (in multiple ways), a variety of EHRs and scraping clinical documentation from poorly structured clinical notes or PDFs.

This technology looks like it could be a game changer for our organization.

mewpmewp2 5 hours ago

True, 2FA and all these little details that exist now have made this automation quite insanely complicated. It is of course necessary that we have 2FA etc, but there is huge potential in solving this I believe.

falcor84 4 hours ago

From a security standpoint, what's considered the "proper" way of assigning a bot access based on a person's 2FA? Would that be some sort of limited scope expiring token like GitHub's fine-grained personal access tokens?

mewpmewp2 3 hours ago

I don't know, I feel like it has to be some sort of near field identity proof. E.g. as long as you are wearing a piece of equipment to a physical computer near you can run all those automations for you, or similar. I haven't fully thought what the best solution could be or whether someone is already working on it, but I feel like there has to be something like that, which would allow you better UX in terms of access, but security at the same time.

So maybe like an automated ubikey that you can opt in to a nearby computer to have all the access. Especially if working from home, you can set it at a state where if you are in 15m radius of your laptop it is able to sign all access.

Because right now, considering amount of tools and everything I use and with single sign on, VPN, Okta, etc, and how slow they seem to be, it's extremely frustrating process constantly logging in to everywhere, and it's almost like it makes me procrastinate my work, because I can't be bothered. Everything about those weird little things is absolutely terrible experience, including things like cookie banners as well.

And it is ridiculous, because I'm working from home, but frustratingly high amount of time is spent on this bs.

A bluetooth wearable or similar to prove that I'm nearby essentially, to me that seems like it could alleviate a lot of safety concerns, while providing amazing dev/ux.

falcor84 2 hours ago

That's a really cool idea.

The main attack vector would then probably be some man-in-the-middle intercepting the signal from your wearable, which leads me to wonder whether you could protect yourself by having the responses valid for only an extremely short duration, e.g. ~1ms, such that there's no way for an attacker to do anything with the token unless they gain control over compute inside your house.

iwontberude 6 hours ago

UiPath can't figure out how to make a profitable business since 2005 and we are nearing the end of this hype cycle. I am not so sure this will lead anywhere. I am a former investor in UiPath.

voidmain0001 1 hour ago

It didn’t help that UIPath forced a subscription model and “cloud orchestrator” on all users and many of which needed neither. They got greedy. We ditched it.

TeMPOraL 8 hours ago

> Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.

In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.

Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.

--

[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.

pants2 8 hours ago

Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.

bambax 7 hours ago

Companies are going to install an AI inside their own proprietary systems full of proprietary and confidential data and PII about their customers and prospects and whatnot, and let it run around and click on random buttons and submit random forms?

Really??!? What could possibly go wrong.

I'm currently trying to do a large ORC project using Google Vision API, and then Gemini 1.5 Pro 002 to parse and reconstruct the results (taking advantage, one hopes, of its big context window). As I'm not familiar with Google Vision API I asked Gemini to guide me in setting it up.

Gemini is the latest Google model; Vision, as the name implies, is also from Google. Yet Gemini makes several egregious mistakes about Vision, gets names of fields or options wrong, etc.

Gemini 1.5 "Pro" also suggests that concatenating two json strings produces a valid json string; when told that's unlikely, it's very sorry and makes lots of apologies, but still it made the mistake in the first place.

LLMs can be useful when used with caution; letting one loose in an enterprise environment doesn't feel safe, or sane.

LASR 8 hours ago

That's exactly it.

I've been peddling my vision of "AI automation" for the last several months to acquaintances of mine in various professional fields. In some cases, even building up prototypes and real-user testing. Invariably, none have really stuck.

This is not a technical problem that requires a technical solution. The problem is that it requires human behavior change.

In the context of AI automation, the promise is huge gains, but when you try to convince users / buyers, there is nothing wrong with their current solutions. Ie: There is no problem to solve. So essentially "why are you bothering me with this AI nonsense?"

Honestly, human behavior change might be the only real blocker to a world where AI automates most of the boring busy work currently done by people.

This approach essentially sidesteps the need to have effect a behavior change, at least in the short-term while AI can prove and solidify its value in the real-world.

sdwr 6 hours ago

There's a huge huge gap between "coaxing what you want out of it" and "trusting it to perform flawlessly". Everybody on the planet would use #2, but #1 is just for enthusiasts.

AI is squarely #1. You can't trust it with your credit card to order groceries, or to budget and plan and book your vacation. People aren't picking up on AI because it isn't good enough yet to trust - you still have the burden of responsibility for the task.

dimitri-vs 4 hours ago

Siri, Alexa and Amazon Dash illustrate this well. I remember everyones excitement and massive investment about these, and we all know how that turned out. I'm not sure how many times we'll need to relearn that unless an automation works >99% of the time AND fails predictably, people don't use it for anything meaningful.

Aeolun 4 hours ago

There’s nothing to gain for anyone there. Workers will lose their jobs, and managers will lose their reports.

ldjkfkdsjnv 8 hours ago

Yeah this will be a true paradigm shift

aledalgrande 3 hours ago

Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.

voidmain0001 1 hour ago

Still present. VB and VB Script would do this by using mouse move to Window handles which were discovered using Spy++. You can do with C# or AutoIT these days.

marsh_mellow 12 hours ago

karpatic 10 hours ago

This needs to be brought up. Was looking for the demo and ended up on the contact form

frankdenbow 6 hours ago

Thanks for these. Wonder how many people will use this at work to pretend that they are doing work while they listen to a podcast.

nwnwhwje 1 hour ago

This is cover for the people whose screens are recorded. Run this on the monitorred laptop to make you look busy then do the actual work on laptop 2, some of which might actually require thinking so no mouse movements.

HarHarVeryFunny 5 hours ago

The "computer use" ability is extremely impressive!

This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.

Take a look at their demo of using this for coding.

https://www.youtube.com/watch?v=vH2f7cjXjKI

This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.

It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?