997 points by weirdcat 13 hours ago | 536 comments
anotherpaulg 7 hours ago
84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%! 92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini
https://aider.chat/docs/leaderboards/artemisart 6 hours ago
usaar333 2 hours ago
Questions are variants of:
Refactor the _set_csrf_cookie method in the CsrfViewMiddleware class to be a stand alone, top level function. Name the new function _set_csrf_cookie, exactly the same name as the existing method. Update any existing self._set_csrf_cookie calls to work with the new _set_csrf_cookie function.
miki123211 2 hours ago
simonw 2 hours ago
If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.
If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.
ianeigorndua 6 hours ago
Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”
Not gonna start looking for a job any time soon
zeroonetwothree 6 hours ago
> Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).
So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.
stavros 4 hours ago
LASR 8 hours ago
As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
aduffy 8 hours ago
Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
CSMastermind 7 hours ago
It's almost always a framework around existing tools like Selenium that you constantly have to fight against to get good results from. I was always left with the feeling that I could build something better myself just handrolling the scripts rather than using their frameworks.
Getting Claude integrated into the space is going to be a game changer.
xxpor 5 hours ago
falcor84 4 hours ago
I know we're still a bit far from there, but I don't see a particular hurdle that strikes me as requiring novel research.
SoftTalker 2 hours ago
monkeydust 8 hours ago
arach 2 hours ago
I don't think there will be a very big place for standalone next gen RPA pure plays. it makes sense that companies that are trying to deliver value would implement capabilities this. Over time, I expect some conventions/specs will emerge. Either Apple/Google or Anthropic/OpenAI are likely to come up with an implementation that everyone aligns on
In other words, I agree
tkellogg 7 hours ago
voidmain0001 7 hours ago
I’ve implemented quite a few RPA apps and the struggle is the request/response turn around time for realtime transactions. For batch data extract or input, RPA is great since there’s no expectation of process duration. However, when a client requests data in realtime that can only be retrieved from an app using RPA, the response time is abysmal. Just picture it - Start the app, log into the app if it requires authentication (hope that the authentication's MFA is email based rather than token based, and then access the mailbox using an in-place configuration with MS Graph/Google Workspace/etc), navigate to the app’s view that has the data or worse, bring up a search interface since the exact data isn’t known and try and find the requested data. So brittle...
miki123211 2 hours ago
I don't think this is going to work in that industry until local models get good enough to do it, and small enoguh to be affordable to hospitals.
HeatrayEnjoyer 58 minutes ago
SoftTalker 2 hours ago
SoftTalker 2 hours ago
So the solution to that is to add another layer of complex AI tech on top of it?
simonw 1 hour ago
girvo 7 hours ago
claytongulick 5 hours ago
In my case I have a bunch of nurses that waste a huge amount of time dealing with clerical work and tech hoops, rather than operating at the top of their license.
Traditional RPAs are tough when you're dealing with VPNs, 2fa, remote desktop (in multiple ways), a variety of EHRs and scraping clinical documentation from poorly structured clinical notes or PDFs.
This technology looks like it could be a game changer for our organization.
mewpmewp2 5 hours ago
falcor84 4 hours ago
mewpmewp2 3 hours ago
So maybe like an automated ubikey that you can opt in to a nearby computer to have all the access. Especially if working from home, you can set it at a state where if you are in 15m radius of your laptop it is able to sign all access.
Because right now, considering amount of tools and everything I use and with single sign on, VPN, Okta, etc, and how slow they seem to be, it's extremely frustrating process constantly logging in to everywhere, and it's almost like it makes me procrastinate my work, because I can't be bothered. Everything about those weird little things is absolutely terrible experience, including things like cookie banners as well.
And it is ridiculous, because I'm working from home, but frustratingly high amount of time is spent on this bs.
A bluetooth wearable or similar to prove that I'm nearby essentially, to me that seems like it could alleviate a lot of safety concerns, while providing amazing dev/ux.
falcor84 2 hours ago
The main attack vector would then probably be some man-in-the-middle intercepting the signal from your wearable, which leads me to wonder whether you could protect yourself by having the responses valid for only an extremely short duration, e.g. ~1ms, such that there's no way for an attacker to do anything with the token unless they gain control over compute inside your house.
iwontberude 6 hours ago
voidmain0001 1 hour ago
TeMPOraL 8 hours ago
FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
--
[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
pants2 8 hours ago
bambax 7 hours ago
Really??!? What could possibly go wrong.
I'm currently trying to do a large ORC project using Google Vision API, and then Gemini 1.5 Pro 002 to parse and reconstruct the results (taking advantage, one hopes, of its big context window). As I'm not familiar with Google Vision API I asked Gemini to guide me in setting it up.
Gemini is the latest Google model; Vision, as the name implies, is also from Google. Yet Gemini makes several egregious mistakes about Vision, gets names of fields or options wrong, etc.
Gemini 1.5 "Pro" also suggests that concatenating two json strings produces a valid json string; when told that's unlikely, it's very sorry and makes lots of apologies, but still it made the mistake in the first place.
LLMs can be useful when used with caution; letting one loose in an enterprise environment doesn't feel safe, or sane.
LASR 8 hours ago
I've been peddling my vision of "AI automation" for the last several months to acquaintances of mine in various professional fields. In some cases, even building up prototypes and real-user testing. Invariably, none have really stuck.
This is not a technical problem that requires a technical solution. The problem is that it requires human behavior change.
In the context of AI automation, the promise is huge gains, but when you try to convince users / buyers, there is nothing wrong with their current solutions. Ie: There is no problem to solve. So essentially "why are you bothering me with this AI nonsense?"
Honestly, human behavior change might be the only real blocker to a world where AI automates most of the boring busy work currently done by people.
This approach essentially sidesteps the need to have effect a behavior change, at least in the short-term while AI can prove and solidify its value in the real-world.
sdwr 6 hours ago
AI is squarely #1. You can't trust it with your credit card to order groceries, or to budget and plan and book your vacation. People aren't picking up on AI because it isn't good enough yet to trust - you still have the burden of responsibility for the task.
dimitri-vs 4 hours ago
Aeolun 4 hours ago
ldjkfkdsjnv 8 hours ago
aledalgrande 3 hours ago
voidmain0001 1 hour ago
marsh_mellow 12 hours ago
Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...
Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
karpatic 10 hours ago
frankdenbow 6 hours ago
nwnwhwje 1 hour ago
HarHarVeryFunny 5 hours ago
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
https://www.youtube.com/watch?v=vH2f7cjXjKI
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?