llamasushi 43 minutes ago

The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.

Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.

The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.

  • losvedir 4 minutes ago

    I almost scrolled past the "Safety" section, because in the past it always seemed sort of silly sci-fi scaremongering (IMO) or things that I would classify as "sharp tool dangerous in the wrong hands". But I'm glad I stopped, because it actually talked about real, practical issues like the prompt injections that you mention. I wonder if the industry term "safety" is pivoting to refer to other things now.

  • tekacs 28 minutes ago

    This is also super relevant for everyone who had ditched Claude Code due to limits:

    > For Claude and Claude Code users with access to Opus 4.5, we’ve removed Opus-specific caps. For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet. We’re updating usage limits to make sure you’re able to use Opus 4.5 for daily work.

  • Scene_Cast2 17 minutes ago

    Still way pricier (>2x) than Gemini 3 and Grok 4. I've noticed that the latter two also perform better than Opus 4, so I've stopped using Opus.

  • wolttam 38 minutes ago

    It's 1/3 the old price ($15/$75)

    • brookst 22 minutes ago

      Not sure if that’s a joke about LLM math performance, but pedantry requires me to point out 15 / 75 = 1/5

      • l1n 20 minutes ago

        15$/Megatoken in, 75$/Megatoken out

        • brookst 15 minutes ago

          Sigh, ok, I’m the defective one here.

      • conradkay 19 minutes ago

        they mean it used to be $15/m input and $75/m output tokens

827a an hour ago

I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.

  • mritchie712 32 minutes ago

    > only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

    I think part of it is this[0] and I expect it will become more of a problem.

    Claude models have built-in tools (e.g. `str_replace_editor`) which they've been trained to use. These tools don't exist in Cursor, but claude really wants to use them.

    0 - https://x.com/thisritchie/status/1944038132665454841?s=20

    • HugoDias 20 minutes ago

      TIL! I'll finally give Claude Code a try. I've been using Cursor since it launched and never tried anything else. The terminal UI didn't appeal to me, but knowing it has better performance, I'll check it out.

      Cursor has been a terrible experience lately, regardless of the model. Sometimes for the same task, I need to try with Sonnet 4.5, ChatGPT 5.1 Codex, Gemini Pro 3... and most times, none managed to do the work, and I end up doing it myself.

      At least I’m coding more again, lol

  • vunderba 40 minutes ago

    My workflow was usually to use Gemini 2.5 Pro (now 3.0) for high-level architecture and design. Then I would take the finished "spec" and have Sonnet 4.5 perform the actual implementation.

    • nevir 29 minutes ago

      Same here. Gemini really excels at all the "softer" parts of the development process (which, TBH, feels like most of the work). And Claude kicks ass at the actual code authoring.

      It's a really nice workflow.

    • SkyPuncher 14 minutes ago

      This is how I do it. Though, I've been using Composer as my main driver more an more.

      * Composer - Line-by-Line changes * Sonnet 4.5 - Task planning and small-to-medium feature architecture. Pass it off to Composer for code * Gemini Pro - Large and XL architecture work. Pass it off to Sonnet to breakdown into tasks.

    • config_yml 33 minutes ago

      I use plan mode in claude code, then use gpt-5 in codex to review the plan and identify gaps and feed it back to claude. Results are amazing.

    • jeswin 12 minutes ago

      Same here. But with GPT 5.1 instead of Gemini.

    • vessenes 38 minutes ago

      I like this plan, too - gemini's recent series have long seemed to have the best large context awareness vs competing frontier models - anecdotally, although much slower, I think gpt-5's architecture plans are slightly better.

    • UltraSane 27 minutes ago

      I've done this and it seems to work well. I ask Gemini to generate a prompt for Claude Code to accomplish X

  • emodendroket 11 minutes ago

    Yeah I think Sonnet is still the best in my experience but the limits are so stingy I find it hard to recommend for personal use.

  • lvl155 35 minutes ago

    I really don’t understand the hype around Gemini. Opus/Sonnet/GPT are much better for agentic workflows. Seems people get hyped for the first few days. It also has a lot to do with Claude code and Codex.

    • egeozcan 27 minutes ago

      I'm completely the opposite. I find Gemini (even 2.5 Pro) much, much better than anything else. But I hate agentic flows, I upload the full context to it in aistudio and then it shines - anything agentic cannot even come close.

    • jdgoesmarching 23 minutes ago

      Personally my hype is for the price, especially for Flash. Before Sonnet 4.5 was competitive with Gemini 2.5 Pro, the latter was a much better value than Opus 4.1.

  • chinathrow 38 minutes ago

    I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

    It gave me the Youtube-URL to Rick Astley.

    • arghwhat 35 minutes ago

      If you're asking an LLM to compute something "off the top of its head", you're using it wrong. Ask it to write the code to perform the computation and it'll do better.

      Same with asking a person to solve something in their head vs. giving them an editor and a random python interpreter, or whatever it is normal people use to solve problems.

    • hu3 27 minutes ago

      > I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

      This is what I imagine the LLM usage of people who tell me AI isn't helpful.

      It's like telling me airplanes aren't useful because you can't use them in McDonald's drive-through.

    • mikestorrent 36 minutes ago

      You should probably tell AI to write you programs to do tasks that programs are better at than minds.

    • stavros 34 minutes ago

      Don't use LLMs for a task a human can't do, they won't do it well.

      • wmf 16 minutes ago

        A human could easily come up with a base64 -d | jq oneliner.

        • stavros a minute ago

          So can the LLM, but that wasn't the task.

    • gregable 23 minutes ago

      it. Not him.

      • chinathrow 4 minutes ago

        It's Claude. Where I live, that is a male name.

  • visioninmyblood an hour ago

    The model is great it is able to code up some interesting visual tasks(I guess they have pretty strong tool calling capapbilities). Like orchestrate prompt -> image generate -> Segmentation -> 3D reconstruction. Checkout the results here https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7. Note the model was only used to orchestrate the pipeline, the tasks are done by other models in an agentic framework. They much have improved tool calling framework with all the MCP usage. Gemini 3 was able to orchestrate the same but Claude 4.5 is much faster

  • rishabhaiover an hour ago

    I suspect Cursor is not the right platform to write code on. IMO, humans are lazy and would never code on Cursor. They default to code generation via prompt which is sub-optimal.

    • viraptor an hour ago

      > They default to writing code via prompt generation which is sub-optimal.

      What do you mean?

      • rishabhaiover 40 minutes ago

        If you're given a finite context window, what's the most efficient token to present for a programming task? sloppy prompts or actual code (using it with autocomplete)

        • viraptor 26 minutes ago

          I'm not sure you get how Cursor works. You add both instructions and code to your prompt. And it does provide its own autocomplete model as well. And... lots of people use that. (It's the largest platform today as far as I can tell)

          • rishabhaiover 13 minutes ago

            I wish I didn't know how Cursor works. It's a great product for 90% of programmers out there no doubt.

  • Squarex an hour ago

    I have heard that gemini 3 is not that great in cursor, but excellent in Antigravity. I don't have a time to personally verify all that though.

    • config_yml 31 minutes ago

      I‘ve had no success using Antigravity, which is a shame because the ideas are promising, but the execution so far is underwhelming. Haven‘t gotten past an initial plannin doc which is usually aborted due to model provider overload or rate limiting.

    • itsdrewmiller 39 minutes ago

      My first couple of attempts at antigravity / Gemini were pretty bad - the model kept aborting and it was relatively helpless at tools compared to Claude (although I have a lot more experience tuning Claude to be fair). Seems like there are some good ideas in antigravity but it’s more like an alpha than a product.

    • incoming1211 an hour ago

      I think gemini 3 is hot garbage in everything. Its great on a greenfield trying to 1 shot something, if you're working on a long term project it just sucks.

  • jjcm 28 minutes ago

    Tangental observation - I've noticed Gemini 3 Pro's train of thought feels very unique. It has kind of an emotive personality to it, where it's surprised or excited by what it finds. It feels like a senior developer looking through legacy code and being like, "wtf is this??".

    I'm curious if this was a deliberate effort on their part, and if they found in testing it provided better output. It's still behind other models clearly, but nonetheless it's fascinating.

  • screye 13 minutes ago

    Gemini being terrible in Cursor is a well known problem.

    Unfortunately, for all its engineers, Google seems the most incompetent at product work.

  • UltraSane 28 minutes ago

    I've had Gemini 3 Pro solve issues that Claude Code failed to solve after 10 tries. It even insulted some code that Sonnet 4.5 generated

  • rustystump 38 minutes ago

    Gemini 3 was awful when i gave it a spin. It was worse than cursor’s composer model.

    Claude is still a go to but i have found that composer was “good enough” in practice.

  • behnamoh an hour ago

    i’ve tried Gemini in Google AI studio as well and was very disappointed by the superficial responses it provided. It seems like at the level of GPT-5-low or even lower.

    On the other hand, it’s a truly multi modal model whereas Claude remains to be specifically targeted at coding tasks, and therefore is only a text model.

  • poszlem 44 minutes ago

    I’ve trashed Gemini non-stop (seriously, check my history on this site), but 3 Pro is the one that finally made me switch from OpenAI. It’s still hot garbage at coding next to Claude, but for general stuff, it’s legit fantastic.

  • enraged_camel 42 minutes ago

    My testing of Gemini 3 Pro in Cursor yielded mixed results. Sometimes it's phenomenal. At other times I either get the "provider overloaded" message (after like 5 mins or whatever the timeout is), or the model's internal monologue starts spilling out to the chat window, which becomes really messy and unreadable. It'll do things like:

    >> I'll execute.

    >> I'll execute.

    >> Wait, what if...?

    >> I'll execute.

    Suffice it to say I've switched back to Sonnet as my daily driver. Excited to give Opus a try.

unsupp0rted 37 minutes ago

This is gonna be game-changing for the next 2-4 weeks before they nerf the model.

Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.

Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.

Then a couple months later they’ll release Opus 4.7 and go through the cycle again.

My allegiance to these companies is now measured in nerf cycles.

I’m a nerf cycle customer.

  • film42 11 minutes ago

    This is why I migrated my apps that need an LLM to Gemini. No model degradation so far all through the v2.5 model generation. What is Anthropic doing? Swapping for a quantized version of the model?

  • blurbleblurble 5 minutes ago

    I hope this comment makes it to the top.

    Gpt-5.1-* are all fully nerfed for me. Maybe they're giving others the real juice but they're not giving it to me.

    My sense is that they give me the real deal until I fork over $200, then they proceed to make me sit there for hours babysitting an LLM that does nothing useful and takes 20 minutes at a time for each nothing increment.

hebejebelus 14 minutes ago

On my Max plan, Opus 4.5 is now the default model! Until now I used Sonnet 4.5 exclusively and never used Opus, even for planning - I'm shocked that this is so cheap (for them) that it can be the default now. I'm curious what this will mean for the daily/weekly limits.

A short run at a small toy app makes me feel like Opus 4.5 is a bit slower than Sonnet 4.5 was, but that could also just be the day-one load it's presumably under. I don't think Sonnet was holding me back much, but it's far too early to tell.

bnchrch an hour ago

Seeing these benchmarks makes me so happy.

Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.

This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.

Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.

But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.

Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.

  • bavell 26 minutes ago

    Same boat and same thoughts here! Hope it holds its own against the competition, I've become a bit of a fan of Anthropic and their focus on devs.

  • tordrt 31 minutes ago

    I tried codex due to the same reasoning you list. The grass is not greener on the other side.. I usually only opt for codex when my claude code rate limit hits.

  • edf13 11 minutes ago

    I’m threw a few hours at Codex the other day and was incredibly disappointed with the outcome…

    I’m a heavy Claude code user and similar workloads just didn’t work out well for me on Codex.

    One of the areas I think is going to make a big difference to any model soon is speed. We can build error correcting systems into the tools - but the base models need more speed (and obviously with that lower costs)

  • wahnfrieden 24 minutes ago

    You need much less of a robust set of habits, commands, sub agent type complexity with Codex. Not only because it lacks some of these features, it also doesn't need them as much.

futureshock 12 minutes ago

A really great way to get an idea of the relative cost and performance of these models at their various thinking budgets is to look at the ARC-AGI-2 leaderboard. Opus 4.5 stacks up very well here when you compare to Gemini 3’s score and cost. Gemini 3 Deep Think is still the current leaders but at more than 30x the cost.

The cost curve of achieving these scores is coming down rapidly. In Dec 2024 when OpenAI announced beating human performance on ARC-AGI-1, they spent more than $3k per task. You can get the same performance for pennies to dollars, approximately an 80x reduction in 11 months.

https://arcprize.org/leaderboard

https://arcprize.org/blog/oai-o3-pub-breakthrough

jasonthorsness 28 minutes ago

I used Gemini instead of my usual Claude for a non-trivial front-end project [1] and it really just hit it out of the park especially after the update last week, no trouble just directly emitting around 95% of the application. Now Claude is back! The pace of releases and competition seems to be heating up more lately, and there is absolutely no switching cost. It's going to be interesting to see if and how the frontier model vendors create a moat or if the coding CLIs/models will forever remain a commodity.

[1] https://github.com/jasonthorsness/tree-dangler

  • hu3 15 minutes ago

    Gemini is indeed great for frontend HTML + CSS and even some light DOM manipulation in JS.

    I have been using Gemini 2.5 and now 3 for frontend mockups.

    When I'm happy with the result, after some prompt massage, I feed it to Sonnet 4.5 to build full stack code using the framework of the application.

jumploops an hour ago

> Pricing is now $5/$25 per million tokens

So it’s 1/3 the price of Opus 4.1…

> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens

…and potentially uses a lot less tokens?

Excited to stress test this in Claude Code, looks like a great model on paper!

  • alach11 an hour ago

    This is the biggest news of the announcement. Prior Opus models were strong, but the cost was a big limiter of usage. This price point still makes it a "premium" option, but isn't prohibitive.

    Also increasingly it's becoming important to look at token usage rather than just token cost. They say Opus 4.5 (with high reasoning) used 50% fewer tokens than Sonnet 4.5. So you get a higher score on SWE-bench verified, you pay more per token, but you use fewer tokens and overall pay less!

  • jmkni an hour ago

    > Pricing is now $5/$25 per million tokens

    For anyone else confused, it's input/output tokens

    $5 for 1million tokens in $25 for 1million tokens out

    • mvdtnz 15 minutes ago

      What prevents these jokers from making their outputs ludicrously verbose to squeeze more out of you, given they charge 5x more for the end that they control? Already model outputs are overly verbose, and I can see this getting worse as they try to squeeze some margin. Especially given that many of the tools conveniently hide most of the output.

stavros an hour ago

Did anyone else notice Sonnet 4.5 being much dumber recently? I tried it today and it was really struggling with some very simple CSS on a 100-line self-contained HTML page. This never used to happen before, and now I'm wondering if this release has something to do with it.

On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.

EDIT: Apparently it's not available in Claude Code with the Pro subscription, but you can add funds to your Claude wallet and use Opus with pay-as-you-go. This is going to be really nice to use Opus for planning and Sonnet for implementation with the Pro subscription.

However, I noticed that the previously-there option of "use Opus for planning and Sonnet for implementation" isn't there in Claude Code with this setup any more. Hopefully they'll implement it soon, as that would be the best of both worlds.

EDIT 2: Apparently you can use `/model opusplan` to get Opus in planning mode. However, it says "Uses your extra balance", and it's not clear whether it means it uses the balance just in planning mode, or also in execution mode. I don't want it to use my balance when I've got a subscription, I'll have to try it and see.

  • vunderba 36 minutes ago

    Anecdotally, I kind of compare the quality of Sonnet 4.5 to that of a chess engine: it performs better when given more time to search deeper into the tree of possible moves (more plies). So when Anthropic is under peak load I think some degradation is to be expected. I just wish Claude Code had a "Signal Peak" so that I could schedule more challenging tasks for a time when its not under high demand.

  • pton_xd 4 minutes ago

    I hate jumping in on these model-nerf conspiracy bandwagons but I have to admit, the amount of "wait, that's wrong!" interjections from Sonnet 4.5 recently has been noticeable. And then it still won't get the answer right after backtracking. Very weird.

  • kjgkjhfkjf an hour ago

    My guess is that Claude's "bad days" are due to the service becoming overloaded and failing over to use cheaper models.

  • beydogan 16 minutes ago

    100% dumber, especially since last 3-4 days. I have two guesses:

    - They make it dumber close to a new release to hype the new model

    - They gave $1000 Claude Code Web credits to a lot of people, which increased the load a lot so they had to serve quantized version to handle the it.

    I love Claude models but I hate this non transparency and instability.

  • bryanlarsen an hour ago

    On Friday my Claude was particularly stupid. It's sometimes stupid, but I've never seen it been that consistently stupid. Just assumed it was a fluke, but maybe something was changing.

saaaaaam 8 minutes ago

Anecdotally, I’ve been using opus 4.5 today via the chat interface to review several large and complex interdependent documents, fillet bits out of them and build a report. It’s very very good at this, and much better than opus 4.1. I actually didn’t realise that I was using opus 4.5 until I saw this thread.

GenerWork 17 minutes ago

I wonder what this means for UX designers like myself who would love to take a screen from Figma and turn it into code with just a single call to the MCP. I've found that Gemini 3 in Figma Make works very well at one-shotting a page when it actually works (there's a lot of issues with it actually working, sadly), so hopefully Opus 4.5 is even better.

keeeba 43 minutes ago

Oh boy, if the benchmarks are this good and Opus feels like it usually does then this is insane.

I’ve always found Opus significantly better than the benchmarks suggested.

LFG

whitepoplar 9 minutes ago

Does the reduced price mean increased usage limits on Claude Code (with a Max subscription)?

elvin_d an hour ago

Great seeing the price reduction. Opus historically was prices at 15/75, this one delivers at 5/25 which is close to Gemini 3 Pro. I hope Anthropic can afford increasing limits for the new Opus.

alvis 33 minutes ago

“For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet.” — seems like anthropic has finally listened!

jedberg 17 minutes ago

Up until today, the general advice was use Opus for deep research, use Haiku for everything else. Given the reduction in cost here, does that rule of thumb no longer apply?

andai 38 minutes ago

Why do they always cut off 70% of the y-axis? Sure it exaggerates the differences, but... it exaggerates the differences.

And they left Haiku out of most of the comparisons! That's the most interesting model for me. Because for some tasks it's fine. And it's still not clear to me which ones those are.

Because in my experience, Haiku sits at this weird middle point where, if you have a well defined task, you can use a smaller/faster/cheaper model than Haiku, and if you don't, then you need to reach for a bigger/slower/costlier model than Haiku.

aliljet 43 minutes ago

The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...

viraptor an hour ago

Has there been any announcement of a new programming benchmark? SWE looks like it's close to saturation already. At this point for SWE it may be more interesting to start looking at which types of issues consistently fail/work between model families.

chaosprint 38 minutes ago

SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.

system2 3 minutes ago

How nice, my Claude Code will reach the weekly limit in one query now instead of two. It is alright, all I spend is $200 a month.

alvis an hour ago

What surprise me is that Opus 4.5 lost all reasoning scores to Gemini and GPT. I thought it’s the area the model will shine the most

cyrusradfar 20 minutes ago

I'm curious if others are finding that there's a comfort in staying within the Claude ecosystem because when it makes a mistake, we get used to spotting the pattern. I'm finding that when I try new models, their "stupid" moments are more surprising and infuriating.

Given this tech is new, the experience of how we relate to their mistakes is something I think a bit about.

Am I alone here, are others finding themselves more forgiving of "their preferred" model provider?

rishabhaiover an hour ago

Is this available on claude-code?

  • elvin_d an hour ago

    Yes, the first run was nice - feels faster than 4.1 and did what Sonnet 4.5 struggled to execute properly.

  • greenavocado an hour ago

    What are you thinking of trying to use it for? It is generally a huge waste of money to unleash Opus on high content tasks ime

    • rishabhaiover an hour ago

      I use claude-code extensively to plan and study for my college using the socrates learning mode. It's a great way to learn for me. I wanted to test the new model's capabilities on that front.

    • flutas an hour ago

      My workflow has always been opus for planning, sonnet for actual work.

  • rishabhaiover an hour ago

    damn, I need a MAX sub for this.

    • stavros an hour ago

      You don't, you can add $5 or whatever to your Claude wallet with the Pro subscription and use those for Opus.

      • rishabhaiover 17 minutes ago

        I ain’t paying a penny more than the $20 I already do. I got cracks in my boots, brother.

GodelNumbering an hour ago

The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.

  • grantpitt an hour ago

    do say more

    • GodelNumbering 42 minutes ago

      Makes it sound like a one trick pony

      • jascha_eng 26 minutes ago

        Anthropic is leaning into agentic coding and heavily so. It makes sense to use swe verified as their main benchmark. It is also the one benchmark Google did not get the top spot last week. Claude remains king that's all that matters here.

      • grantpitt 34 minutes ago

        well, it's a big trick

0x79de 30 minutes ago

this is quite a good

zb3 38 minutes ago

The first chart is straight from "how to lie in charts"..