SAN FRANCISCO, April 2 — OpenAI’s GPT-5 release in February reclaimed the top spot on most published benchmarks, and the company’s launch event was not subtle about which competitor it was pulling the crown back from. Anthropic, for its part, shipped Claude Opus 4.7 in March with a more measured release: better long-context retention, lower hallucination rates, and a set of capability improvements that did not produce a marketing-friendly benchmark headline.

Six weeks of side-by-side daily use suggests the headline benchmarks are, again, telling a story that does not match the working experience of the people actually using these tools.

What the benchmarks say

GPT-5 leads on roughly three-quarters of the published benchmarks in the Stanford HAI 2026 report. The leads are largest on math reasoning (MATH, GPQA), code generation throughput (HumanEval, MBPP at the larger problem sizes), and the structured-reasoning suites that have become the category’s de facto headline metrics. Where GPT-5 does not lead — long-context retrieval, summarization fidelity, multi-turn instruction-following on ambiguous prompts — Claude Opus 4.7 leads by margins that are smaller in absolute terms but consistent across the relevant benchmark families.

If you stop at the headline benchmark numbers, the conclusion is that GPT-5 is the better model. The headline benchmark numbers are, increasingly, not where the interesting differentiation between the frontier models lives.

What six weeks of daily use says

Consumer Tech Wire’s editorial and research teams have been running both models in parallel since GPT-5 launched, on the kind of working tasks that consume the bulk of professional knowledge-worker time: writing, code review, structured reasoning over long documents, multi-step tool use, and the structured back-and-forth that characterizes serious analytical work.

On coding, GPT-5 is the faster model and produces output that is, on first pass, slightly more polished. It is also the model more likely to produce confidently wrong API calls when working in less-mainstream library territory, and the model more likely to require a second prompt to correct subtle errors that Opus 4.7 catches on the first pass. For greenfield coding in well-trodden language ecosystems, GPT-5 is the better default. For code review, refactoring, and work in less-mainstream library territory, Opus 4.7 is the better default. The difference is not large but it is consistent.

On long-form writing, the gap reverses and widens. Opus 4.7 produces prose with markedly better paragraph-level coherence, more faithful adherence to specified voice and register, and substantially fewer of the small hallucinations — invented citations, slightly-wrong dates, plausibly-but-incorrectly-attributed quotes — that have characterized GPT-class models since GPT-4. Anyone who has used both models on serious long-form work has, in our experience, the same observation: GPT-5 writes faster and Opus 4.7 writes better.

On structured reasoning over long documents, Opus 4.7 leads by a margin that the headline benchmarks substantially understate. The 1M-token context window matters less than the model’s ability to retain coherent reasoning chains across that window, and on the document-analysis tasks our team uses every day, Opus 4.7 produces measurably more reliable output. GPT-5 is competitive at shorter context lengths and falls off faster as documents lengthen.

On tool use and agentic workflows, the comparison is closer and depends heavily on which tooling stack is being used. GPT-5 is better integrated with OpenAI’s native tooling. Opus 4.7 is better integrated with the broader open-protocol tooling ecosystem and with Anthropic’s MCP framework, which has become the de facto standard for non-OpenAI tool integration. For users working primarily within OpenAI’s stack, GPT-5 is the right choice. For users working across heterogeneous tool environments, Opus 4.7 is.

The pricing context

Pricing is not symmetric. GPT-5 at OpenAI’s headline tier is more expensive than Opus 4.7 on per-token API pricing, though the consumer ChatGPT Plus subscription remains $20 per month against Claude Pro’s $20 per month — a deliberate parity. The relevant comparison for most consumer users is therefore the subscription tier, where the two products are priced identically and the choice is genuinely a capability comparison rather than a price one.

For API and enterprise users, the math is more complicated and depends heavily on workload. High-volume coding workloads favor GPT-5 on speed-per-dollar. High-volume document-analysis workloads favor Opus 4.7 on accuracy-per-dollar. The right answer for most enterprise teams in 2026 is to maintain access to both and route by workload, which is what the more sophisticated AI-tooling teams have been doing since 2024.

The Consumer Tech Wire view

The headline-benchmark story — GPT-5 retook the lead — is technically correct and substantively misleading. The frontier-model gap between OpenAI and Anthropic in 2026 is no longer a meaningful gap on any axis the typical user can easily observe. It is a difference of strengths, and the right model depends almost entirely on the working profile of the individual user.

For coding-heavy users in mainstream ecosystems: GPT-5. For long-form writing, document analysis, and any task where hallucination cost is high: Opus 4.7. For users who genuinely want a single chatbot for daily knowledge work and are not optimizing for any specific axis: Opus 4.7, on the basis of fewer surprises in routine use. For users who want maximum throughput on tasks where speed matters more than nuance: GPT-5.

The data suggests that the binary “which is better” framing is the wrong frame for 2026. We will be running the comparison again at the next major release from either lab, and we expect the working-experience comparison to continue diverging from the published-benchmark one.


Tomas Whitfield-Asari reported from San Francisco. This analysis reflects the views of its named author and Consumer Tech Wire’s editorial board.