AI ‘Lawyers’ Score Only 30% – But That’s Not the Number You Should Worry About

February 6, 2026
5 min read
Illustration of an AI agent standing between law books and digital code symbols

AI ‘Lawyers’ Score Only 30% – But That’s Not the Number You Should Worry About

Anthropic’s latest model, Opus 4.6, just jumped to the top of a legal-task benchmark – and it still fails more than half the time. On paper, that sounds reassuring for human lawyers. In practice, the trajectory is the real story: from under 20% to nearly 30% in a few months, with multi‑attempt performance around 45%. According to TechCrunch’s coverage of Mercor’s agent benchmark, AI systems are learning to behave less like chatbots and more like junior associates. This piece looks at who should be nervous, how regulation will respond, and what “30%” really means for the future of legal work.


The news in brief

According to TechCrunch, startup Mercor has been running a benchmark that tests AI agents on professional, multi‑step tasks, including legal and corporate analysis. When the benchmark was first publicized, all major labs scored below 25%, suggesting that even advanced models struggled with complex, procedure‑heavy work.

In early February, Anthropic released a new model, Opus 4.6, with expanded “agentic” capabilities such as coordinated “agent swarms.” On Mercor’s benchmark, this model reportedly reached just under 30% accuracy in one‑shot runs, and around 45% when allowed several attempts at a task. Mercor’s CEO described the jump from 18.4% to 29.8% in a few months as unusually fast progress.

TechCrunch emphasizes that these numbers are still far from human‑level performance, so AI is not about to replace lawyers this quarter. But the rapid improvement undermines last month’s comfortable narrative that professional services are largely insulated from near‑term automation.


Why this matters

The obvious headline is that AI “lawyers” are still bad at their jobs. A system that gets 70% of real client matters wrong would be catastrophic. But benchmarks like Mercor’s aren’t meant to be deployed as‑is; they measure slope, not destination. The key message is not 30%, but how quickly we got there once models were given more agent‑like tools.

The winners in the short term are clear:

  • Big law firms and legal ops teams that adopt AI as an internal co‑pilot gain leverage. A tool that correctly drafts or reviews 30–45% of routine work on the first pass can meaningfully reduce hours billed for low‑margin tasks.
  • Model providers and legaltech startups get a compelling story for investors: this is no longer pure research, but measurable progress on revenue‑generating work.

The likely losers are also obvious:

  • Junior lawyers and paralegals whose job is mainly document review, templated drafting, and research. Even a flawed agent that can crank out a first draft in minutes makes a dent in traditional apprenticeship models.
  • Small firms that ignore AI. When large practices standardise AI‑supported workflows, price expectations for basic services will shift across the market.

The more subtle risk is overconfidence. A 30–45% benchmark score can feel “almost there” and tempt firms to push AI into client‑facing roles before safety, supervision, and liability frameworks are ready. The danger is not that AI will replace lawyers tomorrow, but that poorly governed AI will act like a lawyer today – and someone else will pay for its mistakes.


The bigger picture

Mercor’s results drop into a wider pattern: foundation models are slowly morphing from autocomplete engines into goal‑seeking agents. OpenAI’s “reasoning” models, Google’s increasingly agent‑focused Gemini stack, and Anthropic’s new “agent swarms” are all variations on the same theme – chaining steps, using tools, and persisting state to tackle long, structured tasks.

We have seen similar jumps before. In 2023–24, models suddenly went from struggling with bar‑exam‑style questions to comfortably passing professional tests. That didn’t turn GPT‑4 into a competent trial lawyer, but it did signal that threshold‑crossing events can happen very quickly once architectures and training regimes change.

Legal work is especially exposed because it’s both symbolic and textual. Contracts, statutes, case law and corporate policies are all language. Once models can reliably follow procedures, invoke tools (e.g., databases, citation checkers), and coordinate sub‑agents, large chunks of legal workflows become programmable.

At the same time, the comparison with past waves of legal tech is instructive. E‑discovery and digital research didn’t eliminate lawyers; they compressed entire departments of paper‑pushers into smaller, more technical teams. The firms that benefited were those that re‑organised around the technology rather than treating it as a bolt‑on.

Anthropic’s benchmark win is therefore less about model bragging rights and more about signalling: the next competitive frontier isn’t raw IQ, it’s orchestration. Whoever builds the most reliable, auditable, and regulator‑friendly legal agents will shape how the profession evolves.


The European and regional angle

For European practitioners, the headline is not “AI agents can be lawyers,” but “Brussels will want a word.” Under the EU AI Act, systems used in legal decision‑making and rights‑affecting procedures are squarely in high‑risk territory. Any serious deployment of agentic models in law firms, courts, or public administration will trigger stringent requirements for documentation, human oversight, and robustness testing.

Combine this with GDPR and you get a tough compliance puzzle: client data is often deeply sensitive, cross‑border, and long‑lived. Sending that data to cloud‑hosted US models – even with good intentions – reopens old debates about data transfers and confidentiality. European corporate counsel will not green‑light large‑scale AI assistance without clear answers on where prompts and outputs are stored, who can access them, and how long they persist.

At the same time, there is a genuine opportunity for EU‑born legaltech. From Berlin to Ljubljana and Zagreb, startups are already building document‑automation and contract‑lifecycle tools; bolting trustworthy agents on top of that stack is an obvious next step. European players that bake compliance and explainability into their products from day one could become preferred partners for highly regulated sectors where US platforms are perceived as too cavalier with data.

For courts and public agencies, the lesson is simple: AI agents are coming to the legal domain whether institutions are ready or not. Waiting for “mature” technology is no longer an option; piloting, sandboxing and standard‑setting need to start now.


Looking ahead

If capabilities can jump from 18% to almost 30% in a few months, it is reasonable to expect another doubling on similar benchmarks over the next 12–24 months – especially as agents gain better tools, memory, and domain‑specific training. That doesn’t guarantee real‑world reliability, but it does mean that by the end of the decade, “AI paralegals” handling narrow, supervised tasks will be routine.

Expect three developments:

  1. Hybrid workflows become default. Contracts, due‑diligence reports, and internal memos will be drafted by AI and heavily edited by humans, not the other way around.
  2. Regulators and bar associations get specific. Vague ethical guidelines will harden into concrete rules about disclosure, supervision, and record‑keeping when AI is used on client matters.
  3. Insurance and procurement start to bite. Malpractice insurers and corporate clients will increasingly ask not if you use AI, but how – and will price risk accordingly.

What remains unclear is market structure. Will a handful of US foundation‑model providers dominate legal AI, with European firms acting as thin wrappers? Or will regional players leverage local language expertise, on‑prem deployments, and compliance to carve out defensible niches? The answer will depend as much on regulation and procurement choices as on raw model performance.


The bottom line

Anthropic’s performance on Mercor’s benchmark doesn’t prove that AI can replace lawyers, but it does prove that comfortable assumptions about professional immunity from automation are expiring quickly. The legal sector now faces a choice: proactively redesign work around AI agents with strong safeguards, or wait for change to be imposed by clients, competitors and regulators. The real question for lawyers is no longer “Can AI do my job?” but “Which parts of my job do I want to own when AI becomes a competent colleague?”

Comments

Leave a Comment

No comments yet. Be the first to comment!

Related Articles

Stay Updated

Get the latest AI and tech news delivered to your inbox.