Oh man evals are hard!

January 29, 2026

I feel like proper PhD tinkering on them (evals)

Crazy idea leads to a crazy writeup

These are live battle notes from me being quixotic with benchmarks!

GIVEN:

Previously built “knowledge base” pipeline for a regulations-heavy factory: “AI archivarius over company documents that can reliably retrieve, cross reference and render answers that can be trusted”. The initiative aimed at lowering recall time needed to locate necessary information and bring somewhat wow effect by the emerged capabilities of cross-document reasoning.

Pipeline has been built in quite frankly spoken - funny way: with a Claude Code in the loop - because why not, because it is a pilot and it’s booming and zooming and client is happy.

On the pipeline itself - it is an index derived from the documents structure (standards, processes, roles, amendments, etc) plus custom tools aimed at searching by entity (role, standard, etc) or raw BM25 over chunks returning Claude Code specific Read Tool offsets. Ah, and proper OCR fuckery in order to get things as texts and then build index over markdowns.

Booming and Zooming until you get a call from another adjacent client, telling “we want the same pipeline but fully local - brief us on feasibility of it and hardware prices”.

How do you even start solving this problem?

My initial thinking:

Hardware spec has to be defined by the models’ capability we will run there.
1. 3h reading about Apple unified memory with its quite adequate speed vs NVIDIA for slightly different pipelines and VRAM situation.
2. Trying to determine optimal memory size for Mac Studio and evaluate whether the whole local models idea is feasible after all
Bottleneck is the model doing core agentic loop - the one searching the documents with custom tools and deciding whether it has enough to report back with an answer
Let’s build a benchmark for it!

This is exactly where the rabbit hole has started. Benchmarks. And oh man they are hard.

How do you properly measure the quality of an agentic pipeline? In order to decide that Qwen-30B-Coder will be NOT enough, but something like 120B model will do the job? What if documents are multi-language? What if some documents are referenced but not yet present in a system?

Proper over engineering

So as a proper engineer I have started over engineering The Benchmark.

That was an evening after a long day - hence plenty of wrong decisions and assumptions have been made. benchmark_v1 was an attempt to get somewhat reasonable score for a 30B model. The problem has been in the fact that I had only “Claude Code based inference” - So I have first tried to patch the model there to OpenRouter one; failed. Then I have made the OpenCode based implementation for agent - tested with Gemini 3 Flash - worked OK; Then created the benchmark_v1 based on some set of indexed “gold QnA” pairs and LLM judges; Moved to benchmarking - 6/20 for 30B model. Shit.

Decided to build my own agentic harness where only my tools will be present in order to avoid overloading tiny mental capabilities of 30B model. 8/20 for 30B model. BUT! 12.5/20 for the Gemini 3 Flash, which is ~SOTA, which is even more fishy. Almost falling asleep at 4am called it a day.

With a fresher context of mine I have actually performed much worse on the benchmark_v2. I have selected the option where I will be comparing apples to apples - e.g. custom harness with model A to custom harness with model B. Polished it a bit. And have maxed complexity for benchmark_v2 - facts extraction, classification and scoring, tracking of the tools being called, tracking of tokens needed to answer, yada, yada, yada. Oh man it was beautiful. None of it has worked. This iteration benchmark has yet again converged to the state where I couldn’t reliably trust its scores.

Meta thinking on why it has failed

So as I have been drilling the failure reasons behind those benchmarks before starting yet another iteration - I have gained an understanding that the ultimate failure reason has been “too many moving parts”. I even had a nightmare about it! Where “amorphous details” have been laying around my mind and multiplying by a mental touch of them (like in Harry Potter Gringotts vault of Lestrange). And that multiplicational complexity has been causing almost physical pain so I have been sleep talking about simplifying the pipeline.

The setups I have built were actually quite similar to a Star Trek moment: “The notion of transwarp beaming is like trying to hit a bullet with a smaller bullet, whilst wearing a blindfold, riding a horse.”

Moving parts were:

3 different implementations of agent (production one, opencode and custom)
2 different document corpora with slightly different tools (Quality control and HR docs with different index logic behind them)
question answering pipeline (simple fella)
multi step eval frameworks
different prompts and guidelines for models
golden QnA pairs. Some sourced by client, some augmented, some generated; properly indexed in my best overengineered practices (this comes of an importance later on)

So the main narrative of my pre-sleep thinking has been - “what is it that makes Claude Code that efficient? (Claude Code because it is in my prod agent) Is it a model? Is it a harness-system prompt? What is that? If I use opus 4.5 but custom focused harness - will it be better or worse? If I use Claude-code alike harness with a different model - will it be different?” - proper meta thinking. Especially painful when you don’t have a week in order to properly PhD-style-test-it-all in order to calm down the research itch.

Oh man benchmarks are hard.

Aha moment

After proper back and forth with the problem and one more benchmark failure with SOTA model I have decided to zoom back and change approach.

Why don’t we evaluate agent using…. Well, another agent. Vague idea of:

We use Claude Code as an eval engine
We run several evals - Claude Code subagent per eval
We tweak prompts and wordings logic until we have a gut feeling of it being alright
We analyze “traces” in order to condense the procedure to somewhat deterministic steps

So it was literally the shipperish prompt to Claude consisting of:

We are working on the [REDACTED] - is a knowledge assistant that fetches the needed documents and reasons on a top of them; docs are: [REDACTED] and [REDACTED]

This session we will play an interesting mental game on how would you, claude code & opus 4.5 personally solve the problem of evaluation; Assume you have been given the question + answer of SOME pipeline AND you have access to tools that that pipelines (agents mostly) had; you will need to verify their job;

(1) you spin up opus 4.5 subagent in order to onboard you towards what’s happening there; (2) you spin up opus 4.5 then in order to reason from first principles in order to come up with a plan how claude code can EVALUATE using subagents, tools and its powerful harness & model combo;

let’s go!

That has converged to some form of a plan later on. Which has converged to a beautiful mental model:

First Principles

The equation: Answer = f(User Question, Harness over Documents)

To evaluate (no gold needed): Verdict = Claude Code(User Question, Harness Description, Answer)

Key insight: Claude Code IS the evaluator. No Python spawning scripts. No gold QnA database. Claude Code:

Reads the harness description (what tools/prompts/corpus the agent had)

Has access to the same corpus to verify claims

Judges whether the answer correctly addresses the question

And the physical implementation (well, digital):

Claude md with meta instructions about the task; what’s the game, where to look (progressive disclosure on other files) and what to do.
Harness meta descriptions with the methodic on HOW agent was run and “inventory” - its tools and prompts
Evaluation instructions - based on several deep drills with chat and initial spec by client
Log of evaluations

Always scrutinize your data

At the same time while working on it - I have remembered mantra of any AI-related work. Something like “always scrutinize your data” learned by painful mistakes through building autopilots and road systems back in 2018. This mantra has converged to yet another Version (3) of a benchmark. Never have I used “first principles” so many times in my prompts. But the data-driven movement has been slowly but painfully paying off as I’ve been watching through different failure modes of different models over different models with different prompts - (O(N^4)). Proper AI researcher would definitely cry seeing me doing this “polluted pipelines” but I needed speed; And I have been using failed questions in order to build good enough prompt for a focused (custom) version of agent in order to later properly test the models I wanted to test after all. Proper engineering chaos per se.

In parallel Claude Code has been chewing through same QnA pairs I had handy for V3 of benchmark - in order to create a critical mass of data; in order to hopefully converge to something.

I am writing this while Claude does things. So the Claude-code-evaluation has converged to the same idea of “data scrutinization” where I have corrected his evaluations over several questions in order to solidify my implicit understanding of how it shall work into precise guidelines to Claude.

Next hypothesized step on the line - once critical mass of questions evaluated by Claude Code are done - spin up the analytics pipeline over the traces of subagents that have been evaluating. Check their tools, methods, etc; Then loop back knowledge to deterministic evaluation pipeline with proper metrics and scores.

So kind of merging two streams together and distilling Claude Code methodics to somewhat reasonable pipeline.

Other way around

So my initial hypothesis that claude-code-as-eval-engine will yield insights has rather slightly pivoted to “careful cross pollination” between the two benchmarks on QnA pairs that fail in one place but pass in another and vice versa; Both failed pairs have been considered as well. This approach allowed to solidify the instructions first in a form of Claude instructions and later impose them as a deterministic logic towards the benchmark_v3.

Oh man it was painful. I went all of the way towards the prompt_v7 in order to make the focused agent to match the performance of claude-code-in-a-loop-for-production and v8 in order to slightly beat it and then returned to v5 because it had a better performance across several SOTA models (while v8 gave Sonnet 4.5 notable advantage).

The idea that worked here was not exactly the initial scaffold and dissect - but rather “selective cross transfer learning” type of thing. Meta approach gave me ideas I have been implementing in deterministic bench, and later has been giving the rigor and precision I have been looping back to instructions for the meta module; So traces didn’t exactly prove useful - but rather a meta-concept as a whole. Though for later versions of a benchmark - traces will be definitely scrutinized.

THINK-UPD: after running this pipeline for some time another important case has emerged - meta-benchmark can drill the questions

Benchmarks results:

Best Result Per Model (Prompt v5)
┌──────┬──────────────────┬─────────────┬────────────────┬───────────────────┬───────────────┐
│ Rank │      Model       │  Pass Rate  │ M1 (Retrieval) │ M2 (Completeness) │ M3 (Citation) │
├──────┼──────────────────┼─────────────┼────────────────┼───────────────────┼───────────────┤
│ 1    │ Grok 4 Fast      │ 80% (16/20) │ 95%            │ 53%               │ 100%          │
│ 2    │ Gemini 3 Flash   │ 80% (8/10)  │ 95%            │ 54%               │ 100%          │
│ 3    │ Claude Sonnet 4  │ 70% (14/20) │ 100%           │ 48%               │ 100%          │
├──────┼──────────────────┼─────────────┼────────────────┼───────────────────┼───────────────┤
│ 4    │ MiniMax M2.1     │ 40% (8/20)  │ 90%            │ 27%               │ 45%           │
│ 5    │ MiMo-V2-Flash    │ 30% (6/20)  │ 80%            │ 24%               │ 65%           │
│ 5    │ GLM-4.7          │ 30% (6/20)  │ 100%           │ 31%               │ 60%           │
│ 6    │ DeepSeek Chat    │ 20% (2/10)  │ 55%            │ 19%               │ 80%           │
│ 6    │ Gemini 2.0 Flash │ 20% (2/10)  │ 65%            │ 20%               │ 100%          │
│ 7    │ GPT-OSS-120B     │ 0% (0/10)   │ 50%            │ 0%                │ 0%            │
└──────┴──────────────────┴─────────────┴────────────────┴───────────────────┴───────────────┘

UPD: Later model evaluations have shown that the things that work well as tools with a Claude Code in a loop lead to a dramatic mode collapse on GLM 4.7, Minimax, DeepSeek, GPT-OSS-120B (this fella was extremely laggy after all); So I will work rather on a total redesign of tools and retrieval & reasoning logic. Proper redesign from first principles.

Conclusion?

Well, first of all - Claude Code is surprisingly good as a harness. It takes quite an amount of time, trial and error in order to build harness for a narrow use case that will beat the “generalist Claude Code approach”, even using the same model! Well done Anthropic team and Boris! Well cooked!

Another insight of this combat log is that - the less “moving parts” you have - the better. Maybe obvious thing. It was probably quite obvious to me in general too - until I have started going into the rabbit hole - adding more and more into my personal context until I have collapsed. So I guess with new realities of agentic coding - not only we have to care about the Claude context clarity - but rather our own. To avoid personal mode collapse as well.

Watch out for your data too. Especially with LLM evaluation pipelines. Especially when the domain you are working on is far from your knowledge - so that seemingly correct answers might be actually pretty much factually incomplete or even misleading. With things like quality management over hundreds of various standards it’s quite easy to miss. So watch out there. Put your hands into the trenches of walking from question to question and analyzing every metric collapse until it clicks (or until fatigue).

And final conclusion - writing this after several more hours spent benchmarking and looking through the data until fatigue - Claude Code harness and Opus 4.5 (and Sonnet 4.5) - are something of an alien tech. It is just hard to reach their level with other tools. Or maybe my path has been wrong.

P.S. This will be intentially posted raw and un-edited. Because I am on my sligh lows after the setback with all of that benchmarks. Will do a proper 3-4 days roadtrip in order to clean my head and come back rested and ready to solve hard problems.

P.P.S. Key insights here have been: (a) claude code as an agentic pipelines eval tool, (b) first principles are indeed quite effective meta thinking mechanics that seem to work good with opus 4.5;