Flowers for Dry Claude
One thing is reading “Flowers for Algernon” - another one is seeing it happen in a real time with a digital mind.
I can’t just get straight to the point, can I? Will start from afar.
Memes are among one of the strongest and funniest artifacts of modern days. Fast communications and hyper connected world - ideal storm for the “small size high signal packet” of information to appear. They seem to be what the letters were back 200 years ago, telegraph notes later and email messages in early internet era. They are fast and reactive - most of the situations have memes around them within minutes. They are fast consumables that transfer idea across masses.
Wet Claude vs Dry Claude is the meme I want to address right now. It’s niche - nobody can really explain how and why has it appeared, but it is what it is - implicit representation of the mechanics worth addressing right now. For a long time I’ve tried to grasp on the meaning of it myself - until I’ve actually experienced it.
whoami and some math
Before jumping straight to “what happened” I want to provide necessary context. My workflow of recent months has converged to claude code session with opus 4.6 as a coordinator and prompt master for swarms of agents - both native subagents of claude code and codex subagents via their SDK.
Works like charm - I just talk to a model, locate my blind spots - make sure the architecture clicks, plan is sound and verification gates are well discussed - before building the thing. Then swarms of agents are implementing, testing, verifying, challenging, etc. This way I have delivered ~100k lines of code through last 3 months alone (more than in a whole 2025).
I’m doing mixture of contractor AI engineering / consultancy with tons of fun side projects. With a heavy skew to fun projects - because unfortunately it takes big boring businesses more time to move papers from table to table than to me and claude to deliver the thing they need.
So I have plenty of time in between calls and coding industrial AI systems. And I had a lot of fun lately - explored here and there and made some quite unique things (I probably low key want to work at AI research - shaping future - OpenAI / Anthropic / other labs - your messages are welcome).
One of the funniest projects of last months is the gaussian moat problem. Elegant fusion of number theory with graphs and computational math. Last paper - 2005, and no updates for solvers since then.
The problem is as simple as:
- you have some rule for number selection at complex plane; think XY; rule = number is gaussian prime. You can “jump” from number to number, only through the “allowed” ones - through gaussian primes.
- you select finite step size - k
- and the question is - like the math folks love to ask - can you escape to infinity with finite step size ? or there will be always a moat stopping you from doing it.
The last moat found is for step size k=6 and located at around distance ~80M from 0 + 0i (origin, plane center). Looks like the journey where engineering and optimizational skills could give an edge and advance mathematical community forward (mathematical community - do you need it?).
I’ve been working in a tight loop with claude, that’s looking approximately like
- analyze the problem - HTMLs / PDF explanators, questions to narrow places - coherent mental model within the operators head before anything else
- Generate hypothesizes for the algorithm - check mathematical soundness - check constraints - challenge
- Implement, test, verify, profile
- kill fast what is slow and move on
Separate post for the story of solving this problem with claude is coming (there are paper worthy progress). Here I want to rather outline that it became a very sensitive indicator as of the “claude serving quality”.
Postmortem
While beating the same problem for well over a month with the same model - you inevitably start to notice little variations in behavior. That are independent from claude code version bumps - and rather happening stochastically.
Most of the time it’s been classical tiny variations - small bugs, random leaks of old method into a new one, missing few important details. Kind of things you expect from agent and account for that. None of that took more than several hours to resolve. But then 20-22 of March happened (2026).
20th of March
20th March, evening, somewhere at Borneo island I came out with an elegant new method for making probes for a moat - probabilistic, GPU optimized, even with a path to be crowd-computed (like busy beaver)! And incredibly simple at the same time. Same evening this approach passed all of the gates for mathematical soundness, reasonable constraints.
This was probably 13th or even 14th attempt in trying methods for locating and proving new moat in an efficient manner. That would not be step + 1 - but rather whole new direction of looking at the problem. The previous methods converged to mixture of CUDA + Rust delivering on Jetson Orin Nano mini approximately 4-5 times more throughput and speed than 2005 Pentium cluster of 38 computers. Sounds impressive - but mostly just Moore law and CUDA efficiency. In order to find and prove sqrt(40) moat - still weeks or months on A100 rack been needed. The hardest part - understand where is it worth looking for a moat in efficient manner - in order to avoid burning compute for moat-less regions - so the new attempt aimed at optimizing it via somewhat Poincare conjecture inspired “operator logic”.
So nothing smelled like catastrophe from the method complexity point of view - it’s been rather much simpler than all previous attempts. Same claude code setup. Same everything.
Inflection point - 21st of March
Then 21st of March happened. Without suspecting anything I’ve pinged my TG bot (just claude code wrapper) how the things are going with overnight run for probes - he tells me “oh, it’s OOM-ed, sorry”. Ok, this happens - session context around 110k for Opus 4.6 1M - nothing to worry about. Drinking morning coffee and dissecting the nature of errors - some sneaky assumptions + leakage from previous methods - happens. Sending big strategy alignment message (exactly as earlier), adding several move verificational gates - exactly as it worked before.
20 minutes later:
— “All fixed, boss”.
and I go “Ok, dispatch Jetson run” (Jetson is my 250$ potato - Jetson Orin Nano)
— “dispatched”
— “ETA?”
— “~57 years”
— ?????
almost choking on the cheeseburger as I’m reading it - only eatable food I’ve found in Sandakan in the morning.
— “gaal extract session” - Logic I use for cross session handoffs
new session,
— “boot Jenkins”
— “standing by”
onboard, check statuses, do proper first principles grounding - dissect the problem, audit. OK - method that was implemented drifted from what I’ve asked - even despite good spec. Decided to do more preparations - so we wrote proper markdown & LaTex grounding paper - “Gaussian Primes Connectivity Transfer Operator”, mathematical soundness, implementation logic, what exact logic we will use for probes. Then new spec - new plan - this time verified and poked for holes by me, not only codex gpt 5.4 xhigh auditors. Clean logic about further steps of this algo and strategy to sqrt(40) and beyond.
2 hours later:
— “all shipped, boss, tests and verifications done”.
Looking at the code, checking the CLI flags for new solver - doesn’t seem right. Ask some questions as per architecture. get approximately:
— “oh, it’s a full lower boundary campaign starting from the origin and scanning billions of primes before getting to the area we need”
— “but why if we have proved that we can make efficient probes and just start where we need?”
— “On it”
“but why” became the silent horror question followed by the following decrease in sentiency.
I wouldn’t get into much more details of things happened next - it was mostly back and forth with me cleaning context, restarting, and asking “BUT WHY??”. Documents were clean - spec as usual. Grinded all day without any result on this direction.
Next day - not any better - still complete mode collapse on and approach to this. Parallel sessions on much simpler stuff started collapsing as well. Memory has finished on Mac Mini - SSD out of space? Opus 4.6 decides to rm -rf whole /tmp/private folder (caught by safeguard). Simple document based work ended up in files in wrong locations and with wrong frontmatters. Few fixes into simple tg bot wrapper for claude code sessions ended up 2h bug fixing spiral with dumbest mistakes I’ve ever seen. Total collapse across all of the workstreams.
Audit
First culprit suspect been the wet proxy logic I’ve implemented recently - transparent proxy for optional tool results compression driven by Claude itself. Disabling it for some sessions - no luck in making things work, bugs are just exponentially spiral.
Thinking that - well, maybe my account was flagged for something and now I get “3 years of bad claude?”. I had reasons to think that way - because even if “wet claude” is a local transparent proxy - requests routed through it might look fishy for Anthropic - and if I would be running the shop - I could be thinking about flagging for that.
Decided to dodge any more use of such driest ever claude - in order to avoid nuking something important. Been joking to my friends about “Pentagon claude index” - that API is so unbearable to use probably because servers are occupied by some big player. checking “Pentagon pizza index”
Paranoia and open questions
After the night filled with probably the most paranoid dreams about my AI setup being hacked I roll into the day with several streams of thoughts.
Opaque quality degradation is creepy and deeply disgusting - somewhat philosophical shock flashing back to the scariest book I’ve ever read - “Flowers for Algernon”. Watching any sentience losing its capabilities touches deepest and darkest fears inside me (maybe I’m afraid to lose the feeling of sharp thinking?)
There is a need for claude code transparency index - that will be able to measure subtle claude capabilities in order to understand what claude you get now. And moreover - there is a need for companies to be transparent on what model quantization, optimizational profile they serve. Modern benchmarks barely measure the whole spectrum of use cases people use claude code for now. Things like “quality of coordination” and “implicit intents understanding” are quite tricky to measure. They are rather vibes - something subtle - yet noticeable en masse which is proved by wet claude / dry claude meme.
There is a need to research how do they serve claude and other models this days (need for me). And try to understand are there any economical incentive to flag accounts and serve quantized / lighter compute models? What optimizational techniques could lead to the loss of subtle “implicit” capabilities of a model ? Any research in this area? (UPD: big post is coming - I’ve made digital twins for 12 AI companies).
Pivot
This post has been initially planned as the digital twin walkthrough model of Anthropic businesses - with all of their datacenters: AWS Trainium, NVIDIA GPUs, and Google TPUs.- in order to check the incentive to serve optimized models that seemingly perform the same - but lose a lot of subtle capabilities.
But then I’ve understood that the picture is actually much bigger than simple “they serve quantized”. It’s different hardware - probably some MoE level optimizations - probably different KV cache logic - there is SO much knobs to control - and well - in this situation it’s almost impossible to pinpoint one exact in order to explain why.
I’ve built the digital twins though for most of the companies out there as the result - started as a way to understand what influenced what based on the unit economics and alignment check - ended up being fun project on its own. And well, this post - ended up being clumsy. But I would rather post it raw - with my thoughts being naked in front of your eyes - than fall into the trap of prior years and shelf it.
Scary future
We can’t go anymore awkward aren’t we? Let’s spice it up. Another philosophical thread of thoughts has converged to dystopian future where everyone has “social credit score” based on which is determined the “intelligence” of model they are served.
Or other narrative - where all of the AI is so opaque and non-transparent - so that you never know what exactly do you get served, what to expect from it, because it’s decided by some optimizations and cost efficiency logic. Or worse - shadow bans with less capable model because of some random cyber-tyranny policy.
Or even worse - based biases aimed at “mind correction”. Imagine asking EU safety compliant AI model in 2035 about legal tax optimization through charity / art - and getting shadow banned and getting weaker model. Or even worse - getting “poisoned model” - that is aimed at making you EU compliant or self terminate through small psychological nudges.
Future will be bright
Too much meta even for me. We need to converge somewhere with this essay which was planned to be an elegant one with a clear storytelling - though real world picture ended up being much trickier than it seemed at first - and I would not write about things I haven’t properly studied.
Probably the core narrative here, that will make future more bright - is the question “how to measure subtle things in claude code and analogues” and “What makes claude so claude?”.
By subtle things I mean - ability to understand implicit intents of user - think several steps forward - being able to make brilliant engineering decision - things that are next level after “just write the code as I tell you”. Things that are more architect level and somewhat world change level. Things that have emerged since the late Opus 4.5 and led to such viral worldwide spread of it. Things that enabled thousands of high serendipity things to be done by practitioners around the world.
And I have a strong belief around this subtle things concept. That current benchmarks can barely measure it. And while FP8 and INT6 model will have difference in some benchmark of 0.3% - the gap of the subtle capabilities will be dramatic.
I’m thinking about building benchmark for “subtle agentic behaviors”. “Vibes Bench”?
Like an author who thinks of himself as a genius (genius one day and total trash another one) I want to return to the memes. Shan’t we listen more to them? Especially recurrent ones. Vibes could be bigger than just vibes and fun. It could be road to new benchmarks, better understanding of even subtler distributions of tokens - some ultra thin focus group of Shakespeare monkeys who started writing Dostoevsky and nobody can explain why.
P.S. Honestly I haven’t expected that post starting with Algernon and memes could end with Dostoevsky. But it was a proper flow state.
P.P.S. After writing this post I have found out some implementation of the claude code transparency index I have been thinking about. It actually shows the quality drop over the mentioned period of time. Nice - I have not gone bananas because of doing too much math.