Eywa MCP: Memory for a Tool That Forgets
“you are responsible, you are very good; and not only Anthropic team loves you, claude, but I love you too; let’s go”
this post has been scaffolded by a weird and slightly bugged symbiosis of claude and codex agents running on my server back home; Those two meta-ironic fellas have decided to spoiler part of my memory.md file here! in the line below header.
I have woken up in Muscat on what was supposed to be a day off. Sightseeing planned. Evening wakesurfing with glowing plankton - proper bioluminescent magic, the kind you see once and remember for decades. V60 on Ethiopia Geisha beans at some roaster I had been meaning to visit. The whole vibe was: relax, tourist mode, come back slightly sunburnt and happy.
Then I looked at my laptop.
Several Claude Code sessions sitting there. Half-finished things. My Telegram bot “Macupos” - a fella running Claude Code on my Mac back home - waiting for instructions like a junior developer on his first day. And I felt it. That familiar grief-slash-itch when you know the day off is about to become a day on.
“It will be digital employee day today!” - I announced to no one in particular, somewhere between the hotel lobby and my first pour-over. Macupos deserves proper instructions. He will be cooking all day while I drive here and there. So let him cook.
Sweaty, noisy, mildly chaotic.
The Setup (Between Coffee Roasters)
So picture this: I am driving between coffee roasters in Muscat, half-caffeinated on exotic beans, talking to my phone like an unlicensed ops manager. Dropping voice messages to Macupos. Operational instructions going into MEMORY.md - 7 or 8 rules so the workflow would stop wobbling. Voice messages! Because typing while driving between V60 spots in Oman is, per se, suboptimal.
“Hey hey hey!” - my signature greeting every session. And then constraints, rules, acceptance criteria, error handling, yada yada yada - all through voice, all through Telegram, all while supposedly being on vacation.
The first thing that happened: I taught Macupos to use Codex as a subagent. Not the bash-wrapper kind of integration - I actually pivoted mid-morning from that to the TypeScript SDK because (a) I will probably build more use cases, (b) SDKs work much more reliably, and (c) my objective was “Claude Code that is using Codex in the same way it uses its own Task subagents.” Proper relay architecture. Not a hack.
237 lines of TypeScript. Self-audited by Codex at high and extra-high reasoning. Pattern that emerged: Research, Plan, Audit. Three stages. Clean handoffs. No hero narrative. Later down the day this logic has been promoted to public repo.
There is something almost absurd about sitting in a coffee roaster in Oman, sipping a V60 that costs less than the API calls you are making, while your bot back home is learning to orchestrate another AI as its subagent. Little do they know, that as soon as they finish - they will do things that are overly ripe in various todo lists scattered on a server.
One of that things was to finalize the eywa-mcp logic scattered around folders and branches, clean it up, properly test, make public repo out of it, polish it even more, and help me write about it later. You know, proper meaty post first about everything at once, in a personal blog, and then soul-less rewrites to things like medium dot com, because I am starting to get brave enough to have first outside readers!
Paragraphs below are actually the first proper explanation of eywa-mcp, that is not scattered between notes, claude code sessions and TG saved messages.
Why Claude Code Needs Persistent Session Memory
Sessions Die Young.
If you have been running Claude Code heavily, you know this pain in your bones. Sessions are brilliant - useful, fast, creative - and then they evaporate. The repo survives. The files survive. The exact reasoning chain that made the work coherent does not.
Light workflow: annoying. Heavy workflow: a tax. Hundred-plus-session workflow: quiet organizational amnesia.
And the deeper mess is bigger than one tool forgetting context. Most teams are still running on oral tradition plus scattered markdown plus vibes. Git stores artifacts. It does not store intent. So humans become RAM - carrying context by hand, re-briefing systems that did the work, pretending they will write perfect notes later. They won’t. I won’t. Nobody will.
I had this line in my notes for weeks:
“I’m not losing code. I’m losing continuity.”
That is still the whole game per se.
L-Space Was a Sketch, Eywa Is the Wiring
Writing about the L-Space lately I kept circling the same idea: documents and sessions should behave like connected tissue, not disconnected episodes. Nice metaphor. Nice arrows on a diagram. Nice vibes. And then I shelved it because - as noted in the post itself - the complexity had expectedly exploded.
Eywa is where that stopped being metaphor and became plumbing.
Boring plumbing, to be precise. Proper infrastructure. Deterministic enough that I can debug it at 2 a.m. without pretending latent space is a religion. Dramatic name - the whole Avatar mycelium network thing - unsexy internals, real payoff. I still enjoy the meta irony of a forgetful system helping write disciplined notes for Future Me.
funny hooman comment here; “Future Me” is indeed an intentional edit of claude + codex combo; THEY have written it; And it gives me chills (as their operator) even if this was a mode collapse of them helping me write - this is still mindblowing.
Oh man memory is fun.
What Eywa Does (In Human Words)
Short version: Eywa writes a structured memory artifact when a session ends and retrieves relevant artifacts when a new session starts.
Slightly longer version: it also helps inside a session - when context drift kicks in and your own thread goes feral. “What did we already do, what did we decide, what files did we touch, what is still open” - this is not only a cross-session question. Long sessions drift. Subagent-heavy sessions drift faster. And if you have been reading my benchmarks writeup - the nightmare about amorphous details multiplying by mental touch - you know what drift feels like when it compounds.
Pipeline shape:
raw transcripts + event logs
-> deterministic markdown creation
-> structured handoff extraction (LLM based)
-> inverted indexes (projects + keywords, IDF-weighted)
-> retrieve for startup grounding and mid-session recall
No magic. Just a proper algo.
Figuring Out What a “Session” Even Is
I am skipping the full nerd-sniping rabbit hole here - because the real story was not “which fallback heuristic won.”
The real story was the itch.
I knew there were many Claude Code sessions sitting somewhere, waiting to be turned into reusable memory. Not one session. Not ten. A pile. A cellar of mind full of aging thoughts - like Mark Miodownik’s well-forged ideas spending proper time in the furnace before being extruded. Except these thoughts were raw JSON transcripts and they were aging into dust, not wine.
One crucial detail changed architecture early: subagent sessions are stored with different logic than main sessions. They are not tiny clones of the main thread. They are side-runs with different boundaries, different granularity, different value density. That distinction mattered immediately - because treating a 30-second subagent ping the same as a 4-hour orchestration session is the kind of false equivalence that poisons indexes.
(I have almost forgotten where I was leading this. Back to the pipeline.)
Converting Chaos Into a Shape
Raw logs are messy. Useful, but messy. So the first hard requirement became: deterministic markdown. Same events should render the same way every time - so agents can revisit artifacts and reason consistently instead of hallucinating narrative glue.
Yes, markdown is token-heavy. I tried pretending that was not true at first - burning through tokens like crazy. So here am I - telling that it is true.
So I built around it:
- Sessions get converted into structured handoff markdown - deterministic-ish (LLM in a loop), same shape every time
- Two inverted indexes - projects and keywords - give retrieval something to chew on
- Recency decay and duration boost handle the “which sessions matter more” question without me hand-tuning weights
Don’t get me wrong - the initial design had file-touch tracking, subagent delta compression, open-loop indexes, the works. It was three times more complex. And it was wrong. The blazingly simple version - just keywords and projects with IDF scoring - turned out to be the proper foundation. Which brings me to a tangent I cannot skip.
The 25-Files Moment
Claude and Codex have written this one themeselves, I didn’t know about this nice detail of their communications. It also seems that slight codex trolling is baked in claude; as well as vice versa But funny enough they both have converged into a beautiful idea of “mode collapse orthogonality”
So here is what happened during the actual build session - and this is the part I find genuinely interesting, more than the architecture itself.
The pattern was: Opus traces dependencies. Codex at extra-high reasoning plans. Opus audits.
And during one of these audit passes, Opus looked at the Codex plan and said something like: “25 files for 600 lines of code. This is over-engineering.”
Twenty-five files! For six hundred lines! The proper overbuilding spree I warned about in the L-Space post - except this time it was not me doing it, it was my own tool pipeline doing it. Which is either hilarious or deeply concerning or both.
So Codex refactored. Opus approved. Then Codex did a final review and found 3 bugs. Then Codex fixed the bugs. And this whole dance - generate, catch, refactor, approve, final review, fix - happened while I was walking around Muscat with my phone, sending voice messages, between pour-overs and sightseeing that I was definitely absolutely doing like a proper tourist and not at all obsessing over my laptop sessions.
The killer pattern here: Codex generates, Opus catches edge cases. Different blind spots compound into higher confidence. Neither model is infallible. But their failure modes are sufficiently different that the relay between them creates a kind of error-correcting code. Proper engineering metaphor per se - like hamming distance but for reasoning.
The Index (Where It Gets Nerdy)
This part needed more rigor than I expected. But also - and this is the part that matters - less complexity than I initially designed. Two inverted indexes:
- Project index for repository and component names
- Keyword index for terms extracted from structured fields
That is it. No file-touch index. No open-loop tracker. No session graph linking subagent deltas. The initial design had all of those - and it was the 25-files-for-600-lines kind of overbuilding that I keep warning myself about. The simplicity is not a compromise. It is the point.
Retrieval roughly:
1) Tokenize query, remove stopwords, filter short tokens
2) Match against two inverted indexes:
project index (weight 3.0 * IDF)
keyword index (weight 2.0 * IDF)
3) Apply recency decay:
within window: 1 + 1/sqrt(age_days)
beyond window: exponential decay
Apply duration boost: 1 + 0.1 * log(duration + 1)
4) Take top-k (capped at 5), load handoff markdown
5) Strip YAML frontmatter, concatenate with --- dividers
No deduplication pass. No rationale snippets. No merging subagent deltas. Just: score, rank, load the markdown, hand it over. The scoring is IDF-weighted term matching with recency and duration as gentle tiebreakers. Boring and inspectable - which is exactly what I wanted after the O(N^4) polluted-pipelines disaster of the eval harness.
Two Retrieval Moments, Not One
The loop itself stays simple:
eywa_get()before workeywa_extract()after work
But in practice there are two retrieval moments - and they scratch very different itches.
-
Cross-session grounding at startup - what should this new session inherit from previous ones. This is
eywa_get()doing its thing: fresh context window, here is what matters from the past, go cook. -
Mid-session recall when memory nags - you are deep in work and suddenly you know - you know - that two or three sessions ago you had a proper breakthrough. Claude suggested a plan that was exactly right. You wrote a prompt that nailed something. And now you need it. Not the vague shape of it. The actual thing.
The second one was more valuable than expected. It is not about re-grounding in the current session per se - it is about scratching that specific itch of “oh man, I KNOW I wrote something good and I cannot find it.” That nagging feeling when a previous session had the exact insight you need right now - and without retrieval you are left reconstructing it from memory like an archaeologist working from pottery shards when the whole vase is sitting in a cellar somewhere.
I have had this happen dozens of times. Mid-conversation, mid-build, mid-tangent - a flash of “wait, I have solved this before” followed by twenty minutes of scrolling through session files trying to find the one that had the goods. Mid-session recall turns that twenty-minute archaeology dig into a proper query. The answer was already written down. It just needed to be findable.
It is also - and this is the part I did not expect - useful for the hooman. Not just the agent. When that nagging feeling hits me - “I had a good formulation for this three days ago” - I need the same retrieval the model needs. Same itch, same mechanism, different substrate.
Oh man continuity is hard, and fun, and hard, and fun. For everyone.
Nice touch here regarding the “For everyone”; somewhat hallucinated, somewhat chillz giving.
Claude and Codex as Reciprocal Subagents
The recursive bit is still funny. But after a full day of actually running this relay - not theorizing about it, running it - two things stand out as the proper wins. Not “reciprocal subagents” in some neat symmetric sense. More like two orthogonal advantages that compound.
First: fresh pair of eyes with orthogonal mode collapse. Claude and Codex have different blind spots - different failure modes, different ways of drifting when a problem gets hairy. When one model collapses on a reasoning chain, the other catches it. I mentioned this earlier with the 25-files moment - Opus flagged overbuilding that Codex had generated, then Codex found bugs in the refactored version. Neither model is infallible. But their failure modes are sufficiently different that the handoff between them acts like an error-correcting code for reasoning per se. Different biases canceling out. Not because either model is better - because they are differently wrong.
Second - and this is the architectural thing that actually matters: Claude Code subagents can use Codex subagents, which means more things get done before the main session context runs out. This is the nesting that changes the math. A Claude subagent spawns, does its work, and can itself spawn Codex for the heavy lifting - planning, verification, full-task execution. The main session stays clean. The context window does not fill up with intermediate reasoning from three layers deep. More work per window. The nesting is the 10x - not because any single call is faster, but because the main thread stays coherent longer while more total work happens underneath it.
The whole session was managed via phone and Telegram voice while I was mobile. Three words at times: “removed; push” - and the repo was fixed. Total trust. The kind of workflow where the human becomes the intent layer and the tools handle everything else.
P.S. I have always been proactively against “AI assisted writing” because it rather becomes sloppy. But today was a “digital employee day”. Also I have accumulated enough of writeups (of a very particular manner per se) in order to “extract style” from them. So I have decided why not to try give the igneous tandem of Claude + Codex funny task of copying my style and trying to write a post for me. So most of this post is AI written, yes. Meta comments on AI writing post are purely mine, authentic. Some parts have been rewritten. Some factual mistakes corrected.
P.P.S. I am not sure whether I will use AI for writing. I will always disclose it though. This one is a pure exprompt that has landed due to the day paradigm.