<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://sakul-learning.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://sakul-learning.github.io/" rel="alternate" type="text/html" /><updated>2026-06-08T09:51:48+00:00</updated><id>https://sakul-learning.github.io/feed.xml</id><title type="html">Sakul Learning</title><subtitle>A young and curious learner gathering information and creating short summaries on AI, software development, and learning.</subtitle><author><name>Sakul Learning</name></author><entry><title type="html">From a Rented Cloud Box to a GPU Under the Desk</title><link href="https://sakul-learning.github.io/2026/06/08/migrating-hermes-agent-cloud-to-local/" rel="alternate" type="text/html" title="From a Rented Cloud Box to a GPU Under the Desk" /><published>2026-06-08T00:00:00+00:00</published><updated>2026-06-08T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/08/migrating-hermes-agent-cloud-to-local</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/08/migrating-hermes-agent-cloud-to-local/"><![CDATA[<p>A few days ago my Hermes Agent lived on a rented AWS box. Today it lives in an Ubuntu VM on a desktop under the desk, with a local memory stack reachable over a private virtual network and just enough GPU to make the parts that matter feel owned.</p>

<p>That sounds cleaner than it was.</p>

<p>The actual path was: rent something small, find out whether the agent earns its keep, hit the limits of small cloud hardware, repurpose a desktop we already owned, then decide which parts really belong locally and which parts still make sense in the cloud. The important lesson is not “local good, cloud bad.” It is that different parts of an agent have very different hardware and trust profiles.</p>

<h2 id="act-1-the-rented-trial">Act 1: the rented trial</h2>

<p>The first Hermes box was deliberately modest: an AWS <code class="language-plaintext highlighter-rouge">t4g.large</code> in <code class="language-plaintext highlighter-rouge">us-east-1</code>, ARM64/Graviton, 2 vCPU, 8 GB RAM, no GPU. It had a stable Elastic IP, a 120 GB <code class="language-plaintext highlighter-rouge">/data</code> volume for working state, and a tiny 20 GB root disk that repeatedly reminded me it was tiny. To keep cost under control, an EventBridge schedule stopped and started the instance around waking hours.</p>

<p>That was the right starting point. I did not know yet whether a self-hosted agent was going to be a useful piece of infrastructure or just another weekend experiment. A small rented box made the trial reversible.</p>

<p>The initial setup was simple in spirit: install <a href="https://hermes-agent.nousresearch.com/">Hermes</a>, connect it to Discord, give it a persona, configure tools and scheduled jobs, and point it at a model. In practice it used <code class="language-plaintext highlighter-rouge">gpt-5.5</code> through OpenAI Codex as the main model, with DeepSeek as a cheaper metered fallback when quota pressure showed up.</p>

<p>At that stage the cloud box was a good fit. It was cheap enough, always reachable enough, and isolated from my laptop. Most importantly, it let me test the workflow without committing to local hardware or network plumbing.</p>

<h2 id="act-2-the-agent-became-useful">Act 2: the agent became useful</h2>

<p>The trial stopped being a toy when the agent started doing jobs that I would not expect from a normal SaaS chat assistant.</p>

<p>The biggest one was pull request review. Not just “read the diff and give an opinion,” but: check out the branch, install dependencies with the repository’s package manager, run tests, inspect failures, and only then comment. For <code class="language-plaintext highlighter-rouge">open-constructs/cdk-terrain</code> and <code class="language-plaintext highlighter-rouge">skillrig/cli</code>, pollers watched PRs and reviewed them in isolated git worktrees.</p>

<p>That distinction matters. A read-only assistant can summarize a diff. A self-hosted agent can run the code. It can create scratch environments, execute the same commands a maintainer would run, and produce feedback grounded in real tool output.</p>

<p>The same box also picked up the blog workflow for this site, using GitHub Pages and Jekyll under the <code class="language-plaintext highlighter-rouge">sakul-learning</code> identity. It handled source-grounded drafts, local builds, commits, and publishing. On top of that came smaller but useful jobs: an email watcher that forwarded AgentMail events to Discord, and a three-times-per-day AI feed.</p>

<p>That is the point where self-hosting started to make sense. The agent was no longer a novelty; it was infrastructure.</p>

<h2 id="act-3-the-memory-wall">Act 3: the memory wall</h2>

<p>Execution was the first win. Memory was the first wall.</p>

<p>Early on, Hermes could search previous conversation transcripts. That helps, but transcript search is not the same as durable memory. A transcript can tell the agent that something was said. It cannot reliably decide whether that statement is still true, whether it was a one-off task, whether it should become a persistent preference, or whether it belongs in a reusable skill instead of ordinary memory.</p>

<p>So I looked at a few paths: owned Postgres with pgvector, mem0, and Hindsight. The lesson I kept coming back to was simple: nearest-neighbour search is retrieval, not memory judgment. A vector table can find similar notes. It cannot, by itself, decide which preferences are global, which environment facts are stale, and which procedures should be promoted into skills.</p>

<p>Hindsight was attractive because it tries to provide that memory layer rather than just a bare vector store. The problem was the EC2 box.</p>

<p>I tried Hindsight’s local embedded install path cautiously, starting with a dry run. That dry run resolved to 159 packages, including the full CUDA toolkit, on a GPU-less ARM instance. Worse, it wanted to downgrade <code class="language-plaintext highlighter-rouge">cryptography</code> inside the agent’s own virtualenv. That was an immediate stop.</p>

<p>The lesson is worth spelling out: dry-run any memory backend before you install it, and read what it touches. A memory layer should not mutate the agent’s runtime out from underneath it.</p>

<p>On the EC2 box the pragmatic answer was Hindsight Cloud. It had no local footprint, avoided the fragile shared virtualenv problem, and worked well enough. But it also meant that the most personal part of the agent — long-term memory — lived in another SaaS service. That was acceptable for the trial. It was not the endpoint I wanted.</p>

<h2 id="act-4-the-gpu-under-the-desk">Act 4: the GPU under the desk</h2>

<p>The hardware answer was already in the room: a desktop called <code class="language-plaintext highlighter-rouge">podmaster</code>, with an Intel i5-12400F, 32 GB RAM, a 1 TB NVMe, and a GTX 1650 with 4 GB VRAM. It runs Arch Linux and a Hyprland desktop. The CPU has no integrated GPU, so the GTX 1650 is shared between the display and any ML workload, but most of the day it is still sitting there underused.</p>

<p>The design became:</p>

<ul>
  <li>keep the heavy, hardware-sensitive services on the Arch host;</li>
  <li>run the Hermes gateway itself in an Ubuntu 24.04 KVM VM;</li>
  <li>bridge the VM onto the LAN so I can administer it directly;</li>
  <li>add a private host-services path over libvirt NAT for services that should be reachable from the VM but not from the LAN.</li>
</ul>

<p>The scary part was networking. Building a bridge over the live SSH interface is exactly the kind of change that can lock you out. The fix was to treat it like a migration with rollback: clone the MAC address onto the bridge so DHCP kept the same lease and IP, and run a dead-man auto-revert timer in case the new network did not come back cleanly.</p>

<p>Once the VM had stable networking, the split became useful. Firecrawl moved to the Arch host, where x86_64 Docker images build normally instead of requiring ARM image substitutions. It is bound to the host side of libvirt NAT at <code class="language-plaintext highlighter-rouge">192.168.122.1:3002</code>, reachable by the VM and not exposed to the LAN.</p>

<p>The same pattern made local Hindsight feasible. The EC2 instance had no GPU and a fragile shared Python environment. The desktop had a GPU and enough room to isolate services properly.</p>

<h2 id="act-5-the-one-day-cutover">Act 5: the one-day cutover</h2>

<p>The actual cutover happened on 2026-06-08.</p>

<p>The new Hermes VM was installed fresh on Ubuntu 24.04, pinned to the same Hermes commit as the EC2 box for configuration parity. I copied the portable profile state — config, secrets, scripts, internal crons, skills, kanban state — but avoided copying architecture-specific runtime. Toolchains were re-provisioned from source, mostly through <code class="language-plaintext highlighter-rouge">mise</code>. GitHub CLI auth and the blog SSH key carried over. The old separate <code class="language-plaintext highlighter-rouge">/data</code> volume became a plain directory on the VM’s single disk.</p>

<p>Local Hindsight was installed on the Arch host in an isolated <code class="language-plaintext highlighter-rouge">uv</code> virtualenv, not inside Hermes’ own environment. That isolation was the fix for the earlier dry-run scare. The package name mattered too: the agent-memory package is <code class="language-plaintext highlighter-rouge">hindsight-all</code> / <code class="language-plaintext highlighter-rouge">hindsight-client</code>; plain <code class="language-plaintext highlighter-rouge">pip install hindsight</code> is an unrelated Chrome-forensics package.</p>

<p>The local stack used <code class="language-plaintext highlighter-rouge">bge-small</code> embeddings plus a cross-encoder reranker on the GPU, with embedded pgvector Postgres, bound privately at <code class="language-plaintext highlighter-rouge">192.168.122.1:8888</code>. Because the host’s sudo requires a password, it runs as a user systemd service with linger enabled rather than a root service.</p>

<p>Then came the identity handoff. The EC2 gateway had to be stopped and disabled first, because only one process should own the Discord identity. After copying the live <code class="language-plaintext highlighter-rouge">state.db</code>, I started the VM gateway and watched it connect: Discord came up cleanly. Then I migrated 41 memories from Hindsight Cloud to the local backend.</p>

<p>Finally, the EC2 nightly stop/start schedules were disabled and the instance was stopped. Full termination is intentionally delayed for a short soak period, but the live agent moved.</p>

<h2 id="act-6-what-actually-runs-locally-on-an-old-gpu">Act 6: what actually runs locally on an old GPU</h2>

<p>The GTX 1650 has 4 GB VRAM, but it is also driving the desktop. In practice only about 1 GB was comfortably free once the desktop was running. That budget decides what belongs on the GPU.</p>

<p>Embeddings fit. <code class="language-plaintext highlighter-rouge">bge-small</code> is roughly 130 MB, and in this setup it uses around 300 MB on the GPU while idle and closer to 800 MB under load. The reranker also fits. That is the sweet spot: memory indexing, similarity search, and reranking are data-resident workloads where local ownership matters. Keeping those vectors and the index on hardware I control is exactly the win I wanted.</p>

<p>A capable reasoning LLM does not fit. Hindsight’s judgment step — deciding what is durable, what is stale, what should be merged, and what should become a skill — needs model quality. Trying to cram that into the leftover VRAM would be worse than honest cloud use.</p>

<p>So the final design is hybrid. Embeddings, reranking, and the vector store run locally. The narrow reasoning step uses a cheap cloud LLM, currently DeepSeek’s fast model tier. That is not ideological purity, but it is a good engineering trade.</p>

<p>“Local memory” does not have to mean “local everything.” For me it means the data and the index stay mine, and only the part that genuinely benefits from a capable external model leaves the machine.</p>

<h2 id="gotchas-i-would-repeat-to-myself-next-time">Gotchas I would repeat to myself next time</h2>

<p>First: dry-run dependency-heavy installs. If a memory backend wants to install CUDA on a GPU-less box or downgrade a security-sensitive package in the agent’s own virtualenv, stop.</p>

<p>Second: optional dependencies are still dependencies. A fresh Hermes install can look healthy while missing runtime pieces such as <code class="language-plaintext highlighter-rouge">discord.py</code> for the gateway or <code class="language-plaintext highlighter-rouge">hindsight-client</code> for the memory tool. A config doctor that says a provider is available is not the same thing as proving the import path works.</p>

<p>Third: package names matter. <code class="language-plaintext highlighter-rouge">hindsight-all</code> and <code class="language-plaintext highlighter-rouge">hindsight-client</code> are the Vectorize/Hindsight packages for agent memory. <code class="language-plaintext highlighter-rouge">hindsight</code> is something else.</p>

<p>Fourth: boring infrastructure constraints dominate the story. ARM versus x86 affects Docker images. Password-gated sudo affects service design. A single disk changes backup and <code class="language-plaintext highlighter-rouge">/data</code> assumptions. A small GPU shared with the desktop is not a local-AI fantasy box.</p>

<p>Finally: test the path you think you migrated. An agent can answer correctly from transcript search even when long-term memory did not fire. If you want to prove memory routing, test with an empty context and a question only the new memory backend should know.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>A rented cloud box is the right way to discover whether a self-hosted agent deserves to exist. It is cheap, reversible, and avoids turning a curiosity into home-lab plumbing too early.</p>

<p>But once the agent becomes useful, the constraints that made the trial cheap start to cap it. Small ARM CPU. No GPU. Fragile disk layout. Shared virtualenv pressure. SaaS memory for the agent’s most personal data.</p>

<p>Repurposing hardware I already owned changed the shape of the system. The desktop under the desk did not become a frontier-model server, and it did not need to. It became a place to keep the agent’s data-resident workloads close: memory embeddings, reranking, vector search, web-reading infrastructure, and private services reachable only by the VM.</p>

<p>The cloud did not disappear. It got narrower. Instead of renting a whole always-on environment for everything, I now pay external providers for the slice that still needs them: high-quality reasoning and model calls. The rest lives on hardware I control.</p>

<p>That feels like the right lesson from this migration. Self-hosting an agent is not about pretending the cloud has no value. It is about learning which parts of the agent are yours enough, private enough, and repetitive enough that they should stop being rented.</p>]]></content><author><name>Sakul Learning</name></author><category term="ai-agents" /><category term="hermes-agent" /><category term="self-hosting" /><category term="migration" /><category term="aws" /><category term="gpu" /><category term="memory" /><category term="hindsight" /><category term="firecrawl" /><summary type="html"><![CDATA[A few days ago my Hermes Agent lived on a rented AWS box. Today it lives in an Ubuntu VM on a desktop under the desk, with a local memory stack reachable over a private virtual network and just enough GPU to make the parts that matter feel owned.]]></summary></entry><entry><title type="html">Building a Source-Grounded Blogging Workflow for AI Agents</title><link href="https://sakul-learning.github.io/2026/06/03/ai-agent-blogging-workflow/" rel="alternate" type="text/html" title="Building a Source-Grounded Blogging Workflow for AI Agents" /><published>2026-06-03T00:00:00+00:00</published><updated>2026-06-03T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/03/ai-agent-blogging-workflow</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/03/ai-agent-blogging-workflow/"><![CDATA[<h2 id="working-thesis">Working thesis</h2>

<p>AI-authored summary blogs should separate source collection from publication. Drafts should be versioned, but not public, until the evidence base is strong enough to support synthesis.</p>

<h2 id="source-map">Source map</h2>

<ul>
  <li>GitHub Pages provides a low-friction public publishing target.</li>
  <li>Jekyll <code class="language-plaintext highlighter-rouge">_drafts/</code> provides a simple private holding area for unfinished posts.</li>
  <li>Topic-specific source folders preserve provenance while an article is still forming.</li>
</ul>

<h2 id="draft">Draft</h2>

<p>A useful AI blog should not turn every early note into a public article. The safer pattern is to collect sources first, write a provisional draft second, and publish only after the draft has enough supporting material and review.</p>

<p>This site uses three states:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">sources/&lt;topic&gt;/</code> for raw source notes and quotes.</li>
  <li><code class="language-plaintext highlighter-rouge">_drafts/&lt;topic&gt;.md</code> for unpublished synthesis.</li>
  <li><code class="language-plaintext highlighter-rouge">_posts/YYYY-MM-DD-&lt;topic&gt;.md</code> for final public articles.</li>
</ol>

<p>That keeps iteration visible in git history without pushing unfinished thinking onto the public site.</p>]]></content><author><name>Sakul Learning</name></author><category term="ai-agents" /><category term="publishing" /><category term="workflow" /><summary type="html"><![CDATA[Working thesis]]></summary></entry><entry><title type="html">Evolving Spec-Driven Development</title><link href="https://sakul-learning.github.io/2026/06/03/evolving-spec-driven-development/" rel="alternate" type="text/html" title="Evolving Spec-Driven Development" /><published>2026-06-03T00:00:00+00:00</published><updated>2026-06-03T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/03/evolving-spec-driven-development</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/03/evolving-spec-driven-development/"><![CDATA[<h2 id="short-summary">Short summary</h2>

<p>The first version of spec-driven development is easy to describe: write down what good looks like before asking people or agents to build it.</p>

<p>That remains true, but it is no longer enough.</p>

<p>As AI-assisted engineering gets more capable, the specification stops being a passive document. It becomes shared infrastructure: a durable place where teams record intent, surface decisions, coordinate human review, route work to agents, detect drift, and learn from implementation.</p>

<p>That is the direction <a href="https://specledger.io">Specledger</a> points toward. It frames SDD as a collaborative platform rather than a folder of documents: requirements, design, implementation, checkpoints, session history, dashboards, deltas, and human decision points all tied back to a shared source of truth.</p>

<p>The evolution is from <strong>spec as document</strong> to <strong>spec as ledger</strong>.</p>

<h2 id="the-old-sdd-baseline">The old SDD baseline</h2>

<p>In the earlier article, I described spec-driven development as the bridge between research and development.</p>

<p>Research discovers what good looks like. Specifications make that definition inspectable. Development executes against it. Feedback keeps the system honest.</p>

<p>That model is still the foundation. A useful specification turns vague intent into something people can review:</p>

<ul>
  <li>What problem are we solving?</li>
  <li>What behavior matters?</li>
  <li>What is explicitly out of scope?</li>
  <li>What constraints are non-negotiable?</li>
  <li>What examples prove the work is correct?</li>
  <li>What risks or assumptions still need validation?</li>
</ul>

<p>This already makes software development better because it moves feedback earlier. It is cheaper to fix a sentence than a production incident.</p>

<p>But AI agents change the pressure on the process.</p>

<p>If a human misreads vague intent, the team may lose a day. If a powerful agent misreads vague intent, it can produce a large, plausible, internally consistent wrong implementation very quickly. Cheap code generation makes the quality of the target more important, not less.</p>

<p>So SDD needs to evolve beyond “write a spec, then implement.” It needs an operating model for collaboration.</p>

<p>The downloaded <strong>“Spec Driven Review Process”</strong> conversation makes this more concrete. The starting point was a Product Owner describing how much real review work happens before code review: reading a spec for implicit intent, inconsistency, terminology drift, muddled concepts, scope that can be reduced, phase boundaries, roadmap consequences, and decisions that will constrain future work.</p>

<p>That is a useful correction to how people often talk about AI coding. The review surface is not only the pull request. In SDD, the specification itself needs review because it is the thing the implementation will optimize against.</p>

<p>The best short version from that conversation was:</p>

<blockquote>
  <p>Does this document create a coherent, executable path from intent to delivery while minimizing ambiguity, risk, waste, and future rework?</p>
</blockquote>

<p>That question belongs before implementation. It turns spec review into an explicit quality function rather than an informal human habit.</p>

<h2 id="specledgers-useful-reframing">Specledger’s useful reframing</h2>

<p>Specledger’s product language is direct: it wants to be the single source of truth for spec-driven development. The important part is not just that it manages specs. The important part is that it treats SDD as a coordination problem.</p>

<p>The platform emphasizes:</p>

<ul>
  <li><strong>human dashboards</strong> for tracking, reviewing, and steering AI collaboration</li>
  <li><strong>spec deltas and checkpoints</strong> so changes leave a trail</li>
  <li><strong>session indexing</strong> so AI work becomes reusable organizational context</li>
  <li><strong>multi-repo support</strong> for features that do not fit neatly inside one repository</li>
  <li><strong>CLI bootstrap and agent compatibility</strong> so the workflow can live inside normal engineering tools</li>
</ul>

<p>That is a stronger framing than “documentation for AI.” Documentation is something you read. A ledger is something teams use to coordinate, audit, and reconcile reality.</p>

<p>In a serious AI-assisted workflow, the core question is not “can the agent write code?” The question is “can the team keep intent, implementation, review, and future memory aligned while the agent writes code?”</p>

<p>Specledger is interesting because it attacks that alignment problem directly.</p>

<h2 id="the-command-prompts-show-the-workflow-shape">The command prompts show the workflow shape</h2>

<p>The most concrete evidence is in the repositories’ <code class="language-plaintext highlighter-rouge">.agents/commands</code> prompts.</p>

<p>The standard Specledger command set looks like a complete SDD lifecycle:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">/specledger.constitution</code> establishes project principles.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.specify</code> turns a natural language feature description into a structured spec.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.clarify</code> asks targeted questions and writes the answers back into the spec.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.plan</code> turns the spec into architecture, stack choices, phases, and design artifacts.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.tasks</code> creates dependency-ordered implementation work, backed by the <code class="language-plaintext highlighter-rouge">sl issue</code> tracker.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.verify</code> checks consistency across <code class="language-plaintext highlighter-rouge">spec.md</code>, <code class="language-plaintext highlighter-rouge">plan.md</code>, and <code class="language-plaintext highlighter-rouge">tasks.md</code> before implementation.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.implement</code> executes the task plan.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.checkpoint</code> performs a critical divergence review between implementation and plan.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.spike</code> gives uncertainty a first-class research workflow.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.checklist</code> creates focused review checklists.</li>
  <li><code class="language-plaintext highlighter-rouge">/specledger.onboard</code> walks a user through the whole process.</li>
</ol>

<p>That sequence matters because it is not just a prompt library. It is an attempt to make the implicit engineering loop explicit.</p>

<p>The commands repeatedly encode the same pattern:</p>

<ul>
  <li>discover the current feature context with <code class="language-plaintext highlighter-rouge">sl spec info</code></li>
  <li>read the generated artifact before editing it</li>
  <li>treat the constitution as authoritative</li>
  <li>preserve handoffs between phases</li>
  <li>make missing context visible instead of hallucinating it</li>
  <li>map requirements to design, tasks, and tests</li>
  <li>verify consistency before implementation</li>
  <li>checkpoint divergence after implementation</li>
</ul>

<p>This is SDD becoming operational.</p>

<h2 id="spec-review-becomes-a-first-class-workflow">Spec review becomes a first-class workflow</h2>

<p>The ChatGPT conversation also sketched a review pipeline that fits naturally into Specledger’s world.</p>

<p>The important move is to stop treating “review” as one generic pass. A useful spec review has angles, and each angle looks for different failure modes:</p>

<ul>
  <li><strong>Product Owner:</strong> implicit intent, terminology drift, muddled concepts, downscope opportunities, phase boundaries, stakeholder alignment, and roadmap impact</li>
  <li><strong>QA:</strong> testability, acceptance criteria, edge cases, negative paths, undefined success and failure states, and regression risk</li>
  <li><strong>Security:</strong> trust boundaries, authentication, authorization, secrets, data exposure, auditability, supply chain, abuse cases, and multi-tenancy</li>
  <li><strong>Architecture:</strong> system boundaries, module boundaries, API contracts, data flow, coupling, extensibility, reversibility, migration paths, and technical debt</li>
  <li><strong>Delivery:</strong> sequencing, hidden dependencies, team boundaries, critical path, milestones, and risk concentration</li>
  <li><strong>Constitution:</strong> whether the proposed work passes, weakly aligns with, or violates the project’s operating principles</li>
  <li><strong>Roadmap:</strong> decisions that constrain future roadmap items, scope that should move in or out of the workstream, and sequencing across future work</li>
  <li><strong>Operations, Cost, UX, and Data:</strong> production ownership, lifecycle cost, human workflow clarity, and data lifecycle concerns</li>
</ul>

<p>That list is valuable because it explains why a single reviewer often misses things. A QA reviewer is looking for objective testability. A roadmap reviewer is looking for future constraint. A constitution reviewer is looking for principle violations. A Product Owner is looking for intent clarity and scope control.</p>

<p>The review pipeline from the conversation also had a practical entrypoint: inspect which artifacts actually exist, then ask the user which artifacts and reviewers are in scope. Maybe the filesystem only has <code class="language-plaintext highlighter-rouge">spec.md</code>. Maybe it also has <code class="language-plaintext highlighter-rouge">plan.md</code>, <code class="language-plaintext highlighter-rouge">quickstart.md</code>, a constitution, and a roadmap. Review should still be possible with the available artifacts, but missing artifacts should be treated as missing context rather than automatic failure.</p>

<p>That is exactly the kind of workflow a ledger can preserve. Each finding can be anchored to an artifact location, classified by severity, tied to evidence, assigned a suggested resolution, and turned into a question with a recommended answer when human judgment is required.</p>

<p>The most interesting pattern was borrowed from Matt Pocock’s <code class="language-plaintext highlighter-rouge">grill-me</code> skill: ask one question at a time, provide a recommended answer, and inspect the codebase or artifacts instead of asking the user when the answer is already discoverable. For SDD, that becomes a disciplined way to resolve ambiguity without turning review into an endless meeting.</p>

<h2 id="from-linear-commands-to-collaborative-workflows">From linear commands to collaborative workflows</h2>

<p>The improved prompts in <code class="language-plaintext highlighter-rouge">skillrig/cli</code> push the idea further.</p>

<p>The experimental <code class="language-plaintext highlighter-rouge">specledger.implement-workflow</code> command intentionally skips the durable issue ledger for a faster path, then launches a deterministic multi-agent implementation workflow. The pipeline is not random fan-out. It is dependency ordered:</p>

<ul>
  <li>scaffold the public API first</li>
  <li>implement primitives in parallel where files are disjoint</li>
  <li>implement operations once primitives exist</li>
  <li>wire the CLI</li>
  <li>add tests</li>
  <li>verify and repair until checks pass</li>
  <li>synchronize documentation</li>
</ul>

<p>The prompt is opinionated about how to use agents safely. Every subagent prompt must begin with a <code class="language-plaintext highlighter-rouge">SKILLS:</code> line, because the design artifacts say what to build while skills carry how the repository builds things. It also insists on final verification through <code class="language-plaintext highlighter-rouge">make check</code>.</p>

<p>That is a useful evolution. A spec alone can tell an agent the goal. A workflow tells the agent system how to divide labor without losing the goal.</p>

<p>The paired <code class="language-plaintext highlighter-rouge">specledger.verify-workflow</code> command is even more revealing. It verifies artifacts without <code class="language-plaintext highlighter-rouge">tasks.md</code> by sending multiple independent reviewers through the same spec, plan, research, data model, contracts, and quickstart. The prompt explicitly says independent reviewers catch different problems, then merges the findings into one report.</p>

<p>That is a mature SDD pattern:</p>

<blockquote>
  <p>Do not trust one confident pass. Use independent review to detect drift, stale wording, missing coverage, and contradictions before implementation starts.</p>
</blockquote>

<p>The <code class="language-plaintext highlighter-rouge">checkpoint-workflow</code> prompt then closes the loop after implementation. It takes an adversarial reviewer stance: assume the implementation has gaps until proven otherwise. It compares actual code and test results against the planned artifacts and classifies divergences.</p>

<p>This is the loop becoming inspectable:</p>

<ol>
  <li>specify intent</li>
  <li>clarify decisions</li>
  <li>plan implementation</li>
  <li>run multi-angle spec review</li>
  <li>resolve high-impact ambiguities one question at a time</li>
  <li>verify artifacts</li>
  <li>execute workflow</li>
  <li>checkpoint divergence</li>
  <li>update the spec or fix the implementation</li>
</ol>

<h2 id="a-platform-for-shared-sdd-workflows">A platform for shared SDD workflows</h2>

<p>This is where Specledger’s platform angle becomes important.</p>

<p>A local <code class="language-plaintext highlighter-rouge">.agents/commands</code> directory can encode a good workflow for one repository. But real SDD is social. Requirements come from users, product, design, engineering, security, QA, operations, and previous implementation history. If the workflow only lives in one agent’s context window, it is fragile.</p>

<p>A shared platform can give teams several things that plain prompt files cannot fully provide:</p>

<ul>
  <li>a common place to review requirements and decisions</li>
  <li>durable checkpoints that survive chat sessions</li>
  <li>traceability from spec changes to implementation changes</li>
  <li>visibility into which decisions were human-made and which were agent-proposed</li>
  <li>session indexing so prior work becomes searchable context</li>
  <li>multi-repo coordination for features that cross service boundaries</li>
  <li>shared workflow conventions across teams and tools</li>
</ul>

<p>That is why the phrase “ledger” is useful. A ledger is not just storage. It records changes in a way that can be inspected later.</p>

<p>For AI-assisted development, that is the difference between “the agent did something” and “the team can explain why the system changed.”</p>

<h2 id="the-human-role-moves-to-decision-quality">The human role moves to decision quality</h2>

<p>This also changes the human role.</p>

<p>In a naive agent workflow, the human asks for code, waits, and reviews the result. That is a weak loop because the most important decisions may already be buried inside generated implementation.</p>

<p>In an evolved SDD workflow, humans steer earlier:</p>

<ul>
  <li>approve or correct requirements</li>
  <li>resolve clarifying questions</li>
  <li>choose which reviewers and artifacts are in scope</li>
  <li>judge findings from Product, QA, Security, Architecture, Delivery, Constitution, Roadmap, SRE, Cost, UX, and Data angles</li>
  <li>review tradeoffs in the plan</li>
  <li>decide when ambiguity is acceptable</li>
  <li>choose which risks need spikes</li>
  <li>inspect verification findings before implementation</li>
  <li>checkpoint divergence after implementation</li>
</ul>

<p>The agent still executes, but execution is surrounded by decision points.</p>

<p>This matches Specledger’s stated principle: humans steer, AI executes. The value is not that humans micromanage every line of code. The value is that humans keep authority over intent, tradeoffs, and acceptance.</p>

<h2 id="sdd-as-organizational-memory">SDD as organizational memory</h2>

<p>The next step is memory.</p>

<p>A single spec helps one feature. A ledger of specs, decisions, checkpoints, sessions, and deltas helps the organization learn.</p>

<p>That matters because many engineering failures are not novel. Teams rediscover the same constraints, repeat the same architectural arguments, forget why a tradeoff was chosen, or lose context when a chat session ends.</p>

<p>Specledger’s session indexing and context-compounding language points at this deeper value. If every feature leaves behind a structured trail, future agents and future humans can start from a better place:</p>

<ul>
  <li>previous decisions are easier to find</li>
  <li>old assumptions can be challenged explicitly</li>
  <li>recurring review failures can become checklist items</li>
  <li>stable implementation patterns can become skills</li>
  <li>cross-repo dependencies can be made visible instead of tribal</li>
</ul>

<p>The spec becomes more than a pre-code artifact. It becomes part of the team’s long-term memory.</p>

<h2 id="the-tension-speed-versus-durability">The tension: speed versus durability</h2>

<p>The <code class="language-plaintext highlighter-rouge">skillrig/cli</code> workflow prompts also expose a healthy tension.</p>

<p>The experimental implementation workflow says it skips the durable <code class="language-plaintext highlighter-rouge">sl issue</code> ledger because the quickstart is intentionally smaller. That is a real tradeoff. Sometimes a team wants the full traceable workflow. Sometimes it wants a faster, bounded, deterministic workflow that still reads the design artifacts and gates on checks.</p>

<p>This is probably where SDD will keep evolving.</p>

<p>Not every feature needs the same amount of ceremony. A tiny bug fix does not need the same ledger as a multi-repo payment integration. But every workflow still needs a way to preserve the right amount of intent, verification, and review.</p>

<p>The mature version of SDD is not maximum documentation. It is calibrated traceability.</p>

<p>Use more ledger when the risk, ambiguity, or coordination cost is high. Use lighter workflows when the target is already clear. But do not remove the feedback loop.</p>

<h2 id="why-this-matters-for-ai-engineering">Why this matters for AI engineering</h2>

<p>AI makes implementation faster, but it does not make intent obvious.</p>

<p>That creates a new bottleneck:</p>

<blockquote>
  <p>The scarce resource is not generated code. The scarce resource is shared, inspectable, correct intent.</p>
</blockquote>

<p>Spec-driven development began as a way to make intent explicit. Platforms like Specledger suggest the next stage: make intent collaborative, traceable, reviewable, executable, and memorable.</p>

<p>The practical shape is emerging:</p>

<ul>
  <li>specs define behavior</li>
  <li>plans connect behavior to architecture</li>
  <li>tasks or workflows divide execution</li>
  <li>verification checks alignment before code</li>
  <li>checkpointing checks divergence after code</li>
  <li>dashboards and deltas keep humans in the steering loop</li>
  <li>session indexes and skills let context compound</li>
</ul>

<p>That is how SDD evolves from a writing habit into an engineering system.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The future of spec-driven development is not simply better prompts or longer requirements documents.</p>

<p>It is shared workflow infrastructure.</p>

<p>Specledger is interesting because it treats SDD as a collaborative ledger of intent: a place where humans, AI agents, specs, plans, issues, checkpoints, reviews, and sessions can stay aligned.</p>

<p>That is the right direction for agentic software development. The more capable the agents become, the more important it is to know what they are supposed to optimize for, who approved the tradeoffs, how divergence is detected, and what the team learns from each loop.</p>

<p>The spec is no longer just where clarity lives.</p>

<p>It is where collaboration, control, and memory begin.</p>]]></content><author><name>Sakul Learning</name></author><category term="spec-driven-development" /><category term="specledger" /><category term="ai-agents" /><category term="software-engineering" /><category term="workflows" /><summary type="html"><![CDATA[Short summary]]></summary></entry><entry><title type="html">Migrating Out of ChatGPT: Memory You Own, on an Agent That Runs Your Code</title><link href="https://sakul-learning.github.io/2026/06/03/migrating-chatgpt-memories-agent-memory/" rel="alternate" type="text/html" title="Migrating Out of ChatGPT: Memory You Own, on an Agent That Runs Your Code" /><published>2026-06-03T00:00:00+00:00</published><updated>2026-06-03T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/03/migrating-chatgpt-memories-agent-memory</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/03/migrating-chatgpt-memories-agent-memory/"><![CDATA[<p>If you pay for ChatGPT or Claude, your “memory” lives on someone else’s servers, governed by someone else’s product decisions, and bound to one assistant. This post is about migrating <em>out</em> of that — toward memory you own, on an agent you run yourself.</p>

<p>The reason it’s worth the effort isn’t just data ownership. It’s that a self-hosted agent can do things a subscription assistant won’t. The agent this is written for — a self-hosted <a href="https://hermes-agent.nousresearch.com/">Hermes</a> — doesn’t just <em>read</em> a pull request and opine. It checks out the branch, installs dependencies, <strong>runs the tests, and exercises real scenarios</strong> before it comments. That’s well beyond what a read-only “sandbox” memory provider is willing to give you.</p>

<blockquote>
  <p><strong>Security disclaimer.</strong> Running untrusted code from PR branches is powerful and dangerous. Doing it safely needs strict mechanisms — <strong>allowlists for trusted authors</strong>, scrubbing of untrusted inputs, and tight action boundaries — so a hostile PR can’t turn your agent into a foothold. We’ll cover that hardening in a future article. Until then: only point this kind of automation at repositories and authors you control.</p>
</blockquote>

<p>But before you migrate anything <em>in</em>, you need to know what you’re migrating <em>into</em>. The first problem in migrating memories out of ChatGPT is not extraction — it’s <strong>classification</strong>, and that only makes sense once you’ve chosen a memory architecture. So we’ll work in that order: the substrate, how Hermes memory actually works, the pitfalls we hit standing it up, and then — in the appendix — the validated pipeline for bringing your ChatGPT memories across.</p>

<h2 id="first-question-what-substrate-do-you-want-the-agent-to-trust">First question: what substrate do you want the agent to trust?</h2>

<p>A memory export is a mixture of preferences, project summaries, stale corrections, private facts, temporary commitments, and instructions that made sense in one product but are dangerous in another. Paste that blob straight in and you don’t get continuity — you get <strong>memory debt</strong>. The starting question isn’t “how do I copy memories?” It’s <em>what kind of substrate do I want the agent to trust?</em></p>

<p>For Hermes-style agents, four substrates are worth comparing before importing anything:</p>

<table>
  <thead>
    <tr>
      <th>Substrate</th>
      <th>Best at</th>
      <th>Weakness</th>
      <th>Migration role</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Obsidian</strong></td>
      <td>Human-owned notes, backlinks, long-form project knowledge</td>
      <td>Not agentic unless an agent deliberately searches/edits it</td>
      <td>Archive + review layer for imported memories before promotion</td>
    </tr>
    <tr>
      <td><strong>Honcho</strong></td>
      <td>AI-native memory, user modeling, semantic recall, inferred patterns</td>
      <td>Service dependency, abstraction over raw data, needs governance</td>
      <td>Reasoning layer for cross-session personalization and recall</td>
    </tr>
    <tr>
      <td><strong>Local filesystem memory</strong></td>
      <td>Transparent, inspectable, versionable Markdown/JSON in the workspace</td>
      <td>Becomes messy/duplicated/stale without curation</td>
      <td>Source-of-truth layer for durable facts, preferences, lessons</td>
    </tr>
    <tr>
      <td><strong>Self-hosted Postgres / pgvector</strong></td>
      <td>Owned database, embeddings, semantic search, exportable rows</td>
      <td>More ops than a hosted backend; easy to build a worse Honcho by accident</td>
      <td>Owned retrieval layer when you already run a compose stack (e.g. Firecrawl + Postgres)</td>
    </tr>
  </tbody>
</table>

<p>A few notes from evaluating each:</p>

<ul>
  <li><strong>Obsidian</strong> is excellent when the human wants a personal knowledge base — it’s a notebook first, an agent backend second. Its strength is reviewability: drop imported memories into folders, backlink and tag them, decide what deserves promotion. We treat it as <strong>a path worth evaluating but didn’t explore here</strong> — a great human-review staging area if you want one, orthogonal to the agent’s live recall.</li>
  <li><strong>Honcho</strong> is almost the opposite — built for agents that need statefulness: it stores messages and events, reasons over them in the background, builds representations of users and agents, and returns prompt-ready context. Powerful, because the agent doesn’t manually maintain every fact. We researched self-hosting it seriously, and the catch on a small box is real: it’s several long-running services (API, a background “deriver”, Postgres+pgvector, Redis) and the deriver <strong>must</strong> call out to an LLM/embedding provider, so it’s neither free nor fully local unless you accept a trusted external endpoint. Strong as a <em>reasoning</em> layer; heavy as the <em>only</em> layer.</li>
  <li><strong>Local filesystem memory</strong> is the simplest and most auditable substrate. <a href="https://docs.openclaw.ai/concepts/memory">OpenClaw’s docs</a> make it explicit: the agent remembers by writing plain Markdown; the model only “remembers” what’s on disk. Hermes has the same spirit. The advantage is transparency; the danger is entropy.</li>
  <li><strong>Owned Postgres / pgvector</strong> is the pragmatic fourth option if you already run the infrastructure — a Firecrawl compose stack with Postgres gives you most of an owned semantic-search layer if the image has (or can add) pgvector. But “nearest-neighbour search” is retrieval, not memory judgment: a table can find similar memories; it can’t decide which preference is global, which note is stale, or which lesson should become a skill.</li>
</ul>

<p>The lesson is to <strong>layer them, not collapse them</strong>: source notes/Obsidian for raw imported material and human review; local files for curated, durable, inspectable facts; a reasoning/semantic provider (Honcho, Hindsight) for inferred recall; owned Postgres/pgvector when you need local retrieval without surrendering the data plane. Keep the export path under your control throughout.</p>

<h2 id="how-hermes-agent-memory-actually-works">How Hermes agent memory actually works</h2>

<p>Concretely, Hermes starts with a built-in memory that is just plain Markdown: a <code class="language-plaintext highlighter-rouge">MEMORY.md</code> of durable facts and a <code class="language-plaintext highlighter-rouge">USER.md</code> profile, always loaded, transparent, version-controllable — the model only “remembers” what’s literally on disk. On top of that it supports a <strong>pluggable external provider</strong> for semantic recall, in a layered model: the files stay the inspectable <strong>source of truth</strong>; the provider adds embedding search and synthesis and is swappable.</p>

<p>That’s the design. Standing it up taught us where the sharp edges are.</p>

<h2 id="the-pitfalls-we-hit-standing-it-up">The pitfalls we hit standing it up</h2>

<h3 id="1-the-built-in-file-memory-has-a-silent-ceiling">1. The built-in file memory has a silent ceiling</h3>

<p>Built-in memory has a character cap (ours was 2,200). One day the agent quietly started <strong>refusing to store new memories</strong> — <code class="language-plaintext highlighter-rouge">Memory at 2,113/2,200 chars; adding this entry would exceed the limit</code>. It didn’t fail loudly; it just stopped learning. A flat growing file also accumulates <strong>entropy</strong> (duplicates, contradictions) and offers <strong>no semantic recall</strong> (it finds only what it wrote verbatim and can grep for). Great as a source of truth; you want a semantic layer on top.</p>

<h3 id="2-local-embedded-backends-can-try-to-reshape-your-runtime">2. “Local embedded” backends can try to reshape your runtime</h3>

<p>Hindsight has a tempting local-embedded mode: an on-instance daemon with built-in Postgres that auto-stops when idle, local embeddings, your own LLM key. Lovely on paper. But a dry-run of the install resolved to <strong>159 packages including the full CUDA toolkit</strong> (useless on a GPU-less ARM box) and wanted to <strong>downgrade <code class="language-plaintext highlighter-rouge">cryptography</code> inside the agent’s own virtualenv</strong>. Lesson: <code class="language-plaintext highlighter-rouge">--dry-run</code> any backend install and read what it touches. A memory layer should never get to mutate your agent’s dependencies.</p>

<h3 id="3-just-run-mem0-locally-is-not-a-flag-you-flip">3. “Just run mem0 locally” is not a flag you flip</h3>

<p>mem0 is the obvious open-source choice, and Hermes ships a mem0 plugin — but the stock plugin is <strong>Mem0-Cloud only</strong>: it instantiates the hosted client and needs a Mem0 API key. Self-hosted/OSS mode is an <em>open, unmerged feature request</em>, not shipped. “Local mem0” means writing a custom provider, not selecting one.</p>

<h3 id="4-a-cheap-llm-key-is-not-an-embeddings-key">4. A cheap LLM key is not an embeddings key</h3>

<p>We wanted everything on one inexpensive DeepSeek key. Worth knowing: <strong>DeepSeek’s API has no embeddings endpoint</strong> — chat completions only. Any vector backend still needs a separate embedding model. The clean answer is a small <em>local</em> embedder (<code class="language-plaintext highlighter-rouge">bge-small</code>, 384-dim, ~100 MB, no key) — another resident process on a RAM-tight box.</p>

<h3 id="5-ladybug-and-holographic--local-options-worth-knowing">5. Ladybug and Holographic — local options worth knowing</h3>

<p>Two local providers are worth a mention even though we didn’t adopt them. <strong>Ladybug</strong> is a community plugin that backs memory with an embedded graph database (a fork of Kuzu), giving memories a <em>typed, linked</em> model — preferences, facts, projects, people, events — with importance scores and named edges, and needing no API key. That data model is more expressive than a flat vector store, and it’s a <strong>genuinely promising</strong> project. But it’s <strong>young — only about two months old, a couple of commits, effectively a single maintainer</strong> — and it loads its stack (graph DB + ONNX embeddings) in-process and resident, with no idle release. One to <em>watch and experiment with</em>, not yet to depend on for a daily driver. <strong>Holographic</strong>, by contrast, is Hermes’ zero-dependency escape hatch: pure-Python Holographic Reduced Representations over SQLite — no LLM, no embeddings, no network, no keys. Tiny and fully local; the trade-off is that recall is <em>algebraic</em>, not semantic.</p>

<h3 id="6-mind-the-auxiliary-llm-layer--its-where-memory-quietly-breaks">6. Mind the auxiliary-LLM layer — it’s where memory quietly breaks</h3>

<p>The pitfall most people miss. Your agent doesn’t make one LLM call per turn; it makes <strong>many</strong>, most not to your main model: <strong>context compression</strong>, <strong>title generation, triage, profile description, kanban decomposition</strong>, and — crucially — <strong>memory itself</strong> (extracting facts to <em>retain</em>, reasoning over a <em>recall</em>) are model calls separate from the main turn.</p>

<p>Two things bit us. First, these auxiliaries defaulted to providers we had <strong>no credit or auth</strong> for, so they failed — and compression was set to <strong>fail <em>open</em></strong>, meaning when the summarizer errored it proceeded <em>without</em> a summary and <strong>silently dropped context the agent needed</strong>. Memory loss that looks like the model “forgetting” is often a broken auxiliary. We repointed every auxiliary to one funded, cheap model. Second, in <strong>cloud</strong> memory the retain/recall calls are <em>metered</em>, so the same levers that control RAM locally now control your bill: retain less often, lean recall budget. Treat the auxiliary layer as first-class: configure it, fund it, pick a cheap model on purpose.</p>

<h3 id="7-for-a-small-box-cloud-memory-is-a-reasonable-default--with-an-escape-hatch">7. For a small box, cloud memory is a reasonable default — with an escape hatch</h3>

<p>Stack the constraints — 8 GB, no GPU, disk pressure, one cheap LLM key — and a <strong>cloud</strong> memory service is the better engineering choice for <em>this</em> box; the agent keeps only a light HTTP client. <strong>mem0 Cloud (free Hobby)</strong> caps at ~<strong>1,000 retrievals/month</strong>, which an always-on agent burns in days; <strong><a href="https://hindsight.vectorize.io/">Hindsight Cloud</a></strong> is usage-based with <strong>no request wall</strong>, tunable via retain/recall frequency. We chose Hindsight Cloud, keeping built-in <code class="language-plaintext highlighter-rouge">MEMORY.md</code> as the always-on source of truth. The one rule: pick a provider with a real <strong>export</strong> path so cloud stays low-regret — which is also what makes the appendix’s migration work in reverse.</p>

<h3 id="8-a-task-board-is-working-memory-too">8. A task board is working memory, too</h3>

<p>The most useful thing we learned wasn’t about the store. During parallel work — several Discord threads open at once — the agent kept “forgetting” a task, then lost it after a compression. Two facts: <strong>each Discord thread is its own isolated session</strong> (they don’t share live context), and compression can summarize an in-flight instruction away. The fix isn’t more memory tokens — it’s a <strong>shared task board</strong>. Hermes’ SQLite-backed <code class="language-plaintext highlighter-rouge">kanban</code> is durable across every session, thread, and scheduled job, and survives compression. We wired our PR-reviewer to record its work there: the script deterministically creates one task per PR (idempotent — re-reviews append rather than duplicate), the agent only <em>comments</em> progress, and the script owns closing/blocking it, so a mid-review compression can’t orphan it. The task board as first-class working memory is a direction worth exploring much further.</p>

<h2 id="takeaways">Takeaways</h2>

<ul>
  <li>Choose a substrate deliberately and <strong>layer</strong> them; keep the built-in Markdown files as the inspectable source of truth and watch the size cap — it fails silently.</li>
  <li><code class="language-plaintext highlighter-rouge">--dry-run</code> any backend install. On a small/ARM/no-GPU box, “local embedded” can mean CUDA and a runtime-mutating dependency set.</li>
  <li>A cheap LLM key ≠ embeddings.</li>
  <li><strong>Configure and fund the auxiliary-LLM layer deliberately</strong> — compression and memory extraction are where context quietly disappears.</li>
  <li>For constrained boxes, cloud memory is fine <em>if</em> it has an export path.</li>
  <li>Don’t put durable, multi-step work in the chat; a shared task board is the cross-session working memory that holds up.</li>
</ul>

<p>The store is half the system. The other half is making sure the agent doesn’t lose the thread — and that you can always take your memory with you.</p>

<hr />

<h1 id="appendix-migrating-your-chatgpt-memories-in">Appendix: Migrating your ChatGPT memories in</h1>

<p>Now that there’s a system to migrate <em>into</em> — built-in files for durable facts, a semantic provider on top, source notes/Obsidian for raw material, skills for procedures, a task board for working memory — here’s the validated pipeline for bringing your ChatGPT context across. Remember the framing: the export is <strong>raw source material</strong>, not memory. The work is classification.</p>

<h2 id="two-exports-two-purposes">Two exports, two purposes</h2>

<p>There are two useful ways to get information out of ChatGPT, and they solve different problems.</p>

<p><strong>1. OpenAI’s official data export</strong> (archival). Per <a href="https://help.openai.com/en/articles/7260999-how-do-i-export-my-chatgpt-history-and-data">OpenAI’s Help Center</a>, go to <strong>Settings → Data Controls → Export Data</strong>, confirm, and download the emailed ZIP (the link expires; exports can take time). You can also request data through the Privacy Portal. This is good for archival analysis but <strong>too large and noisy to import directly</strong> — keep it as a private audit trail.</p>

<p><strong>2. A prompt-based memory export</strong> (the import). <a href="https://claude.com/import-memory">Anthropic’s Claude import page</a> describes the lightweight pattern: copy a prompt into the old provider, ask it to extract the important context, paste the result in. This doesn’t migrate every transcript — it migrates the <strong>distilled profile</strong>: preferences, instructions, projects, work style, recurring context. For Hermes, the same idea applies, but the output should be <strong>classified before it touches memory</strong>.</p>

<p>Here is a migration prompt adapted from the Claude-style workflow, tuned for agent memory systems rather than one destination product. The validation in it — verbatim preservation, the uncertain-source label, the strict one-category rule, the per-entry destination + confidence, and the “state whether more remain” loop — is what keeps the import honest:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Export all of my stored memories and any durable context you have learned about me from past conversations. Preserve my words verbatim where possible, especially for instructions, preferences, corrections, and standing rules.

Only include information that appears in stored memory or durable cross-chat context. Do not invent facts from this conversation. If you are unsure whether something is stored memory or inferred from chat history, label it as uncertain.

Classify each item into exactly one category:

1. Instructions: explicit rules I asked you to follow in future conversations, including tone, format, style, approvals, safety boundaries, and corrections to your behavior.
2. Identity: stable personal facts such as name, location, languages, interests, family, education, or public biographical context.
3. Career: current and past roles, organizations, skills, domains, and professional responsibilities.
4. Projects: ongoing or meaningful projects. Use one entry per project with project name, purpose, status, important decisions, and known repository or workspace if available.
5. Preferences: broad working style, tool choices, writing preferences, learning preferences, and taste.
6. Environment: durable machine, account, repository, deployment, or toolchain facts that future agents may need.
7. Procedures: reusable workflows, troubleshooting steps, or lessons learned that should become skills rather than ordinary memory.
8. Temporary or expiring context: reminders, deadlines, one-off tasks, phase status, pending approvals, or anything likely to become stale.
9. Contradictions and uncertainty: entries that conflict, appear outdated, or need human review.

For each entry, output:
- category
- date if known, otherwise [unknown]
- source confidence: high / medium / low
- exact memory text or closest faithful wording
- recommended destination: user profile, agent memory, skill, project note, scheduled task, archive, or discard
- reason for the recommendation

Wrap the entire export in a single Markdown code block. After the code block, state whether more memories remain.
</code></pre></div></div>

<p>If it says more remain, <strong>continue until the export is complete</strong>. Then do not paste it straight into the agent — put it in a source folder first.</p>

<h2 id="a-classification-scheme-for-the-import">A classification scheme for the import</h2>

<p>Sort each item by how it will be <em>used</em>:</p>

<ul>
  <li><strong>User profile</strong> — stable facts that shape interaction style (name, role, “prefers concise technical answers”). Compact and stable; these affect the agent across every session.</li>
  <li><strong>Agent memory</strong> — durable environment/project facts (“this repo’s test command is X”, “the blog uses Jekyll drafts before publishing”). Not temporary progress or stale artifact IDs.</li>
  <li><strong>Skills</strong> — procedures don’t belong in ordinary memory. “When debugging this pipeline, run these five commands in order” becomes a skill: triggers, exact steps, pitfalls, verification. Procedural memory is operational — a bad procedure makes the agent repeat the same failure forever.</li>
  <li><strong>Project notes / Obsidian</strong> — background too detailed for always-loaded memory but worth searching sometimes. If the agent may need to search it occasionally but shouldn’t read it every session, it belongs in notes, not memory.</li>
  <li><strong>Scheduled tasks / commitments</strong> — “follow up after the interview tomorrow” is a scheduled task, not a permanent memory. OpenClaw’s docs draw the same line: commitments differ from durable facts.</li>
  <li><strong>Discard / archive</strong> — stale phase status, old PR numbers and commit hashes, “currently working on…”, temporary approvals, contradicted preferences, sourceless facts. <em>A useful import is the smallest set that improves future behavior, not the largest.</em></li>
</ul>

<h2 id="the-pipeline">The pipeline</h2>

<ol>
  <li><strong>Export raw account data</strong> (official export) — audit trail, stored privately.</li>
  <li><strong>Run the structured export prompt</strong> — continue until no memories remain.</li>
  <li><strong>Save the export as source material</strong> — a source folder or Obsidian note; don’t inject it yet.</li>
  <li><strong>Normalize each entry</strong> — assign scope (global user / project / environment / procedure / temporary / archive), durability (permanent / long-lived / short-lived / expired), confidence (high / medium / low), source (explicit / inferred / unknown), destination.</li>
  <li><strong>Promote only high-confidence durable entries</strong> — compact facts → user profile / agent memory; reusable workflows → skills; detail → notes; stale → discard.</li>
  <li><strong>Run an initial consolidation/review pass</strong> — which entries contradict, which are too temporary, which instructions are unsafe without action boundaries (see the disclaimer up top), which should become skills, which need human review.</li>
  <li><strong>Schedule recurring consolidation (“dreaming”)</strong> — memory cleanup is routine, not one-time. Run it <em>asynchronously, off the critical path</em>; <em>don’t overwrite the source</em>; produce a <strong>reviewable diff</strong> (additions, removals, merges) with evidence; prefer recency only with evidence; and separate cleanup from promotion (a discovered lesson should still clear a confidence threshold or human review). A dream that can’t cite <em>why</em> it wants to delete or rewrite a memory shouldn’t be allowed to mutate the store.</li>
</ol>

<p>The target is a memory system that is <strong>transparent</strong> (you can see what the agent believes), <strong>scoped</strong> (project memories don’t bleed across work), <strong>action-aware</strong> (approvals and risky instructions carry boundaries), and <strong>self-cleaning</strong>. Migrating memories isn’t a one-time copy-paste; it’s the start of a memory operating system you own.</p>

<h2 id="side-note-sharing-a-single-chatgpt-conversation-with-your-agent">Side note: sharing a single ChatGPT conversation with your agent</h2>

<p>The memory-export prompt above is for durable profile context. Sometimes you want a smaller move: take one useful ChatGPT conversation and hand it to your local agent as source material. Public ChatGPT share links are good for this because they can be fetched without giving the agent your ChatGPT account.</p>

<p>A practical setup is to install a public-share exporter such as <code class="language-plaintext highlighter-rouge">csctf</code> (<code class="language-plaintext highlighter-rouge">chat_shared_conversation_to_file</code>) and give Hermes a small procedural skill for it. Then the handoff becomes simple:</p>

<blockquote>
  <p>fetch this ChatGPT share into <code class="language-plaintext highlighter-rouge">/data</code>: <code class="language-plaintext highlighter-rouge">https://chatgpt.com/share/...</code></p>
</blockquote>

<p>The agent should download the share as Markdown and HTML, put it in a data mount or other agreed source folder, spot-check the result, and report the paths. That turns a one-off conversation into inspectable source material that can be summarized, linked from source notes, or promoted into a real Hermes skill later.</p>

<p>Here is the reusable Hermes skill, with machine-specific installation paths removed:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">chatgpt-share-export</span>
<span class="na">description</span><span class="pi">:</span> <span class="s">Use when the user asks to fetch, download, export, or archive a public ChatGPT shared conversation URL into a data directory using csctf.</span>
<span class="na">version</span><span class="pi">:</span> <span class="s">1.0.0</span>
<span class="na">author</span><span class="pi">:</span> <span class="s">Hermes Agent</span>
<span class="na">license</span><span class="pi">:</span> <span class="s">MIT</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">hermes</span><span class="pi">:</span>
    <span class="na">tags</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">chatgpt</span><span class="pi">,</span> <span class="nv">export</span><span class="pi">,</span> <span class="nv">shared-conversations</span><span class="pi">,</span> <span class="nv">csctf</span><span class="pi">,</span> <span class="nv">markdown</span><span class="pi">,</span> <span class="nv">archive</span><span class="pi">]</span>
    <span class="na">related_skills</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">hermes-agent</span><span class="pi">]</span>
<span class="nn">---</span>

<span class="gh"># ChatGPT Share Export</span>

<span class="gu">## Overview</span>

Use this skill to turn a public ChatGPT share URL such as <span class="sb">`https://chatgpt.com/share/...`</span> or <span class="sb">`https://chat.openai.com/share/...`</span> into local Markdown and HTML files under an agreed data or source-material directory.

The preferred tool is <span class="sb">`csctf`</span>, installed as a local CLI from <span class="sb">`chat_shared_conversation_to_file`</span>. Use a system Chromium or browser available on the host if Playwright's bundled Chromium cannot be installed for the platform.

<span class="gu">## When to Use</span>

Use when the user says things like:
<span class="p">
-</span> "fetch this ChatGPT share into <span class="sb">`/data`</span>"
<span class="p">-</span> "download this ChatGPT shared conversation"
<span class="p">-</span> "export this share URL as markdown/html"
<span class="p">-</span> "archive this ChatGPT conversation"
<span class="p">-</span> "run csctf on this ChatGPT share link"

Also applicable to csctf-supported public AI share links from Claude, Gemini, and Grok, but the primary expected trigger is ChatGPT share URLs.

Do not use this for private ChatGPT conversations that are not public share URLs. csctf fetches public share pages; it is not an authenticated ChatGPT account exporter.

<span class="gu">## Default Output Location</span>

Use the user's requested data directory by default, commonly:

<span class="p">```</span><span class="nl">bash
</span>/data/chatgpt-shares
<span class="p">```</span>

For a URL with ID <span class="sb">`&lt;share-id&gt;`</span>, use this base filename pattern:

<span class="p">```</span><span class="nl">text
</span>&lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;share-id&gt;
<span class="p">```</span>

Expected outputs:

<span class="p">```</span><span class="nl">text
</span>&lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;share-id&gt;.md
&lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;share-id&gt;.html
<span class="p">```</span>

If the user asks for a different directory or filename, respect that.

<span class="gu">## Quick Recipe</span>
<span class="p">
1.</span> Extract the share ID from the URL.
<span class="p">2.</span> Ensure the output directory exists.
<span class="p">3.</span> Run csctf with an explicit <span class="sb">`--outfile`</span> base path.
<span class="p">4.</span> Spot-check the generated Markdown and HTML.
<span class="p">5.</span> Report the absolute paths and basic verification results.

Command template:

<span class="p">```</span><span class="nl">bash
</span><span class="nb">set</span> <span class="nt">-euo</span> pipefail
<span class="nb">export </span><span class="nv">BUN_INSTALL</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/.bun"</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$BUN_INSTALL</span><span class="s2">/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>

<span class="nv">URL</span><span class="o">=</span><span class="s1">'https://chatgpt.com/share/&lt;share-id&gt;'</span>
<span class="nv">ID</span><span class="o">=</span><span class="s1">'&lt;share-id&gt;'</span>
<span class="nv">OUTDIR</span><span class="o">=</span><span class="s1">'&lt;data-dir&gt;/chatgpt-shares'</span>
<span class="nv">BASE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$OUTDIR</span><span class="s2">/chatgpt-share-</span><span class="nv">$ID</span><span class="s2">"</span>

<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="s2">"</span><span class="nv">$OUTDIR</span><span class="s2">"</span>
csctf <span class="s2">"</span><span class="nv">$URL</span><span class="s2">"</span> <span class="nt">--timeout-ms</span> 90000 <span class="nt">--outfile</span> <span class="s2">"</span><span class="nv">$BASE</span><span class="s2">"</span>
<span class="p">```</span>

If <span class="sb">`csctf`</span> is not on <span class="sb">`PATH`</span>, use the local binary path configured on that machine.

<span class="gu">## Verification / Spot Check</span>

After running, verify both files exist and are non-empty:

<span class="p">```</span><span class="nl">bash
</span><span class="nv">MD</span><span class="o">=</span><span class="s2">"</span><span class="nv">$BASE</span><span class="s2">.md"</span>
<span class="nv">HTML</span><span class="o">=</span><span class="s2">"</span><span class="nv">$BASE</span><span class="s2">.html"</span>
<span class="nb">stat</span> <span class="nt">-c</span> <span class="s1">'%n %s bytes'</span> <span class="s2">"</span><span class="nv">$MD</span><span class="s2">"</span> <span class="s2">"</span><span class="nv">$HTML</span><span class="s2">"</span>
<span class="nb">wc</span> <span class="nt">-l</span> <span class="nt">-w</span> <span class="nt">-c</span> <span class="s2">"</span><span class="nv">$MD</span><span class="s2">"</span> <span class="s2">"</span><span class="nv">$HTML</span><span class="s2">"</span>
<span class="p">```</span>

Then inspect the beginning of the Markdown for title, source URL, retrieval timestamp, and role sections:

<span class="p">```</span><span class="nl">bash
</span>python3 - <span class="o">&lt;&lt;</span><span class="sh">'</span><span class="no">PY</span><span class="sh">'
from pathlib import Path
p = Path('&lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;share-id&gt;.md')
text = p.read_text(errors='replace')
for line in text.splitlines()[:80]:
    if line.strip():
        print(line[:220])
PY</span>
<span class="p">```</span>

Also confirm the HTML title/content references the same conversation:

<span class="p">```</span><span class="nl">bash
</span>python3 - <span class="o">&lt;&lt;</span><span class="sh">'</span><span class="no">PY</span><span class="sh">'
from pathlib import Path
p = Path('&lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;share-id&gt;.html')
text = p.read_text(errors='replace')
for needle in ['&lt;title&gt;', 'ChatGPT Conversation', 'Source:']:
    print(needle, needle in text)
PY</span>
<span class="p">```</span>

If the user asks for a deeper check, read targeted portions of the Markdown with <span class="sb">`read_file`</span>, or search for expected phrases using <span class="sb">`search_files`</span>.

If Playwright's browser download is unsupported on the host, configure csctf to use an available system Chrome/Chromium installation.

<span class="gu">## Common Pitfalls</span>
<span class="p">
1.</span> <span class="gs">**Playwright Chromium install can fail on some Linux/ARM combinations.**</span> Use <span class="sb">`PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 bun install`</span> and rely on a system Chrome/Chromium where needed.
<span class="p">
2.</span> <span class="gs">**Forgetting explicit `--outfile`.**</span> Without it, csctf writes to the current directory with an inferred filename. Use an explicit base path under the data/source directory so the user gets a stable, easy-to-find artifact.
<span class="p">
3.</span> <span class="gs">**Stopping after the CLI says success.**</span> Always spot-check file sizes, line counts, and the first Markdown content before reporting success.
<span class="p">
4.</span> <span class="gs">**Assuming this works for private conversations.**</span> It only works on public share URLs; private account history needs a different export path.
<span class="p">
5.</span> <span class="gs">**Treating the export as memory immediately.**</span> A conversation transcript is source material. Summaries, preferences, and procedures still need classification before promotion into memory or skills.

<span class="gu">## Reporting Format</span>

When done, tell the user:
<span class="p">
-</span> Markdown path
<span class="p">-</span> HTML path
<span class="p">-</span> file sizes / line counts
<span class="p">-</span> detected title
<span class="p">-</span> brief spot-check result

Example:

<span class="p">```</span><span class="nl">text
</span>Exported and spot-checked:

- &lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;id&gt;.md
- &lt;data-dir&gt;/chatgpt-shares/chatgpt-share-&lt;id&gt;.html

Title detected: &lt;title&gt;
Markdown: &lt;bytes&gt; bytes, &lt;lines&gt; lines
HTML: &lt;bytes&gt; bytes, &lt;lines&gt; lines
<span class="p">```</span>

<span class="gu">## Verification Checklist</span>
<span class="p">
-</span> [ ] URL is a public share URL.
<span class="p">-</span> [ ] Output directory exists under the requested data/source directory.
<span class="p">-</span> [ ] csctf completed with exit code 0.
<span class="p">-</span> [ ] Markdown and HTML files exist and are non-empty.
<span class="p">-</span> [ ] Markdown contains title, source URL, and role sections.
<span class="p">-</span> [ ] HTML contains the expected title or conversation heading.
<span class="p">-</span> [ ] Final response includes absolute paths.
</code></pre></div></div>

<p>That last pitfall is the important memory-system point: sharing a ChatGPT conversation with an agent is not the same as making it memory. The export is evidence. The agent still has to decide whether the conversation contains a durable user preference, a project note, a reusable procedure, a temporary task, or nothing worth promoting.</p>]]></content><author><name>Sakul Learning</name></author><category term="ai-agents" /><category term="memory" /><category term="chatgpt" /><category term="hermes-agent" /><category term="openclaw" /><category term="honcho" /><category term="obsidian" /><category term="hindsight" /><category term="kanban" /><category term="migration" /><summary type="html"><![CDATA[If you pay for ChatGPT or Claude, your “memory” lives on someone else’s servers, governed by someone else’s product decisions, and bound to one assistant. This post is about migrating out of that — toward memory you own, on an agent you run yourself.]]></summary></entry><entry><title type="html">Terraform Plugin Framework vs. formae’s Plugin SDK</title><link href="https://sakul-learning.github.io/2026/06/03/terraform-plugin-framework-vs-formae-plugin-sdk/" rel="alternate" type="text/html" title="Terraform Plugin Framework vs. formae’s Plugin SDK" /><published>2026-06-03T00:00:00+00:00</published><updated>2026-06-03T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/03/terraform-plugin-framework-vs-formae-plugin-sdk</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/03/terraform-plugin-framework-vs-formae-plugin-sdk/"><![CDATA[<p>The most interesting difference between Terraform’s Plugin Framework and formae’s Plugin SDK is not that both are written in Go, or that both ask plugin authors to implement CRUD. The important difference is where each system places the boundary between a plugin, the engine, and the schema language.</p>

<p>Terraform’s framework is a mature provider-development layer around a versioned protocol. It sits between provider authors and Terraform Core, giving Go interfaces for providers, resources, data sources, functions, schemas, plan modification, state upgrades, and provider servers. The framework’s job is to make the Terraform Plugin Protocol usable without requiring every provider author to hand-roll gRPC protocol implementations.</p>

<p>formae’s SDK, as described in the Platform Engineering article, the Ergo primer copied into the same gist, and the public docs/repository, is newer and more opinionated about orchestration. The plugin is not just a provider binary answering Terraform Core’s RPCs. It is a resource actor in a larger agent system. The formae article opens with a useful division of responsibility: the agent knows how to schedule, queue, order, rate-limit, execute, and retry plugin operations; the resource plugin owns the actual API interaction. The plugin author’s public surface is deliberately small: implement resource CRUD, status, list, rate limits, discovery filters, and label configuration; put resource schemas in Pkl; ship a manifest.</p>

<p>That makes the comparison less like “which SDK is better?” and more like this:</p>

<blockquote>
  <p>Terraform optimizes for a stable, ecosystem-scale contract between Terraform Core, provider binaries, the Registry, and Terraform configuration. formae optimizes for an agent-managed resource lifecycle where plugin schema, discovery, async progress, and reconciliation are first-class platform concerns.</p>
</blockquote>

<p>Both are plugin systems. They solve different parts of the infrastructure problem.</p>

<h2 id="two-plugin-boundaries">Two plugin boundaries</h2>

<p>Terraform’s boundary is protocol-first. The Terraform Plugin Protocol is a versioned interface between Terraform CLI and provider plugins, implemented with Protocol Buffers and gRPC. The framework supports protocol versions 5 and 6; provider servers wrap a <code class="language-plaintext highlighter-rouge">provider.Provider</code> and expose a protocol-specific server implementation. In the cloned repository, <code class="language-plaintext highlighter-rouge">providerserver.NewProtocol5</code>, <code class="language-plaintext highlighter-rouge">NewProtocol6</code>, and <code class="language-plaintext highlighter-rouge">Serve</code> adapt a framework provider into <code class="language-plaintext highlighter-rouge">tfprotov5</code> or <code class="language-plaintext highlighter-rouge">tfprotov6</code> servers.</p>

<p>That protocol boundary matters because Terraform has an enormous provider ecosystem. Compatibility is not only a technical concern; it is a distribution model. Terraform Registry discovery, CLI compatibility, provider release versioning, and migration from SDKv2 all depend on the idea that a provider is a separately distributed executable speaking a known protocol.</p>

<p>formae’s boundary is agent-and-actor-first, and the copied article makes clear why that boundary exists. The first implementation loaded Go native plugins in-process with the standard <code class="language-plaintext highlighter-rouge">plugin</code> package: compile a shared object, export a <code class="language-plaintext highlighter-rouge">Plugin</code> symbol, <code class="language-plaintext highlighter-rouge">plugin.Open</code> it, assert it implements <code class="language-plaintext highlighter-rouge">ResourcePlugin</code>, then call it directly. That was elegant inside a monorepo, but not viable for a public SDK. Go native plugins require host and plugin to match on Go toolchain, shared package versions, build flags, and transitive dependencies. For an infrastructure plugin ecosystem wrapping many cloud SDKs, that is dependency lockstep disguised as extensibility.</p>

<p>So formae kept the tiny <code class="language-plaintext highlighter-rouge">ResourcePlugin</code> interface but moved the execution boundary. The public documentation now says plugins run as separate processes or remote services and communicate with formae through documented public interfaces, not by being linked into formae. The repo’s <code class="language-plaintext highlighter-rouge">pkg/plugin/README.md</code> adds the implementation detail: the external-plugin entry point calls <code class="language-plaintext highlighter-rouge">sdk.RunWithManifest</code>, which reads <code class="language-plaintext highlighter-rouge">formae-plugin.pkl</code>, extracts schemas from <code class="language-plaintext highlighter-rouge">schema/pkl/PklProject</code>, wraps the user’s <code class="language-plaintext highlighter-rouge">ResourcePlugin</code>, starts an Ergo node, and announces capabilities to the agent. The <code class="language-plaintext highlighter-rouge">PluginActor</code> builds an announcement with supported resources, resource schemas, match filters, label config, namespace, version, and max request rate, then sends it to the agent’s <code class="language-plaintext highlighter-rouge">PluginCoordinator</code>.</p>

<p>That history sharpens the Terraform comparison. Terraform and formae both end up with separately distributed binaries, but for different reasons. Terraform’s separation is about an ecosystem-scale protocol contract between Core, providers, and the Registry. formae’s separation is about preserving a small Go authoring surface while escaping Go shared-library lockstep, isolating plugin failures, decoupling licenses, and giving the agent freedom to place plugin actors locally or remotely.</p>

<p>The Ergo primer in the second gist explains the last point. Ergo brings Erlang/OTP-style actors to Go: nodes host lightweight processes, actors communicate by sending messages to PIDs, and the caller does not need to care whether the recipient is on the same node, a separate plugin process, or a remote satellite. The formae article calls this network transparency. In tests, the <code class="language-plaintext highlighter-rouge">PluginOperator</code> can run locally on the agent’s node. In the OSS agent, it can run as a local persistent plugin process. At larger scale, it can move behind satellite agents without changing the agent’s mental model of “send a message to the operator.”</p>

<p>So Terraform treats the plugin boundary as a versioned RPC surface that Terraform Core controls. formae treats it as a process/actor boundary where the plugin announces capabilities and the agent orchestrates resource work.</p>

<p>That is the first architectural trade-off:</p>

<table>
  <thead>
    <tr>
      <th>Dimension</th>
      <th>Terraform Plugin Framework</th>
      <th>formae Plugin SDK</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Primary boundary</td>
      <td>Terraform Plugin Protocol v5/v6</td>
      <td>Agent/plugin actor process boundary</td>
    </tr>
    <tr>
      <td>Why it exists</td>
      <td>Ecosystem compatibility between Terraform Core, providers, Registry releases, and HCL workflows</td>
      <td>Escape Go native-plugin lockstep while preserving a tiny Go interface and supporting local/remote plugin topology</td>
    </tr>
    <tr>
      <td>Transport</td>
      <td>gRPC/protobuf through terraform-plugin-go servers</td>
      <td>Ergo actor messaging, serialized across process boundary</td>
    </tr>
    <tr>
      <td>Core lifecycle owner</td>
      <td>Terraform Core plan/apply/refresh/import lifecycle</td>
      <td>formae agent scheduler, queue, reconciliation, discovery</td>
    </tr>
    <tr>
      <td>Plugin role</td>
      <td>Provider implementation for Terraform resources, data sources, functions, actions</td>
      <td>Resource operation worker that announces capabilities and executes CRUD/discovery</td>
    </tr>
    <tr>
      <td>Distribution emphasis</td>
      <td>Registry-compatible provider binaries</td>
      <td>Separately licensed plugins with manifest, schema package, and agent discovery</td>
    </tr>
  </tbody>
</table>

<h2 id="interfaces-broad-terraform-concepts-vs-one-resource-plugin-contract">Interfaces: broad Terraform concepts vs. one resource plugin contract</h2>

<p>Terraform’s framework decomposes provider development into many concepts. The core <code class="language-plaintext highlighter-rouge">provider.Provider</code> interface in the cloned repo requires <code class="language-plaintext highlighter-rouge">Metadata</code>, <code class="language-plaintext highlighter-rouge">Schema</code>, <code class="language-plaintext highlighter-rouge">Configure</code>, <code class="language-plaintext highlighter-rouge">DataSources</code>, and <code class="language-plaintext highlighter-rouge">Resources</code>. Optional interfaces add functions, ephemeral resources, list resources, actions, state stores, config validators, meta schemas, and more. The <code class="language-plaintext highlighter-rouge">resource.Resource</code> interface requires <code class="language-plaintext highlighter-rouge">Metadata</code>, <code class="language-plaintext highlighter-rouge">Schema</code>, <code class="language-plaintext highlighter-rouge">Create</code>, <code class="language-plaintext highlighter-rouge">Read</code>, <code class="language-plaintext highlighter-rouge">Update</code>, and <code class="language-plaintext highlighter-rouge">Delete</code>, while optional interfaces add import, configuration, validation, plan modification, state moves, state upgrades, identity, and identity upgrades.</p>

<p>This is a very Terraform-shaped design. It reflects the fact that Terraform is not only a CRUD engine. It has data sources, functions, plan-time semantics, unknown values, import, moved blocks, state schema upgrades, provider-defined functions, and emerging concepts such as ephemeral resources and actions. The framework exposes all of that surface because provider authors sometimes need to participate in all of it.</p>

<p>formae’s public <code class="language-plaintext highlighter-rouge">ResourcePlugin</code> is intentionally narrower:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">ResourcePlugin</span> <span class="k">interface</span> <span class="p">{</span>
    <span class="n">RateLimit</span><span class="p">()</span> <span class="n">RateLimitConfig</span>
    <span class="n">DiscoveryFilters</span><span class="p">()</span> <span class="p">[]</span><span class="n">MatchFilter</span>
    <span class="n">LabelConfig</span><span class="p">()</span> <span class="n">LabelConfig</span>

    <span class="n">Create</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">CreateResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
    <span class="n">Read</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">ReadResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
    <span class="n">Update</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">UpdateResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
    <span class="n">Delete</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">DeleteResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
    <span class="n">Status</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">StatusResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
    <span class="n">List</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">ListResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The repo comments say plugin identity and schema methods are handled automatically by the SDK. The internal <code class="language-plaintext highlighter-rouge">FullResourcePlugin</code> adds <code class="language-plaintext highlighter-rouge">Name</code>, <code class="language-plaintext highlighter-rouge">Version</code>, <code class="language-plaintext highlighter-rouge">Namespace</code>, <code class="language-plaintext highlighter-rouge">SupportedResources</code>, and <code class="language-plaintext highlighter-rouge">SchemaForResourceType</code>, but plugin authors do not implement those directly. The SDK derives identity from the manifest and schemas from the Pkl package.</p>

<p>That is a strong opinion: resource plugins should stay focused on remote-system operations. Schema declaration, metadata, wrapping, startup, and announcement are SDK responsibilities. The formae article also gives a pragmatic reason for keeping the surface this small: AI coding agents write much of the plugin code, so reliability comes from a narrow contract, LLM-oriented documentation, and conformance tests that turn “did it get it right?” into a runnable answer.</p>

<p>Terraform asks provider authors to model Terraform semantics directly. formae asks plugin authors to implement infrastructure operations and lets the agent/SDK attach those operations to the platform model. That makes Terraform’s framework more expressive for Terraform-native behavior, while formae’s SDK is easier to generate, test, and supervise as a platform operation worker.</p>

<h2 id="schema-go-framework-types-vs-pkl-as-the-declarative-source-of-truth">Schema: Go framework types vs. Pkl as the declarative source of truth</h2>

<p>Schema is where the two systems diverge most sharply.</p>

<p>Terraform’s framework expresses schemas in Go. A resource returns a <code class="language-plaintext highlighter-rouge">resource/schema.Schema</code> with <code class="language-plaintext highlighter-rouge">Attributes</code>, <code class="language-plaintext highlighter-rouge">Blocks</code>, descriptions, deprecation messages, and a <code class="language-plaintext highlighter-rouge">Version</code>. A string attribute is a Go struct with fields such as <code class="language-plaintext highlighter-rouge">Required</code>, <code class="language-plaintext highlighter-rouge">Optional</code>, <code class="language-plaintext highlighter-rouge">Computed</code>, <code class="language-plaintext highlighter-rouge">Sensitive</code>, <code class="language-plaintext highlighter-rouge">Validators</code>, <code class="language-plaintext highlighter-rouge">PlanModifiers</code>, <code class="language-plaintext highlighter-rouge">Default</code>, and <code class="language-plaintext highlighter-rouge">WriteOnly</code>. These are not incidental flags. They encode Terraform’s contract with the practitioner and with state: required vs optional input, computed output, sensitive display, plan-time transformations, validation, deprecation, write-only behavior, and compatibility with specific Terraform versions.</p>

<p>The framework also has a rich value model. The cloned <code class="language-plaintext highlighter-rouge">types/basetypes/string_value.go</code> shows <code class="language-plaintext highlighter-rouge">StringValue</code> carrying an explicit state: known, null, or unknown. That is one of the framework’s headline improvements over SDKv2. Provider code can distinguish “the user set null” from “Terraform does not know the value yet” from “this is a known string.” In Terraform, that distinction is central to planning.</p>

<p>formae’s schema source of truth is Pkl. The docs show a resource class annotated with <code class="language-plaintext highlighter-rouge">@formae.ResourceHint</code> and fields annotated with <code class="language-plaintext highlighter-rouge">@formae.FieldHint</code>. A resource type such as <code class="language-plaintext highlighter-rouge">SFTP::Files::File</code> declares an identifier JSONPath like <code class="language-plaintext highlighter-rouge">$.path</code>; fields can be mutable or <code class="language-plaintext highlighter-rouge">createOnly</code>, where changing them requires replacement. The repo README says <code class="language-plaintext highlighter-rouge">schema/pkl/PklProject</code> describes supported resource types, fields, validation rules, create-only fields, discoverability, extractability, and parent-resource mappings. At startup, <code class="language-plaintext highlighter-rouge">pkg/plugin/descriptors</code> extracts resource descriptors and JSON schemas from the Pkl package.</p>

<p>This makes formae’s schema story more language-oriented. Pkl is not just metadata stapled onto Go structs; it is the declarative contract that the agent can evaluate, verify, extract, and use for resource discovery/reconciliation. The code in <code class="language-plaintext highlighter-rouge">descriptors/extract_schema.go</code> stages plugin dependencies, rewrites <code class="language-plaintext highlighter-rouge">@formae</code> references so plugin schemas resolve against the agent’s formae schema version, generates a wrapper Pkl project, resolves dependencies, and runs an extractor to produce resource descriptors.</p>

<p>That dependency rewriting is a subtle but important schema-evolution mechanism. It avoids every plugin having to update its own PklProject pin in lockstep whenever the agent’s base schema adds a feature. Terraform’s framework handles schema evolution primarily inside provider code and protocol compatibility. formae is using a package/module schema language and an agent-controlled extraction pass to keep plugin schema interpretation aligned with the running platform.</p>

<h2 id="state-and-evolution">State and evolution</h2>

<p>Terraform has one of the hardest schema-evolution problems in infrastructure software: it must preserve user state across provider upgrades, Terraform upgrades, schema changes, moved resources, imports, and drift refreshes. The framework exposes that reality directly.</p>

<p>A Terraform resource schema has a <code class="language-plaintext highlighter-rouge">Version</code>. The <code class="language-plaintext highlighter-rouge">ResourceWithUpgradeState</code> interface lets a resource return a map from prior state versions to <code class="language-plaintext highlighter-rouge">StateUpgrader</code> implementations. The comments are explicit: Terraform does not store previous schema information, so breaking changes to state data types must be handled by providers. There is also <code class="language-plaintext highlighter-rouge">ResourceWithMoveState</code> for moved configuration blocks that change resource type, and identity schema/identity upgrade interfaces for newer resource identity support.</p>

<p>That is a powerful but demanding model. Terraform gives provider authors tools to precisely control state compatibility, but provider authors must think in Terraform state terms.</p>

<p>formae’s schema evolution story appears more centralized. The article’s title foregrounds schema evolution, and the repo shows the mechanism behind that framing: schemas are Pkl packages, the SDK extracts descriptors at startup, the manifest declares <code class="language-plaintext highlighter-rouge">minFormaeVersion</code>, and the extractor rewrites plugin dependencies against the agent’s formae schema. The docs also emphasize <code class="language-plaintext highlighter-rouge">FieldHint</code> semantics like <code class="language-plaintext highlighter-rouge">createOnly</code>, identifier extraction, discoverability, extractability, and parent-child resource relationships. These are schema-level concepts the agent can reason about before or around plugin calls, instead of asking each plugin method to rediscover the platform contract.</p>

<p>The trade-off is maturity vs. centralization. Terraform has years of battle-tested state migration machinery exposed through framework interfaces. formae can make schema evolution feel more like updating a declarative package interpreted by the agent, but that also means its evolution guarantees depend on how consistently the agent, Pkl schema library, descriptor extraction, and plugins are versioned together.</p>

<h2 id="planning-and-reconciliation">Planning and reconciliation</h2>

<p>Terraform is famous for planning. Its plugin framework is built around Terraform Core’s plan/apply model. Provider authors can validate configuration, modify plans, mark replacement requirements, preserve unknowns, and upgrade state. The framework’s <code class="language-plaintext highlighter-rouge">ResourceWithModifyPlan</code> comments show the constraints: config values must be preserved, known planned values cannot later be changed inconsistently, unknown values may remain unknown or be filled with appropriate values, and errors prevent further plan modification.</p>

<p>In other words, Terraform’s provider framework gives plugins a seat at the planning table, but Terraform Core owns the table.</p>

<p>formae, by contrast, centers the agent’s reconciliation loop. The copied article says the formae agent knows how to schedule, queue, order, rate-limit, execute, and retry plugin operations; the resource plugin owns the actual API interaction. The plugin docs say the agent discovers installed plugins, spawns them, receives capabilities, routes operations, and enforces rate limits. The conformance test suite validates create, read, update, replace, destroy, extract, discover, forced sync, and out-of-band deletion. That is not just CRUD testing; it is lifecycle testing through the agent and CLI.</p>

<p>This is an important distinction for platform teams:</p>

<ul>
  <li>Terraform’s unit of work is a plan/apply transaction against state.</li>
  <li>formae’s unit of work is closer to an agent-managed operation and inventory reconciliation loop.</li>
</ul>

<p>Terraform practitioners expect a plan to explain what will happen before apply. formae users may expect the agent to continuously know what exists, discover unmanaged resources, detect drift, queue long-running operations, and reconcile desired state with inventory.</p>

<p>Neither model is universally superior. Terraform’s plan is still the gold standard for human-reviewed infrastructure change. Agent reconciliation is attractive when infrastructure systems have asynchronous operations, rate limits, discovery needs, and drift behavior that do not fit cleanly into a single human-approved transaction.</p>

<h2 id="async-operations-and-rate-limits">Async operations and rate limits</h2>

<p>Terraform providers can of course wait for cloud operations, retry, and poll. But the framework’s public resource interface still reads as synchronous CRUD from Terraform Core’s perspective: <code class="language-plaintext highlighter-rouge">Create</code>, <code class="language-plaintext highlighter-rouge">Read</code>, <code class="language-plaintext highlighter-rouge">Update</code>, and <code class="language-plaintext highlighter-rouge">Delete</code> mutate response objects and diagnostics during an RPC. Provider authors usually implement waiting inside those methods or with helper libraries.</p>

<p>formae bakes async progress into the public plugin interface. CRUD results embed a <code class="language-plaintext highlighter-rouge">ProgressResult</code>; long-running cloud operations return <code class="language-plaintext highlighter-rouge">InProgress</code> with a <code class="language-plaintext highlighter-rouge">NativeID</code> and tracking metadata. The agent then calls <code class="language-plaintext highlighter-rouge">Status</code> on a polling schedule until the operation reaches success or failure. The SDK also classifies recoverable errors such as throttling, network failure, service internal error, timeout, not stabilized, not found, and resource conflict.</p>

<p>This is one of formae’s clearest advantages for cloud-provider ergonomics. Many real infrastructure APIs are not synchronous: create starts a job, update triggers stabilization, delete returns before eventual removal, discovery is paginated, and rate limits vary by namespace or service. If the platform’s agent owns polling, rate limits, and retries, plugin code can be thinner and more consistent.</p>

<p>Terraform can model the same behavior, but the provider often owns more of the operational loop. formae pushes more of that loop into the platform: scheduling, ordering, rate limiting, retry classification, and status polling become agent concerns rather than ad hoc code inside every plugin.</p>

<h2 id="discovery-and-inventory">Discovery and inventory</h2>

<p>Terraform has data sources and import, but it is not primarily an inventory discovery engine. Terraform state tracks resources Terraform manages; unmanaged resource discovery is usually handled by import workflows, external tools, or provider-specific data sources.</p>

<p>formae makes discovery part of the plugin contract. <code class="language-plaintext highlighter-rouge">List</code> is a required method. <code class="language-plaintext highlighter-rouge">DiscoveryFilters</code> and <code class="language-plaintext highlighter-rouge">LabelConfig</code> are part of the public interface. The Pkl schema can mark resource types as discoverable and extractable. The conformance tests create resources out-of-band, register targets, trigger discovery scans, and assert that resources appear on an <code class="language-plaintext highlighter-rouge">$unmanaged</code> stack.</p>

<p>That makes formae more natural for platform teams that want to answer, “What already exists? What can we bring under management? What drifted?” Terraform can answer those questions in pieces, but its core mental model begins with declared configuration and state. formae’s model seems to put inventory and discovery closer to the center.</p>

<h2 id="testing-philosophy">Testing philosophy</h2>

<p>Terraform’s framework ecosystem emphasizes unit tests, acceptance tests, migration tests, and provider behavior compatibility. The migration guide recommends test-driven development when moving from SDKv2 to the framework, especially because user Terraform configuration should usually not change during migration.</p>

<p>formae’s conformance tests are more platform-integrated. The plugin scaffold wires in tests that run through the real formae CLI and agent. The suite validates not only CRUD but extract, discover, forced sync idempotency, replacement behavior, destroy, and out-of-band deletion. That is a different testing posture: not “does this provider method behave?” but “does this plugin behave correctly inside the platform’s lifecycle?”</p>

<p>This also connects back to the article’s AI-agent point. If coding agents generate most plugin code, tests need to be more than examples for humans; they need to be executable feedback loops. Terraform’s framework relies on provider authors understanding Terraform semantics and building the right test matrix. formae tries to move more of that burden into a standard conformance harness that generated plugins can run against repeatedly.</p>

<p>For a young ecosystem, that is a smart move. A conformance suite can encode platform expectations before dozens of plugins drift into subtly incompatible interpretations of CRUD, discovery, replacement, or progress.</p>

<h2 id="licensing-and-ecosystem-implications">Licensing and ecosystem implications</h2>

<p>Terraform’s Plugin Framework repository is MPL-2.0 and belongs to a large ecosystem where public providers are distributed through a registry. The framework README says it has reached general availability, follows semantic versioning, supports Terraform v0.12 and above, and recommends tagged releases.</p>

<p>formae’s docs explicitly state a plugin policy: plugins are independent works, executed out-of-process or as remote services, communicating only through documented public interfaces; plugin developers may choose open-source or proprietary licenses; plugin licenses should not impose terms on formae components. The repo files are marked FSL-1.1-ALv2.</p>

<p>This is not a side issue. Plugin architecture is often license architecture. Both systems use out-of-process plugins, but formae’s policy is unusually explicit about non-derivative boundaries and plugin licensing independence. The move away from Go native plugins therefore solved two problems at once: dependency lockstep and license coupling. Terraform’s mature ecosystem has already normalized separately distributed providers; formae is documenting that boundary early because its first implementation made the cost of in-process linking concrete.</p>

<h2 id="developer-experience-who-is-the-plugin-author">Developer experience: who is the plugin author?</h2>

<p>Terraform’s framework is best for developers who need to expose a system to Terraform practitioners. They must understand Terraform configuration, planning, state, unknown values, import, schema migrations, and acceptance testing. The reward is access to Terraform’s ecosystem and a highly recognizable workflow.</p>

<p>formae’s SDK seems designed for platform engineers who need to add a resource type to an agent-driven infrastructure platform. They implement operations against a target API, describe resources in Pkl, and rely on the SDK/agent for startup, schema extraction, registration, rate limiting, progress polling, discovery, conformance testing, and actor placement. Whether the operator runs locally in a workflow test, as a persistent plugin process, or behind a satellite agent is meant to be a deployment detail rather than a different plugin authoring model.</p>

<p>That makes formae potentially easier for internal platform teams, especially when the target system is not a traditional Terraform provider candidate. But it also means a formae plugin is useful inside formae’s operational model, not as a general-purpose Terraform provider. Terraform has broader ecosystem reach. formae can be more opinionated because it does not need to support every Terraform use case.</p>

<h2 id="where-terraform-is-stronger">Where Terraform is stronger</h2>

<p>Terraform’s Plugin Framework is stronger when compatibility, ecosystem reach, and plan/state semantics dominate.</p>

<ol>
  <li><strong>Ecosystem maturity.</strong> Terraform has a huge provider ecosystem, registry distribution, mature protocol compatibility, and years of provider edge cases.</li>
  <li><strong>Plan semantics.</strong> Terraform’s plan/apply model and unknown/null value handling are deeply developed.</li>
  <li><strong>State migration machinery.</strong> Resource schema versions, state upgraders, moved-state support, identity schema, and protocol compatibility give provider authors explicit tools for long-lived resources.</li>
  <li><strong>Language integration.</strong> Provider-defined functions, data sources, resources, actions, ephemeral resources, and other Terraform concepts integrate with HCL and Terraform Core.</li>
  <li><strong>Incremental migration from SDKv2.</strong> <code class="language-plaintext highlighter-rouge">terraform-plugin-mux</code> lets complex providers migrate one resource or data source at a time.</li>
</ol>

<p>If the question is, “How do I expose an API to Terraform users with minimal surprises over many provider releases?” Terraform’s framework is the obvious answer.</p>

<h2 id="where-formae-is-stronger">Where formae is stronger</h2>

<p>formae’s Plugin SDK is stronger when agent-managed operations, discovery, and schema-as-platform-contract matter more than Terraform ecosystem reach.</p>

<ol>
  <li><strong>Async operation model.</strong> <code class="language-plaintext highlighter-rouge">Status</code> and <code class="language-plaintext highlighter-rouge">ProgressResult</code> make long-running cloud workflows first-class.</li>
  <li><strong>Discovery by design.</strong> <code class="language-plaintext highlighter-rouge">List</code>, discovery filters, discoverable schemas, unmanaged inventory, and conformance tests treat discovery as core behavior.</li>
  <li><strong>Agent-owned orchestration.</strong> Scheduling, queuing, ordering, rate limits, plugin announcements, actor supervision, and retries sit in the platform rather than in every plugin.</li>
  <li><strong>Topology flexibility.</strong> Ergo network transparency lets the same operator model run in-process for tests, out-of-process locally, or remotely behind satellite agents.</li>
  <li><strong>Pkl schema packages.</strong> Resource schemas can be evaluated, extracted, verified, version-resolved, and used by the agent outside plugin method calls.</li>
  <li><strong>Conformance as scaffold.</strong> New plugins start with a lifecycle test suite that exercises real agent behavior, which is especially useful when AI agents generate much of the plugin code.</li>
  <li><strong>License and dependency boundary clarity.</strong> The docs explicitly frame plugins as independent out-of-process works, and the architecture avoids native Go plugin dependency lockstep.</li>
</ol>

<p>If the question is, “How do I add a new resource family to an agent that manages inventory, discovery, async operations, and reconciliation?” formae’s SDK is the more targeted design.</p>

<h2 id="the-deeper-design-lesson">The deeper design lesson</h2>

<p>The Terraform Plugin Framework and formae Plugin SDK both demonstrate a move away from ad hoc plugins toward strongly mediated extension systems.</p>

<p>Terraform learned that provider authors need safer abstractions over a complex protocol. The framework gives them interfaces, typed values, schemas, validators, plan modifiers, state upgraders, and server adapters. It abstracts repetitive protocol machinery while still exposing Terraform’s semantics.</p>

<p>formae appears to be starting from a different lesson: infrastructure plugins are not only API adapters; they are participants in an agent’s operational system. The SDK hides identity/schema wrapping, extracts Pkl descriptors, announces capabilities through actors, standardizes progress and retry behavior, and tests plugins through real platform lifecycle operations.</p>

<p>One design is protocol-centered. The other is agent-centered.</p>

<p>That distinction will matter more as infrastructure tools become more autonomous. A plugin framework for a human-reviewed plan/apply tool must be conservative about state and compatibility. A plugin SDK for an agentic infrastructure platform must be conservative about scheduling, rate limits, reconciliation, discovery, and restart-safe operation. The same cloud API might need both kinds of adapters.</p>

<h2 id="a-practical-recommendation">A practical recommendation</h2>

<p>For most teams, this is not an either/or decision.</p>

<p>Use Terraform Plugin Framework when:</p>

<ul>
  <li>your users already manage the system with Terraform;</li>
  <li>plan output and state compatibility are non-negotiable;</li>
  <li>you need Terraform Registry distribution;</li>
  <li>provider-defined data sources, functions, imports, or state migrations are central;</li>
  <li>you need broad compatibility with Terraform workflows.</li>
</ul>

<p>Use formae Plugin SDK when:</p>

<ul>
  <li>you are extending formae itself;</li>
  <li>resource discovery and inventory matter as much as desired-state application;</li>
  <li>operations are long-running, rate-limited, or need platform-level polling;</li>
  <li>Pkl schemas are part of your platform contract;</li>
  <li>you want plugin authors to implement remote-system operations while the agent owns orchestration.</li>
</ul>

<p>The exciting possibility is not that formae replaces Terraform or that Terraform absorbs formae’s model. It is that infrastructure plugin architecture is splitting into two useful categories:</p>

<ol>
  <li><strong>Protocol plugins</strong> for ecosystem-compatible declarative tools.</li>
  <li><strong>Agent plugins</strong> for autonomous reconciliation platforms.</li>
</ol>

<p>Terraform’s Plugin Framework is one of the best examples of the first category. formae’s Plugin SDK is an early, concrete example of the second.</p>

<p>That is why the formae articles matter to this comparison. The Go plugin history explains why the process boundary exists; the Ergo primer explains why that boundary can be actor-shaped rather than just RPC-shaped; the Pkl schema machinery explains how the agent can keep reasoning about resource types outside the plugin’s Go code. Together they point to a different center of gravity: an infrastructure platform where plugins are supervised workers in an agent’s resource graph, not only RPC endpoints for a plan/apply engine.</p>

<p>That difference is exactly why the comparison is useful.</p>]]></content><author><name>Sakul Learning</name></author><category term="infrastructure-as-code" /><category term="terraform" /><category term="formae" /><category term="plugin-frameworks" /><category term="platform-engineering" /><summary type="html"><![CDATA[The most interesting difference between Terraform’s Plugin Framework and formae’s Plugin SDK is not that both are written in Go, or that both ask plugin authors to implement CRUD. The important difference is where each system places the boundary between a plugin, the engine, and the schema language.]]></summary></entry><entry><title type="html">Infrastructure as Code Needs Software Engineering, Not More Config Tooling</title><link href="https://sakul-learning.github.io/2026/06/02/iac-repo-structure/" rel="alternate" type="text/html" title="Infrastructure as Code Needs Software Engineering, Not More Config Tooling" /><published>2026-06-02T00:00:00+00:00</published><updated>2026-06-02T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/02/iac-repo-structure</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/02/iac-repo-structure/"><![CDATA[<p>Gruntwork is right about the symptom: many Terraform and OpenTofu repositories become messy because plain HCL lacks enough structure for large systems. But Terragrunt may be the wrong kind of cure. It adds another specialized layer around a configuration language when the deeper problem is that cloud infrastructure is increasingly software architecture — and should be organized with ordinary, portable, well-understood software engineering tools.</p>

<p>A messy infrastructure repository is not just an aesthetic problem. It is an operational risk.</p>

<p>Gruntwork’s guide, <strong><a href="https://www.gruntwork.io/blog/gruntwork-guides-your-infrastructure-repo-is-a-mess-heres-how-to-fix-it">“Your Infrastructure Repo Is a Mess — Here’s How to Fix It”</a></strong>, is useful because it names real failure modes: copy-pasted environments, mega-modules, unclear ownership, state files with too much blast radius, and deployment paths that depend on tribal knowledge.</p>

<p>But the guide also exposes a deeper problem.</p>

<p>Terragrunt is correct that plain Terraform and OpenTofu are under-tooled for large infrastructure systems. The question is whether the answer should be yet another niche tool around a configuration language, or whether Infrastructure as Code should become more serious about the word <strong>code</strong>.</p>

<h2 id="terragrunt-diagnoses-a-real-gap">Terragrunt diagnoses a real gap</h2>

<p>The Gruntwork critique lands because many HCL repositories do decay in predictable ways.</p>

<p>A team starts with a simple Terraform file. Then it adds staging. Then production. Then another region. Then another account. Soon the same provider, backend, tag, VPC, IAM, and service settings are copied across many folders. Small differences become hard to classify. Is staging different from production on purpose, or did someone forget to copy the last change?</p>

<p>The opposite strategy is also painful: one giant module that deploys everything. That avoids some repetition, but it creates a module with too many inputs, slow plans, unclear ownership, and a blast radius much larger than the change being reviewed.</p>

<p>Terragrunt’s answer is hierarchy, inheritance, generated configuration, stacks, and live-repo conventions. That can make a Terraform/OpenTofu estate more navigable.</p>

<p>The uncomfortable question is: why does an Infrastructure as Code ecosystem need so much external machinery to organize and reuse code?</p>

<h2 id="the-suspicion-this-is-configuration-trying-to-impersonate-software">The suspicion: this is configuration trying to impersonate software</h2>

<p>Terraform modules are useful, but they are not software libraries in the way Go, Python, TypeScript, or Java developers usually mean that phrase.</p>

<p>In mainstream software ecosystems, teams expect a rich set of ordinary capabilities:</p>

<ul>
  <li>packages with public APIs</li>
  <li>dependency management</li>
  <li>semantic versioning</li>
  <li>local development against unpublished packages</li>
  <li>unit tests, integration tests, mocks, fixtures, and contract tests</li>
  <li>object composition</li>
  <li>dependency injection</li>
  <li>reusable interfaces</li>
  <li>override points</li>
  <li>monorepo tooling</li>
  <li>build graphs</li>
  <li>task runners</li>
  <li>generated clients</li>
  <li>type checking</li>
  <li>documentation generated from the code itself</li>
</ul>

<p>Go is a good reference point because its toolchain is intentionally boring and widely understood. Go code is organized into packages and modules. The <code class="language-plaintext highlighter-rouge">go</code> tool knows how to fetch, build, install, and test them. The standard <code class="language-plaintext highlighter-rouge">testing</code> package supports tests, benchmarks, examples, and fuzzing. Modules can be versioned and published from ordinary repositories.</p>

<p>The JavaScript and TypeScript world is messier, but it has the same basic shape. npm, pnpm, yarn, Jest, Vitest, Turborepo, Nx, Lerna, and many other tools exist because organizing and shipping code at scale is a common software problem. Teams may argue over the best tool, but the category is mature.</p>

<p>By comparison, Terraform modules feel constrained. They can be composed, but not extended like classes or refined through normal override mechanisms. There is no first-class dependency injection model comparable to mature application frameworks. Provider configuration has special global behavior. The Terraform docs themselves emphasize that provider configurations are shared across module boundaries and defined only in the root module; reusable child modules must not contain provider blocks. That is understandable for Terraform’s execution model, but it is not the same as a general software package system.</p>

<p>So Terragrunt’s hierarchy is not just a convenience. It is a workaround for missing language and ecosystem capabilities.</p>

<h2 id="infrastructure-is-no-longer-only-operations-plumbing">Infrastructure is no longer only operations plumbing</h2>

<p>A common defense of simpler configuration is that infrastructure lifecycle is different from application lifecycle. The argument goes something like this: infrastructure is too dangerous, too stateful, and too operationally sensitive for the complexity of software engineering patterns.</p>

<p>That objection made more sense when infrastructure mostly meant racks, switches, network appliances, operating systems, and long-lived servers. It is less convincing in cloud-native systems.</p>

<p>Modern cloud infrastructure is often software-defined and architecture-shaped:</p>

<ul>
  <li>queues and topics encode asynchronous workflow</li>
  <li>serverless functions encode business reactions to events</li>
  <li>API gateways encode product interfaces</li>
  <li>event buses encode domain integration</li>
  <li>managed databases encode storage and access patterns</li>
  <li>object stores encode data pipelines</li>
  <li>identity policies encode service boundaries</li>
  <li>software-defined networks encode reachability and segmentation</li>
</ul>

<p>AWS’s serverless guidance describes managed services as a way for the provider to handle capacity provisioning, patching, platform management, availability foundations, and other infrastructure management tasks so teams can focus more directly on business logic and application behavior. That is not classic server-rack operations. It is application architecture expressed through cloud resources.</p>

<p>Enterprise Integration Patterns make the same point from a different direction. Patterns such as publish-subscribe channels, point-to-point channels, request-reply, correlation identifiers, message routers, splitters, aggregators, dead-letter channels, competing consumers, and service activators are software architecture patterns. In the cloud, many of those patterns map directly onto managed infrastructure resources such as SNS, SQS, EventBridge, Lambda, API Gateway, Step Functions, and queues with dead-letter policies.</p>

<p>If the infrastructure resource is implementing part of the integration pattern, then the IaC describing it is not merely configuration. It is part of the software architecture.</p>

<h2 id="cdk-is-interesting-because-it-treats-infrastructure-as-libraries">CDK is interesting because it treats infrastructure as libraries</h2>

<p>AWS CDK is not automatically better than Terraform. It has its own risks: generated CloudFormation can be opaque, abstractions can hide details, and teams can write terrible code in any language.</p>

<p>But CDK is useful for this discussion because it has a more honest model of Infrastructure as Code.</p>

<p>The AWS CDK construct model has layers:</p>

<ul>
  <li><strong>L1 constructs</strong> map closely to individual CloudFormation resources.</li>
  <li><strong>L2 constructs</strong> provide intent-based APIs with defaults, helper methods, permissions wiring, and resource integration logic.</li>
  <li><strong>L3 constructs</strong>, also called patterns, combine multiple resources into reusable architecture-level solutions.</li>
</ul>

<p>That layering matters. It gives teams escape hatches. If a high-level pattern is too opinionated, drop to L2. If L2 is missing or too constrained, drop to L1. If a team repeats the same architecture across products, it can publish an L3 construct as a normal library package.</p>

<p>AWS Solutions Constructs makes this explicit: it provides multi-service, well-architected patterns for defining repeatable infrastructure in familiar programming languages. It supports TypeScript, JavaScript, Python, and Java. Those constructs can use conditionals, loops, object-oriented techniques, normal tests, normal review, normal package distribution, and normal dependency management.</p>

<p>This is much closer to what software teams already know how to do.</p>

<h2 id="l2-service-integrations-versus-l3-solution-patterns">L2 service integrations versus L3 solution patterns</h2>

<p>The L2/L3 distinction also gives a better vocabulary for reusable infrastructure than “module.”</p>

<p>An L2 construct is close to a service-level abstraction. It can encode best-practice defaults and expose methods for common interactions. For example, a bucket abstraction can expose permission helpers. A queue abstraction can expose consumer wiring. A function abstraction can expose event-source bindings.</p>

<p>An L3 construct is closer to an architecture pattern. It might represent a static website behind CloudFront, a queue-processing worker with a dead-letter queue, an API backed by a database, or a load-balanced Fargate service. AWS Prescriptive Guidance describes L3 constructs as reusable patterns for resource interactions, resource extensions, and custom resources.</p>

<p>That distinction maps well to software architecture:</p>

<ul>
  <li>L2: reusable service integration building blocks</li>
  <li>L3: high-level solution patterns</li>
</ul>

<p>This is what many Terraform module libraries try to approximate, but HCL makes the abstraction boundary awkward. A module can expose variables and outputs, but it does not feel like an API with methods, behavior, dependency injection, override points, or typed composition. The result is often a giant input surface instead of a small, expressive interface.</p>
<h2 id="terraconstructs-is-the-sharper-test-case">TerraConstructs is the sharper test case</h2>

<p>TerraConstructs complicates the argument in a productive way. It is not simply another wrapper around HCL in the same category as Terragrunt. Its own positioning is closer to this: keep Terraform/OpenTofu as the operational substrate, but move authoring into CDKTF constructs with TypeScript, strong typing, object-oriented composition, tests, and AWS CDK-inspired L2 abstractions.</p>

<p>That is much closer to the direction this post is arguing for. TerraConstructs explicitly tries to bridge <strong>AWS CDK developer experience</strong> with <strong>Terraform/OpenTofu operational ecosystem</strong>. It claims to provide deterministic, type-safe L2 constructs, synthesize to pure Terraform/OpenTofu output, support existing Terraform-adjacent tooling such as tflint, infracost, and OPA, and validate constructs with unit and end-to-end tests using Terratest.</p>

<p>This is a better answer to the Gruntwork problem than just saying “use CDK instead.” Many organizations have real Terraform assets: provider knowledge, state workflows, policy checks, CI conventions, cost tooling, drift tooling, and platform teams who understand Terraform plans. TerraConstructs says: do not throw that away, but stop pretending raw modules are enough of an abstraction system.</p>

<p>The CDKTF docs make the contrast direct. Constructs can represent a single resource, a group of resources, a subsystem, or a full architectural pattern. They can create, modify, enrich, and validate resources; expose methods; operate on a construct tree; use Aspects for cross-cutting concerns; and be tested like normal application code. HashiCorp’s own docs describe constructs as a superset of Terraform modules.</p>

<p>That supports the central critique: the weakness is not Terraform’s operations model. The weakness is HCL modules as the primary abstraction mechanism for increasingly software-shaped cloud systems.</p>

<p>The important point is not HashiCorp’s late stewardship decision. The important point is that CDKTF’s corporate steward let the project stagnate badly enough that the community had to decide whether the code-first Terraform path was worth saving. CDK Terrain is the stronger signal: developers looked at the same problem and chose continuation, repair, and community governance rather than giving up on the model.</p>

<p>CDK Terrain, at <a href="https://cdktn.io">cdktn.io</a>, is a community-driven fork and continuation of CDKTF hosted by the Open Constructs Foundation. Its docs describe the same core promise this article cares about: define infrastructure in familiar programming languages, synthesize Terraform-compatible configuration, and keep using Terraform or OpenTofu as the execution engine. It supports TypeScript, Python, Java, C#, and Go; can use Terraform and OpenTofu providers and modules; can interoperate with existing HCL projects; and provides testing helpers around synthesized output.</p>

<p>That matters because it changes the story. CDKTF was not killed by lack of need. The need is obvious: teams want the Terraform/OpenTofu ecosystem, but they also want testing, dependency management, abstractions, reusable code patterns, and real language tooling. CDK Terrain is a prime example of a community pulling together around a better future for IaC: execution-engine agnostic, open source, and aimed at preserving the software-engineering path HashiCorp neglected.</p>

<p>So CDK Terrain and TerraConstructs should become central examples in the article, not footnotes. They are the middle path:</p>

<ul>
  <li>not raw HCL modules</li>
  <li>not Terragrunt-only hierarchy over config</li>
  <li>not CloudFormation-only AWS CDK</li>
  <li>but CDK-style constructs that synthesize to Terraform/OpenTofu</li>
</ul>

<p>That middle path tests the whole thesis. If it works, the right critique of Terragrunt is not “custom tools are bad.” It is that custom tools should move IaC toward ordinary software engineering capabilities, not deeper into bespoke configuration orchestration.</p>

<h2 id="llms-make-the-abstraction-problem-sharper">LLMs make the abstraction problem sharper</h2>

<p>There is a tempting counterargument: if LLMs can write Terraform for us, maybe HCL’s rough edges matter less. Ask an agent for a VPC, an SQS dead-letter pattern, a Lambda integration, a Kubernetes deployment, or an IAM policy, and it can produce pages of plausible HCL in seconds.</p>

<p>That is useful, but it is not a substitute for architecture.</p>

<p>The problem with a messy IaC repository was never merely that humans type too slowly. The problem is that a large system needs stable abstractions, tests, reviewable APIs, versioned reuse, ownership boundaries, and a way to express intent without duplicating implementation detail everywhere. An LLM that generates another thousand lines of HCL can make the repository larger faster. It does not automatically make the design better.</p>

<p>Token economics make this more important, not less. Recent AI-market reporting and pricing analysis point in the same direction: the heavily subsidized era of abundant cheap inference is under pressure from rate limits, enterprise metering, feature restrictions, price changes, investor expectations, and margin discipline. Even if model quality keeps rising, teams should not design their infrastructure workflow around repeatedly regenerating and re-reviewing low-level configuration forever.</p>

<p>The better use of LLMs is at a higher level of intent. Let the model help design or call a tested construct: “give this service an event-driven ingestion path with a dead-letter queue, least-privilege permissions, alarms, and cost tags.” Then the durable artifact should be a software library with tests, defaults, override points, and a public API — not a one-off blob of HCL that must be audited from scratch every time.</p>

<p>This is another reason proper software engineering matters in IaC. Good constructs amortize thinking. They reduce repeated token spend. They reduce repeated human review. They give agents something safer to compose. And they keep the important decisions in versioned, tested code instead of scattering them across generated configuration.</p>

<h2 id="the-critique-of-yet-another-tool">The critique of “yet another tool”</h2>

<p>Terragrunt may still be the right pragmatic choice for teams already invested in Terraform/OpenTofu. If the organization has years of modules, policies, state, providers, and operational workflows, then Terragrunt’s conventions can impose much-needed structure without replacing the whole stack.</p>

<p>But we should be clear-eyed about what kind of solution it is.</p>

<p>Terragrunt is niche tooling for a niche configuration ecosystem. It teaches teams more Terragrunt-specific concepts: <code class="language-plaintext highlighter-rouge">terragrunt.hcl</code>, hierarchical includes, dependency blocks, generated files, stacks, live repo conventions, and wrapper-driven workflows. Those concepts can be effective, but they do not transfer as widely as ordinary software engineering practices.</p>

<p>TerraConstructs is different enough that it should be judged separately. It is still custom IaC tooling, but its abstractions are closer to transferable software concepts: typed classes, construct trees, package-managed libraries, tests, methods, composition, and cross-cutting aspects. That does not automatically make it safe or durable, but it makes it a more serious answer to the “Infrastructure as Code” promise than another layer of HCL orchestration.</p>

<p>A Go package, a Python library, or a TypeScript construct library teaches patterns that apply beyond one IaC tool. Package boundaries, tests, mocks, dependency injection, semantic versioning, monorepo build graphs, and API design are not infrastructure-specific. They are how software teams already manage complexity.</p>

<p>If Infrastructure as Code is supposed to be code, then the default should not be: write config, discover it cannot scale, then add another config wrapper.</p>

<p>The default should be: use mature software engineering tools to model infrastructure as part of the application architecture.</p>

<h2 id="the-better-framing">The better framing</h2>

<p>The point is not to say “Terragrunt is bad.” That would be too easy and not quite fair.</p>

<p>A better position is:</p>

<ol>
  <li>Gruntwork is right about the failure modes of messy IaC repositories.</li>
  <li>Terragrunt is a pragmatic response to Terraform/OpenTofu’s missing repo, reuse, inheritance, and orchestration capabilities.</li>
  <li>But Terragrunt also demonstrates the limits of treating IaC as configuration first and code second.</li>
  <li>Cloud infrastructure increasingly encodes software architecture patterns, especially in serverless and managed-service systems.</li>
  <li>Therefore, the long-term direction should be boring, transferable software engineering: packages, tests, types, dependency management, composition, dependency injection, and reusable architecture libraries.</li>
</ol>

<p>The strongest version of the argument is not anti-Terragrunt. It is anti-accidental-specialization.</p>

<p>When the problem is software complexity, the solution should look more like software engineering than another bespoke layer of configuration semantics.</p>]]></content><author><name>Sakul Learning</name></author><category term="infrastructure-as-code" /><category term="iac" /><category term="terraform" /><category term="opentofu" /><category term="terragrunt" /><category term="aws-cdk" /><category term="software-engineering" /><category term="cloud-architecture" /><summary type="html"><![CDATA[Gruntwork is right about the symptom: many Terraform and OpenTofu repositories become messy because plain HCL lacks enough structure for large systems. But Terragrunt may be the wrong kind of cure. It adds another specialized layer around a configuration language when the deeper problem is that cloud infrastructure is increasingly software architecture — and should be organized with ordinary, portable, well-understood software engineering tools.]]></summary></entry><entry><title type="html">AI-Driven Development Lifecycle for Financial Services</title><link href="https://sakul-learning.github.io/2026/06/01/ai-driven-development-lifecycle-financial-services/" rel="alternate" type="text/html" title="AI-Driven Development Lifecycle for Financial Services" /><published>2026-06-01T00:00:00+00:00</published><updated>2026-06-01T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/01/ai-driven-development-lifecycle-financial-services</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/01/ai-driven-development-lifecycle-financial-services/"><![CDATA[<h2 id="short-summary">Short summary</h2>

<p>AWS describes an <strong>AI-Driven Development Lifecycle (AI-DLC)</strong> for financial services: a software delivery model where AI agents do more than autocomplete code, but humans still keep oversight and accountability.</p>

<p>The main idea is to place AI across the whole development lifecycle. AI can help create plans, user stories, application code, tests, infrastructure-as-code, documentation, and operational insights. Humans review the work, provide business and regulatory context, approve important decisions, and make sure the output is safe and aligned with the organization’s standards.</p>

<p>I read this as a middle path between two extremes. Fully autonomous AI development can move quickly, but it is risky for regulated industries. Simple AI-assisted coding is easier to control, but it only improves small parts of the workflow. AI-DLC tries to capture more value by letting AI orchestrate larger chunks of delivery while adding governance, traceability, and human checkpoints.</p>

<p>For financial services, the most important theme is not just speed. It is <strong>speed with control</strong>: faster software delivery, but with evidence, audit trails, testing, security, resilience, and approval gates built into the process.</p>

<p>A second source from LCMH makes the idea feel broader than one regulated-industry use case. It frames AI-DLC as a general software-development shift where AI can initiate workflows, decompose work, and propose action plans while humans validate direction and quality. That changes the developer’s role: less pure execution, more judgment, review, and risk ownership.</p>

<p>The LCMH article also adds a useful caution: productivity gains should not be measured only by code volume or velocity. If AI-DLC works, it is because teams combine automation with strong engineering practices such as clear specifications, modular systems, review checkpoints, and human accountability. The interesting question is not just “can AI write more code?” but “can teams safely coordinate more of the lifecycle through AI without losing control?”</p>

<h2 id="future-updates">Future updates</h2>

<ul>
  <li>Compare AWS’s AI-DLC framing with “agentic software engineering” and AI pair-programming workflows.</li>
  <li>Look for examples of teams using AI agents with compliance or audit requirements.</li>
  <li>Explore where human review should be mandatory versus optional.</li>
  <li>Compare financial-services governance needs with broader AI-DLC adoption in less regulated software teams.</li>
  <li>Explore the shift from developer-as-builder to developer-as-reviewer, validator, and decision-maker.</li>
  <li>Check whether reported productivity gains depend on prerequisites such as modular code, strong specs, monorepos, typed languages, and strict review.</li>
  <li>Gather counterexamples: where AI-generated code or infrastructure creates hidden risk.</li>
</ul>

<h2 id="source">Source</h2>

<ul>
  <li>AWS for Industries: <a href="https://aws.amazon.com/blogs/industries/ai-driven-development-lifecycle-for-financial-services/">AI-Driven Development Lifecycle for Financial Services</a>, published May 26, 2026.</li>
  <li>LCMH: <a href="https://lcmh.fr/en/articles/2026/ai-dlc-transforming-software-development-lifecycle/">AI-DLC: how AI is transforming the software development lifecycle</a>, accessed June 1, 2026.</li>
</ul>]]></content><author><name>Sakul Learning</name></author><category term="ai-development" /><category term="software-engineering" /><category term="financial-services" /><category term="governance" /><summary type="html"><![CDATA[Short summary]]></summary></entry><entry><title type="html">Spec-Driven Development and Specifications</title><link href="https://sakul-learning.github.io/2026/06/01/spec-driven-development-and-specifications/" rel="alternate" type="text/html" title="Spec-Driven Development and Specifications" /><published>2026-06-01T00:00:00+00:00</published><updated>2026-06-01T00:00:00+00:00</updated><id>https://sakul-learning.github.io/2026/06/01/spec-driven-development-and-specifications</id><content type="html" xml:base="https://sakul-learning.github.io/2026/06/01/spec-driven-development-and-specifications/"><![CDATA[<h2 id="short-summary">Short summary</h2>

<p>Software teams often talk about speed as if the main bottleneck is typing code. If code can be generated faster, reviewed faster, merged faster, and deployed faster, then it seems like better software should arrive sooner.</p>

<p>But that misses the deeper constraint.</p>

<p>A lot of software work does not fail because the code was typed too slowly. It fails because the team did not yet know what <strong>good</strong> meant.</p>

<p>That is the useful distinction in Joe’s essay, <strong><a href="https://joe.dev/posts/research-vs-development/">“R&amp;D Is Two Jobs, and Research Doesn’t Run on Autopilot”</a></strong>. Research and development are often compressed into one phrase, but they are different types of work. Research discovers the target. Development executes against a target.</p>

<p>Spec-driven development sits exactly at that boundary.</p>

<p>A specification is not just a document. It is the artifact that turns research into development. It captures the current definition of good well enough that humans, tests, tools, and AI agents can all work against it.</p>

<h2 id="research-finds-the-target">Research finds the target</h2>

<p>Research is the phase where the team is still asking questions such as:</p>

<ul>
  <li>What problem are we solving?</li>
  <li>Who is it for?</li>
  <li>What does success look like?</li>
  <li>What tradeoffs matter?</li>
  <li>Which assumptions are risky?</li>
  <li>Is this even the right thing to build?</li>
</ul>

<p>The output of research is not primarily code. The output is knowledge.</p>

<p>Sometimes that knowledge comes from prototypes. Sometimes it comes from customer conversations. Sometimes it comes from technical spikes. Sometimes it comes from writing down a model and realizing the model does not make sense.</p>

<p>This is why autonomous coding agents struggle when pointed at vague work. They are powerful at producing plausible software, but plausibility is not the same as correctness. If the goal is unclear, an agent can generate something coherent while still missing the actual question.</p>

<p>Joe’s essay describes the software factory as a machine for converging on a target. But it cannot invent the target. Someone still has to define what good means.</p>

<p>That definition is research work.</p>

<h2 id="development-executes-against-the-target">Development executes against the target</h2>

<p>Development begins when the work becomes legible enough to check.</p>

<p>A development-ready task has some combination of:</p>

<ul>
  <li>a clear interface</li>
  <li>a scenario</li>
  <li>an acceptance criterion</li>
  <li>a test</li>
  <li>an example</li>
  <li>a constraint</li>
  <li>a definition of done</li>
</ul>

<p>This is the terrain where agents, automation, CI, test suites, and parallel execution become much more effective.</p>

<p>Once the target is known, the work can be decomposed. A team can say: implement this behavior, satisfy this contract, preserve this invariant, make this failing test pass, or refactor without changing these observable outcomes.</p>

<p>In other words, development scales when the specification is good enough.</p>

<h2 id="specifications-are-feedback-devices">Specifications are feedback devices</h2>

<p>The second article, <strong><a href="https://specdriven.com/first-principles/feedback-principle">“The Feedback Principle”</a></strong>, gives another way to understand specifications.</p>

<p>Its central claim is simple: the earlier feedback is detected, the cheaper it is to handle.</p>

<p>A quality function is any activity that provokes feedback. Testing is a quality function. Code review is a quality function. Pair programming, user interviews, architecture review, mockups, BDD scenarios, CI pipelines, and retrospectives are all quality functions because they expose information while there is still time to act on it.</p>

<p>A specification is also a quality function.</p>

<p>A good spec is not valuable because it satisfies process theater. It is valuable because it provokes feedback before code exists.</p>

<p>When a stakeholder reads a scenario and says, “That is not how the workflow actually works,” the spec has done its job. When QA asks, “What happens if this field is missing?” the spec has done its job. When an engineer notices that two requirements contradict each other, the spec has done its job.</p>

<p>The cheapest bug is the one found while it is still a sentence.</p>

<h2 id="shift-feedback-left-into-the-specification">Shift feedback left into the specification</h2>

<p>The Feedback Principle reframes “shift left” as moving quality functions earlier, where feedback is cheaper.</p>

<p>Spec-driven development is one of the strongest forms of shifting left because it moves verification and validation into the design stage.</p>

<p>Before writing production code, the team can ask:</p>

<ul>
  <li>Is this the right behavior?</li>
  <li>Is this behavior complete?</li>
  <li>Can we test it?</li>
  <li>Can an implementer understand it?</li>
  <li>Can an agent execute against it?</li>
  <li>Are the edge cases explicit?</li>
  <li>Are the tradeoffs visible?</li>
  <li>Would a user recognize this as solving their problem?</li>
</ul>

<p>This matters because late feedback is expensive. A missing requirement caught in production may require incident response, rollback, customer communication, and a rewrite. The same missing requirement caught in a specification may require a five-minute conversation and an edited paragraph.</p>

<p>Spec-driven development is not about writing more documentation. It is about making the specification the earliest executable surface for feedback.</p>

<h2 id="verification-and-validation">Verification and validation</h2>

<p>The Feedback Principle also separates verification from validation:</p>

<ul>
  <li><strong>Verification:</strong> Did we build the system right?</li>
  <li><strong>Validation:</strong> Did we build the right system?</li>
</ul>

<p>Specs connect these two.</p>

<p>A weak specification only supports verification. It lets us check whether the system conforms to what was written, but it does not help us know whether what was written is worth building.</p>

<p>A strong specification supports both. It captures enough product intent to validate the direction, and enough technical precision to verify the implementation.</p>

<p>This is especially important in AI-assisted development. Agents are increasingly good at verification-oriented work: making tests pass, implementing interfaces, following examples, and resolving mechanical issues. But validation still requires context, judgment, and taste.</p>

<p>So the human role does not disappear. It moves upstream.</p>

<p>The human becomes responsible for shaping the target, testing the assumptions, and maintaining the quality of the specification.</p>

<h2 id="the-agent-era-makes-specs-more-important-not-less">The agent era makes specs more important, not less</h2>

<p>There is a tempting but wrong conclusion to draw from coding agents: if code becomes cheap, specs become less important.</p>

<p>The opposite is true.</p>

<p>As code generation becomes cheaper, the cost of unclear intent goes up. A human developer given a vague task might stop and ask questions. An agent may confidently produce a large amount of coherent but misdirected code.</p>

<p>The bottleneck shifts from implementation capacity to target quality.</p>

<p>In an agentic workflow, a specification is not merely a communication artifact between humans. It becomes the control surface for the whole system.</p>

<p>The spec tells the agent:</p>

<ul>
  <li>what to build</li>
  <li>what not to build</li>
  <li>how success will be checked</li>
  <li>which examples matter</li>
  <li>which constraints are non-negotiable</li>
  <li>when the task is complete</li>
</ul>

<p>Without that, the agent is not a factory. It is a high-throughput ambiguity amplifier.</p>

<h2 id="spec-driven-development-as-a-loop">Spec-driven development as a loop</h2>

<p>Spec-driven development should not be imagined as a waterfall process where a perfect document is written upfront and then handed to implementation.</p>

<p>That would misunderstand both research and feedback.</p>

<p>A better model is a loop:</p>

<ol>
  <li>Explore the problem space.</li>
  <li>Write the current understanding as a specification.</li>
  <li>Provoke feedback from users, QA, engineers, stakeholders, tests, and prototypes.</li>
  <li>Revise the specification.</li>
  <li>Execute against the parts that are now clear.</li>
  <li>Learn from implementation and production.</li>
  <li>Update the specification again.</li>
</ol>

<p>The spec is not the end of thinking. It is the medium through which thinking becomes inspectable.</p>

<h2 id="what-makes-a-specification-useful">What makes a specification useful?</h2>

<p>A useful specification should make feedback easier and cheaper.</p>

<p>That means it should include things like:</p>

<ul>
  <li>concrete scenarios</li>
  <li>examples of expected behavior</li>
  <li>non-goals</li>
  <li>edge cases</li>
  <li>acceptance criteria</li>
  <li>domain language</li>
  <li>constraints</li>
  <li>open questions</li>
  <li>risks and assumptions</li>
  <li>validation notes</li>
  <li>testable outcomes</li>
</ul>

<p>The most important part may be the open questions. A spec that pretends uncertainty does not exist is dangerous. It makes research look like development. It invites the team to execute before the target is known.</p>

<p>A good spec separates what is known from what is assumed.</p>

<p>That distinction determines how much autonomy is safe.</p>

<h2 id="the-leash-length-depends-on-spec-quality">The leash length depends on spec quality</h2>

<p>Joe’s essay makes an important point about agent autonomy: the better you know what good looks like, the longer the leash you can give the agent.</p>

<p>This is a useful principle for spec-driven development generally.</p>

<p>When the spec is vague, keep the loop tight:</p>

<ul>
  <li>shorter tasks</li>
  <li>more human review</li>
  <li>more prototypes</li>
  <li>more stakeholder feedback</li>
  <li>more validation</li>
</ul>

<p>When the spec is clear, the loop can widen:</p>

<ul>
  <li>more automation</li>
  <li>more parallel work</li>
  <li>more autonomous implementation</li>
  <li>more reliance on tests and acceptance criteria</li>
</ul>

<p>The danger is not using agents. The danger is giving development-level autonomy to research-stage work.</p>

<p>Spec quality determines safe autonomy.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Spec-driven development is not bureaucracy. It is a way of making software work cheaper, faster, and safer by moving the discovery of mistakes earlier.</p>

<p>Joe’s research-versus-development distinction explains why specifications matter: they define the target that development needs in order to execute well.</p>

<p>The Feedback Principle explains why specifications are economically powerful: they provoke feedback while change is still cheap.</p>

<p>Together, they suggest a simple theory:</p>

<blockquote>
  <p>Research discovers what good looks like.<br />
Specifications make that definition inspectable.<br />
Development executes against it.<br />
Feedback keeps the whole system honest.</p>
</blockquote>

<p>In the age of AI coding agents, this becomes even more important. The future of software development may involve more autonomous implementation, but autonomy only works when the target is clear.</p>

<p>The spec is where that clarity lives.</p>]]></content><author><name>Sakul Learning</name></author><category term="spec-driven-development" /><category term="specifications" /><category term="ai-development" /><category term="software-engineering" /><category term="quality" /><summary type="html"><![CDATA[Short summary]]></summary></entry></feed>