← All articles

Building Prompt Workbench: What Happens When a Book and a Platform Grow Up Together

In my last post, I introduced the projects I'm actively building — the AI-native SaaS apps, the VS Codium extension suite, a DSL for acceptance criteria, and two books in progress. This post goes deep on the first of those: Prompt Workbench, the platform for managing, versioning, testing, and red-teaming the prompts and system instructions that drive LLM-powered features.

The Origin Story

Prompt Workbench didn't start as a product idea. It started as a research problem. I was writing PromptOS — a 23-document practitioner reference covering everything from core prompt patterns through adversarial security testing and production observability — and I kept running into the same issue every technical author runs into eventually: a pattern documented in prose is a claim, not a proof. I wanted to know whether the patterns I was writing about actually held up against real prompts and real failure modes, not just whether they sounded reasonable on the page.

So the platform and the book started growing up together. Every prompt pattern, anti-pattern, and debugging ritual I documented got built into Prompt Workbench as a working feature — a red-team command that adversarially probes a prompt, a failure-classification command that categorizes why a response went wrong, a benchmark-scoring command that turns "this feels better" into a number you can track over time. The book taught me what to build; building it taught me where the book was wrong. That loop is still running.

The Architecture: One App, Not a Distributed System

I've built several "Micro-Apps" composed of an Angular shell lazy-loading micro-frontends, NestJS microservices behind an API gateway, and supporting platform services — auth with zero-trust RBAC, rate limiting, and the rest. Prompt Workbench is not one of those apps — it doesn't need to be. It's simpler than that: one NestJS API, one Angular front end.

What keeps that single API from turning into a 3,000-line service class is CQRS — commands and queries flowing through @nestjs/cqrs. Every capability the platform exposes lives behind its own handler: RedTeamHandler, ClassifyFailureHandler, SuggestRepairHandler, ScoreBenchmarkHandler, GenerateReportHandler, GenerateTestBatteryHandler. A controller doesn't call a service method directly — it dispatches a command (RedTeamCommand, say) onto a bus, and whichever handler is registered for that command picks it up. In practice, this means I can add a seventh capability without touching the other six, and I can unit-test a handler in complete isolation from the HTTP layer that eventually calls it. For a solo-built platform that keeps accumulating capabilities as the book grows, that isolation is what keeps the codebase honest instead of turning into the kind of tangled service I've had to untangle in the past.

Microservices and Module Federation solve real problems — coordinating multiple teams, deploying pieces independently — but Prompt Workbench doesn't have those problems yet, and CQRS inside one app gets me the modularity benefit without the operational cost of a distributed system I don't need.

The AI Router Refactor

For most of Prompt Workbench's life, it talked to language models through three separate, hand-rolled adapters — one for Claude, one for Ollama, one for OpenAI — each with its own request shape, its own error handling, its own quirks. That worked until it didn't: every time I wanted to add a provider or change how a model got selected, I was touching three different files that didn't agree with each other on what "success" even looked like.

The fix was a five-phase refactor that replaced all three adapters with a single AiRuntimeAdapter, backed by a shared ai-runtime package with one router and one set of provider adapters underneath it: Anthropic, Cerebras, LM Studio, Ollama, and OpenAI. Any Prompt Workbench capability that needs a model call — red-teaming, classification, benchmark scoring — goes through the same router, the same error contract, the same reason codes.

The part I'd actually recommend other people copy isn't the multi-provider support — it's the failure behavior. The router resolves deterministically: user-selected model, then the app's default model, then a hardcoded global fallback, and if none of those resolve to an available provider, it fails fast with an explicit reason code — model_not_found, model_deprecated, provider_restricted, capability_mismatch — instead of silently substituting a different model than the one you asked for. I've been burned before by systems that "helpfully" degrade to a cheaper or different model without telling you, and then you spend an afternoon debugging output quality that dropped for no reason you can see in the logs. This router would rather tell you plainly that it can't do what you asked than quietly do something else.

Lesson learned, the honest version: the original plan for this refactor was to retire the Ollama and OpenAI adapters outright and route everything through Anthropic and Cerebras. Partway through, once the shared ai-runtime package existed and I could see what it actually looked like in practice, I changed course — converting those two adapters into proper provider adapters inside the shared router instead of deleting them. That's not a plan failure; it's what building the real thing teaches you that a design doc can't. The other lesson, from a closely related piece of work on the same shared router: a provider that returns "no result found" and a provider that's actually unavailable are not the same outcome, and treating them the same lets a downstream cache quietly poison itself with a false negative. Fixing that meant making "unavailable" a first-class failure that throws, not a value that flows through as if everything worked. Small distinction, real consequence if you skip it.

A Side Note on Job Search Studio

While I was building the AI-native SaaS work above, I also shipped Job Search Studio, a job-search automation platform, for a reason that has nothing to do with architecture. I watched a wave of AI-driven layoffs hit experienced developers — the people who knew their companies' systems, customers, and quirks better than any onboarding doc ever could — and I wanted to build something useful for the people on the receiving end of those decisions.

If you're an employer reading this: before you lay off the people holding your institutional knowledge in favor of a model that doesn't have it, price out what it costs to rebuild that knowledge from scratch, with worse documentation than you think you have. #TrainAndRetain your talent. It's usually cheaper than the alternative, and the alternative rarely looks as cheap in year two as it did on the slide that justified it.

What This Makes Possible

None of this is complicated technology by industry standards — CQRS and a provider router are both well-understood patterns. What it gets me, as a team of one, is the ability to keep adding capability without the codebase fighting back. I can add an eighth CQRS handler next month without touching the other seven. I can add a sixth model provider without rewriting how the other five work. That's the actual payoff of treating architecture as a discipline you practice on real, shipping software: not impressiveness, just the compounding ability to keep building without grinding to a halt under your own previous decisions.

Next in this series: a tour of the VS Codium extension suite — Prompt Studio, Infographic Studio, Compliance Studio, and Chat Panel — and the shared offline-first platform layer underneath all four of them.

← All articles