// AI PRODUCT ENGINEERING

I orchestrate AI agents
to ship measurable
products.

The artifacts are the proof. The discipline is the differentiator.

// The work

I direct AI coding agents — Claude Code, Copilot, Codex, others — through architectural decisions, drive the build, and then I test what comes back. I don't review the code line by line; I run it, push the edges (drop a screenshot it hasn't seen, ask the question the README skips, lower the threshold until it breaks), and when something doesn't work we hunt down the cause together. The keystrokes are increasingly not mine. The product decisions, the validation, and the refusal to ship what hasn't been proven end-to-end are.

The toolkit varies with the task — spec-kit, GSD, BMAD, or my own UCAI plugin to drive the build; Spectre when the spec itself needs to execute (action-verification gates, deterministic advance); Pragma to block AI-written tests when they're gamed; benchmark tables when a decision is worth pinning. The format follows the project; the discipline doesn't. Hard constraints, SOLID and DRY applied gradually — sometimes the right answer is a sharp principle, sometimes it's a grey-zone judgment that an 80%-correct shape now beats a 100%-correct shape later. Docs updated as the code moves.

// The thesis

Cognition is context-bound. Humans and large language models are two implementations of the same problem class. The bottleneck for both is the spec — the completeness of the context fed in. Most projects, agent systems, and human collaborations fail not because the engine is dumb, but because the input was incomplete.

The implication: as agents get more capable, spec authorship becomes the leverage point. Execution, retrieval, and even reasoning are commoditising. What stays scarce is the discipline to drive a spec to asymptotic completeness, and to propagate it between cognitive nodes — human or silicon — without lossy resampling. The durable role isn't writing code. It's sensor (observation outside the model), spec author (completeness as discipline), adversarial reviewer (attack the spec before it ships), and dispatcher (route the spec to the right engine, re-route when one fails).

I treat my own collaboration with agents as an instance of that frame. The system prompt, the rules I keep, the notes I take and the hooks I write are all the same thing: a substrate the model thinks through. Durable behavioural rules live as markdown in a version-controlled directory, not as advice scattered across prompts — and they're enforced in two layers. Rules with a clear action signature ("block git commit if PII present," "refuse bash when a dedicated tool fits," "stop response if it ends in a check-in question") are promoted to deterministic hooks that run before tool calls or after responses; the model never gets a chance to rationalise its way past them. Rules that depend on conversational judgment — uncertainty signaling, brevity, deferring — stay as markdown and are routed in by bouncer, a small FastAPI daemon that scores each user prompt against the rule library and admits only the contextually relevant ones plus a small always-on core. Same dual layering ESLint (autofix vs warn) and OPA (deny vs alert) have used for years on large rule sets — borrowed, because it works.

Where the four roles land in practice: sensor is the always-connected operating environment in lab notes; spec author is the discipline named in the essay; adversarial reviewer is Pragma blocking gamed AI-written tests; dispatcher is the local-LLM routing lane and the engine-picking I do per task.

// Writing

Long-form essays where the operating discipline gets argued out, scope-conditioned, and stress-tested against its own falsifiers.

~/essays/context-as-cognitive-substrate

essay

Context as Cognitive Substrate

~22 min read · AI engineering · framing concept · published 2026-05 · updated 2026-05-05.

Cognition is bounded by context. Humans and LLMs are two implementations of the same problem class; the bottleneck for both is the spec. The piece distinguishes cognitive, operational, and economic claims, names where the framework breaks (the metis boundary), states falsifiable predictions with binding falsifiers, and ends in a three-rule distillation that survives Tuesday morning.

Read the essay All essays

// Lab notes

Things I run for myself, not shipped products. They earn their place by carrying the same operating discipline as the published work — and they're where most of the load testing of that discipline actually happens.

Continuous-research pipeline always-on

// Python · forked from an academic-paper agent and rebuilt for tech/industry research · writes interlinked Obsidian wiki pages instead of LaTeX.

A 12-stage pipeline organised in four phases — Scout (parse a topic into 5–8 sub-questions and search queries), Gather (hit six source backends across web and academic databases, dedupe, fetch, LLM-screen for relevance, produce knowledge cards), Analyze (cluster findings, build comparison tables, then re-read everything through three deliberately distinct lenses: pragmatist — what's usable today, contrarian — what are the risks nobody is talking about, builder — what could be assembled on top), and Synthesize (outline, draft, run a quality + slop-detection pass, revise, write to the vault).

It runs on a schedule against a topic queue. Output is wiki pages, not reports — every entity gets its own page with backlinks, so a year of research compounds into a navigable graph instead of an inbox of PDFs. The same vault feeds the rule router behind bouncer, so research findings can become routable rules without leaving their original context. A curated subset of the corpus ships publicly under /notes/ — reports, weekly syntheses, semantic-bridge threads, and a TimesFM-based research-rate forecast.

Always-connected operating environment always-on

// two always-on machines + workstation + phone, mesh-VPN'd · no public ports · watchdogs feed a private chat bot · automatic recovery before anything pages.

The machines I work on are stitched together by an encrypted mesh, not exposed to the public internet. From a phone I can SSH into any of them, hit private web UIs (vault browser, session browser, dashboards, local LLM frontend), and watch services without opening a single port to the world. Same access from a laptop on a coffee-shop network. The discipline this enforces: no service ever needs a public hostname to be useful, which removes a whole class of attack surface and keeps the threat model honest.

Underneath that there's a layered watchdog stack — heartbeat logger (one-minute ring buffer for post-mortems), connectivity watchdog (link / gateway / internet / mesh), system health (disk, critical services, thermal), index-freshness watchdog (catches stale background jobs), filesystem-event watcher that reloads the rule router the moment a rule file changes locally or arrives via sync from another machine. Failures recover automatically where they can; what can't be auto-fixed pages me on a private chat bot with the diagnostic already attached. Backups are timer-driven with jitter and persistence, so a missed window fires on next boot rather than going silent.

Boot notify — every reboot pings the bot, so I know unattended hardware came back up.
Belt-and-braces reloads — primary path is the inotify watcher; a 15-minute cron does an idempotent re-poke in case the watcher dies.
Logs-as-context — every watchdog writes a self-truncating log, which is exactly the surface a future agent will read when something needs explaining.
No-secrets-in-repos — credentials live in mode 0600 root-owned files on the box that needs them, never in git.

None of this is a product; it's the substrate the products grow on. It's also the reason I trust agents to do long-running autonomous work — the box is observable, the failure modes are caught, and the recovery is automatic.

Local-LLM lane privacy + cost

// integrated GPU on a small home server · GGUF runtime · web frontend on the private mesh.

A local-inference lane sits next to the cloud-frontier lane. Small dense models capped around 9B for fast general-purpose work; larger models only when they're Mixture-of-Experts with a small active expert per token (so a 30B-class model with a 3–4B active expert runs comfortably on integrated graphics, while a dense 30B would not). I don't try to match frontier-model quality locally — that's the wrong target.

What it's actually for: privacy lane for anything I'd rather not send to a third-party API (vault-capture summarisation, draft passes over personal notes, peer-review of plans I haven't published yet); cost-zero secondary opinion when one model's answer needs sanity-checking but a second cloud round-trip isn't worth it; offline fallback for short internet outages. The discipline: local inference is a fallback and a privacy lane, not a replacement for frontier models on hard reasoning. Routing the right task to the right engine is part of the dispatcher role from the essay; this is that role made physical.

// What I enforce

Each rule below is either checked by code (a hook that blocks the action) or written as a single imperative trigger that the routing daemon re-injects when the situation calls for it. Rules that resist both forms don't get captured.

NEVER REINVENT. Search GitHub and the literature before building. Use as-is, build on top, or borrow parts. The cost of a 30-minute search beats the cost of maintaining a 200-line reinvention forever.

AUTOMATION MUST BE TRANSPARENT, NOT OPT-IN. Solutions that require me to remember to invoke them just relocate cognitive load. Hooks and watchdogs that fire on their own are acceptable; slash commands and "run X first" instructions are not.

EVERY RULE NEEDS AN ENFORCER. If a rule can be machine-checked (paths, secrets, tool selection), it's a hook. If not, it's a one-line trigger written in the imperative. A rule that resists both forms is a preference, not a rule.

VERIFY BEFORE CLAIMING DONE. Tests run, logs checked, behaviour diffed. Applies to me, applies harder to agents — they will confidently report success on work they didn't do unless asked to prove it.

GROUND PERSONAL AND TIME-SENSITIVE QUESTIONS IN SOURCE. Questions about systems I've built, my preferences, or state past the model's training cutoff get answered from the project's source-of-truth files first, not from training-data pattern matching.

BREVITY IS A CORRECTNESS PROPERTY. Padded LLM output isn't a style problem; it's a context budget burning at scale. Strip recaps, hedging, restatements. Lead with the answer; caveats only when they change it. Enforced by an output-validator hook that blocks responses matching known padding patterns.

MOMENTUM OVER META. When work concludes, the system stops. No check-in questions, no unsolicited next-steps. The agent's job is forward motion against the spec; meta-commentary is a tax on the next prompt.

Rules live as plain markdown files, organised by trigger context. bouncer handles the routing into the agent's per-turn context; a separate output-validator hook blocks responses that match known violation patterns (trailing check-in questions, recap sections, padded preambles). The hot list of behavioural rules stops being prose I have to remind the model about and becomes a structural part of the loop.

// Selected projects

Three shipped artifacts that carry the operating discipline as code. bouncer is the rule-router daemon, UCAI is the build-driver plugin, Spectre is the deterministic spec executor. All 11 projects live on the projects page.

~/projects/bouncer

bouncer

Python · MIT · FastAPI daemon · per-rule activation predicates (regex + semantic) · the residual layer under deterministic hooks.

The packaged version of the rule-routing daemon from the thesis. Caps per-prompt injection by trigger_keywords regex + per-rule embedding prototype, so a 30+ rule library stays usable without diluting attention. Rules with deterministic action signatures get promoted out to PreToolUse / Stop hooks; bouncer is the residual layer for rules that need conversational judgment. A private companion tool reads agent transcripts and clusters recurring corrections — that's the discovery loop that decides which patterns become rules and which rules become hooks.

GitHub repo

~/projects/UCAI

27★

UCAI

JavaScript · MIT · Claude Code plugin, marketplace-installable · v2.2 with programmatic phase enforcement.

A Claude Code plugin that solves the same problems as community frameworks (GSD, BMAD, Ralph, Agent OS) — but using Claude Code's native architecture instead of fighting it. v2.2 adds programmatic phase enforcement via a companion contingency engine.

GitHub repo

~/projects/Spectre

Spectre

JavaScript · MIT · Claude Code plugin · deterministic spec-driven implementation engine · backs the sdl-vision-engine plugin.

A spec-locked executor: distill a vague vision into a First-Principles spec with action/verification steps, then run the spec deterministically — each step's action gated on its verification, retry once on fail, scratchpad advances. Halts on any unrecoverable failure with full negative knowledge so the next iteration starts from what didn't work, not from scratch. The discipline this whole portfolio claims, packaged as a plugin.

GitHub repo

All 11 projects →

// Looking for

AI product engineering roles where the orchestration story is the point.
Teams that already trust AI agents in their pipeline and want someone who can direct them with measurement discipline.
Remote or Kraków-based. Open to short-term contracts and full-time.

// The work

// The thesis

// Writing

Context as Cognitive Substrate

// Lab notes

Continuous-research pipeline always-on

Always-connected operating environment always-on

Local-LLM lane privacy + cost

// What I enforce

// Selected projects

bouncer

UCAI

Spectre

// Looking for

// Get in touch