// The work
I direct AI coding agents — Claude Code, Copilot, Codex, others —
through architectural decisions, drive the build, and then
I test what comes back. I don't review the code line
by line; I run it, push the edges (drop a screenshot it hasn't seen,
ask the question the README skips, lower the threshold until it
breaks), and when something doesn't work we hunt down the cause
together. The keystrokes are increasingly not mine. The product
decisions, the validation, and the refusal to ship what hasn't been
proven end-to-end are.
The toolkit varies with the task — spec-kit,
GSD, BMAD, or my own UCAI
plugin to drive the build; Spectre when the spec
itself needs to execute (action-verification gates, deterministic
advance); Pragma to block AI-written tests when
they're gamed; benchmark tables when a decision is worth pinning. The format follows the project; the discipline
doesn't. Hard constraints, SOLID and
DRY applied gradually — sometimes the right answer
is a sharp principle, sometimes it's a grey-zone judgment that an
80%-correct shape now beats a 100%-correct shape later. Docs updated
as the code moves.
// The thesis
Cognition is context-bound. Humans and large language
models are two implementations of the same problem class. The
bottleneck for both is the spec — the completeness of the
context fed in. Most projects, agent systems, and human
collaborations fail not because the engine is dumb, but because the
input was incomplete.
The implication: as agents get more capable, spec authorship
becomes the leverage point. Execution, retrieval, and even reasoning
are commoditising. What stays scarce is the discipline to drive a
spec to asymptotic completeness, and to propagate it between
cognitive nodes — human or silicon — without lossy resampling. The
durable role isn't writing code. It's sensor (observation
outside the model), spec author (completeness as
discipline), adversarial reviewer (attack the spec before
it ships), and dispatcher (route the spec to the right
engine, re-route when one fails).
I treat my own collaboration with agents as an instance of that
frame. The system prompt, the rules I keep, the notes I take and
the hooks I write are all the same thing: a substrate the model
thinks through. Durable behavioural rules live as markdown in a
version-controlled directory, not as advice scattered across
prompts — and they're enforced in two layers. Rules with a clear
action signature ("block git commit if PII present,"
"refuse bash when a dedicated tool fits,"
"stop response if it ends in a check-in question") are
promoted to deterministic hooks that run before tool
calls or after responses; the model never gets a chance to
rationalise its way past them. Rules that depend on conversational
judgment — uncertainty signaling, brevity, deferring — stay as
markdown and are routed in by
bouncer, a small
FastAPI daemon that scores each user prompt against the rule
library and admits only the contextually relevant ones plus a
small always-on core. Same dual layering ESLint (autofix vs warn)
and OPA (deny vs alert) have used for years on large rule sets —
borrowed, because it works.
Where the four roles land in practice: sensor is the
always-connected operating environment in lab notes;
spec author is the discipline named in the
essay;
adversarial reviewer is
Pragma blocking
gamed AI-written tests; dispatcher is the local-LLM
routing lane and the engine-picking I do per task.
// Writing
Long-form essays where the operating discipline gets argued out, scope-conditioned, and stress-tested against its own falsifiers.
~/essays/context-as-cognitive-substrate
essay
~22 min read · AI engineering · framing concept · published 2026-05 · updated 2026-05-05.
Cognition is bounded by context. Humans and LLMs are two implementations of the same problem class; the bottleneck for both is the spec. The piece distinguishes cognitive, operational, and economic claims, names where the framework breaks (the metis boundary), states falsifiable predictions with binding falsifiers, and ends in a three-rule distillation that survives Tuesday morning.
// Lab notes
Things I run for myself, not shipped products. They earn their
place by carrying the same operating discipline as the published
work — and they're where most of the load testing of that
discipline actually happens.
Continuous-research pipeline always-on
// Python · forked from an academic-paper agent and rebuilt for tech/industry research · writes interlinked Obsidian wiki pages instead of LaTeX.
A 12-stage pipeline organised in four phases — Scout
(parse a topic into 5–8 sub-questions and search queries),
Gather (hit six source backends across web and
academic databases, dedupe, fetch, LLM-screen for relevance,
produce knowledge cards), Analyze (cluster
findings, build comparison tables, then re-read everything
through three deliberately distinct lenses: pragmatist
— what's usable today, contrarian — what are the risks
nobody is talking about, builder — what could be
assembled on top), and Synthesize (outline,
draft, run a quality + slop-detection pass, revise, write to
the vault).
It runs on a schedule against a topic queue. Output is wiki
pages, not reports — every entity gets its own page with
backlinks, so a year of research compounds into a navigable
graph instead of an inbox of PDFs. The same vault feeds the
rule router behind bouncer,
so research findings can become routable rules without leaving
their original context. A curated subset of the corpus ships
publicly under /notes/ — reports, weekly
syntheses, semantic-bridge threads, and a TimesFM-based
research-rate forecast.
Always-connected operating environment always-on
// two always-on machines + workstation + phone, mesh-VPN'd · no public ports · watchdogs feed a private chat bot · automatic recovery before anything pages.
The machines I work on are stitched together by an encrypted
mesh, not exposed to the public internet. From a phone I can
SSH into any of them, hit private web UIs (vault browser,
session browser, dashboards, local LLM frontend), and watch
services without opening a single port to the world. Same
access from a laptop on a coffee-shop network. The discipline
this enforces: no service ever needs a public hostname to be
useful, which removes a whole class of attack surface and
keeps the threat model honest.
Underneath that there's a layered watchdog stack — heartbeat
logger (one-minute ring buffer for post-mortems), connectivity
watchdog (link / gateway / internet / mesh), system health
(disk, critical services, thermal), index-freshness watchdog
(catches stale background jobs), filesystem-event watcher that
reloads the rule router the moment a rule file changes locally
or arrives via sync from another machine. Failures recover
automatically where they can; what can't be auto-fixed pages
me on a private chat bot with the diagnostic already attached.
Backups are timer-driven with jitter and persistence, so a
missed window fires on next boot rather than going silent.
- Boot notify — every reboot pings the bot, so I know unattended hardware came back up.
- Belt-and-braces reloads — primary path is the inotify watcher; a 15-minute cron does an idempotent re-poke in case the watcher dies.
- Logs-as-context — every watchdog writes a self-truncating log, which is exactly the surface a future agent will read when something needs explaining.
- No-secrets-in-repos — credentials live in mode 0600 root-owned files on the box that needs them, never in git.
None of this is a product; it's the substrate the products
grow on. It's also the reason I trust agents to do
long-running autonomous work — the box is observable, the
failure modes are caught, and the recovery is automatic.
Local-LLM lane privacy + cost
// integrated GPU on a small home server · GGUF runtime · web frontend on the private mesh.
A local-inference lane sits next to the cloud-frontier lane.
Small dense models capped around 9B for fast
general-purpose work; larger models only when they're
Mixture-of-Experts with a small active
expert per token (so a 30B-class model with a 3–4B active
expert runs comfortably on integrated graphics, while a dense
30B would not). I don't try to match frontier-model quality
locally — that's the wrong target.
What it's actually for: privacy lane for
anything I'd rather not send to a third-party API
(vault-capture summarisation, draft passes over personal
notes, peer-review of plans I haven't published yet);
cost-zero secondary opinion when one model's
answer needs sanity-checking but a second cloud round-trip
isn't worth it; offline fallback for short
internet outages. The discipline: local inference is a
fallback and a privacy lane, not a replacement for
frontier models on hard reasoning. Routing the right task to
the right engine is part of the dispatcher role from
the essay;
this is that role made physical.
// What I enforce
Each rule below is either checked by code (a hook that blocks the
action) or written as a single imperative trigger that the routing
daemon re-injects when the situation calls for it. Rules that
resist both forms don't get captured.
01
NEVER REINVENT. Search GitHub and the literature before building. Use as-is, build on top, or borrow parts. The cost of a 30-minute search beats the cost of maintaining a 200-line reinvention forever.
02
AUTOMATION MUST BE TRANSPARENT, NOT OPT-IN. Solutions that require me to remember to invoke them just relocate cognitive load. Hooks and watchdogs that fire on their own are acceptable; slash commands and "run X first" instructions are not.
03
EVERY RULE NEEDS AN ENFORCER. If a rule can be machine-checked (paths, secrets, tool selection), it's a hook. If not, it's a one-line trigger written in the imperative. A rule that resists both forms is a preference, not a rule.
04
VERIFY BEFORE CLAIMING DONE. Tests run, logs checked, behaviour diffed. Applies to me, applies harder to agents — they will confidently report success on work they didn't do unless asked to prove it.
05
GROUND PERSONAL AND TIME-SENSITIVE QUESTIONS IN SOURCE. Questions about systems I've built, my preferences, or state past the model's training cutoff get answered from the project's source-of-truth files first, not from training-data pattern matching.
06
BREVITY IS A CORRECTNESS PROPERTY. Padded LLM output isn't a style problem; it's a context budget burning at scale. Strip recaps, hedging, restatements. Lead with the answer; caveats only when they change it. Enforced by an output-validator hook that blocks responses matching known padding patterns.
07
MOMENTUM OVER META. When work concludes, the system stops. No check-in questions, no unsolicited next-steps. The agent's job is forward motion against the spec; meta-commentary is a tax on the next prompt.
Rules live as plain markdown files, organised by trigger context.
bouncer handles the
routing into the agent's per-turn context; a separate output-validator
hook blocks responses that match known violation patterns
(trailing check-in questions, recap sections, padded preambles).
The hot list of behavioural rules stops being prose I have to remind
the model about and becomes a structural part of the loop.
// Selected projects
Three shipped artifacts that carry the operating discipline as code.
bouncer is the rule-router daemon, UCAI is the build-driver plugin,
Spectre is the deterministic spec executor. All
11 projects live on the projects page.
bouncer
Python · MIT · FastAPI daemon · per-rule activation predicates (regex + semantic) · the residual layer under deterministic hooks.
The packaged version of the rule-routing daemon from
the thesis. Caps per-prompt injection by
trigger_keywords regex + per-rule embedding
prototype, so a 30+ rule library stays usable without diluting
attention. Rules with deterministic action signatures get
promoted out to PreToolUse / Stop hooks; bouncer is
the residual layer for rules that need conversational judgment.
A private companion tool reads agent transcripts and clusters
recurring corrections — that's the discovery loop that decides
which patterns become rules and which rules become hooks.
UCAI
JavaScript · MIT · Claude Code plugin, marketplace-installable · v2.2 with programmatic phase enforcement.
A Claude Code plugin that solves the same problems as community
frameworks (GSD, BMAD, Ralph, Agent OS) — but using Claude Code's
native architecture instead of fighting it. v2.2 adds programmatic
phase enforcement via a companion
contingency engine.
Spectre
JavaScript · MIT · Claude Code plugin · deterministic spec-driven implementation engine · backs the sdl-vision-engine plugin.
A spec-locked executor: distill a vague vision into a
First-Principles spec with action/verification steps, then run
the spec deterministically — each step's action gated on its
verification, retry once on fail, scratchpad advances. Halts
on any unrecoverable failure with full negative knowledge so
the next iteration starts from what didn't work, not
from scratch. The discipline this whole portfolio claims,
packaged as a plugin.
All 11 projects →
// Looking for
- AI product engineering roles where the orchestration story is the point.
- Teams that already trust AI agents in their pipeline and want someone who can direct them with measurement discipline.
- Remote or Kraków-based. Open to short-term contracts and full-time.