✦
bob.cx · safe ai coding agents
✦
Safe AI coding agents
in real codebases.
not greenfield. not toy projects. the messy kind.
~ what "safe" means ~
What does safe actually mean here?
A coding agent that works on a 50-line script and a coding agent that works in a 500k-line monorepo are not the same animal.
Once an agent has the keys to a real codebase, the failure modes change. Bad refactors don't just compile and run — they pass tests in surprising ways, regress invariants nobody wrote down, and ship to production through PRs that look fine to a tired reviewer at 4 PM on a Friday.
Safe means: the agent is allowed to be wrong, and the system around it catches that wrongness before customers do.
~ who this is for ~
Who is this for?
- Your team is already using Claude Code, Cursor, Aider, or Copilot — and the question has shifted from "will this work?" to "how do we let it ship without burning the on-call rotation?"
- You have a non-trivial existing codebase. Institutional knowledge lives in heads, conventions, and load-bearing comments.
- You don't want vendor lock-in — you want patterns and tooling you can run yourself.
~ what changes ~
What gets wired up?
- Eval harnesses. Concrete, repeatable scenarios that score whether a change made the system better or worse. Holdout sets so the agent can't memorize the answer.
- Merge-policy guardrails. Automated gates that decide whether an agent PR is safe to land — calibrated risk, paper trail, no vibes.
- Review loops. Agent-on-agent and agent-on-human review with explicit rubrics.
- Sandboxing. Tight, recoverable runtime boundaries so an agent can experiment without breaking the workspace.
- Observability. Every agent action traced, replayable, and comparable across runs. Answer "what did the agent actually do?" without reading 800 lines of chat log.
~ proof ~
The relics that make it work.
-
sigilAutonomous merge-policy engine. Evaluates agent PRs against holdout scenarios — did this change help or hurt?SHIPPED ✦
-
sealDistributed code review for teams of agents and humans. Stays with the repo. No hosted PR system.IN FLIGHT
-
mawMulti-agent workspaces with deterministic merges and recovery-aware lifecycle controls.IN FLIGHT
~ commission ~
To start a conversation.
Independent engagement, accepted in limited number. Typical: 4–8 weeks, remote, fixed scope or weekly retainer.
Begin a commission
or write directly to [email protected]