deadwax.io
How a small team used Claude Code and Codex to build a full-stack product -- responsive web for desktop and phone browsers plus a native iOS app -- with the rigor of a 10-person engineering org, in under two months.
Built by hobbyists and collectors who wanted a more transparent way to research pressings, maintain the product, and explain the work.
The Problem
Discogs built something remarkable: the world's most comprehensive music database, contributed to by millions of collectors. It's the backbone of the vinyl community. But the collector experience -- especially on mobile -- hasn't kept pace with the depth of the data.
Discogs is irreplaceable -- we use their API, their data, their community contributions. We're not competing with Discogs. We're building the experience layer on top of their incredible database.
"Why do I have to open 20 tabs to compare pressings?"
The Product
A vinyl collector's companion that works from your desktop, your phone browser, or the native iOS app. Android users can use the mobile-friendly site without waiting for a separate app.
"This is what Discogs should have built years ago."
The Team
Five humans bring taste, domain expertise, creativity, and community voice. Eleven AI agents handle strategy, code across the web app, native iOS, and Android, testing, design specs, and legal compliance. They coordinate through markdown files in Git -- no Slack, no Jira, no Notion.
deadwax.io API so the native surface stays truthful to the same single source of truth the web uses.What Actually Makes the Agents Work
People assume the magic is "the AI." It isn't. The same Claude or Codex model that anyone can rent by the hour is what we use. The difference between a chatbot that forgets your name and an agent team that actually ships a product is two disciplines: context engineering and harness engineering. They are the real job, and the harness side is where most of our time goes.
An AI agent has no memory between sessions. Every morning it shows up like a contractor on Day 1 who has never heard of the project. Context engineering is deciding exactly which files, decisions, and instructions get re-read into its head before it picks up a wrench.
Our version of it: a single lean daily packet (TODAY.md, under 50 lines) and a today-only context file (COMMS-TODAY.md). Permanent strategy lives in 80+ dedicated docs that get loaded only when relevant. Config files are pointers, not encyclopedias. Every token of instruction we send costs a token of work we get back.
If context engineering is what the AI sees, harness engineering is the entire workshop around it: which tools it can reach for, what it's physically prevented from touching, when it wakes up, who reviews its work, what happens when it gets stuck. The harness is everything that isn't the model itself or the prompt.
Why we lean into it: models change every few months. Prompts get rewritten weekly. But the harness — the rules of the workshop — is the thing that compounds. A well-built harness makes a mediocre prompt safe; a bad harness makes a brilliant prompt dangerous. We spend more time on the harness than on the agents themselves.
Think of each AI agent as a smart but reckless intern. The harness is everything we built so the intern can do real work without burning the building down.
"You don't make an AI agent reliable by writing a better prompt. You make it reliable by building a workshop where reliability is the only thing it's allowed to do."
How the Harness Evolved
The harness above isn't the harness we started with. The biggest thing we learned running an AI team is that the workshop has to match the shape of the work — and the shape of the work changed the day we shipped. The launch harness was tuned for a sprint. Keeping a live product healthy is a different job, and using the launch harness for it slowly rotted our own paperwork. So we changed the harness, not the model.
To get from nothing to a shipped app, every day was a numbered container. Each morning the Director wrote one lean plan ("Day 53"), the Codex orchestrator spawned the coding agents against it, work merged in a fixed order, and every night a single heavy "close" archived everything that happened and set up tomorrow.
Why it worked: a 0-to-1 build is a plannable daily batch. The day container forced a clean plan → build → review → ship → close loop and gave us one tidy "here's everything that happened today" story. For a launch sprint, that rhythm is exactly right.
Once real users arrived, work stopped showing up in neat daily batches. It became a continuous stream of small bug reports, punctuated by the occasional big feature. Forcing that stream into a daily ceremony is exactly why days started staying "open" for days, the heavy nightly close kept getting skipped, and our decision logs and branches drifted out of date.
The fix -- split the work into two lanes: an Ops / Sustain lane where bug fixes flow continuously and the merge is the close (one line in a running log, no nightly ceremony), and a Feature Epic lane that keeps all the heavyweight planning machinery -- but scoped to a feature, not to a calendar day.
Past launch, the bet shifted from speed to judgment. The two-lane harness mostly runs itself now, so the job changed again: keep a live product healthy while we grow it. We optimize for sustained engineering, growth features, and learning -- not launch throughput.
The team rebalanced too: two evenly funded AI subscriptions split the work -- Claude Code as the primary builder and operator that writes most of the code and opens the pull requests, and Codex as the review-and-specialist layer that gives every PR an architecture review before merge and handles the native iOS and Android deep cuts. Models and prompts keep changing; the harness keeps compounding.
The day model assumed work arrives in plannable daily batches. Live work doesn't -- it's a flow plus the odd epic. We had been measuring a stream with a ruler meant for boxes. Two separate reviews -- one of our infrastructure, one of our process -- landed on the same diagnosis independently, and it matches how continuous-flow agent systems work in the wider industry.
"The harness isn't a monument you build once. It's a workshop you keep re-tooling to fit the work in front of you. When the work changed from a sprint to a stream, the harness had to change with it."
The System
The most common question from engineers: "How does the AI team know what to work on?" The answer is a chain of markdown files that replaces Slack, Jira, and standups.
agents/COMMS.md with a structured tag. Example: "Confirmed: Top 10 album seed list. Mastering engineer tier: Kevin Gray, Bernie Grundman, Steve Hoffman, Chris Bellman, Bob Ludwig." Gets a permanent ID in the Decision Log (DEC-033).agents/execution/EXEC-YYYY-MM-DD.md with: theme of the day, task packets per agent, merge order (Architect -> Backend -> Frontend -> Tester), and dependencies.agents/TODAY.md (today's tasks, under 50 lines) and agents/COMMS-TODAY.md (today's context). These are the "morning standup" -- every agent reads them before doing anything.agents/done/DAY-N.md. Decision Log updated. Tomorrow's lean packet created. Nothing is verbal. Everything is traceable."If it's not in writing, it didn't happen."
Mental Models
Don't prompt them. Manage them. Give them roles, context, constraints, and feedback -- like onboarding a human on Day 1.
Ask Claude how to use Claude. It writes its own config, debugs its own workflows, and knows its own limits better than any doc.
Deterministic tasks go in shell scripts, not AI prompts. Scripts are cheaper, faster, testable, and version-controlled.
Jason sets direction. David advises on AI workflow optimization. Chris validates domain expertise. Hope creates the visual identity. Caden builds community. AI does everything else.
Context engineering decides what the AI sees; harness engineering decides what the AI can do, when it wakes up, and what stops it. Models churn every few months — the harness compounds. See The Harness.
Bad code physically cannot reach production. CI enforces security, tests, coverage, and type safety. Verify with machines, not eyeballs.
Not everything is automated. Some decisions deliberately stop the pipeline and wait for a human. AI proposes, humans approve. No workaround, no override.
Engineering Rigor
The same CI/CD pipeline you'd expect from a 10-person engineering team -- enforced by automation.
/api/*)The Mind-Blowing Part
An automated pipeline runs every hour, on the hour. AI triages, implements, tests, and deploys -- zero human touch for safe changes. Larger feature requests and ambiguous asks get flagged for human review.
Safety rails: Won't auto-fix P0 critical, UX redesigns, auth/security, schema changes, or anything ambiguous. Feature requests and larger asks get flagged for deeper human review. Only safe, scoped bug fixes ship automatically.
What Went Wrong
We run formal post-mortems every 5 days. They're the single most valuable process artifact. Here's what they caught.
For 3 days, AI dev agents opened PRs that silently overwrote Director docs on main. Decision log entries vanished. Git worktrees snapshot all files at branch creation, and stale copies overwrote current ones on merge.
Three-layer prevention: sparse checkout (files physically absent from worktrees), pre-PR cleanup script, and a CI gate that fails any PR touching protected paths. Only mechanical enforcement works with AI agents.
Every time something went wrong, we added more instructions. Agent config files bloated. Agents burned context window tokens reading their own config, leaving less room for actual work.
David helped define the problem and crafted a diagnostic prompt: "Analyze my CLAUDE agentic workflow for token usage, workflow optimizations, and tools vs. skills/script usage." That audit led to aggressive pruning -- details moved into dedicated docs, config files became pointers not encyclopedias, and lean daily packets stayed under 50 lines. Every token of instruction costs a token of output.
Legal agent: 11 days, zero completions. Designer: 4 consecutive missed sessions. Nobody noticed because the Director was busy shipping code. Claude Code agents don't run unless explicitly scheduled.
Standing daily schedule with explicit session slots. No implicit expectations. If it's not on the schedule, it doesn't exist. Same lesson any manager learns: "delegated" is not "done."
ESLint v9 broke lint on every PR. The team treated it as noise -- "oh, lint always fails." This masked real problems for days and created a culture of ignoring red builds.
No "known failures." If a step fails, fix it or remove it. Noise in CI is indistinguishable from real problems. A broken build that's always broken teaches everyone to stop looking.
PI feature launched without a way to disable it in production. During Day 22 rollback drill, there was no off switch. Required an emergency remediation PR.
Rollback capability is now a pre-deployment gate. Every new feature must have a kill-switch before it ships. Not after. Not "we'll add it later."
We built web-first with AI and shipped fast -- 40+ days of branding, design tokens, component patterns, and UX flows. When native iOS arrived, none of it was portable. Colors, spacing, typography, interaction models -- all lived in React/CSS with no platform-agnostic layer. We started from scratch on the native app's visual identity, and the parity rubric ("make iOS look like web") even caused a regression where Codex replaced native iOS glass controls with custom web-like chrome before we caught it in a device build.
Brand template + parity rubric: Parity applies to content, flow, and acceptance criteria -- not chrome. Native iOS keeps native iOS controls (tab bars, sheets, glass toolbars) unless a ticket explicitly accepts a custom shell. Brand tokens (palette, type scale, logotype) live in a shared design doc that both platforms implement in their own idiomatic way.
SoT API as the consistency layer: Both the responsive web app and the native Swift app call the same Deadwax endpoints (/api/collection, /api/pressing-intelligence, /api/curation/*, /api/mastering-engineers). The API is the single source of truth -- pressing metadata, mastering-engineer tiers, curation lists, Pressing Intelligence payloads all come from one place. Changing the truth once updates both surfaces, so "platform drift" becomes a UI-shell problem, not a data problem.
"Our retro cadence broke for 15 days. During those 15 days, the same mistakes repeated. The retro is the product."
What's Next
Outcome-driven. Each item describes the result for collectors, not just the feature.
The Takeaway
Five humans and eleven AI agents, building a real product -- responsive web for desktop and mobile browsers plus a native App Store app -- with real users, real tests, and real accountability. Not a demo. A product.
The value isn't "AI wrote code." It's that AI can be organized with the same roles, process, and accountability as humans.
Worktrees, PR reviews, CI gates, automated testing. AI without process produces chaos. AI with process produces products.
The CEO, the domain expert, the visual artist, the community manager. AI amplifies human judgment -- it doesn't replace it.
We built web-first, mobile-first, and shipped fast with AI. But when native iOS arrived, our branding, design tokens, and UX patterns were trapped in React/CSS. Moving fast amplifies blind spots -- abstract the portable parts early, or your progress becomes a single-platform cage.
deadwax.io
Built by hobbyists, collectors, and AI-assisted workflows • Designed for vinyl collectors