A technical white paper for engineers and anyone who wants to understand what Deadwax is and how a small human team orchestrated AI agents to build it.
Companion document: How We Built Deadwax, the visual overview. Export this as Markdown
The thesis: Deadwax is two products in one. On the surface it's an app for vinyl collectors. Underneath, it's a learning lab for AI-driven software development, a deliberately over-built system for testing how far a one-person team can get by managing AI agents instead of writing every line by hand. This paper is mostly about that second product.
If you take nothing else away, take these:
Deadwax is an app for vinyl record collectors. It connects to Discogs (a big music database) and makes it dramatically easier to browse your collection from a phone or desktop and find which pressing of a record sounds best, all in one screen instead of ten browser tabs.
Collectors can use the responsive site at deadwax.io on desktop, iPhone, iPad, or Android browsers. iPhone users can also install the native App Store app, and a native Android app (Kotlin + Jetpack Compose) is in active development ahead of Play Store submission.
The wild part? It was built by a tiny team, Jason managing AI agents, with help from an AI workflow advisor, a vinyl expert, a visual artist, and a social media manager. There's a Director that assigns daily work, a Product Manager that defines what to build, platform developers that write code across web, iOS, and Android, a Tester that catches bugs, a Designer, a Legal advisor, and a Marketing lead. Jason's job is to be the boss. The AI does everything else.
And it's not just code. User-reported bugs get automatically triaged, fixed, tested, and deployed to production in under 90 minutes, while Jason sleeps.
It helps to be honest about what this project really is. Stated plainly, Deadwax is both:
For a hobby app aimed at vinyl collectors, Deadwax is wildly over-engineered. It runs AWS infrastructure, S3 storage, Lambda functions, automated testing, code coverage, Playwright screen recordings, flaky-test detection, CI/CD pipelines, GitHub issue and backlog management, weekly post-mortems, an agent documentation suite, product roadmaps, and white papers like this one. That is overkill for the size of the product, and that is the point. The app is the excuse; the real experiment is finding out how AI changes professional software development when you take it seriously.
The classic software lifecycle looks like this:
idea → write code → test → debug → open a pull request → review → QA → release
Every step a human touches is a place the work slows down, and that has always been true. For years the bottleneck was a person writing and debugging code in the inner loop, and then a person reviewing and merging it. AI has largely removed the first one. Writing code is now cheap and fast, so the constraint has shifted downstream to the part still gated by human judgment: code review. That is exactly where a lot of the industry's investment is going this year, building tools to make review faster and to automate more of it.
As each human step gets automated, the human role changes rather than disappears. You stop being the person who writes every line and become the manager of the system that does: keeping it running smoothly, reviewing what it produces, and making sure it is building the right things. Critical thinking does not go away, it moves up a level, and the people who learn to run these systems will have an enormous advantage. Deadwax is one person's attempt to practice exactly that.
Because review is becoming the new bottleneck, it is the next thing we are automating. We are piloting an open-source reviewer, Open Code Review, that wraps an LLM in deterministic plumbing: it forces coverage of the exact changed files, bundles them, matches rules, writes line-level comments, and emits machine-readable JSON for CI. It ships as a skill and as both a Claude Code and a Codex plugin. The rollout is deliberately staged: start report-only (a local pilot, then label-triggered CI artifacts, then a small number of high-confidence PR comments), always as a supplement to the human and Architect review gate, never a replacement and never an auto-merge. Reference: github.com/alibaba/open-code-review.
Deadwax sits on top of Discogs and provides a screen-friendly experience for vinyl collectors. The core feature is Pressing Intelligence, a panel that shows the best pressings of any album ranked by community ratings, audiophile label status, and mastering engineer. What takes 10-20 tabs on Discogs fits in one screen on Deadwax.
Three surfaces, one backend: a responsive web app at deadwax.io (React + Vite) for desktop and phone browsers, a native iOS app (Swift + SwiftUI, iPhone + iPad, iOS 17+) live on the App Store, and a native Android app (Kotlin + Jetpack Compose, Material 3) in active build-out ahead of Play Store submission. All three call the same Deadwax API (collection sync, Pressing Intelligence, curation lists, mastering engineer tiers), so the data is identical no matter where the collector is.
Positioning: Discogs has the data. Deadwax has the experience. The moat is UX quality.
This isn't a solo AI project. It's a human-and-AI collaboration:
| Who | Role | What They Do |
|---|---|---|
| Jason | CEO & Founder | Product vision, strategic decisions, agent orchestration |
| David | AI Workflow Advisor | Consultant on agentic AI workflows, token optimization, tools-vs-skills-vs-scripts framework |
| Chris | Board Member | Pressing expertise, audiophile label knowledge, seed list curation |
| Caden | Social Media | Instagram presence (@deadwax.io), community engagement |
| Hope | Visual Creative | Logo, wordmark, brand visual identity |
The human team handles what AI can't: taste, domain expertise, visual creativity, and community voice. The AI team handles execution:
| Agent | Platform | What They Do |
|---|---|---|
| Director | Claude Code | Daily task orchestration, merge order, blocker escalation |
| Product Manager | Claude Code | Requirements, backlog, acceptance criteria |
| Designer | Claude Code | Wireframes, design system, UX specs |
| Marketing | Claude Code | Positioning, GTM, community launch |
| Legal | Claude Code | Privacy, compliance, naming |
| Architect | Codex | Technical decisions, code review |
| Backend Dev | Codex | API, OAuth, data pipelines |
| Frontend Dev | Codex | UI, routing, mobile layout |
| Tester | Codex | CI/CD, Playwright E2E, coverage |
| Swift iOS Dev | Codex | Native SwiftUI app (iPhone + iPad) |
| Android Dev | Codex | Native Kotlin + Jetpack Compose app |
Growth-mode update: the table above is the launch-era org, where Codex did all the engineering. Post-launch the split rebalanced into two evenly funded subscriptions: Claude Code became the primary builder across backend, frontend, and tests as well as the operator role, while Codex became the review-and-specialist layer — the architecture-review gate on every change, plus the native iOS and Android deep cuts and second-opinion rescue. The goal of the rebalance is review quality and judgment over raw output, and a working setup that mirrors the day job.
It doesn't magically know, and nothing about it is a black box. The agents are fed context, goals, and feedback, and they prioritize from that. Deadwax builds whatever the humans and the system put in front of it.
That input comes from three places, which together decide what gets built next.
From the humans. Ideas and priorities from Jason, Chris, Caden, and Hope, plus the mission statement and company values that fix what Deadwax is for in the first place. The humans own taste and direction, and this is where both enter the system.
From the product and its users. Real user feedback and GitHub issues, which the automated pipeline turns into tasks every hour, along with the vinyl-collector culture the product serves. This is the steady stream of "what people actually need" that keeps the roadmap honest.
From the system itself. The product roadmap and feature backlog, the agents' own recommendations, and the weekly post-mortems that feed lessons back in. The system proposes and remembers, so good ideas and old mistakes both resurface as work.
Deciding what to build is the part AI does not do for you, so it gets real product rigor. Everything traces back to a few fixed statements. The company mission is to help vinyl enthusiasts discover the best version of their favorite music. The product is, in its own words, "a vinyl app, full stop": mobile-first, design-led, and vinyl-only, on the bet that Discogs has the data and Deadwax has the experience. The culture is explicit too, with customers first, mobile-first, security as non-negotiable, great UX as the actual product, ship-to-learn over planning loops, and data integrity over speed. Those are the constants every decision is measured against.
Before building, we try to separate three questions that are easy to blur together. Customer discovery asks who the collector really is, and the work here came from synthesizing the vinyl and audiophile communities, reading one and two-star reviews of the existing Discogs app, and interviewing real collectors. Problem discovery asks what actually hurts, and the sharpest finding was that deciding whether a specific pressing is worth owning takes four to eight steps and a fistful of browser tabs, often while standing in a record store on one bar of signal. Solution discovery asks what to build in response, and only here does the Pressing Intelligence panel appear, putting the original pressing, the top-ranked versions, the audiophile editions, and the mastering engineer in one screen. Keeping the three separate stops us from confidently solving a problem nobody has.
The backlog is scored with RICE: reach times impact times confidence, divided by effort. Reach is the share of active users a feature would touch, impact is a coarse one-to-three weight, confidence discounts our own guesses, and effort is a relative t-shirt size. It is deliberately blunt and assumption-based before launch, and the rule is to revisit the scores once real user data exists. The value is not the precise number, it is that a trivial donation page and a deep new ranking feature get compared on the same axis instead of by whoever argues loudest.
Users are mapped to be MECE, mutually exclusive and collectively exhaustive, so every collector lands in exactly one segment and no real behavior is left unnamed. Beta usage data sorted users into a handful of segments, such as large-catalog confidence seekers, daily pressing explorers, search-to-decision shoppers, and wantlist evaluators, each with its own pain and its own opportunity. The same discipline is applied to the wider ecosystem the product sits in: the forums, the runout-groove databases, the ratings sites, and the marketplace that serious collectors already stitch together by hand. Naming the segments and the ecosystem cleanly is what keeps the roadmap aligned to the mission instead of drifting toward whatever is easiest to build.
Once a priority is set, the work moves through a chain of markdown files committed to Git, a paper trail that replaces Slack and Jira:
agents/COMMS.md (e.g., "Confirmed: Top 10 album seed list for Pressing Intelligence")TODAY.md (tasks) and COMMS-TODAY.md (context), that every agent reads firstEvery decision gets a permanent ID in the decision log. Nothing is verbal. Everything is traceable. The AI equivalent of "if it's not in writing, it didn't happen."
HOW AGENTS COLLABORATE (via markdown files in Git)
+----------+ +--------------+
| CEO |-- decision ------------->| COMMS.md |
| (Jason) | | (message |
+----------+ | board) |
+------+-------+
| reads
v
+--------------+
| DIRECTOR |
| (Claude) |
+------+-------+
| writes
+------------------+------------------+
v v v
+------------+ +------------+ +------------+
| TODAY.md | | COMMS- | | EXEC- |
| (tasks) | | TODAY.md | | YYYY-MM- |
| | | (context) | | DD.md |
+-----+------+ +-----+------+ +------------+
| |
v v
+-------------------------------------------------+
| ALL AGENTS READ THESE FIRST |
| |
| PM --- Designer --- Legal --- Marketing |
| (Claude Code agents) |
| |
| Architect --- Backend --- Frontend --- Android --- Tester |
| (Codex agents, in git worktrees) |
+---------------------+---------------------------+
| post updates
v
+--------------+
| COMMS.md |<-- cycle repeats
+--------------+
The most impressive part: user-reported bugs go from "submitted" to "fixed in production" in under 90 minutes with zero human intervention. A cron job runs every hour, on the hour, the PM agent triages (auto-fixable bugs get task packets; feature requests and larger asks get flagged for deeper human review), the Dev agent implements safe fixes, CI validates, and it deploys.
AUTOMATED FEEDBACK-TO-PRODUCTION PIPELINE
User finds bug on deadwax.io
|
v
+--------------+ GitHub API +------------------+
| Feedback | ------------------> | GitHub Issue |
| Widget | | label: [feedback]|
+--------------+ +--------+---------+
|
+--------------------------+
| Every hour, on the hour (macOS cron)
v
+--------------+
| PM Agent | Phase 1: TRIAGE
| (Codex) | - Classify priority & effort
| | - Safe to auto-fix? --> task packet
| | - Needs human? --> [needs-input] label
+------+-------+
| writes task packet
v
+--------------+
| Dev Agent | Phase 2: IMPLEMENT (max 2/run)
| (Codex) | - Read task packet
| | - Create branch + write fix
| | - Run lint + typecheck + tests
+------+-------+
| opens PR
v
+--------------+
| CI Pipeline | Security > Lint > Types >
| (Actions) | Vitest > Playwright E2E > Coverage
+------+-------+
| all green
v
+--------------+
| Auto-merge |--> Deploy --> Verify production
| + Deploy |
+--------------+
|
v
Bug is fixed in prod.
User hasn't checked back yet.
This continuous, issue-driven loop is the same shape as OpenAI's Codex Symphony pattern: an issue-tracker-as-control-plane model for autonomous coding work (ticket → isolated workspace → agent → proof-of-work PR → review gate → ship → closeout log). Deadwax arrived here independently, through its feedback pipeline, git worktrees, Architect review, ship-loop, and OPS-LOG.md. After evaluating Symphony, we made an explicit decision to borrow its vocabulary and control-plane ideas while staying GitHub-Issues-first, rather than onboarding the third-party reference stack (Linear plus an Elixir service). The call was to harden the system we already have instead of taking on a new dependency. Reference: github.com/openai/symphony.
Deadwax runs against a real product roadmap with four phases: foundation, MVP launch, growth, and scale / monetization. It is currently in the growth phase, where the core product works and is live and the focus is widening the audience and hardening operations.
The backlog deliberately mixes two kinds of work that the system tracks side by side. Product features are things like better deadwax scanning, deeper Pressing Intelligence, and collection tools. Engineering tasks are flaky-test detection, code-review tooling, cost control, beta-tester recruitment, Android testing, and go-to-market work.
A product this small still has to answer three questions every day: is anyone using it, is it healthy, and can we afford to run it. The answers live in an internal admin dashboard at a login-gated route, and it is one of the more useful things we built. For an engineering audience the interesting part is not the charts, it is that the whole thing is derived from infrastructure we already pay for rather than a separate analytics stack.
Alongside that cumulative total we track weekly net-new panel views as the trend line, so a flat week is visible instead of being hidden inside the ever-rising all-time number. In telemetry a view counts only when the panel renders usable pressing data, captured by a pressing_panel_viewed event (a legacy open event is folded in and de-duplicated per user, session, and release) and aggregated from the same structured logs the rest of the dashboard reads.
At its core this is a two-sided ecosystem, demand and supply, with two broader players that keep it honest. We never optimize one at another's expense: more panel views are only good if collectors get clearer decisions, the underlying data stays trustworthy, and the Discogs paths keep working.
| Player | Side | Role in the ecosystem |
|---|---|---|
| Collectors / buyers | Demand | Bring the actual problem: which pressing to own, buy, or keep. |
| Pressing-knowledge supply | Supply | Discogs metadata, community ratings, runouts, and curator judgment that make a panel worth reading. |
| Influencers / trusted voices | Broader | YouTubers, reviewers, forum leaders, and stores that route high-intent collectors in. |
| Discogs / vinyl commerce | Broader | System of record and the action layer for auth, collection, wantlist, and marketplace. |
We read standard engagement metrics next to it. MAU is the monthly active base, and the stickiness ratios matter more than the raw count: DAU over MAU is the share of monthly users active on a given day (above roughly 0.3 is a genuinely habitual product), and DAU over WAU shows how much of the weekly audience is daily.
Every good North Star has blind spots, and this one has three.
So we let the North Star rise only while four guardrails stay healthy. They are the diagnostic layer: when the headline moves, these say whether it can be trusted.
| Guardrail | What it protects against | Target |
|---|---|---|
| Panel Activation Rate | Power-user views hiding weak demand | Alert below 40% |
| PI Availability Rate | Gaps in pressing-data coverage (supply) | At or above 80% |
| Collection / Discovery Load Success Rate | Ecosystem access failures, loads that never complete | At or above 95% |
| PI Panel P90 Load Time | Panels that technically render but feel broken | Under 2 seconds |
Every client and server event is written to logs as structured JSON, which lets the dashboard ask questions after the fact instead of needing each metric defined in advance. It surfaces API and front-end error counts and rates, the top errors by type, and Lambda latency at the median and 90th percentile. Loading a collection is the riskiest moment in the app, so it gets first-class monitoring: success, failure, and incomplete rates per platform, load and indexing times, and a breakdown of why loads fail (API error, rate limit, timeout, incomplete index, page-load failure). We also track collection sizes, because a collector with five thousand records stresses the system very differently from one with fifty, and a regression usually shows up there first.
The dashboard also turns real usage into a cost-per-active-user-day, so "can we afford the next thousand users" is a number on a screen rather than a guess. The whole thing is built to be light to run: there is no analytics database, the metrics are computed from existing logs and tables, and the entire response is cached as a single daily snapshot so opening the dashboard does not trigger expensive live queries. The live queries that do run are bounded with strict timeouts and a concurrency cap.
That last point was learned the hard way. Early on, the dashboard and a related per-deploy job ran heavy queries far too often and produced a real spike in our own cloud usage, the irony being that the tool built to watch efficiency was briefly a thing driving it up. The fix was the snapshot-and-schedule model above, and it is now a standing rule: cost-sensitive admin reads stay cached, bounded, and scheduled by default. The full story is in the lessons below.
Monorepo (npm workspaces): React+Vite frontend, TypeScript Lambda backend (Hono), shared packages. AWS infrastructure: Lambda + API Gateway + CloudFront + DynamoDB + S3. Discogs OAuth 1.0a with server-side-only tokens.
Git worktrees. Each Codex agent works in an isolated worktree, enabling parallel development on separate branches without conflicts. Merge order is enforced: Architect → Backend → Frontend → Tester.
CI pipeline (GitHub Actions). Every PR triggers: security scanning (ripgrep blocks DELETE/POST/PUT to Discogs, blocks token exposure), ESLint, TypeScript strict, Vitest unit tests, Playwright E2E with video, and code coverage delta posted as a PR comment.
Protected files. AGENTS.md, CLAUDE.md, and all docs and agent files are automatically stripped from code PRs. Strategy docs commit directly to main.
Ship gate. Nothing is "done" until the ship-loop script squash-merges, watches the deploy, and verifies production endpoints. The Architect agent reviews every code PR.
Feedback pipeline. Runs via macOS cron (every hour, top of hour). Phase 1: PM triages GitHub issues, auto-fixable bugs get task packets, feature requests and ambiguous asks get flagged for human review. Phase 2: Dev implements (max 2/run), opens PR, CI auto-merges on green. From feedback to production in under 90 minutes.
Pressing Intelligence. Bayesian-weighted scoring: score = (v/(v+m))*R + (m/(v+m))*C where m=25, C=4.0. Pipeline fetches master release, ranks vinyl versions, flags audiophile labels (MoFi, AP, Classic, Impex), cross-references mastering engineers, checks user ownership, caches results.
This isn't AI-generated spaghetti code. It's 80+ docs providing context, 29 scripts automating mechanics, comprehensive CI with security scanning, git worktree isolation for parallel agents, and a decision log tracking every choice with rationale. The insight: AI can be organized into a team with the same rigor you'd expect from human engineers, and one human can orchestrate it.
Deadwax started the way it still runs: a serverless stack on AWS behind a mobile-first, responsive React web app. That order mattered. Designing for a 375-pixel screen in a record store first, and choosing infrastructure that scales to near-zero at rest, is what let one person stand up a real product and keep it light enough to leave running.
It is a classic three-tier system: clients on top, application logic in the middle, and managed storage underneath, with every client talking to the same single API.
The web app is a React single-page app built with Vite, shipped as static files in S3 and served worldwide through CloudFront, with the bucket kept private so the CDN is the only way in. Every API call goes through API Gateway to a single Lambda function that handles Discogs OAuth, the read-only data proxy, Pressing Intelligence, analytics events, and feedback. Session and cache state live in DynamoDB on demand, so they scale to zero when nobody is using the app, while the Pressing Intelligence pipeline keeps its structured evidence in PostgreSQL where it needs real relational queries. CloudWatch collects the structured logs and Lambda metrics the admin dashboard reads, secrets live in Parameter Store and never reach the browser, and EventBridge fans a large collection sync out across several Lambda invocations so importing a five-thousand-record collection does not time out a single request.
The product runtime uses Google Gemini (the 2.5 Flash model) for one well-scoped job: watching YouTube pressing-shootout videos and turning them into structured claims about which pressing sounds best, by mastering engineer, country, and label. Those claims are cached in PostgreSQL and reused across users, so the expensive analysis happens once per video, not once per viewer. We call Gemini directly rather than through AWS Bedrock today, mostly for speed of iteration, though Bedrock is the obvious place to consolidate inference inside AWS if that tradeoff ever changes. Worth being clear about the split: the agent team that builds Deadwax runs on Claude and Codex, while the only model in the shipped product is Gemini, on this one pipeline.
The architecture is tuned to scale down as cleanly as it scales up. Everything is pay-per-request, so there are no idle servers and no reserved capacity to size. Caching is layered, with a short-lived cache in the browser, an in-process cache in the warm Lambda, and a persistent cache in DynamoDB, so a popular collection is not re-fetched from Discogs on every view. After launch we tightened the obvious dials, capping log retention, narrowing the CDN to the regions we actually serve, lengthening cache windows, and turning an automatic per-deploy backfill into a manual step. Nothing sits idle and nothing is reserved, so the footprint follows real demand, which is the entire point: infrastructure that gets out of the way until there is demand to serve.
The interesting engineering here is not the model and not the prompts. The model is a rented commodity that improves on its own schedule, and the prompts get rewritten weekly. The work that compounds, and the work that separates a hobby agent from a production team, lives in two disciplines above the model: context engineering, deciding exactly what the model sees on each turn, and harness engineering, designing the runtime around it. The rest of this section covers the harness as what it is, why we build it, and how we built it for Deadwax.
The harness is everything that is not the model and not the prompt: the agent loop, the tools an agent can call, the permission system, the sandbox, the hooks, the sub-agents, the scheduler, and the files agents pass between runs. If the model is the engine, the harness is the rest of the car. You do not make a car faster by revving the engine harder, you build a chassis, brakes, steering, and a dashboard around it. Context engineering is the fuel mixture: too little context and the agent hallucinates, too much and the signal drowns in noise while latency and cost climb.
Three reasons. First, the harness is the only place you can put a hard rule. "Never DELETE from the Discogs API" in a prompt is a suggestion; a commit hook plus a CI check plus a proxy that refuses the call at the gateway is a wall. Second, the harness outlasts the model: across Claude Opus and GPT generations the prompts changed, but the worktrees, the ship-loop, the schedule, and the CI gates did not. Third, the harness is testable where a prompt is not, because a script either strips a protected file or it does not, while a prompt only encourages the right behavior and can quietly regress on the next model. The test we apply to any fix is simple: if swapping the model tomorrow would force us to rewrite it, it is prompt work; if not, it is harness work, and harness work is the work worth doing.
Scripts first. Anything deterministic, creating a worktree, opening a PR, running tests, deploying, shipping, lives in a shell script, not a prompt. Scripts are cheaper, faster, have real tests, and never burn context. The standing rule is that you never write an AI skill for something a script can do.
Skills for judgment. Procedures that genuinely need a model, like verifying a change in a real browser, running a code review, or the morning kickoff, are packaged as skills that load only when triggered, so their cost is paid per task instead of sitting in every prompt.
Hooks for the things that must always happen. Shell commands attached to tool events fire deterministically no matter what an agent remembered to do, so "from now on, every time X happens, do Y" is always a hook, never a note in memory. Memory is advisory; hooks are mandatory.
Calling Codex from Claude. A Claude session can hand a task straight to Codex through the Claude Code to Codex plugin, used both to rescue a stuck run and, far more often, to do ordinary implementation work. Codex did most of the heavy lifting on the zero-to-one build that got the product live.
Daily kickoffs and scheduling. Claude Code agents do not wake themselves up; they act only when something invokes them, which is why a quiet agent is almost always a scheduling problem, not a model one. A cron fires the bug-fix feedback loop every hour, and a morning kickoff routine briefs the team and sets the day's work. The harness owns the clock; the agent owns what happens when it strikes.
How agents communicate. Agents talk through tagged markdown files committed to Git, not Slack, so every decision and handoff is a grep-able, versioned paper trail. We never rely on an agent remembering a past session; the harness re-injects whatever it needs to know.
Isolation and least privilege. Each coding agent works in its own git worktree with protected paths physically removed, so it cannot clobber strategy docs even by accident. Tokens never enter an agent's context and the Discogs proxy refuses writes at the gateway, so the destructive call is unreachable rather than merely discouraged. And the ship-loop, not the agent, decides what counts as shipped, by squash-merging, watching the deploy, and confirming the new build is actually serving in production.
The hardest harness lesson was not about safety rails. It was that a harness quietly encodes an assumption about how work shows up, and a wrong assumption rots your own records. The zero-to-one build genuinely arrived as daily batches, so we ran a single daily container with a heavyweight nightly close. After launch the work changed shape into a continuous stream of small bug fixes plus the occasional multi-day feature, and forcing that stream through a daily ceremony meant days stayed "open" for days and the heavy close got skipped, leaving branches and status docs drifting out of date.
So we split the work into two lanes. An Ops lane handles the continuous stream, where the definition of done is simply a merged PR plus one line in a running log; the merge is the close, with no daily ceremony. A Feature Epic lane keeps the heavier planning, but scoped to an epic rather than the calendar, and a promotion rule kicks anything large or risky out of the auto-fix pipeline and up to a human. We lost the tidy "everything that happened today" summary and took that trade on purpose, because a branch list and a decision log you can actually trust beat a daily narrative nobody keeps writing. The same shape, an issue queue feeding isolated agents that produce PRs, is what mature continuous-flow agent systems converge on.
The newest shift is to stop defining every role and step and instead hand the system a goal, such as "improve the deadwax scanning feature and make it better," and let a more capable model research, plan, build, test, and iterate on its own. When it works it collapses the whole loop into a single instruction. The catch is cost, since an open-ended autonomous run spends far more than a tightly scoped task, so it stays a deliberate choice rather than a default. The pattern holds even here: the model keeps getting more capable, but deciding when to spend that capability is still a human job.
This project runs formal post-mortems and retrospectives. Run retrospectives. We do them every 5 days. They're the single most valuable process artifact, more important than the code itself. Here's what we learned:
The worst incident. For 3 days, AI dev agents opened PRs that silently overwrote Director and business-agent changes to main. Decision log entries disappeared. Entire documentation sections vanished. We didn't notice for 3 days.
Root cause: Git worktrees snapshot all tracked files at branch creation. When the worktree branch merges, stale copies of docs overwrite the current versions, and git doesn't warn you because it's technically correct behavior.
Fix: Three-layer prevention: (1) sparse checkout so protected files are physically absent from worktrees, (2) pre-PR cleanup script strips protected files before PR creation, (3) CI gate fails any PR touching protected paths. Documentation alone was not enough. The rules existed in AGENTS.md, agents just didn't follow them. Mechanical enforcement is the only enforcement that works with AI agents.
These files started small and grew into sprawling instruction manuals. Every time something went wrong, we'd add more instructions. Eventually they were so long that agents burned context window tokens just reading their own config files, leaving less room for actual work.
Fix: David (AI Workflow Advisor) helped define the problem and crafted a diagnostic prompt you can steal, paste it into your own Claude Code or Codex setup to audit your agent workflow the same way:
That audit led to aggressive pruning, details moved into dedicated docs, config files became pointers not encyclopedias, and lean daily packets stayed under 50 lines. The lesson: treat AI context like expensive real estate. Every token of instruction costs a token of output.
The Legal agent went 11 days with zero completed tasks. The Designer missed 4 consecutive sessions. Nobody noticed because the Director was busy shipping code.
Root cause: Claude Code agents (PM, Designer, Legal, Marketing) don't run unless explicitly scheduled. Unlike Codex agents which get spawned with a task, Claude Code agents just... sit there. If the Director doesn't schedule them, they don't exist.
Fix: Standing daily schedule with explicit session slots per agent. No implicit expectations. If it's not on the schedule, it won't happen.
ESLint v9 broke every PR's lint step. Instead of treating it as a blocker, the team treated it as background noise, "oh, lint always fails." This masked real problems for days.
Fix: CI failures are blockers, period. No "known failures" that get ignored. If a step fails, fix it or remove it. Noise in CI is indistinguishable from real problems.
The PM wrote acceptance criteria retroactively, after the developer had already implemented the feature. This meant the criteria described what was built, not what should have been built.
Fix: PM writes acceptance criteria at story creation with a 48-hour validation window. If AC isn't written before implementation, the task stays blocked.
The Pressing Intelligence feature shipped without a kill-switch. During a rollback drill on Day 22, there was no way to disable PI in production. Required an emergency remediation PR.
Fix: Rollback capability is now a pre-deployment gate. If you can't turn it off, you can't turn it on.
We learned not every integration deserves the heaviest tool. When a task can be handled cleanly with the GitHub CLI, that is usually the cheapest path in both time and context. If CLI is not enough, go to the API. Only reach for richer tool layers when they add unique capability or context we cannot get another way.
Fix: Treat interface choice as context budgeting: CLI first, then API, then heavier tool layers when the extra surface area earns its keep.
We built web-first with AI and shipped 40+ days of branding, design tokens, component patterns, and tested UX flows. When native iOS entered the picture, none of it was portable. Colors, spacing, typography, and interaction models all lived in React/CSS/HTML with no platform-agnostic layer. We essentially started from scratch on the native app's visual identity and interaction patterns rather than carrying that investment forward.
What we should have caught: Design tokens should have been platform-agnostic from day one (e.g., Style Dictionary exporting to both CSS variables and Swift assets). Brand guidelines should have been formalized separately from implementation. Component behavior specs should exist independently of React. The "future native app" roadmap item needed architectural investment early, not just a line item.
Second-order lesson, rubric drift (2026-04-21): The first attempt to correct this, a "make iOS look like web" parity rubric, caused its own regression. Codex Swift subagents tolerated custom web-like chrome (including a non-native bottom control shell) because the rubric didn't distinguish brand parity from chrome parity. We caught it in a physical-device build and restored native glass controls, but it cost a half-day of shipping velocity.
What we're doing now:
The tool we built to watch operating efficiency briefly became a driver against it. In the launch window we saw an unexpected jump in our cloud usage. The dashboard was running expensive log and metrics queries far too often, and a separate routine re-scanned an entire table on every backend deploy, throwing off millions of database read units in a few hours with no matching user activity. A second contributor was self-hosting a large launch video and serving it through the CDN.
Fix: the dashboard now serves a single cached snapshot that refreshes on a schedule rather than on every page load, the live queries it does run are capped in concurrency and time, the per-deploy backfill was made manual, and the video moved to an embedded player. The standing rule that came out of it: cost-sensitive admin and reporting reads stay cached, bounded, and scheduled by default. Observability has an operating cost too, and an unbounded query is a bill waiting to happen.
Our retro cadence broke down for 15 days because the Director kept prioritizing shipping over reflection. During those 15 days, the same patterns repeated: agents not scheduled, carry-forward tasks piling up, same root causes. The retro is the product. Without it, you're just making the same mistakes faster.
This section is for engineers who want to understand exactly how the system is built. It covers repository structure, infrastructure, CI/CD mechanics, the agent coordination protocol, the Pressing Intelligence pipeline, and the automation scripts that hold it all together.
Monorepo using npm workspaces. Node 20+. Three workspaces, 29 automation scripts, 65+ documentation files.
discogs-app/
├── apps/
│ ├── web/ # React 19 + Vite 6 SPA (Tailwind, Playwright E2E)
│ └── api/ # TypeScript Lambda backend (Hono router)
├── packages/
│ ├── shared/ # Shared types (VinylRecord, Pressing, Pagination)
│ └── config/ # Centralized ESLint, TypeScript, Prettier config
├── scripts/ # 29 automation scripts (worktrees, PRs, CI, bots)
├── agents/ # Agent orchestration files (roles, execution, comms)
├── docs/ # 65+ markdown docs (product, design, engineering, legal)
├── infra/ # Infrastructure as Code
├── data/ # Seed data and static datasets
├── logs/ # Runtime logs from CI/agents/bots
└── tests/ # Shared test metadata + generated artifacts
React 19 SPA built with Vite 6, styled with Tailwind CSS (150+ custom theme tokens), routed with React Router v7.
apps/web/src/
├── pages/ # Route-level components
│ ├── CollectionPage # Discogs collection browser with sorting/filtering
│ ├── WantlistPage # Wantlist browser with pressing intelligence
│ ├── DiscoverPage # Discovery + rarity insights
│ ├── AboutPage # About + education links
│ └── ... # Landing, Privacy, TipJar, Beta, Callback
├── components/
│ ├── RecordTableLayout # Main collection/wantlist table
│ ├── PressingPanel # Pressing discovery slide-out panel
│ ├── FeedbackWidget # In-app bug report → GitHub Issue
│ └── ...
└── lib/
├── pressing-intel.ts # PI data retrieval
├── pressing-data.ts # Bayesian scoring + ranking logic
├── mastering-engineer-signals.ts # Engineer name matching
├── record-pressing-signals.ts # Multi-factor pressing signals
├── persistent-record-page-cache.ts # IndexedDB caching
├── background-preload.ts # Preload next record while viewing current
├── request-cache.ts # HTTP response cache layer
└── record-search.ts # Discogs search + filtering
Data flow: Components → lib utilities → API endpoints → Discogs proxy. Three-layer caching: browser IndexedDB for record pages, HTTP response cache for API calls, and background preloading for next-record anticipation.
Single AWS Lambda function routing via Hono. Handles OAuth 1.0a, Discogs API proxy, Pressing Intelligence, analytics, and feedback submission.
apps/api/src/
├── handlers/ # Route handlers
│ ├── auth.ts # OAuth 1.0a flow (Discogs)
│ ├── auth-session.ts # JWT session tokens
│ ├── collection.ts # GET /api/collection/* (cached proxy)
│ ├── wantlist.ts # GET /api/wantlist/*
│ ├── wantlist-add.ts # POST /api/wantlist (allowed wantlist add)
│ ├── wantlist-remove.ts # DELETE /api/wantlist/:releaseId (allowed wantlist remove)
│ ├── pressing-intel.ts # GET /api/pressing-intel/*
│ ├── feedback.ts # POST /api/feedback → GitHub Issue (Octokit)
│ ├── events.ts # POST /api/events (structured analytics)
│ └── health.ts # GET /api/health
├── services/
│ ├── discogs-client.ts # HTTP client with caching + rate-limit backoff
│ ├── oauth.ts # OAuth 1.0a signature generation
│ ├── session-store.ts # DynamoDB session storage
│ ├── catalog-cache-store.ts # DynamoDB collection cache
│ ├── memory-cache.ts # In-process LRU cache
│ └── ssm.ts # AWS Secrets Manager client
├── middleware/
│ ├── discogs-rate-limit.ts # Discogs API backoff (respect 429s)
│ └── read-only-guard.ts # Block all mutations except explicit wantlist add/remove routes
└── pi/ # Pressing Intelligence pipeline (see below)
Security layers:
┌─────────────────────────────────────────────────────┐
│ CloudFront CDN │
│ (www.deadwax.io + deadwax.io) │
├───────────────────────┬─────────────────────────────┤
│ S3 (private origin) │ API Gateway + Lambda │
│ React SPA + static │ /api/* routes │
│ assets │ OAuth, proxy, PI, events │
├───────────────────────┴─────────────────────────────┤
│ DynamoDB (sessions, cache, config) │
│ PostgreSQL (Pressing Intelligence) │
│ CloudWatch (structured logs + alarms) │
│ Secrets Manager (OAuth keys, API tokens) │
└─────────────────────────────────────────────────────┘
S3 bucket is private, no public access. CloudFront uses Origin Access Identity. All API calls route through API Gateway → single Lambda. Everything is pay-per-request (Lambda + DynamoDB on-demand + S3/CloudFront), so there is no idle capacity to pay for.
Every PR triggers the full CI pipeline via GitHub Actions. Nothing merges without passing all gates.
PR opened or pushed to main/codex/feat/fix/infra branches
│
├── Security Scanning ──── ripgrep enforces Discogs write constraints,
│ blocks token/secret exposure in code
├── ESLint ──────────────── shared config (typescript-eslint v8 flat config)
├── TypeScript ──────────── strict mode, all workspaces
├── Vitest Unit Tests ───── frontend components + backend handlers
├── Playwright E2E ──────── Chromium headless, video recording
├── Code Coverage ───────── Istanbul + V8, delta posted as PR comment
└── Protected Files Gate ── fails if PR modifies AGENTS.md, CLAUDE.md, etc.
All green → merge to main → deploy workflow triggers:
│
├── Build frontend (Vite)
├── Upload to S3
├── Invalidate CloudFront
└── Verify production endpoints (curl health checks)
Deploy triggers: workflow_run event fires after CI succeeds on main. Concurrency group serializes deploys (no race conditions). OIDC role assumption for AWS credentials (no static keys in CI).
Ten AI agents coordinate through markdown files committed to Git. No Slack, no Jira, no Notion. Everything is versioned and traceable.
agents/
├── MASTER-EXECUTION.md # Single source of truth for all agent tasks
├── TODAY.md # Current-day runtime packet (<50 lines)
├── COMMS.md # Durable communication log (CEO ↔ agents)
├── COMMS-TODAY.md # Today's communication digest
├── STATUS.md # Build/deploy status tracker
├── roles/ # Role definitions (one per agent)
│ ├── director.md # Orchestration, dispatch, escalation
│ ├── product-manager.md # Backlog, stories, acceptance criteria
│ ├── architect.md # ADRs, system design, code review
│ ├── backend-dev.md # Lambda, services, PI pipeline
│ ├── frontend-dev.md # React, UI, E2E tests
│ ├── tester.md # Test strategy, QA, coverage
│ ├── designer-researcher.md # UX specs, design system
│ ├── marketing-manager.md # GTM, community, outreach
│ └── legal.md # Compliance, privacy, naming
├── execution/ # Daily execution logs (EXEC-2026-03-*.md)
├── decisions/ # Decision log, one entry per choice
├── incidents/ # Post-mortem records (INC-*.md)
├── retros/ # Sprint retrospectives (every 5 days)
└── auto-tasks/ # Bot-generated task packets (TASK-*.md)
Information flow: CEO posts decisions to COMMS.md → Director reads and creates TODAY.md + COMMS-TODAY.md → all agents read these first → agents execute and post updates back to COMMS.md → Director archives to execution/EXEC-*.md and creates next day's packet.
Merge order enforcement: Infra → Backend → Frontend → Tester. The Director specifies daily in MASTER-EXECUTION.md. This prevents merge conflicts from parallel agent work.
Not everything is automated. Some decisions deliberately stop the pipeline and require human sign-off before work continues. No workaround, no override.
Each Codex agent works in an isolated git worktree, a full copy of the repo on a separate branch. This enables truly parallel development without conflicts.
# Create a worktree for backend work
$ bash scripts/worktree-create.sh backend feat/oauth-proxy
→ /Users/.../discogs-app/.claude/worktrees/backend
# When done, create PR from worktree
$ bash scripts/pr-create.sh --title "feat: OAuth proxy"
→ Rebases on main → strips protected files → pushes → opens PR
# Clean up after merge
$ bash scripts/worktree-cleanup.sh backend --delete-branch
Protected files problem: Worktrees snapshot all files at branch creation. When merged, stale copies of docs overwrite current versions. Fix: three-layer prevention, sparse checkout (files physically absent), pre-PR cleanup script, and CI gate that fails any PR touching protected paths.
Core principle: never create an AI skill for a deterministic task. Scripts are cheaper, faster, testable, and don't burn tokens.
| Category | Script | What It Does |
|---|---|---|
| Worktrees | worktree-create.sh | Create isolated worktree + branch from origin/main |
worktree-cleanup.sh | Remove worktrees, prune merged branches | |
| PRs | pr-create.sh | Rebase → cleanup protected files → push → open PR |
pr-review.sh | Fetch diff, metadata, check protected files, CI status | |
pre-pr-cleanup.sh | Reset protected files to main before PR | |
| CI/CD | deploy-aws.sh | Build → S3 upload → CloudFront invalidation |
coverage.sh | Run tests, generate coverage, diff against main | |
main-sync.sh | Fetch + rebase local main on origin | |
| Bots | run-feedback-pipeline.sh | Full feedback loop: triage → fix → PR → merge → deploy (hourly) |
run-triage-bot.sh | PM triage: classify issues, create task packets | |
run-fix-bot.sh | Dev fix: read task packets, implement, open PR | |
| Testing | run-all.sh | Master test runner (vitest + playwright + node --test) |
security-readonly-checks.sh | Enforce Discogs write constraints with wantlist add/remove as the only allowed write path | |
verify-runtime-endpoints.sh | POST curl tests to deployed API endpoints | |
| PI | pi-report.mjs | Generate Pressing Intelligence status report |
bootstrap-qa-session.mjs | Set up QA session with seed albums | |
| Codex | ship-loop.sh | Post-approval squash-merge → deploy → verify production |
Multi-source evidence pipeline that ranks vinyl pressings using Bayesian-weighted confidence scoring.
Scoring formula:
score = (v/(v+m)) * R + (m/(v+m)) * C
where m = 25 (minimum votes), C = 4.0 (prior mean)
Pipeline:
Sources (YouTube, forums, Gemini AI analysis, editorial)
↓
Ingest (fetch content, hash for dedup, store in PostgreSQL)
↓
Extract claims (per-connector: youtube_transcript, youtube_gemini_video,
forum_ingest_shf, forum_ingest_reddit, editorial_review)
↓
Normalize (entity resolver: "Bernie Grundman" == "BG at Bernie Grundman Mastering")
↓
Score (confidence × relevance weights per claim type)
↓
Rank (materialized view: pressings sorted by weighted score)
↓
Serve API: GET /api/pressing-intel/{album_seed_id}
Connectors:
youtube_metadata, Album metadata from YouTube searchyoutube_transcript, Transcript text extractionyoutube_gemini_video, Gemini 2.5-flash analyzes video content → structured claimseditorial_review, Hand-curated expert assessmentsforum_ingest_shf, Steve Hoffman Forums scraperforum_ingest_reddit, Reddit r/vinyl community discussionsStorage: PostgreSQL for sources, claims, citations, and review queue. DynamoDB for connector registry and album seed config. CloudWatch for event logging.
On-demand research: when a collector opens an album that doesn't have Pressing Intelligence yet, Deadwax emits an event. A scheduled job picks that event up later and researches the album, Discogs data, audio forums, Steve Hoffman threads, YouTube shootout videos, collector discussions, pressing comparisons, then distills it into collector-facing guidance. Instead of you opening twenty tabs to figure out the best pressing of Pet Sounds or Wish You Were Here, Deadwax tries to hand you the summary.
The hardest feature in the product is letting a collector scan the deadwax itself, the matrix and stamper codes etched into the runout groove between the last track and the label. Get it working and you can identify a specific pressing from the record in your hands, no typing required.
The trouble is that the deadwax is genuinely hostile to a camera: tiny etched text, bad lighting, reflections off the vinyl, a curved reflective surface, handwritten matrix numbers, noisy label artwork, and plain old wear. So rather than guess, the build leans on real evidence: testers screen-record themselves scanning records, those recordings are saved and analyzed, and the scanning experience is tuned against them until it's intuitive and accurate enough to ship. It's the same philosophy as the rest of the system: don't argue about what works, instrument it and let the data decide.
Runs on a Mac mini via macOS cron. Three scheduled jobs:
| Script | Schedule | What It Does |
|---|---|---|
run-feedback-pipeline.sh | Every hour, top of hour | Full loop: triage → fix → PR → CI → merge → deploy |
run-triage-bot.sh | Every 4 hours | PM classifies GitHub issues → creates TASK-*.md packets |
run-fix-bot.sh | Every 4 hours (offset +1h) | Dev reads task packets → implements fix → opens PR |
Safety rails: Won't auto-fix P0 critical issues, UX redesigns, auth/security changes, schema migrations, or anything ambiguous. Feature requests and larger asks get flagged with [needs-input] label for human review. Only safe, scoped bug fixes ship automatically.
docs/
├── adr/ # Architecture Decision Records
├── design/ # Design system, brand guidelines, UX specs, wireframes
├── engineering/ # Codex playbook, execution overview, code review, API reference
├── infra/ # Architecture diagrams, caching, deploy runbook, cost analysis
├── product/ # PRD, vision, roadmap, backlog, user stories, PI rollout
├── legal/ # Privacy policy, compliance checklist, OAuth consent, licensing
├── marketing/ # GTM brief, landing copy, competitive analysis, outreach tracker
├── research/ # User research findings, beta candidates, interview guides
├── education/ # This document + DEADWAX-EXPLAINED.md
├── company/ # Culture, team
└── testing/ # Test plan + QA strategy
Every document is self-contained but cross-referenced. ADRs track architecture decisions with rationale. Product specs include acceptance criteria linked to design specs. The decision log gives every choice a permanent ID.
# Install dependencies
npm install
# Build all workspaces
npm run build
# Run all checks
npm run lint && npm run typecheck && npm run test
# Local frontend dev (Vite hot reload)
npm run dev --workspace @deadwax/web
# Environment
cp .env.example .env.local
# then fill in the values your environment needs
Want to learn more about how Deadwax is built? This paper is meant to teach, so questions are welcome. Email support@deadwax.io, or follow @deadwax.io on Instagram.
Most engineering write-ups skip the legal work, which is a mistake, because for a small consumer product it sits squarely on the critical path to launch. We brought a dedicated Legal agent in early, and it paid for itself almost immediately.
The first working name combined "Discogs" with our own, which turns out to violate Discogs' application-naming policy outright, so a rename was forced before we had really started. We chose Deadwax, the collector's term for the runout groove, because it lands instantly with the audience. The catch is that a good name is rarely unclaimed. A trademark clearance search across the relevant software and entertainment classes came back clear of blocking registrations, but there was prior use to navigate, including a similarly named app that surfaced in the App Store around the same time we did. We treated that seriously rather than ignoring it: running the clearance search up front, securing the matching domain, and planning an intent-to-use trademark filing ahead of a wider launch. The lesson is the cheap one to state and the expensive one to learn late: validate the name before you build a brand on it.
Shipping a native iOS app meant clearing Apple's review, which is its own body of rules to comply with, and two issues had to be worked through. Apple requires in-app account deletion rather than a web redirect, so we added a flow in Settings that permanently removes everything Deadwax stores for a user. Apple also questioned the album artwork in our screenshots, so we documented that the images are user-contributed, openly licensed photos served the same way Discogs' own apps serve them, with a synthetic-placeholder fallback ready in case. We also worked directly with Apple to release the app name to our account so the listing could go live. None of this is glamorous, but it is the difference between a finished build and an app a collector can actually download.
The privacy posture is deliberately conservative, which also keeps the legal surface small. Collectors sign in with Discogs OAuth, so the app never sees a Discogs password. Access tokens are held server-side only, encrypted at rest, and never written to logs. The app reads collection and wantlist data and performs only the writes a signed-in user explicitly asks for, never deleting anything from a Discogs account. Nothing is sold to third parties or handed to ad networks, and collection data is fetched fresh rather than warehoused. A published privacy policy and an OAuth consent flow make that contract visible to the user, which is exactly where a trust-dependent product wants it.
See also: How We Built Deadwax, the visual overview with team cards, pipeline diagrams, and roadmap.
Back to top