Case StudyJanuary 24, 20268 min read

7 Incidents in One Day: What Human + AI Code Review Actually Looks Like

Real incident data from a live development session. Not theory—actual findings with timestamps and categorization.

On January 24, 2026, during a routine code improvement session, we documented something unexpected: 7 distinct incidents caught in a single day. What emerged wasn't just a list of bugs—it was a clear pattern showing exactly where humans and AI each excel.

The Raw Data

Here's what we found, categorized by who caught it:

Caught by AI (Systematic Review): 5 Incidents

Zero Test Coverage: No tests in entire 250K+ line codebase
Client-Side Role Assignments: Security vulnerability—roles determined on client
Dual Firestore Collections: Same data stored in two places
Hardcoded Client Data: Real company name embedded in code
Styling Documentation Gap: No documented decision on styling approach

Caught by Human (Judgment Calls): 2 Incidents

Functionality Removal: AI accidentally deleted a working "Enter Board" button during refactoring
Unfounded Recommendation: AI suggested migrating to Tailwind without investigating—turns out the codebase was 99% inline styles by design

The Pattern

This wasn't random. A clear pattern emerged:

AI excels at finding. Systematic review, pattern matching, exhaustive search. The AI scanned thousands of files and identified issues that would take a human days to find. Test coverage? Counted. Security patterns? Analyzed. Data architecture? Mapped.

Human excels at filtering. Both human-caught incidents were judgment errors—places where the AI "followed the rules" but missed the intent. Removing the button was technically cleaning up code. Suggesting Tailwind was technically addressing "inconsistency." But a human immediately recognized: "Wait, we need that button" and "Wait, why would we change 2,400 style blocks for no reason?"

The Unfounded Recommendation Incident

This one deserves special attention because it almost caused 10-20 hours of unnecessary work.

The AI recommended: "Standardize styling (migrate inline → Tailwind)" as a medium-priority "cleanup" task.

The human asked: "Is there a reason we chose one over the other?"

The AI investigated (for the first time) and found:

style={} (inline): 2,429 usages
className= (Tailwind): 11 usages

A 220:1 ratio. This wasn't inconsistency—it was a deliberate architectural decision. The AI had recommended a major pivot based on assumption, not evidence.

What We Changed

This session led to two framework improvements:

1. The Agent Constitution

We created 10 concise rules that the AI checks before every action. Rule 5 (ROOT): "No workarounds, find the real problem." Rule 6 (VALIDATE): "Test technology choices before committing."

2. The Recommendation Protocol

Before recommending any change, the AI must now:

Investigate current state
Understand why it's that way
Provide evidence of the problem
Assess trade-offs
Label it correctly (cleanup vs. refactor vs. architecture change)

The Takeaway

The human didn't catch more incidents than the AI. The human caught different incidents—the ones that required understanding intent, context, and business value.

This is why we believe in Human + AI, not AI alone. The combination caught more issues in one day than either would have alone. And more importantly, it prevented a multi-day detour into unnecessary refactoring.

The AI is excellent at systematic analysis. The human is essential for maintaining direction. Together, they're formidable.

Appendix: All 7 Incident Reports

Each incident was formally documented with root cause analysis and prevention measures. The full reports are available in our methodology documentation for teams implementing similar Human + AI workflows.

Share this article

Back to Blog