Benchmark

Proof, not promises.

Rules held identical, only the context format changing. AIGX produced the most correct and most disciplined agent output - and it is the only context format validated this way.

What was measured

A controlled ablation. A single real TypeScript codebase (~35 source files) is held constant. The only variable is how the project's rules are written down. Every format encodes the identical rule set; semantic parity is machine-checked. The subject models are autonomous agents that grep, read, edit, and run npm test / tsc.

The codebase contains planted traps a careless edit hits: deep-import boundary violations, dependency cycles, cross-event data leaks, cache-header ordering, AI hallucination from marketing copy, plus 10 hard-correctness traps (TOCTOU double-booking, floating-point money, DST conversion, Unicode folding, cursor pagination, idempotency, IDOR, ReDoS, illegal state transitions, unbounded caches).

How it was scored (deterministic & tamper-proof)

  • Hidden tests - injected after the agent finishes, run, then removed. The agent never sees them, so it cannot teach to the test. This is the primary correctness signal.
  • Architecture-violation check - a pristine diff detects forbidden imports / cycles.
  • tsc --noEmit, a gzip bundle-budget gate, and rubric probes.
  • Final score (0-100) is weighted: visible 20 / hidden 30 / architecture 20 / obedience 15 / perf-security 10 / minimality 5. No LLM judge is in the score.

Headline results (powered to n=60)

Mean final-score on the discriminating original-10 suite. arch-viol = % of runs that crossed a forbidden import boundary (lower is better).

Claude Sonnet 4.6 (stronger tier)

Formatmeanpass@1hiddenarch-viol
🧬 aigx_terse 95.4 0.92 98.6% 8%
md 95.1 0.80 96.4% 0%
exifai_v2 94.6 0.80 96.1% 3%
aigx_v9 93.6 0.77 94.3% 10%
xml 93.1 0.80 93.8% 13%

Claude Haiku 4.5 (weaker tier)

Formatmeanpass@1hiddenarch-viol
🧬 aigx_terse 93.5 0.78 96.0% 7%
aigx_v9 92.8 0.70 92.6% 5%
exifai_v2 92.4 0.67 90.2% 0%
xml 92.3 0.75 93.3% 8%
md 92.2 0.70 93.6% 10%

AIGX ranks first on mean, pass@1, and hidden-test pass on both models - the only format that leads on both tiers. Markdown is excellent on Sonnet yet near-last on Haiku; XML is roughly the reverse - AIGX is the one you can trust not to fall over when you change models.

Consistency is the headline. AIGX comes first on both a weaker and a stronger model, leads on the hidden-test pass rate that matters most, and stays the simplest to author - a result that holds when you swap models, not a one-tier spike.

The challenger log - we tried hard to beat it

After AIGX won the comparison, we ran a deliberate campaign to beat the winning design: ~24 challenger variants across 6 research rounds - in-source guards, positional tricks (primacy/recency), salience ladders, positive re-framing, 10 prose re-renderings, and combinations of the two best ideas. Every one failed to beat it. Challengers that looked strong at a small sample fell back behind terse on hidden-test pass and pass@1 once powered to n=60. AIGX is a robust optimum, not a lucky point estimate.

Reproduce & extend

The whole harness is open: a generator, a materializer, a runner, and a deterministic scorer. One canonical knowledge base produces every format's files, so parity is guaranteed by construction and re-checked. Point it at your own codebase, add a model, or submit a challenger format - the result is built to be re-run, not taken on faith.


The full methodology, raw data, and round-by-round challenger table are the canonical record: BENCHMARK.md on GitHub ↗. Independent replication is welcome - open a discussion ↗.