Back to blog

How ScryCheck Uses AI (and Where It Doesn’t)

April 15, 2026

If you’re skeptical of AI deck scoring, you’re probably right to be. We tested our scoring engine against Claude and GPT-4o on 252 Commander decks. The deterministic engine hit 93% bracket accuracy. The LLMs got 48% and 38%. Here’s where we actually use AI — and what that test showed us about why the split matters.

The black-box problem is real

Most player frustration with “AI analysis” comes down to one thing: opacity. If a tool can’t explain why your deck is an 8 instead of a 6, it doesn’t matter how confident the output sounds.

Commander players already know power level is messy. Adding a black box on top of an already subjective problem just makes trust worse.

What ScryCheck does not use AI for

ScryCheck does not use an LLM to make runtime scoring decisions on your deck.

When you submit a list, it runs through a deterministic, multi-stage scoring pipeline. Card parsing, vector scoring, combo detection, archetype/theme detection, composition checks, strategy ceiling, and bracket assignment are all handled by defined rules and thresholds. Same deck list in, same output out.

That determinism matters because it makes calibration possible. If the output changes, it’s because the scoring system changed — not because the model “felt different” today.

Where we do use AI

AI is still part of ScryCheck. It’s just used in three specific places:

1) Card categorization at scale. We use LLM-generated category tags to help cover Commander’s massive card pool. That coverage supports the rating stack, which also includes hand-tuned overrides, land-type rules, and oracle-text heuristics. Without it, a card like Kenrith’s Transformation gets filed as an enchantment and nothing else. The LLM correctly tags it as single-target removal because of what it actually does in a game.

2) Independent cross-validation. We ran the full reference deck suite through Claude and GPT-4o as separate evaluators to sanity-check whether the engine’s accuracy claims were circular.

3) Optional AI insights (beta). If you opt in after seeing your results, we send your deterministic analysis to an LLM to generate strategy notes and swap suggestions. The score doesn’t change — the AI reads the engine output, not the other way around. It’s a separate layer, and it’s never on by default.

How we measure whether it’s actually accurate

ScryCheck validates scoring against a 252-deck reference suite. Every scoring change is measured against that set before it ships.

93%
Bracket exact match
82%
PL within ±0.5
0.24
Mean PL error
-0.1
Avg. bias (slight under)

One honest caveat: 202 of the 252 reference decks use community-sourced power estimates, not hand-verified labels. The validation numbers are real, but the ground truth itself isn’t perfect — and we don’t pretend it is.

What AI cross-validation found

  • The deterministic engine outperformed both LLMs on bracket accuracy: 93% vs 48% (Claude) and 38% (GPT-4o).
  • LLMs systematically over-rate decks, especially missing how casual true Bracket 1-2 decks can be.
  • Cross-validation surfaced 6 miscalibrated reference decks. Correcting those improved measured bracket accuracy from 90% to 93%.

The LLM over-rating pattern makes sense in hindsight: LLMs anchor on individual card power. They see Rhystic Study and assume the deck is strong. They don’t see that it’s in a shell with 30 taplands, no ramp, and a 5-mana commander who needs to attack to do anything. The deterministic engine weighs the whole picture.

AI wasn’t asked to replace the engine. It was used to challenge it. Where the evaluators disagreed, we investigated. Where both LLMs and the engine agreed and the ground truth didn’t, we corrected the benchmark.

What this still can’t solve

Commander power level has no perfect objective truth. A deck near a bracket boundary will always feel off to somebody.

ScryCheck measures the deck as constructed. It can’t model pilot skill, local meta, or table politics. Those are real parts of game outcomes, and no card-list analysis should pretend otherwise.

Why this split is the advantage

The approach we trust is: deterministic system for the score, AI for coverage and stress-testing. That gives us reproducibility, a clear way to find and fix scoring mistakes, and a more honest way to improve over time.

If you want the technical details, everything here maps to the public docs in How it works and How accuracy is measured.

We’re not anti-AI. We’re anti-black-box scoring.

Want to see it on your own list? Every point is traceable — no black boxes, no vague numbers. Run your deck through and inspect exactly how the score was built.

Analyze your deck →