METHODOLOGY

How we tested whether EWA is real.

Short version: EWA measures whether a player's minutes moved his team closer to winning after accounting for who else was on the floor. We test that by hiding future seasons, training on older games, and checking whether EWA beats simpler baselines. The friendlier overview is on /about.

The 30-second version

What EWA is

A context-aware player grade

EWA asks a simple question: when this player was on the floor, did his possessions move his team closer to winning after accounting for teammates and opponents?

What we tested

Hide future games, then predict them

For each season, we trained the prediction layer only on older games, hid the next season, and checked whether EWA beat simpler baselines.

What happened

+3.4 percentage points vs team-only

Across four seasons, EWA picked winners better than a model that only knows team strength. Its probabilities were also better in every fold.

What not to claim

Still behind the market

Vegas averaged 67.7% accuracy; EWA averaged 59.2%. That gap is expected because markets use injuries, line movement, and sharp action.

Rolling-origin validation · 4 chronological folds

fold cutoffs (test 2021-22 → 2024-25)

1,660

total test games across folds

4 / 4

folds with Brier & log-loss CIs excluding zero

In plain English: EWA picked winners +3.40 percentage points better than team-only on average across four seasons, and the direction was positive in every fold. The stricter probability grades also improved: Brier −3.58% (4/4 folds CI-exclude zero), log-loss −2.66% (4/4), and margin error −1.59% (3/4). Market odds averaged 67.7% accuracy, so EWA is useful signal, not a Vegas substitute.

Five models, side by side (most-recent fold)

This is the audit version of the sentence above. The 2024-25 fold (n_train = 5,822, n_test = 401) is shown as a representative slice. Brier and log-loss are stricter ways to grade probabilities: lower is better. Margin RMSE grades the spread: lower is better. Accuracy is the simple "did it pick the winner?" number: higher is better. Bracketed numbers are 95% bootstrap CIs.

Model	Brier	Accuracy	Margin RMSE
Naive (50/50)	0.2500 [0.250, 0.250]	50% expected —	15.75 [14.7, 16.7]
Home court only	0.2456 [0.241, 0.251]	56.9% [51.9, 61.6]	15.58 [14.5, 16.5]
Team identity (no players)	0.2451 [0.240, 0.251]	58.1% [53.4, 62.8]	15.56 [14.5, 16.6]
EWA (roster-aware)	0.2365 [0.228, 0.244]	59.4% [54.4, 64.3]	15.31 [14.2, 16.3]
Market (Vegas, de-vigged)	0.2011 [0.184, 0.218]	67.3% [62.6, 72.1]	N/A —

Market is included as benchmark/context. The accuracy gap (~8 pp pooled across folds) reflects information EWA does not use — line movement, sharp action, real-time injuries. We don't try to close it on this page.

EWA vs team-only — paired-bootstrap delta CIs

Across the 4 folds, here's how often EWA's improvement over team-only is statistically distinguishable from zero. CIs that exclude zero are paired-bootstrap CIs computed within each individual fold (1,000 resamples, n ≈ 400-440 per fold).

Log-loss

✓ excludes 0

−2.66%

95% CI 4 / 4 folds

EWA reduces log-loss vs team-only in all 4 folds; each fold's CI excludes zero.

Brier

✓ excludes 0

−3.58%

95% CI 4 / 4 folds

EWA reduces Brier vs team-only in all 4 folds; each fold's CI excludes zero.

Margin RMSE

✓ excludes 0

−1.59%

95% CI 3 / 4 folds

Predicted point margin tightens vs team-only in 4/4 folds; CI excludes zero in 3/4.

Accuracy

✗ crosses 0

+3.40 pp

95% CI 0 / 4 folds

Positive in all 4 folds (+1.2 to +4.6 pp), but per-fold n=400-440 underpowers individual CIs.

Rolling-origin — every fold, not the best fold

Each row is an independent chronological fold: train strictly on games from prior seasons, test on one season's odds-matched games. The pattern holds across all four cutoffs — Brier and log-loss CI-exclude zero in 4/4 folds, margin RMSE in 3/4. Same direction, same approximate magnitude, every time.

Test	n_train	n_test	EWA acc	Mkt acc	Δ Brier	Δ Log-loss	Δ RMSE
2021-22	2,136	417	59.2%	69.8%	+3.95% ✓	+2.88% ✓	+1.82% ✓
2022-23	3,366	404	60.9%	64.4%	+3.11% ✓	+2.34% ✓	+1.23% ✗
2023-24	4,593	438	57.1%	69.2%	+3.73% ✓	+2.79% ✓	+1.71% ✓
2024-25	5,822	401	59.4%	67.3%	+3.51% ✓	+2.63% ✓	+1.61% ✓

✓ marks deltas whose 95% CI excludes zero within that fold. The single-table comparison above (5 models, 2024-25 fold) is the most-recent fold; the other three cutoffs show the same shape. The EWA-aware improvement is not a single-cutoff artifact.

Window sensitivity — robustness, not tuning

We use a roster-aware recent-usage aggregate, defaulting to each team's last 30 games. Sensitivity checks across 15 / 30 / 45 / 60 games show the EWA signal is strongest in recent windows and fades as older roster usage is included — consistent with roster drift over time. The default of 30 was set as a disciplined mid-window value, not because it dominates any single metric.

N	EWA Brier	Δ Brier	Δ Log-loss	Δ Margin RMSE
15	0.2438	+2.81% ✓	+2.11% ✓	+1.24% ✓
30 (default)	0.2449	+2.39% ✓	+1.79% ✓	+0.97% ✓
45	0.2473	+1.44% ✗	+1.08% ✗	+0.62% ✓
60	0.2478	+1.24% ✗	+0.93% ✗	+0.55% ✓

✓ marks deltas whose 95% CI excludes zero. The story is robust across recent windows, especially 15 ≤ N ≤ 30: three of four metrics (Brier, log-loss, margin RMSE) are statistically distinguishable. At N = 45 and 60 the aggregate grows stale and only margin RMSE remains significant. We publish at the default window rather than the best-on-test window.

Calibration

When EWA says a team has a 65% chance to win, do they actually win about 65% of the time? Each dot below is a probability bin from the held-out games — predicted on the x-axis, actual win rate on the y-axis. Perfect calibration is the dashed diagonal. Dot size shows games per bin.

Central bins are the populated ones in this fold (n = 93, 181, 128, 22). Calibration drifts a little at the high end on this 438-game test set — fewer games per bin means more sampling noise. We treat calibration as a property to monitor across runs, not a single number.

Why raw plus-minus lies

The simplest impact stat is raw plus-minus — point differential while a player is on the court. It looks honest and breaks immediately. In recent seasons, players like Payton Pritchard and Luke Kornet have posted higher raw on-court plus-minus than Stephen Curry, Giannis Antetokounmpo, and Luka Dončić. Not because they generate more impact — because they happen to share the floor with stars on winning teams.

Ridge regression with player-level controls is what fixes this. EWA splits credit in a way that controls for teammates and opponents, so a strong rotation player on a great team doesn't inherit his teammates' impact. That's the attribution layer. Shrinkage then ensures small-sample players don't ride a hot streak to the top of the rankings.

Concrete example

Nikola Jokić's rate over the last three seasons is +8.16 EWA / 100 possessions. Decomposed by role, 84% of that comes from assisting — not scoring, not rebounding. His best pair with Jamal Murray adds +1.4 wins added together; strong, but they underperform what you'd expect from stacking their individual numbers. That's the kind of read no box score or single-number metric gives you.

See Jokić's full breakdown

How the scores are built

Win probability per possession

A sequence model trained on play-by-play estimates win probability after every event. The change in win probability across each possession (WPA) is the unit of credit.

Ridge attribution

A regularized regression splits each possession’s WPA across the ten players on court while controlling for teammates, opponents, and home court. This is the regularized adjusted plus-minus tradition (Sill 2010), with role-aware interactions added on top.

Low-sample shrinkage + Empirical Bayes

Players with few possessions get pulled toward the population mean by both a count-based shrinkage (count / (count + k)) and an Empirical Bayes step. This is what keeps a 100-possession rookie from showing up next to Jokić on the leaderboard.

Roster-aware pregame aggregate

For game prediction, per-team EWA aggregates use each team's most recent 30 train games — not a static average across the whole training period. This keeps the predictor honest about mid-season trades and roster turnover.

A note on the name

“Estimated Wins Added” was used in the early 2000s by John Hollinger as a linear function of PER: (PER − 11) × Minutes / 67. That formulation is no longer maintained and was box-score derived, with no possession context, no role decomposition, and no shrinkage.

The version of EWA on Alleygorithm shares the goal — per-game wins added above the league baseline — but the methodology is fundamentally different. Ridge regression on possession-level win-probability change (WPA), with role-aware decomposition and empirical-Bayes shrinkage. Same destination, modern math. Treat the acronym the way the field treats “WAR”: a category, multiple flavors, judged on methodology and out-of-sample performance.

Academic lineage

EWA isn't a new technique. It's an honest reassembly of established methods with a transparent validation harness on top.

RAPM

Sill (2010 Sloan)

Regularized adjusted plus-minus via ridge regression. The base technique behind EWA’s attribution layer.

WPA

Multiple authors

Possession-level win-probability swings as a credit signal. EWA inherits this framing rather than the raw point-differential one.

SPM / BPM

Daniel Myers

Statistical / Box Plus-Minus. Where role and box-stat information enter as priors. EWA’s role-aware interactions are in this tradition.

EPM / DARKO

Snarr / Medvedovsky

The two strongest public predictive metrics. EWA borrows their commitment to chronological holdout testing and roster-aware aggregation.

What EWA is not

Not a market substitute.Vegas pregame odds reflect sharp action, line movement, real-time injuries, and decade-long iteration. The ~8 pp average accuracy gap across the 4 folds is real and we don't claim to close it.
Useful as a check, not a substitute. We don't publish picks or edge claims. If your read includes betting, treat the model as one of several inputs alongside the line, late injury news, and your own judgment — not a replacement for any of them.
Not a replacement for watching the game. EWA captures aggregate impact on win probability. It cannot tell you why a player is great today, only that they have been.

Limitations we're honest about

Reading these openly is the price of asking you to trust the rest. Every limitation below is on the roadmap and labeled in our internal validation reports.

Win-probability labeler is not strictly leakage-free. The current harness uses the production WP model to label every possession's WPA. That model is trained on multi-season play-by-play that may overlap our test window. The player-attribution side (ridge fit on train games only) is clean, but the label-side discipline isn't — and that's the biggest open caveat on this page. Retraining the WP labeler with a strict pre-test cutoff is the next item on the roadmap.
App-surface EWA uses a slightly less-shrunken estimator than the headline tables on this page. The leaderboards, player pages, and game pages compute cumulative wins added as ridge per-100 × possessions in the selected window — so a custom date range, season, or game-type filter all stay self-consistent (per-100 × poss = cumulative). The published static EWA tables apply an additional Empirical-Bayes shrinkage layer on top of ridge, which makes them ~5–10% smaller for top players but is fixed to the windows the engine pre-computes. Same direction, same ranking, just less shrunk than the static tables. The validation work above is on the static EB-shrunken estimator.
Rolling-origin validation here is 4 chronological folds (test seasons 2021-22 through 2024-25). The signal is consistent across all 4. We have not yet sensitivity-checked across alpha or random seed beyond the recent-games window — those passes are queued.
Per-player minute projection isn't modeled yet. The current pregame layer uses each team's recent per-game possession profile rather than predicting individual minutes per game. EPM and DARKO have richer minute models.
Market odds are a benchmark on this page, not a target. The accuracy gap to Vegas (averaging ~8 pp across the 4 folds) is real and reflects information we don't have (sharp action, line movement, real-time injuries). We don't claim to beat the market — predictions surface the model's reasoning and a public track record on /predictions, not picks against a line.
Bootstrap CIs are on a single seed. We have not yet sensitivity-checked across alpha, shrinkage_k, or random seed beyond the recent-games window — that's the next robustness pass.

Audit it yourself

The validation code is open and runnable. The numbers above came from scripts/validate_pregame_prediction.py with --recent-games-per-team 30 on a chronological holdout. The window-sensitivity sweep ran via scripts/sweep_recent_games_window.sh. The attribution math lives in unified_scores.py.

Validation harness: scripts/validate_pregame_prediction.pyEngine + ridge attribution: unified_scores.py

What's coming

Alpha + seed sensitivity

Sweep ridge alpha (2,500 / 5,000 / 7,500 / 10,000) and bootstrap seeds across the 4 rolling-origin folds. Demonstrates the result is not a single-hyperparameter or single-seed artifact.

Per-player minute projection

Replace per-team possession averages with per-player rolling minute estimates. Closes part of the gap to EPM/DARKO's richer minute models.

Lineup sensitivity

Counterfactual calculator: "if Player X is out, the lineup loses N wins added." The most direct expression of EWA's player-level attribution and the natural foundation for a paid analytics tier.

Live pregame layer

Daily-refreshed pregame projections that incorporate the day's active rosters and inactives. Today's harness uses recent training data; the live layer uses recent live data.

Frozen pre-cutoff WP labeler

Retrain the win-probability model with a strict cutoff before each test window so the WPA labels themselves are leakage-free. The current harness uses the production WP model and discloses that limitation; this closes it.

FAQ

How is EWA different from traditional plus/minus?

Plus/minus measures point differential while you're on court. EWA measures how much each possession changed win probability — weighting high-leverage moments more — and then splits credit fairly via ridge regression. Plus/minus conflates your impact with your teammates'.

Why do some great players rank lower than expected?

EWA captures context. A star on a dominant team faces fewer high-leverage possessions because the game state is already stable. The public scores also apply shrinkage, so lower-volume players get pulled toward the middle.

How often is the data updated?

Score artifacts refresh on a daily cadence; the underlying win-probability model is retrained on a slower review cycle. The footer shows the most recent promoted run currently being served.

How does EWA compare to Vegas / the betting market?

Market is a fifth column in our validation table — we have multi-season de-vigged moneylines for 1,954 NBA games matched cleanly to game IDs. Across all 4 rolling-origin folds, market accuracy averages ~67.7%; roster-aware EWA averages ~59.2%. The ~8 pp gap is real and reflects information markets have that we don't (sharp action, line movement, real-time injuries). We report it as a benchmark, not a target.

Why these four folds (2021-22 through 2024-25)?

Those are the four most recent NBA seasons where we have both play-by-play data and de-vigged pregame moneylines, and where each fold has a strictly older training set available. The pattern (Brier and log-loss CIs excluding zero, margin RMSE excluding zero in 3/4) holds across every fold tested.

Can EWA pick tonight's winners?

Yes — that's what /predictions is. Every game we predict, you can see what the model said and (after the game) whether it called the winner. Across the four published rolling-origin folds, EWA accuracy averages 59.2%; the de-vigged Vegas market averages 67.7%. EWA beats team-only baselines but doesn't approach the market — Vegas has information we don't (sharp action, line movement, real-time injuries). The page tracks the model's live record so you can see exactly how it's doing.

Explore the data