Method Review 11 — Scoring-config backtest (2026-05)
Harness
scripts/backtest-scoring-configs.mjs. For each ScoringConfig (weights,
recency, k, filters):
- Re-compute
recoScorefor everyBacktestResultrow joined withDeclaration+Company(n = 20 656; BUY = 15 171, SELL = 5 485). - Apply filter (signal floor, mcap band, board-exclude, etc.).
- Group eligible signals by ISO week; pick top-10 by score per week; hold 90d at equal weight.
- Report: trades, winRate, mean / median r90, σ, annualized Sharpe (yearly buckets, rf = 3 %), cross-sectional Sharpe, max drawdown of the rolling NAV, positive-year count.
- Composite = 0.4·SR_ann_norm + 0.3·winRate + 0.2·(1 − DD/50) + 0.1·posY/totY.
Side-specific stats use the inverted predicate for SELL (winning sell = r90 < 0). Bucket means are shrunk toward the side overall with adaptive-k (n<20 → 30, n<200 → 15, else 5), with a James-Stein variance kicker. The full-sample bucket stats are reused across the sim (not point-in-time) — common backtest simplification, noted as a leakage caveat.
Configs tested (17)
| Name | Weights (signal/wr/ret/rec/conv) | Filter / param |
|---|---|---|
| A1-current | 35/25/20/20/0 | score≥30, ex board |
| A2-return-heavy | 25/20/30/15/10 | base |
| A3-signal-heavy | 50/20/15/15/0 | base |
| A4-winrate-heavy | 25/35/15/15/10 | base |
| A5-recency-cold | 35/25/20/0/20 | base |
| A6-recency-fast | 35/25/15/25/0 | half-life 10 |
| A7-recency-slow | 35/25/15/25/0 | half-life 45 |
| A8-equal | 20/20/20/20/20 | base |
| A9-minimalist | 60/40/0/0/0 | base |
| A10-pure-quant | 0/40/40/20/0 | base |
| A11-fixedK-20 | 35/25/20/20/0 | k=20 fixed |
| A12-strict-score-60 | 35/25/20/20/0 | score≥60 |
| A13-broad-score-20 | 35/25/20/20/0 | score≥20 |
| A14-include-board | 35/25/20/20/0 | board kept |
| A15-sells-shorted | 35/25/20/20/0 | sells short |
| A16-signal-only | 100/0/0/0/0 | base |
| A17-conv-heavy | 30/20/15/15/20 | base |
Results
| Rank | Config | Composite | SR_ann | winRate | meanR90 | maxDD | posY |
|---|---|---|---|---|---|---|---|
| 1 | A12-strict-score-60 | 0.762 | 0.90 | 85.1 % | +26.48 % | 6.5 % | 4/5 |
| 2 | A15-sells-shorted | 0.593 | 1.33 | 60.5 % | +7.72 % | 61.0 % | 6/6 |
| 3 | A14-include-board | 0.483 | 0.68 | 53.2 % | +3.92 % | 82.6 % | 6/6 |
| 4 | A1-current | 0.454 | 0.53 | 50.1 % | +2.92 % | 91.6 % | 6/6 |
| 5 | A11-fixedK-20 | 0.454 | 0.52 | 50.2 % | +2.93 % | 91.7 % | 6/6 |
| 6 | A9-minimalist | 0.447 | 0.47 | 50.4 % | +3.07 % | 91.6 % | 6/6 |
| 7 | A6-recency-fast | 0.440 | 0.43 | 50.1 % | +2.85 % | 91.7 % | 6/6 |
| 8 | A7-recency-slow | 0.440 | 0.43 | 50.1 % | +2.85 % | 91.7 % | 6/6 |
| 9 | A3-signal-heavy | 0.438 | 0.43 | 49.3 % | +2.86 % | 91.4 % | 6/6 |
| 10 | A17-conv-heavy | 0.434 | 0.38 | 49.9 % | +2.83 % | 91.6 % | 6/6 |
| 11 | A2-return-heavy | 0.433 | 0.38 | 49.6 % | +2.71 % | 92.2 % | 6/6 |
| 12 | A8-equal | 0.429 | 0.35 | 49.6 % | +2.60 % | 92.4 % | 6/6 |
| 13 | A5-recency-cold | 0.427 | 0.33 | 49.8 % | +2.66 % | 91.5 % | 6/6 |
| 14 | A13-broad-score-20 | 0.426 | 0.32 | 50.0 % | +2.41 % | 91.8 % | 6/6 |
| 15 | A4-winrate-heavy | 0.424 | 0.32 | 49.5 % | +2.46 % | 93.0 % | 6/6 |
| 16 | A10-pure-quant | 0.376 | −0.05 | 50.1 % | +2.58 % | 92.4 % | 6/6 |
| 17 | A16-signal-only | 0.369 | −0.06 | 47.8 % | +2.52 % | 92.3 % | 6/6 |
Reading
The ranking compares apples to oranges. A12, A14, A15 change the candidate universe (filter); A1–A11, A13, A16–A17 change only the ranking weights on a fixed universe.
Within the pure-weights group (A1–A11, A13, A16, A17), A1 is at the top and the deltas to every alternative are tiny:
- A1 vs A11 (best weight-only competitor): ΔSR = +0.01, ΔwinRate = −0.1pt.
- A1 vs A9: ΔSR = +0.06, ΔwinRate = −0.3pt.
- A1 vs A2/A3/A4/A8: all worse than A1 by 0.1–0.2 SR.
- A10/A16 (signalScore-only or no-signalScore): both kill Sharpe.
This says the current weighting (35/25/20/20/0) sits on a flat plateau — small perturbations do nothing. No weight rebalance beats A1 by the honesty threshold (ΔSR > 0.3 OR ΔwinRate > 5pt).
A12, A14, A15 are filter/scope changes, not scoring changes. A12's
massive lift (85 % winRate, +26 % mean, 6.5 % DD) is the score floor moving
from 30 → 60 — the Sigma WINNING_STRATEGY already pushes that with
minScore: 40 + cluster + mid-cap + board-exclude + freshness, which our
harness does NOT model. So A12's edge is largely already captured by the
production filter stack. The 6.5 % DD on n=67 is also a small-sample
artifact (T=5 yearly buckets, one negative year).
A14 (include board) outperforms A1 by 0.15 SR / +3pt winRate. Surprising
but inside the noise floor, and crucially is the OPPOSITE of the grid-search
finding that drove the current excludeBoardRole: true. Likely
universe-specific: when we include the BUY-only 15 k universe (vs the
filtered ~200 strict-strategy subset), board roles aren't as toxic as on the
strict slice. Not worth flipping the production default on this evidence.
A15 (short SELLs) Sharpe 1.33 is interesting but completely orthogonal to BUY scoring weights — it's a new strategy, not a re-weighting of the existing one. Outside the scope of this review.
Robustness checks
- All weight-only configs hit 6/6 positive yearly buckets — universe is too large for any reasonable weighting to lose money on average.
- maxDD ~92 % on weight-only configs is a NAV-compounding artifact of the toy "every pick contributes 1/10 of a slice" model — not a real portfolio DD. Trust SR_ann and winRate instead.
- T = 6 yearly buckets is too thin for a deflated Sharpe — applying Bailey-López de Prado on 17 trials × T=6 would push every config's DSR negative. Don't oversell any of these as "proven".
Decision
Engine NOT updated. The current A1 weighting (35/25/20/20/0, adaptive k, half-life 21d) is statistically indistinguishable from every alternative in the weight-space. Per the harness's honesty rule (ΔSR > 0.3 OR Δwr > 5pt required), no config clears the bar.
The headline lifts (A12 / A15 / A14) come from filter / scope changes that either:
- duplicate the existing
WINNING_STRATEGYfilter (A12), or - contradict prior grid-search evidence on a different universe (A14), or
- define a NEW strategy entirely (A15).
None warrants changing the scoring weights or RECENCY_HALFLIFE_DAYS /
adaptive-k logic in src/lib/recommendation-engine.ts.
Caveats
- Full-sample bucket means (no point-in-time recomputation per signal) → mild forward-look bias on bucket priors. Affects all configs equally so relative ranking is preserved.
- Recency axis is nearly degenerate in historical replay (every candidate is "fresh" at its own pubDate). Recency-related variants (A6/A7) move recency points uniformly; their tiny deltas are expected.
- Conviction proxy is crude (cluster bit only); A2/A4/A8/A17 may underestimate the value of weight on conviction if a richer proxy were used.
- T = 6 yearly buckets; CI half-widths on SR_ann are large (≈ ±1.2); ranking among weight-only configs is within noise.
Files
- Harness:
scripts/backtest-scoring-configs.mjs - Raw output:
/tmp/backtest-configs-raw.json