Method Review 11 — Scoring-config backtest (2026-05)

Harness

scripts/backtest-scoring-configs.mjs. For each ScoringConfig (weights, recency, k, filters):

Re-compute recoScore for every BacktestResult row joined with Declaration + Company (n = 20 656; BUY = 15 171, SELL = 5 485).
Apply filter (signal floor, mcap band, board-exclude, etc.).
Group eligible signals by ISO week; pick top-10 by score per week; hold 90d at equal weight.
Report: trades, winRate, mean / median r90, σ, annualized Sharpe (yearly buckets, rf = 3 %), cross-sectional Sharpe, max drawdown of the rolling NAV, positive-year count.
Composite = 0.4·SR_ann_norm + 0.3·winRate + 0.2·(1 − DD/50) + 0.1·posY/totY.

Side-specific stats use the inverted predicate for SELL (winning sell = r90 < 0). Bucket means are shrunk toward the side overall with adaptive-k (n<20 → 30, n<200 → 15, else 5), with a James-Stein variance kicker. The full-sample bucket stats are reused across the sim (not point-in-time) — common backtest simplification, noted as a leakage caveat.

Configs tested (17)

Name	Weights (signal/wr/ret/rec/conv)	Filter / param
A1-current	35/25/20/20/0	score≥30, ex board
A2-return-heavy	25/20/30/15/10	base
A3-signal-heavy	50/20/15/15/0	base
A4-winrate-heavy	25/35/15/15/10	base
A5-recency-cold	35/25/20/0/20	base
A6-recency-fast	35/25/15/25/0	half-life 10
A7-recency-slow	35/25/15/25/0	half-life 45
A8-equal	20/20/20/20/20	base
A9-minimalist	60/40/0/0/0	base
A10-pure-quant	0/40/40/20/0	base
A11-fixedK-20	35/25/20/20/0	k=20 fixed
A12-strict-score-60	35/25/20/20/0	score≥60
A13-broad-score-20	35/25/20/20/0	score≥20
A14-include-board	35/25/20/20/0	board kept
A15-sells-shorted	35/25/20/20/0	sells short
A16-signal-only	100/0/0/0/0	base
A17-conv-heavy	30/20/15/15/20	base

Results

Rank	Config	Composite	SR_ann	winRate	meanR90	maxDD	posY
1	A12-strict-score-60	0.762	0.90	85.1 %	+26.48 %	6.5 %	4/5
2	A15-sells-shorted	0.593	1.33	60.5 %	+7.72 %	61.0 %	6/6
3	A14-include-board	0.483	0.68	53.2 %	+3.92 %	82.6 %	6/6
4	A1-current	0.454	0.53	50.1 %	+2.92 %	91.6 %	6/6
5	A11-fixedK-20	0.454	0.52	50.2 %	+2.93 %	91.7 %	6/6
6	A9-minimalist	0.447	0.47	50.4 %	+3.07 %	91.6 %	6/6
7	A6-recency-fast	0.440	0.43	50.1 %	+2.85 %	91.7 %	6/6
8	A7-recency-slow	0.440	0.43	50.1 %	+2.85 %	91.7 %	6/6
9	A3-signal-heavy	0.438	0.43	49.3 %	+2.86 %	91.4 %	6/6
10	A17-conv-heavy	0.434	0.38	49.9 %	+2.83 %	91.6 %	6/6
11	A2-return-heavy	0.433	0.38	49.6 %	+2.71 %	92.2 %	6/6
12	A8-equal	0.429	0.35	49.6 %	+2.60 %	92.4 %	6/6
13	A5-recency-cold	0.427	0.33	49.8 %	+2.66 %	91.5 %	6/6
14	A13-broad-score-20	0.426	0.32	50.0 %	+2.41 %	91.8 %	6/6
15	A4-winrate-heavy	0.424	0.32	49.5 %	+2.46 %	93.0 %	6/6
16	A10-pure-quant	0.376	−0.05	50.1 %	+2.58 %	92.4 %	6/6
17	A16-signal-only	0.369	−0.06	47.8 %	+2.52 %	92.3 %	6/6

Reading

The ranking compares apples to oranges. A12, A14, A15 change the candidate universe (filter); A1–A11, A13, A16–A17 change only the ranking weights on a fixed universe.

Within the pure-weights group (A1–A11, A13, A16, A17), A1 is at the top and the deltas to every alternative are tiny:

A1 vs A11 (best weight-only competitor): ΔSR = +0.01, ΔwinRate = −0.1pt.
A1 vs A9: ΔSR = +0.06, ΔwinRate = −0.3pt.
A1 vs A2/A3/A4/A8: all worse than A1 by 0.1–0.2 SR.
A10/A16 (signalScore-only or no-signalScore): both kill Sharpe.

This says the current weighting (35/25/20/20/0) sits on a flat plateau — small perturbations do nothing. No weight rebalance beats A1 by the honesty threshold (ΔSR > 0.3 OR ΔwinRate > 5pt).

A12, A14, A15 are filter/scope changes, not scoring changes. A12's massive lift (85 % winRate, +26 % mean, 6.5 % DD) is the score floor moving from 30 → 60 — the Sigma WINNING_STRATEGY already pushes that with minScore: 40 + cluster + mid-cap + board-exclude + freshness, which our harness does NOT model. So A12's edge is largely already captured by the production filter stack. The 6.5 % DD on n=67 is also a small-sample artifact (T=5 yearly buckets, one negative year).

A14 (include board) outperforms A1 by 0.15 SR / +3pt winRate. Surprising but inside the noise floor, and crucially is the OPPOSITE of the grid-search finding that drove the current excludeBoardRole: true. Likely universe-specific: when we include the BUY-only 15 k universe (vs the filtered ~200 strict-strategy subset), board roles aren't as toxic as on the strict slice. Not worth flipping the production default on this evidence.

A15 (short SELLs) Sharpe 1.33 is interesting but completely orthogonal to BUY scoring weights — it's a new strategy, not a re-weighting of the existing one. Outside the scope of this review.

Robustness checks

All weight-only configs hit 6/6 positive yearly buckets — universe is too large for any reasonable weighting to lose money on average.
maxDD ~92 % on weight-only configs is a NAV-compounding artifact of the toy "every pick contributes 1/10 of a slice" model — not a real portfolio DD. Trust SR_ann and winRate instead.
T = 6 yearly buckets is too thin for a deflated Sharpe — applying Bailey-López de Prado on 17 trials × T=6 would push every config's DSR negative. Don't oversell any of these as "proven".

Decision

Engine NOT updated. The current A1 weighting (35/25/20/20/0, adaptive k, half-life 21d) is statistically indistinguishable from every alternative in the weight-space. Per the harness's honesty rule (ΔSR > 0.3 OR Δwr > 5pt required), no config clears the bar.

The headline lifts (A12 / A15 / A14) come from filter / scope changes that either:

duplicate the existing WINNING_STRATEGY filter (A12), or
contradict prior grid-search evidence on a different universe (A14), or
define a NEW strategy entirely (A15).

None warrants changing the scoring weights or RECENCY_HALFLIFE_DAYS / adaptive-k logic in src/lib/recommendation-engine.ts.

Caveats

Full-sample bucket means (no point-in-time recomputation per signal) → mild forward-look bias on bucket priors. Affects all configs equally so relative ranking is preserved.
Recency axis is nearly degenerate in historical replay (every candidate is "fresh" at its own pubDate). Recency-related variants (A6/A7) move recency points uniformly; their tiny deltas are expected.
Conviction proxy is crude (cluster bit only); A2/A4/A8/A17 may underestimate the value of weight on conviction if a richer proxy were used.
T = 6 yearly buckets; CI half-widths on SR_ann are large (≈ ±1.2); ranking among weight-only configs is within noise.

Files

Harness: scripts/backtest-scoring-configs.mjs
Raw output: /tmp/backtest-configs-raw.json