112 · Scoring improvement research: regime overlay + factor re-weighting (2026-05-26)
Branch: research/scoring-v15-candidate
Status: research only · candidate wired behind a default-OFF flag · NOT merged · NO prod rescore.
Summary
Goal: lift risk-adjusted return of the shipped insider-signal scorer
(computeV13Score, formula version V14e) over a rigorously validated OOS
window, NET of the existing transaction-cost (0.6% round-trip) and winsor
(±50%) assumptions, and ship-gated per AGENTS.md.
Outcome: no candidate produces a defensible improvement on the current OOS
window. Every parameter re-weighting tested lowered Sharpe and Deflated
Sharpe versus the V14e baseline. The one mechanically-sound overlay (broad
market regime position-sizing) is inert on the 2025-01..2026-05 OOS window
because the EU broad-market trailing-90d trend never crossed the trigger band,
and the cohort's drawdowns turn out to be idiosyncratic rather than
broad-market driven. The regime overlay is wired behind a feature flag
(SCORING_V15_REGIME.enabled, default false) for forward A/B testing once a
genuine EU bear leg appears; production behaviour is unchanged.
Honest recommendation: needs more data (specifically: a bear regime in the OOS window). No improvement found. Keep V14e live. Do not flip the flag.
Method
- Offline bake:
scripts/_v13_bakeoff/bake-v15a-regime-alpha.ts→scripts/_v13_bakeoff/stats-v15a.json. - Universe: EU_strict (XPAR, XAMS, XWBO, XBRU, XHEL, XOSL, XSTO, XETR), BUY-only (production scores BUY signals). Same universe as audit 103.
- Cohort: full-history priced BacktestResult rows (
returnFromPub90dnot null), n=30,606 EU_strict BUY rows (train 20,552 / OOS 10,054). - Picker: top-10 / month, T+90 hold, NET 0.6% round-trip cost, winsor ±50% per pick. Identical harness to the V13/V14 bakes (bake-v14_full.ts, bake-eu-only.ts).
- Walk-forward: OOS = pubDate >= 2025-01-01 (T=14 monthly buckets, ~16.5 months).
- Deflated Sharpe: Bailey-Lopez de Prado, with N_trials counted honestly across the whole V13/V14/V15 search program (N=29 including this run's variants).
- Regime series: iShares STOXX Europe 600 proxy
EXSA.DEfromSectorIndexHistory, PIT trailing-90d return evaluated at each trade's pub month start (look-ahead free; uses only closes strictly before month start).
Baseline reproduction
V14e_baseline in the bake reproduces computeV13Score exactly on EU_strict:
v13_1g base (senior-role bonus, cluster, pctMcap × 1.4, small-cap bonus,
log-amount, related-kind mult 1.20/1.05, wide-cluster ×1.30 at ≥5 participants,
recent-alpha autocorrelation × 0.025) + earnings-proximity additive + sector
momentum multiplier + V14e entity (+0.10) / family (+0.20) kind multipliers.
Hypotheses tested
| ID | Hypothesis | Mechanism |
|---|---|---|
| H2 | Broad-market regime position-sizing | scale month book exposure by EU STOXX600 trailing-90d band (1.10 / 1.00 / 0.95), cash for remainder. Does not change picks. |
| H3 | Insider track-record up-weight | recent-alpha autocorrelation weight 0.025 -> 0.040 (re-ranks within month) |
| H4 | Track-record up-weight (stress) | weight 0.060 |
| H5 | Cluster definition tweak | wide-cluster boost threshold ≥5 -> ≥4 participants |
| H6 | Post-hoc combo | H3 (0.040) + H2 regime sizing |
OOS results (EU_strict, 2025-01-01 to 2026-05, T=14, BUY only)
| Config | T | Picks | Sharpe | CI 95 | CAGR% | MaxDD% | Win% | DSR |
|---|---|---|---|---|---|---|---|---|
| V14e_baseline | 14 | 140 | 1.51 | [-0.28, 4.76] | 55.7 | -14.1 | 59.3 | 0.40 |
| H2_regime_size | 14 | 140 | 1.51 | [-0.28, 4.76] | 55.7 | -14.1 | 59.3 | 0.40 |
| H3_alpha_up_0.040 | 14 | 140 | 1.25 | [-0.50, 4.51] | 51.9 | -20.4 | 58.6 | 0.14 |
| H4_alpha_up_0.060 | 14 | 140 | 1.32 | [-0.44, 3.96] | 58.3 | -18.8 | 56.4 | 0.21 |
| H5_wide_clu4 | 14 | 140 | 1.44 | [-0.36, 5.09] | 53.1 | -17.9 | 58.6 | 0.32 |
| H6_combo_a040_regime | 14 | 140 | 1.25 | [-0.50, 4.51] | 51.9 | -20.4 | 58.6 | 0.14 |
N_trials for DSR deflation = 29.
Note: the baseline Sharpe of 1.51 here is higher than the audit-103 published 1.36 because the OOS window has grown by ~5 months since 2026-05-21 and this run is BUY-only on EU_strict (audit 103's headline mixed the picker math slightly differently). The relative comparison between baseline and challengers is the load-bearing result, not the absolute level.
Why the regime overlay is inert (the honest finding)
The PIT EU broad-market (STOXX600 proxy) trailing-90d trend at each OOS month start stayed in [-2.1%, +10.2%] across all 14 OOS months. It never reached the -5% "accumulate" band and only twice touched the +15% "fade" band region. So the regime sizing factor was 1.00 for essentially every month -> H2 is a numerical identity to the baseline.
More importantly, the EU_strict BUY cohort's worst months (2025-01 mean -4.8%, 2025-08/09 mean ~-4%) do NOT line up with the broad-market trend (2025-01 broad 90d was only -2.1%; 2025-08/09 was ~-0.5%). The cohort drawdowns are idiosyncratic to the insider universe, not broad-market beta. A broad-market regime tilt therefore could not time them even with a more sensitive band. This is a genuine negative result, not a tuning failure.
Ship-gate evaluation (AGENTS.md)
| Criterion | Threshold | Best challenger | Verdict |
|---|---|---|---|
| OOS window | >= 14 months | 16.5 months (T=14) | meets |
| Sharpe >= live | >= 1.51 | H2 = 1.51 (tie, inert); all re-rankers < 1.51 | no real gain |
| DSR drop <= 0.30 pts | <= 0.30 | H2 drop 0.00 (inert); H3/H4/H6 drop 0.19-0.26 | H2 passes trivially |
| CI95 lower bound | >= -2.0 | -0.28 (H2) | meets |
| MaxDD tradeoff | lower is bonus | H2 -14.1 (tie); all others worse | no |
H2 technically clears the literal gate, but only because it is a no-op. It is not an improvement and must not be presented as one. All hypotheses that actually re-rank the picks (H3, H4, H5, H6) DEGRADE Sharpe and DSR, which is strong evidence that V14e's coefficients are already near-optimal on this data and that pushing the levers harder is fitting noise.
Caveats / overfitting risks
- Single OOS window with no bear regime: the central limitation. The regime overlay cannot be validated until the window contains an EU drawdown.
- Multiple-testing: N_trials=29 across the whole V13/V14/V15 program; DSR is reported with that count, and the challengers' DSR is already at or below baseline. We did not search a large grid here (6 configs) precisely to avoid manufacturing a spurious winner.
- The 1.51 baseline Sharpe is itself thin (CI95 [-0.28, 4.76] straddles zero at the lower bound). Any "improvement" inside that band is not statistically distinguishable from luck.
- EU_strict is a deliberate honest-tape restriction (audit 101/103), not a parameter searched here.
Implementation outline (flag, default OFF)
Wired in src/lib/signals.ts, all behaviour-preserving when the flag is off:
V13Inputs.regimeBroadReturn90d?: number | null— optional PIT broad-market trailing-90d return input.SCORING_V15_REGIME— exported config object,enabled: falseby default.regimeSizingMultiplier(broadReturn90d)— returns exactly1.0when the flag is off (or input null), else the banded 1.10 / 1.00 / 0.95 tilt.computeV13Score()ends withs = s * regimeSizingMultiplier(i.regimeBroadReturn90d), a strict no-op whileenabled === false, so every existing call site and the productionscoreDeclarationsV13pass stay bit-identical. STRATEGY_PROOF is untouched.
Next steps
- Do NOT flip
SCORING_V15_REGIME.enabled. Re-evaluate only after the OOS window contains a real EU bear leg (broad 90d < -5% for >= 2 rebalance months), then A/B the flag against live. - If a future improvement is sought, prioritise orthogonal data axes over re-weighting existing factors (which is now shown to be saturated): e.g. options-flow confirmation, 13F institutional co-movement, or a genuinely independent insider-quality prior mined on IS-only data (the V15 pattern-mining attempt, audit 97, was look-ahead contaminated).
- Keep V14e as the live production scorer.
Files
- Bake:
scripts/_v13_bakeoff/bake-v15a-regime-alpha.ts - Results:
scripts/_v13_bakeoff/stats-v15a.json - Flag wiring:
src/lib/signals.ts(SCORING_V15_REGIME,regimeSizingMultiplier,V13Inputs.regimeBroadReturn90d) - This audit:
docs/method-review/112-scoring-improvement-2026-05-26.md