112 · Scoring improvement research: regime overlay + factor re-weighting (2026-05-26)

Branch: research/scoring-v15-candidate Status: research only · candidate wired behind a default-OFF flag · NOT merged · NO prod rescore.

Summary

Goal: lift risk-adjusted return of the shipped insider-signal scorer (computeV13Score, formula version V14e) over a rigorously validated OOS window, NET of the existing transaction-cost (0.6% round-trip) and winsor (±50%) assumptions, and ship-gated per AGENTS.md.

Outcome: no candidate produces a defensible improvement on the current OOS window. Every parameter re-weighting tested lowered Sharpe and Deflated Sharpe versus the V14e baseline. The one mechanically-sound overlay (broad market regime position-sizing) is inert on the 2025-01..2026-05 OOS window because the EU broad-market trailing-90d trend never crossed the trigger band, and the cohort's drawdowns turn out to be idiosyncratic rather than broad-market driven. The regime overlay is wired behind a feature flag (SCORING_V15_REGIME.enabled, default false) for forward A/B testing once a genuine EU bear leg appears; production behaviour is unchanged.

Honest recommendation: needs more data (specifically: a bear regime in the OOS window). No improvement found. Keep V14e live. Do not flip the flag.

Method

Offline bake: scripts/_v13_bakeoff/bake-v15a-regime-alpha.ts → scripts/_v13_bakeoff/stats-v15a.json.
Universe: EU_strict (XPAR, XAMS, XWBO, XBRU, XHEL, XOSL, XSTO, XETR), BUY-only (production scores BUY signals). Same universe as audit 103.
Cohort: full-history priced BacktestResult rows (returnFromPub90d not null), n=30,606 EU_strict BUY rows (train 20,552 / OOS 10,054).
Picker: top-10 / month, T+90 hold, NET 0.6% round-trip cost, winsor ±50% per pick. Identical harness to the V13/V14 bakes (bake-v14_full.ts, bake-eu-only.ts).
Walk-forward: OOS = pubDate >= 2025-01-01 (T=14 monthly buckets, ~16.5 months).
Deflated Sharpe: Bailey-Lopez de Prado, with N_trials counted honestly across the whole V13/V14/V15 search program (N=29 including this run's variants).
Regime series: iShares STOXX Europe 600 proxy EXSA.DE from SectorIndexHistory, PIT trailing-90d return evaluated at each trade's pub month start (look-ahead free; uses only closes strictly before month start).

Baseline reproduction

V14e_baseline in the bake reproduces computeV13Score exactly on EU_strict: v13_1g base (senior-role bonus, cluster, pctMcap × 1.4, small-cap bonus, log-amount, related-kind mult 1.20/1.05, wide-cluster ×1.30 at ≥5 participants, recent-alpha autocorrelation × 0.025) + earnings-proximity additive + sector momentum multiplier + V14e entity (+0.10) / family (+0.20) kind multipliers.

Hypotheses tested

ID	Hypothesis	Mechanism
H2	Broad-market regime position-sizing	scale month book exposure by EU STOXX600 trailing-90d band (1.10 / 1.00 / 0.95), cash for remainder. Does not change picks.
H3	Insider track-record up-weight	recent-alpha autocorrelation weight 0.025 -> 0.040 (re-ranks within month)
H4	Track-record up-weight (stress)	weight 0.060
H5	Cluster definition tweak	wide-cluster boost threshold ≥5 -> ≥4 participants
H6	Post-hoc combo	H3 (0.040) + H2 regime sizing

OOS results (EU_strict, 2025-01-01 to 2026-05, T=14, BUY only)

Config	T	Picks	Sharpe	CI 95	CAGR%	MaxDD%	Win%	DSR
V14e_baseline	14	140	1.51	[-0.28, 4.76]	55.7	-14.1	59.3	0.40
H2_regime_size	14	140	1.51	[-0.28, 4.76]	55.7	-14.1	59.3	0.40
H3_alpha_up_0.040	14	140	1.25	[-0.50, 4.51]	51.9	-20.4	58.6	0.14
H4_alpha_up_0.060	14	140	1.32	[-0.44, 3.96]	58.3	-18.8	56.4	0.21
H5_wide_clu4	14	140	1.44	[-0.36, 5.09]	53.1	-17.9	58.6	0.32
H6_combo_a040_regime	14	140	1.25	[-0.50, 4.51]	51.9	-20.4	58.6	0.14

N_trials for DSR deflation = 29.

Note: the baseline Sharpe of 1.51 here is higher than the audit-103 published 1.36 because the OOS window has grown by ~5 months since 2026-05-21 and this run is BUY-only on EU_strict (audit 103's headline mixed the picker math slightly differently). The relative comparison between baseline and challengers is the load-bearing result, not the absolute level.

Why the regime overlay is inert (the honest finding)

The PIT EU broad-market (STOXX600 proxy) trailing-90d trend at each OOS month start stayed in [-2.1%, +10.2%] across all 14 OOS months. It never reached the -5% "accumulate" band and only twice touched the +15% "fade" band region. So the regime sizing factor was 1.00 for essentially every month -> H2 is a numerical identity to the baseline.

More importantly, the EU_strict BUY cohort's worst months (2025-01 mean -4.8%, 2025-08/09 mean ~-4%) do NOT line up with the broad-market trend (2025-01 broad 90d was only -2.1%; 2025-08/09 was ~-0.5%). The cohort drawdowns are idiosyncratic to the insider universe, not broad-market beta. A broad-market regime tilt therefore could not time them even with a more sensitive band. This is a genuine negative result, not a tuning failure.

Ship-gate evaluation (AGENTS.md)

Criterion	Threshold	Best challenger	Verdict
OOS window	>= 14 months	16.5 months (T=14)	meets
Sharpe >= live	>= 1.51	H2 = 1.51 (tie, inert); all re-rankers < 1.51	no real gain
DSR drop <= 0.30 pts	<= 0.30	H2 drop 0.00 (inert); H3/H4/H6 drop 0.19-0.26	H2 passes trivially
CI95 lower bound	>= -2.0	-0.28 (H2)	meets
MaxDD tradeoff	lower is bonus	H2 -14.1 (tie); all others worse	no

H2 technically clears the literal gate, but only because it is a no-op. It is not an improvement and must not be presented as one. All hypotheses that actually re-rank the picks (H3, H4, H5, H6) DEGRADE Sharpe and DSR, which is strong evidence that V14e's coefficients are already near-optimal on this data and that pushing the levers harder is fitting noise.

Caveats / overfitting risks

Single OOS window with no bear regime: the central limitation. The regime overlay cannot be validated until the window contains an EU drawdown.
Multiple-testing: N_trials=29 across the whole V13/V14/V15 program; DSR is reported with that count, and the challengers' DSR is already at or below baseline. We did not search a large grid here (6 configs) precisely to avoid manufacturing a spurious winner.
The 1.51 baseline Sharpe is itself thin (CI95 [-0.28, 4.76] straddles zero at the lower bound). Any "improvement" inside that band is not statistically distinguishable from luck.
EU_strict is a deliberate honest-tape restriction (audit 101/103), not a parameter searched here.

Implementation outline (flag, default OFF)

Wired in src/lib/signals.ts, all behaviour-preserving when the flag is off:

V13Inputs.regimeBroadReturn90d?: number | null — optional PIT broad-market trailing-90d return input.
SCORING_V15_REGIME — exported config object, enabled: false by default.
regimeSizingMultiplier(broadReturn90d) — returns exactly 1.0 when the flag is off (or input null), else the banded 1.10 / 1.00 / 0.95 tilt.
computeV13Score() ends with s = s * regimeSizingMultiplier(i.regimeBroadReturn90d), a strict no-op while enabled === false, so every existing call site and the production scoreDeclarationsV13 pass stay bit-identical. STRATEGY_PROOF is untouched.

Next steps

Do NOT flip SCORING_V15_REGIME.enabled. Re-evaluate only after the OOS window contains a real EU bear leg (broad 90d < -5% for >= 2 rebalance months), then A/B the flag against live.
If a future improvement is sought, prioritise orthogonal data axes over re-weighting existing factors (which is now shown to be saturated): e.g. options-flow confirmation, 13F institutional co-movement, or a genuinely independent insider-quality prior mined on IS-only data (the V15 pattern-mining attempt, audit 97, was look-ahead contaminated).
Keep V14e as the live production scorer.

Files

Bake: scripts/_v13_bakeoff/bake-v15a-regime-alpha.ts
Results: scripts/_v13_bakeoff/stats-v15a.json
Flag wiring: src/lib/signals.ts (SCORING_V15_REGIME, regimeSizingMultiplier, V13Inputs.regimeBroadReturn90d)
This audit: docs/method-review/112-scoring-improvement-2026-05-26.md