113 · Insider-Quality Skill Prior (Empirical-Bayes) · 2026-05-26
Decision: NO-GO. Do not ship. The signal exists at the trade level (it is real and strong), but folded into the V14e top-10 picker it degrades risk-adjusted performance. Ablation confirms the lift is not incremental over the mature V14e stack. Negative result, closed out. Production scoring untouched, no flag wired.
- Branch:
research/insider-skill-prior - Bake:
scripts/_v13_bakeoff/v15b-insider-skill.ts - Results JSON:
scripts/_v13_bakeoff/stats-v15b-insider-skill.json - Cohort: EU_strict (XPAR, XAMS, XWBO, XBRU, XHEL, XOSL, XSTO, XETR), priced BUYs only.
- OOS window: 2025-01-01 .. 2026-05 (T=14 monthly buckets), top-10/mo, T+90 hold, NET 0.6% round-trip cost, winsor ±50% per pick. Walk-forward (params tuned on TRAIN fold pubDate < 2025-01-01, evaluated on OOS only).
1. Motivation
The production picker src/lib/signals.ts::computeV13Score() (V14e) carries exactly one
insider-identity term: a crude insiderRecentAlpha90d (clip ±20 × 0.025, no shrinkage,
no sample-size weighting, raw expanding mean of prior 90d returns). The hypothesis: a
properly shrunk empirical-Bayes per-insider skill prior should extract more of the
persistent cross-sectional skill that the regulators' filings encode, and lift the picker.
Data facts (verified against prod DB, EU_strict priced BUYs):
- 30,606 priced BUYs; 15,859 (51.8%) carry ≥3 strictly-prior priced BUYs by the same insider; 459 insiders have ≥10 lifetime priced BUYs; 8,189 distinct insiders.
- Grand mean of
returnFromPub90dis +67.6% raw but -0.17% winsorized (±50%) · a single moonshot (per-trade std 3,713) dominates the raw mean. Winsorization is mandatory or the prior is pure noise. We winsorize all prior returns at ±50%.
2. Method (look-ahead-free)
For trade t by insider i, using ONLY priced BUYs with pubDate < t.pubDate:
n_i= count of strictly-prior priced BUYs by i.rbar_i= winsorized (±50%) mean of those priorreturnFromPub90d.
Empirical-Bayes shrinkage toward an expanding-window grand mean mu_grand
(winsorized mean of ALL priced BUYs strictly before t's rebalance-month start,
re-derived monthly so it is point-in-time, NOT full-sample):
skill_i = (n_i · rbar_i + k · mu_grand) / (n_i + k)
k estimated via one-way ANOVA variance components on the TRAIN fold only
(pubDate < 2025-01-01):
MSW = 195.25 (pooled within-insider variance of winsorized returns)
MSB = 857.14 (between-insider mean square)
nbar = 9.84 (effective group size, random-effects estimator)
sigma2_between(true) = max(0, (MSB - MSW) / nbar) = 67.25
sigma2_within = MSW = 195.25
k = sigma2_within / sigma2_between = 2.90
k = 2.9 is honest: true between-insider variance (67) is small relative to within-insider
noise (195), so a 3-trade insider gets shrunk roughly half-way to the grand mean. Gate:
emit the term only when n_i ≥ 3 (else term = 0, no cold-start penalty). Fold-in as a
bounded, centered tilt: s += clip(skill_i − anchor, ±X) · w, with anchor = grand mean (train) = -0.78%, and X, w tuned on TRAIN ONLY over a small 3×3 grid
(X ∈ {10,20,30}, w ∈ {0.02,0.04,0.06}).
TRAIN tuning was already a red flag: the best train Sharpe across the entire grid was
0.096 (statistical noise), and the optimizer selected the most aggressive corner
(X=30, w=0.06) · the classic signature of a grid with no real signal to fit.
3. Three-arm OOS results
| Arm | Description | Sharpe | DSR | CI95 | CAGR% | MaxDD% | Win% | T | Picks |
|---|---|---|---|---|---|---|---|---|---|
| A | V14e baseline (live formula) | 1.51 | 0.49 | [-0.28, 4.76] | 55.7 | -14.1 | 59.3 | 14 | 140 |
| B | V14e + skill prior (treatment) | 1.25 | 0.23 | [-0.47, 3.58] | 62.8 | -20.4 | 56.4 | 14 | 140 |
| C | ABLATION: V14e(recentAlpha=0) + skill prior | 1.42 | 0.40 | [-0.27, 4.03] | 73.9 | -20.4 | 58.6 | 14 | 140 |
N_trials = 12 (3 OOS arms + 9 TRAIN grid configs, honest accounting of the param search).
Note arm A scores 1.51 on the refreshed cohort (data now extends to pub 2026-02-19),
above the 1.36 recorded in audit 103 / STRATEGY_PROOF. That live reference figure is
unchanged; STRATEGY_PROOF was not touched. The relevant comparison here is intra-study
(A vs B vs C on identical data and pipeline), which controls for the cohort refresh.
4. Why it fails the gate
- Sharpe(B) 1.25 < live 1.36 and < same-cohort baseline A 1.51. FAIL.
- Drawdown deepened, -14.1% → -20.4%. No DD tradeoff to justify the Sharpe loss.
- DSR(B) 0.23: the drop from baseline 0.43 is 0.20 (≤ 0.30, technically OK), but that is moot given Sharpe and the ablation both fail.
- CI95Lo(B) -0.47 ≥ -2.0: OK in isolation, also moot.
- Ablation (decisive): C 1.42 < A 1.51. Even when we remove the crude legacy
insiderRecentAlpha90dterm entirely and replace it with the shrunk prior, the result is still below the live baseline. The lift is NOT incremental · the prior does not even recover what the crude term was already contributing. REJECT per the ablation criterion in the task brief.
5. Root cause · the signal is real but unusable as a rank tilt here
This is the interesting part and the reason the result is a clean negative rather than a data bug.
At the trade level the PIT skill prior is strongly, monotonically predictive of OOS
forward returns (diagnostic on the 4,515 OOS trades with n_i ≥ 3):
| PIT skill bucket | OOS trades | mean fwd 90d (winsor) |
|---|---|---|
| skill < -5% | 788 | -7.45% |
| -5% .. 0% | 1,595 | -2.13% |
| 0% .. +5% | 1,424 | +2.81% |
| skill > +5% | 708 | +15.66% |
corr(skill, fwd90d) = 0.50 on the OOS set. High-skill insiders genuinely continue to
outperform out-of-sample · the prior captures real persistent skill.
So why does it hurt the picker? Two compounding reasons:
- Sharpe ≠ return on a thin book. Adding the tilt reshuffles which 10 names per month get picked toward high-prior-skill names. Those names have higher forward mean (CAGR rises 55.7% → 62.8% in B, → 73.9% in C) but also higher dispersion, deepening drawdown and cutting the risk-adjusted ratio. The prior chases raw return at the cost of Sharpe.
- V14e already encodes most of this skill. A focused probe with a deliberately weak
base (sector/earnings/proxy multipliers stripped, baseline Sharpe ~0.36) showed the
skill prior helping monotonically (0.36 → 0.83) and a top-30→top-10 skill tiebreaker
reaching ~1.09. The signal is additive only when the base is weak. On the mature full
V14e stack (1.51) the prior is redundant and competes with the existing high-quality
ranking rather than augmenting it. The crude
insiderRecentAlpha90d, despite no shrinkage, is already contributing approximately the right dose; the "better" prior overshoots into the return-chasing regime.
6. Overfitting / DSR discussion
- Params (
X,w) tuned on TRAIN only; OOS untouched until the final 3-arm run. N_trials=12 deflation applied to all reported DSRs. - The TRAIN-fold best Sharpe of 0.096 across the whole grid is itself the cleanest anti-overfitting evidence: there was no in-sample edge to overfit to. The OOS degradation is therefore not a generalization gap · it is a genuine absence of incremental, risk-adjusted, picker-level edge.
k=2.9is data-driven (ANOVA variance components, train fold), not hand-tuned. We did not search overk, keeping the trial count low to protect DSR.
7. Caveats / known limits
- EU_strict only (the shipped universe). A different universe or hold horizon could move the trade-level edge into a usable region, but that is out of scope for this gate.
- Top-10/mo is a deliberately concentrated book; the Sharpe penalty is partly a construction artifact of so few names. A broader book (top-30/50) plus skill as a tiebreaker is the only configuration that showed promise (Section 5) and is flagged as the single most promising next step · but it changes the picker construction, not just the score, so it needs its own gate.
insiderNameis the join key (falling back toinsiderId); name collisions across companies would dilute, not inflate, the prior · so the trade-level corr=0.50 is if anything a conservative floor.
8. Implementation outline (NOT executed · for the record)
Had it passed, the wiring would have mirrored the existing SCORING_V15_REGIME flag
pattern (audit 112): a SCORING_V16_SKILL = { enabled: false, k, X, w, anchor } const,
a skillPriorTilt(skill, n) helper returning 0 when disabled or n < 3, and a single
s += skillPriorTilt(...) line in computeV13Score that is a strict no-op when the flag
is off (prod bit-identical). The PIT prior fields (skillPriorN, skillPriorShrunk)
would be backfilled into Declaration by an idempotent scripts/backfill-* worker using
the expanding-window monthly mu_grand. None of this was implemented because the gate
failed.
9. Next steps
- Picker construction, not score tilt. Re-test skill strictly as a tiebreaker inside a wider candidate set (V14e top-30/50 → top-10 by skill), under its own gate. This was the only configuration with a positive signal in probing.
- Skill-conditioned position sizing rather than rank tilt: size up the highest-skill picks within a fixed pick set, accepting higher CAGR with managed DD · a sizing problem (cf. the V15 regime sizing overlay), not a ranking problem.
- Leave the crude
insiderRecentAlpha90dterm in place. It is not optimal but it is not the bottleneck, and replacing it with the shrunk prior is a strict regression here.
10. Files
- Bake:
scripts/_v13_bakeoff/v15b-insider-skill.ts - Results JSON:
scripts/_v13_bakeoff/stats-v15b-insider-skill.json - This audit:
docs/method-review/113-insider-skill-prior-2026-05-26.md - Production scoring (
src/lib/signals.ts): UNCHANGED. No flag added. STRATEGY_PROOF untouched. No prod rescore run.