113 · Insider-Quality Skill Prior (Empirical-Bayes) · 2026-05-26

Decision: NO-GO. Do not ship. The signal exists at the trade level (it is real and strong), but folded into the V14e top-10 picker it degrades risk-adjusted performance. Ablation confirms the lift is not incremental over the mature V14e stack. Negative result, closed out. Production scoring untouched, no flag wired.

Branch: research/insider-skill-prior
Bake: scripts/_v13_bakeoff/v15b-insider-skill.ts
Results JSON: scripts/_v13_bakeoff/stats-v15b-insider-skill.json
Cohort: EU_strict (XPAR, XAMS, XWBO, XBRU, XHEL, XOSL, XSTO, XETR), priced BUYs only.
OOS window: 2025-01-01 .. 2026-05 (T=14 monthly buckets), top-10/mo, T+90 hold, NET 0.6% round-trip cost, winsor ±50% per pick. Walk-forward (params tuned on TRAIN fold pubDate < 2025-01-01, evaluated on OOS only).

1. Motivation

The production picker src/lib/signals.ts::computeV13Score() (V14e) carries exactly one insider-identity term: a crude insiderRecentAlpha90d (clip ±20 × 0.025, no shrinkage, no sample-size weighting, raw expanding mean of prior 90d returns). The hypothesis: a properly shrunk empirical-Bayes per-insider skill prior should extract more of the persistent cross-sectional skill that the regulators' filings encode, and lift the picker.

Data facts (verified against prod DB, EU_strict priced BUYs):

30,606 priced BUYs; 15,859 (51.8%) carry ≥3 strictly-prior priced BUYs by the same insider; 459 insiders have ≥10 lifetime priced BUYs; 8,189 distinct insiders.
Grand mean of returnFromPub90d is +67.6% raw but -0.17% winsorized (±50%) · a single moonshot (per-trade std 3,713) dominates the raw mean. Winsorization is mandatory or the prior is pure noise. We winsorize all prior returns at ±50%.

2. Method (look-ahead-free)

For trade t by insider i, using ONLY priced BUYs with pubDate < t.pubDate:

n_i = count of strictly-prior priced BUYs by i.
rbar_i = winsorized (±50%) mean of those prior returnFromPub90d.

Empirical-Bayes shrinkage toward an expanding-window grand mean mu_grand (winsorized mean of ALL priced BUYs strictly before t's rebalance-month start, re-derived monthly so it is point-in-time, NOT full-sample):

skill_i = (n_i · rbar_i + k · mu_grand) / (n_i + k)

k estimated via one-way ANOVA variance components on the TRAIN fold only (pubDate < 2025-01-01):

MSW = 195.25   (pooled within-insider variance of winsorized returns)
MSB = 857.14   (between-insider mean square)
nbar = 9.84    (effective group size, random-effects estimator)
sigma2_between(true) = max(0, (MSB - MSW) / nbar) = 67.25
sigma2_within        = MSW                         = 195.25
k = sigma2_within / sigma2_between = 2.90

k = 2.9 is honest: true between-insider variance (67) is small relative to within-insider noise (195), so a 3-trade insider gets shrunk roughly half-way to the grand mean. Gate: emit the term only when n_i ≥ 3 (else term = 0, no cold-start penalty). Fold-in as a bounded, centered tilt: s += clip(skill_i − anchor, ±X) · w, with anchor = grand mean (train) = -0.78%, and X, w tuned on TRAIN ONLY over a small 3×3 grid (X ∈ {10,20,30}, w ∈ {0.02,0.04,0.06}).

TRAIN tuning was already a red flag: the best train Sharpe across the entire grid was 0.096 (statistical noise), and the optimizer selected the most aggressive corner (X=30, w=0.06) · the classic signature of a grid with no real signal to fit.

3. Three-arm OOS results

Arm	Description	Sharpe	DSR	CI95	CAGR%	MaxDD%	Win%	T	Picks
A	V14e baseline (live formula)	1.51	0.49	[-0.28, 4.76]	55.7	-14.1	59.3	14	140
B	V14e + skill prior (treatment)	1.25	0.23	[-0.47, 3.58]	62.8	-20.4	56.4	14	140
C	ABLATION: V14e(recentAlpha=0) + skill prior	1.42	0.40	[-0.27, 4.03]	73.9	-20.4	58.6	14	140

N_trials = 12 (3 OOS arms + 9 TRAIN grid configs, honest accounting of the param search).

Note arm A scores 1.51 on the refreshed cohort (data now extends to pub 2026-02-19), above the 1.36 recorded in audit 103 / STRATEGY_PROOF. That live reference figure is unchanged; STRATEGY_PROOF was not touched. The relevant comparison here is intra-study (A vs B vs C on identical data and pipeline), which controls for the cohort refresh.

4. Why it fails the gate

Sharpe(B) 1.25 < live 1.36 and < same-cohort baseline A 1.51. FAIL.
Drawdown deepened, -14.1% → -20.4%. No DD tradeoff to justify the Sharpe loss.
DSR(B) 0.23: the drop from baseline 0.43 is 0.20 (≤ 0.30, technically OK), but that is moot given Sharpe and the ablation both fail.
CI95Lo(B) -0.47 ≥ -2.0: OK in isolation, also moot.
Ablation (decisive): C 1.42 < A 1.51. Even when we remove the crude legacy insiderRecentAlpha90d term entirely and replace it with the shrunk prior, the result is still below the live baseline. The lift is NOT incremental · the prior does not even recover what the crude term was already contributing. REJECT per the ablation criterion in the task brief.

5. Root cause · the signal is real but unusable as a rank tilt here

This is the interesting part and the reason the result is a clean negative rather than a data bug.

At the trade level the PIT skill prior is strongly, monotonically predictive of OOS forward returns (diagnostic on the 4,515 OOS trades with n_i ≥ 3):

PIT skill bucket	OOS trades	mean fwd 90d (winsor)
skill < -5%	788	-7.45%
-5% .. 0%	1,595	-2.13%
0% .. +5%	1,424	+2.81%
skill > +5%	708	+15.66%

corr(skill, fwd90d) = 0.50 on the OOS set. High-skill insiders genuinely continue to outperform out-of-sample · the prior captures real persistent skill.

So why does it hurt the picker? Two compounding reasons:

Sharpe ≠ return on a thin book. Adding the tilt reshuffles which 10 names per month get picked toward high-prior-skill names. Those names have higher forward mean (CAGR rises 55.7% → 62.8% in B, → 73.9% in C) but also higher dispersion, deepening drawdown and cutting the risk-adjusted ratio. The prior chases raw return at the cost of Sharpe.
V14e already encodes most of this skill. A focused probe with a deliberately weak base (sector/earnings/proxy multipliers stripped, baseline Sharpe ~0.36) showed the skill prior helping monotonically (0.36 → 0.83) and a top-30→top-10 skill tiebreaker reaching ~1.09. The signal is additive only when the base is weak. On the mature full V14e stack (1.51) the prior is redundant and competes with the existing high-quality ranking rather than augmenting it. The crude insiderRecentAlpha90d, despite no shrinkage, is already contributing approximately the right dose; the "better" prior overshoots into the return-chasing regime.

6. Overfitting / DSR discussion

Params (X, w) tuned on TRAIN only; OOS untouched until the final 3-arm run. N_trials=12 deflation applied to all reported DSRs.
The TRAIN-fold best Sharpe of 0.096 across the whole grid is itself the cleanest anti-overfitting evidence: there was no in-sample edge to overfit to. The OOS degradation is therefore not a generalization gap · it is a genuine absence of incremental, risk-adjusted, picker-level edge.
k=2.9 is data-driven (ANOVA variance components, train fold), not hand-tuned. We did not search over k, keeping the trial count low to protect DSR.

7. Caveats / known limits

EU_strict only (the shipped universe). A different universe or hold horizon could move the trade-level edge into a usable region, but that is out of scope for this gate.
Top-10/mo is a deliberately concentrated book; the Sharpe penalty is partly a construction artifact of so few names. A broader book (top-30/50) plus skill as a tiebreaker is the only configuration that showed promise (Section 5) and is flagged as the single most promising next step · but it changes the picker construction, not just the score, so it needs its own gate.
insiderName is the join key (falling back to insiderId); name collisions across companies would dilute, not inflate, the prior · so the trade-level corr=0.50 is if anything a conservative floor.

8. Implementation outline (NOT executed · for the record)

Had it passed, the wiring would have mirrored the existing SCORING_V15_REGIME flag pattern (audit 112): a SCORING_V16_SKILL = { enabled: false, k, X, w, anchor } const, a skillPriorTilt(skill, n) helper returning 0 when disabled or n < 3, and a single s += skillPriorTilt(...) line in computeV13Score that is a strict no-op when the flag is off (prod bit-identical). The PIT prior fields (skillPriorN, skillPriorShrunk) would be backfilled into Declaration by an idempotent scripts/backfill-* worker using the expanding-window monthly mu_grand. None of this was implemented because the gate failed.

9. Next steps

Picker construction, not score tilt. Re-test skill strictly as a tiebreaker inside a wider candidate set (V14e top-30/50 → top-10 by skill), under its own gate. This was the only configuration with a positive signal in probing.
Skill-conditioned position sizing rather than rank tilt: size up the highest-skill picks within a fixed pick set, accepting higher CAGR with managed DD · a sizing problem (cf. the V15 regime sizing overlay), not a ranking problem.
Leave the crude insiderRecentAlpha90d term in place. It is not optimal but it is not the bottleneck, and replacing it with the shrunk prior is a strict regression here.

10. Files

Bake: scripts/_v13_bakeoff/v15b-insider-skill.ts
Results JSON: scripts/_v13_bakeoff/stats-v15b-insider-skill.json
This audit: docs/method-review/113-insider-skill-prior-2026-05-26.md
Production scoring (src/lib/signals.ts): UNCHANGED. No flag added. STRATEGY_PROOF untouched. No prod rescore run.