114 · Picker-Construction Bake-off · 2026-05-26
Decision: NO-GO. Do not ship. No picker-construction variant robustly beats the V14e top-10 equal-weight baseline. The single contender (skill-prior as a tiebreaker on a wide candidate pool) is a continuous overfit dial that converges toward a pure skill-prior picker, the exact return-chasing failure mode already documented in audit 113. The robust optimum remains V14e + top-10 equal-weight. Production untouched, no flag wired.
- Branch:
research/picker-construction - Bake:
scripts/_v13_bakeoff/v15c-picker-construction.ts - Results JSON:
scripts/_v13_bakeoff/stats-v15c-picker.json - Cohort: EU_strict (XPAR, XAMS, XWBO, XBRU, XHEL, XOSL, XSTO, XETR), priced BUYs only.
- OOS window: 2025-01-01 .. 2026-02 (T=14 monthly buckets), T+90 hold, NET 0.6% round-trip cost, winsor +-50% per pick. The V14e score is FROZEN for every variant; only the picker (how the monthly book of 10 names is built from the scores) varies.
1. Motivation
Two prior rigorous nulls (audits 112 + 113) showed that re-weighting V14e and adding an empirical-Bayes insider-skill factor both FAILED to beat the baseline. Audit 113 §9 explicitly flagged the picker (not the score) as the prime suspect and named three next steps: a wider candidate pool with a secondary-key tiebreaker, book-size tuning, and skill-conditioned sizing. This bake executes that program honestly under the ship-gate.
Baseline A (V14e top-10 equal-weight) reproduces audit 113 exactly on the refreshed cohort: Sharpe 1.51, CI95 [-0.28, 4.76], CAGR 55.7%, MaxDD -14.1%, Win 59.3%, T=14, 140 picks. This is the reference all variants must beat.
2. Variants (V14e score frozen for ALL)
- F1 TIEBREAKER · V14e top-N candidate pool (N in {15,20,30}), final 10 chosen by a secondary key in {skill prior (PIT empirical-Bayes), recency (newest pubDate first), pctMcap}. 9 configs. Tests the audit-113 §9 step-1 hypothesis directly.
- F2 BOOK SIZE · top-{8,12,15} per month equal-weight (10 = baseline). 3 configs.
- F3 CONCENTRATION CAPS · max-1 / max-2 picks per sector, max-1 per issuer, greedily from the V14e ranking, fill to 10. 3 configs.
- F4 CONVICTION SIZING · within the top-10 book, weight by score (linear shift, and softmax temperature 2.0) instead of equal-weight. 2 configs.
N_trials for DSR = 18 (1 baseline + 9 + 3 + 3 + 2). All tiebreaker / cap / sizing params are fixed a priori, NO train-fold parameter search, so 18 is the honest deflation count. A targeted robustness probe added an 8-point pool-size sweep on the F1-skill family; that brings the fully honest search count to 26 (see §5).
3. OOS results (2025-01-01 to today, T=14)
DSR is the within-study deflated Sharpe across the N=18 trial set (Bailey-Lopez de Prado,
expected-max-of-N correction). book = average names held per month.
| Family | Variant | Sharpe | DSR | CI95 | CAGR% | MaxDD% | Win% | book |
|---|---|---|---|---|---|---|---|---|
| BASE | A · V14e top-10 equal-weight | 1.51 | 0.05 | [-0.28, 4.76] | 55.7 | -14.1 | 59.3 | 10.0 |
| F1 | top15 -> 10 by skill | 1.23 | -0.24 | [-0.61, 3.21] | 56.5 | -14.9 | 56.4 | 10.0 |
| F1 | top20 -> 10 by skill | 1.34 | -0.13 | [-0.34, 3.22] | 63.6 | -12.9 | 54.3 | 10.0 |
| F1 | top30 -> 10 by skill | 1.63 | 0.17 | [0.08, 3.30] | 99.6 | -15.5 | 60.0 | 10.0 |
| F1 | top15 -> 10 by recency | 1.16 | -0.31 | [-0.60, 3.64] | 44.3 | -20.4 | 57.9 | 10.0 |
| F1 | top20 -> 10 by recency | 0.82 | -0.65 | [-0.95, 3.22] | 35.5 | -32.4 | 57.1 | 10.0 |
| F1 | top30 -> 10 by recency | 0.40 | -1.07 | [-1.43, 2.45] | 12.3 | -31.5 | 50.7 | 10.0 |
| F1 | top15 -> 10 by pctMcap | 0.84 | -0.63 | [-0.99, 3.08] | 34.4 | -33.9 | 56.4 | 10.0 |
| F1 | top20 -> 10 by pctMcap | 0.41 | -1.06 | [-1.38, 2.80] | 12.2 | -40.9 | 54.3 | 10.0 |
| F1 | top30 -> 10 by pctMcap | 0.22 | -1.25 | [-1.63, 2.39] | 4.1 | -43.3 | 52.9 | 10.0 |
| F2 | top-8 equal-weight | 0.93 | -0.54 | [-0.85, 4.19] | 35.7 | -27.3 | 54.5 | 8.0 |
| F2 | top-12 equal-weight | 1.14 | -0.33 | [-0.64, 4.00] | 38.3 | -17.1 | 56.5 | 12.0 |
| F2 | top-15 equal-weight | 1.01 | -0.45 | [-0.77, 3.44] | 32.8 | -15.8 | 55.7 | 15.0 |
| F3 | max-1/sector, fill to 10 | -1.21 | -2.68 | [-3.08, 0.62] | -23.8 | -32.2 | 42.9 | 10.0 |
| F3 | max-2/sector, fill to 10 | 0.25 | -1.22 | [-1.43, 2.73] | 6.3 | -27.0 | 50.0 | 10.0 |
| F3 | max-1/issuer, fill to 10 | -0.49 | -1.96 | [-2.56, 1.48] | -16.2 | -36.2 | 47.1 | 10.0 |
| F4 | top-10 score-linear sizing | -0.27 | -1.73 | [-2.59, 1.64] | -12.0 | -36.2 | 59.3 | 10.0 |
| F4 | top-10 softmax(t=2) sizing | -0.52 | -1.99 | [-2.31, 1.63] | -26.2 | -43.2 | 59.3 | 10.0 |
Note: the within-study baseline DSR here (0.05) is lower than audit 113's 0.49 because the DSR deflation penalty scales with the std of the trial-set Sharpes, and this study's set is far more dispersed (it deliberately includes the catastrophic F3/F4 configs from -1.21 to +1.63). DSR is a property of the search, not of a config in isolation. The relevant honest comparison is intra-study: does any variant clearly dominate baseline A across the gate?
4. Family-by-family read
- F1 recency / pctMcap: clear FAILURES. Both destroy Sharpe (down to 0.22-0.84) and blow out drawdown (-31% to -43%). Selecting the final 10 by "newest filing" or "biggest %-of-mcap buy" inside the V14e pool actively discards the score's ranking value. Dead.
- F2 book size: no help. Every alternative book size (8, 12, 15) is BELOW baseline Sharpe. top-12 is the best alternative (1.14) and still loses 0.37 Sharpe. The book is not too concentrated; widening it dilutes the high-conviction names without a diversification payoff on this thin, correlated EU universe. The top-10 book is already near the variance-minimising point.
- F3 concentration caps: actively harmful. max-1/sector (-1.21) and max-1/issuer (-0.49) force the picker to skip its best names to satisfy the cap, dragging in lower-ranked names that drown the signal. max-2/sector (0.25) is less bad but still far below baseline. The V14e book is not over-concentrated in a way that hurts Sharpe; the caps trade away alpha for a diversification the data does not reward here.
- F4 conviction sizing: actively harmful. Both score-linear (-0.27) and softmax (-0.52) sizing tank the Sharpe while win-rate is unchanged (59.3%): over-weighting the highest-score names concentrates into a few positions whose OOS dispersion deepens drawdown to -36% / -43%. Equal-weight is the right sizing for a 10-name book.
- F1 skill tiebreaker: the only contender, but not robust. top30 -> 10 by skill posts Sharpe 1.63 (> 1.51) with a positive CI95 lower bound (+0.08) and CAGR 99.6%. This is the one configuration that beats baseline on point Sharpe. It is dissected in §5 and rejected.
5. Why the F1 skill tiebreaker is rejected (the decisive analysis)
A pool-size sensitivity sweep on the F1-skill family (the only promising lead) shows the Sharpe rising near-monotonically with the candidate-pool size N:
| pool N | Sharpe | CI95 | MaxDD% |
|---|---|---|---|
| 12 | 1.29 | [-0.51, 3.68] | -14.8 |
| 15 | 1.23 | [-0.61, 3.21] | -14.9 |
| 18 | 1.44 | [-0.25, 3.37] | -9.8 |
| 20 | 1.34 | [-0.34, 3.22] | -12.9 |
| 25 | 1.72 | [0.31, 3.49] | -6.8 |
| 30 | 1.63 | [0.08, 3.30] | -15.5 |
| 35 | 2.24 | [0.96, 3.82] | -7.3 |
| 40 | 2.02 | [0.68, 3.53] | -6.8 |
This monotonic Sharpe-vs-N relationship is the tell. As the candidate pool widens, the V14e score is used only as a coarse prefilter and the skill prior does almost all of the final selection. At N=40 the picker is effectively a pure skill-prior picker with a V14e gate. That is not a "tiebreaker on V14e"; it is the pure skill-prior picker that audit 113 already proved chases raw return at the cost of risk-adjusted return. The pool size N is a researcher degree-of-freedom: where you set the dial determines whether you "beat" baseline. Picking N=30 (or N=35) post hoc to clear the gate is precisely the overfit the ship-gate exists to reject.
Honest DSR accounting confirms it. The pool-size sweep is itself an 8-config search on the F1-skill family; folding it into the trial count gives N=26 across the full honest picker search. Under N=26 deflation:
- baseline 1.51 -> DSR -0.17
- F1-skill pool=30 1.63 -> DSR -0.05 (below the 0.19 gate, and below the luck floor)
- F1-skill pool=35 2.24 -> DSR +0.56 (only "passes" because it sits at the extreme end of a homogeneous winning sub-family; this is the most overfit, least defensible config)
The pool=30 config's nominal within-N=18 DSR of 0.17 is already below the gate (>= 0.19). Under the fully honest N=26 search it is negative.
Sub-period stability removes any remaining doubt. Splitting the 14-month OOS into two 7-month halves (pool=30 skill tiebreaker vs baseline):
| H1 (7mo) Sharpe / mean | H2 (7mo) Sharpe / mean | |
|---|---|---|
| F1-skill pool=30 | 2.56 / +7.94% | 1.05 / +5.42% |
| baseline top-10 | 2.52 / +2.35%* | 0.72 / +2.35% |
(*baseline H1 mean +5.89%.) Both books crash together in H2 (Sharpe 1.05 vs 0.72), i.e. the skill tiebreaker does not change the regime exposure; it slightly lifts the mean while inheriting the same drawdown timing. The whole-period 1.63 is propped up by a strong H1 that the baseline enjoys too. The incremental, regime-robust edge is not there.
6. Overfitting / DSR discussion
- The 18 a-priori configs were defined before any OOS evaluation. No train-fold tuning was
done on the picker params (in contrast to the score-tilt search in audit 113). The skill
prior's EB shrinkage
k=2.90is reused as-is from audit 113 (TRAIN-fold ANOVA), not re-searched. - DSR is reported within the honest N=18 set and stress-tested at N=26 (incl. the pool-size probe). The only config that survives N=26 deflation is the N=35 pool extreme, which fails the robustness test in §5. We therefore treat the entire F1-skill family as non-shippable.
- T=14 months is short. With a thin, internally-correlated EU universe, a wide-pool picker that concentrates into a handful of repeat high-skill insiders is exactly the configuration most exposed to small-sample luck. The monotonic-with-N Sharpe is the signature of that exposure, not of a stable edge.
7. Caveats / known limits
- EU_strict only (the shipped universe). A broader universe or a different hold horizon could move the skill-tiebreaker into a usable region, but both are out of scope for this gate (the brief fixes T+90, cost, and winsor for comparability).
- The skill prior is genuinely predictive at the trade level (audit 113: corr=0.50 OOS). This bake confirms the audit-113 root cause: a real trade-level signal does not translate into a risk-adjusted picker edge on a thin 10-name monthly book, whether folded in as a score tilt (113) or as a selection tiebreaker (here).
- F3/F4 were tested at fixed, sensible parameterisations; a wider cap/temperature grid was deliberately NOT run, to protect the trial count. Their uniformly large negative Sharpes make a finer search pointless.
8. Conclusion · V14e + top-10 equal-weight is the robust optimum
Across four picker-construction families and 18 a-priori configs (plus an 8-point pool-size robustness sweep), nothing robustly beats the V14e top-10 equal-weight baseline:
- F2 book-size, F3 caps, F4 sizing: all strictly worse, most catastrophically so.
- F1 recency / pctMcap tiebreakers: strictly worse.
- F1 skill tiebreaker: the only point-Sharpe beat, but it is a monotonic overfit dial that fails sub-period stability and the honest (N=26) DSR gate.
This is now the THIRD independent null on this universe (score re-weighting · audit 112; skill-prior score tilt · audit 113; picker construction · this audit). The convergent evidence is that V14e top-10 equal-weight on EU_strict is at or very near the robust risk-adjusted optimum for this universe and hold horizon. Further gains will require a different lever (universe expansion with its own honest-tape validation, or the hold-horizon / exit-rule axis), not a better way to slice the same V14e scores into the same 10-name monthly book.
No flag was wired (nothing passed). Production scoring (src/lib/signals.ts), the picker
in src/lib/recommendation-engine.ts, WINNING_STRATEGY, and STRATEGY_PROOF are all
UNCHANGED. No prod rescore run.
9. Files
- Bake:
scripts/_v13_bakeoff/v15c-picker-construction.ts - Results JSON:
scripts/_v13_bakeoff/stats-v15c-picker.json - This audit:
docs/method-review/114-picker-construction-2026-05-26.md - Production: UNCHANGED. No flag added. STRATEGY_PROOF untouched. No prod rescore.