114 · Picker-Construction Bake-off · 2026-05-26

Decision: NO-GO. Do not ship. No picker-construction variant robustly beats the V14e top-10 equal-weight baseline. The single contender (skill-prior as a tiebreaker on a wide candidate pool) is a continuous overfit dial that converges toward a pure skill-prior picker, the exact return-chasing failure mode already documented in audit 113. The robust optimum remains V14e + top-10 equal-weight. Production untouched, no flag wired.

Branch: research/picker-construction
Bake: scripts/_v13_bakeoff/v15c-picker-construction.ts
Results JSON: scripts/_v13_bakeoff/stats-v15c-picker.json
Cohort: EU_strict (XPAR, XAMS, XWBO, XBRU, XHEL, XOSL, XSTO, XETR), priced BUYs only.
OOS window: 2025-01-01 .. 2026-02 (T=14 monthly buckets), T+90 hold, NET 0.6% round-trip cost, winsor +-50% per pick. The V14e score is FROZEN for every variant; only the picker (how the monthly book of 10 names is built from the scores) varies.

1. Motivation

Two prior rigorous nulls (audits 112 + 113) showed that re-weighting V14e and adding an empirical-Bayes insider-skill factor both FAILED to beat the baseline. Audit 113 §9 explicitly flagged the picker (not the score) as the prime suspect and named three next steps: a wider candidate pool with a secondary-key tiebreaker, book-size tuning, and skill-conditioned sizing. This bake executes that program honestly under the ship-gate.

Baseline A (V14e top-10 equal-weight) reproduces audit 113 exactly on the refreshed cohort: Sharpe 1.51, CI95 [-0.28, 4.76], CAGR 55.7%, MaxDD -14.1%, Win 59.3%, T=14, 140 picks. This is the reference all variants must beat.

2. Variants (V14e score frozen for ALL)

F1 TIEBREAKER · V14e top-N candidate pool (N in {15,20,30}), final 10 chosen by a secondary key in {skill prior (PIT empirical-Bayes), recency (newest pubDate first), pctMcap}. 9 configs. Tests the audit-113 §9 step-1 hypothesis directly.
F2 BOOK SIZE · top-{8,12,15} per month equal-weight (10 = baseline). 3 configs.
F3 CONCENTRATION CAPS · max-1 / max-2 picks per sector, max-1 per issuer, greedily from the V14e ranking, fill to 10. 3 configs.
F4 CONVICTION SIZING · within the top-10 book, weight by score (linear shift, and softmax temperature 2.0) instead of equal-weight. 2 configs.

N_trials for DSR = 18 (1 baseline + 9 + 3 + 3 + 2). All tiebreaker / cap / sizing params are fixed a priori, NO train-fold parameter search, so 18 is the honest deflation count. A targeted robustness probe added an 8-point pool-size sweep on the F1-skill family; that brings the fully honest search count to 26 (see §5).

3. OOS results (2025-01-01 to today, T=14)

DSR is the within-study deflated Sharpe across the N=18 trial set (Bailey-Lopez de Prado, expected-max-of-N correction). book = average names held per month.

Family	Variant	Sharpe	DSR	CI95	CAGR%	MaxDD%	Win%	book
BASE	A · V14e top-10 equal-weight	1.51	0.05	[-0.28, 4.76]	55.7	-14.1	59.3	10.0
F1	top15 -> 10 by skill	1.23	-0.24	[-0.61, 3.21]	56.5	-14.9	56.4	10.0
F1	top20 -> 10 by skill	1.34	-0.13	[-0.34, 3.22]	63.6	-12.9	54.3	10.0
F1	top30 -> 10 by skill	1.63	0.17	[0.08, 3.30]	99.6	-15.5	60.0	10.0
F1	top15 -> 10 by recency	1.16	-0.31	[-0.60, 3.64]	44.3	-20.4	57.9	10.0
F1	top20 -> 10 by recency	0.82	-0.65	[-0.95, 3.22]	35.5	-32.4	57.1	10.0
F1	top30 -> 10 by recency	0.40	-1.07	[-1.43, 2.45]	12.3	-31.5	50.7	10.0
F1	top15 -> 10 by pctMcap	0.84	-0.63	[-0.99, 3.08]	34.4	-33.9	56.4	10.0
F1	top20 -> 10 by pctMcap	0.41	-1.06	[-1.38, 2.80]	12.2	-40.9	54.3	10.0
F1	top30 -> 10 by pctMcap	0.22	-1.25	[-1.63, 2.39]	4.1	-43.3	52.9	10.0
F2	top-8 equal-weight	0.93	-0.54	[-0.85, 4.19]	35.7	-27.3	54.5	8.0
F2	top-12 equal-weight	1.14	-0.33	[-0.64, 4.00]	38.3	-17.1	56.5	12.0
F2	top-15 equal-weight	1.01	-0.45	[-0.77, 3.44]	32.8	-15.8	55.7	15.0
F3	max-1/sector, fill to 10	-1.21	-2.68	[-3.08, 0.62]	-23.8	-32.2	42.9	10.0
F3	max-2/sector, fill to 10	0.25	-1.22	[-1.43, 2.73]	6.3	-27.0	50.0	10.0
F3	max-1/issuer, fill to 10	-0.49	-1.96	[-2.56, 1.48]	-16.2	-36.2	47.1	10.0
F4	top-10 score-linear sizing	-0.27	-1.73	[-2.59, 1.64]	-12.0	-36.2	59.3	10.0
F4	top-10 softmax(t=2) sizing	-0.52	-1.99	[-2.31, 1.63]	-26.2	-43.2	59.3	10.0

Note: the within-study baseline DSR here (0.05) is lower than audit 113's 0.49 because the DSR deflation penalty scales with the std of the trial-set Sharpes, and this study's set is far more dispersed (it deliberately includes the catastrophic F3/F4 configs from -1.21 to +1.63). DSR is a property of the search, not of a config in isolation. The relevant honest comparison is intra-study: does any variant clearly dominate baseline A across the gate?

4. Family-by-family read

F1 recency / pctMcap: clear FAILURES. Both destroy Sharpe (down to 0.22-0.84) and blow out drawdown (-31% to -43%). Selecting the final 10 by "newest filing" or "biggest %-of-mcap buy" inside the V14e pool actively discards the score's ranking value. Dead.
F2 book size: no help. Every alternative book size (8, 12, 15) is BELOW baseline Sharpe. top-12 is the best alternative (1.14) and still loses 0.37 Sharpe. The book is not too concentrated; widening it dilutes the high-conviction names without a diversification payoff on this thin, correlated EU universe. The top-10 book is already near the variance-minimising point.
F3 concentration caps: actively harmful. max-1/sector (-1.21) and max-1/issuer (-0.49) force the picker to skip its best names to satisfy the cap, dragging in lower-ranked names that drown the signal. max-2/sector (0.25) is less bad but still far below baseline. The V14e book is not over-concentrated in a way that hurts Sharpe; the caps trade away alpha for a diversification the data does not reward here.
F4 conviction sizing: actively harmful. Both score-linear (-0.27) and softmax (-0.52) sizing tank the Sharpe while win-rate is unchanged (59.3%): over-weighting the highest-score names concentrates into a few positions whose OOS dispersion deepens drawdown to -36% / -43%. Equal-weight is the right sizing for a 10-name book.
F1 skill tiebreaker: the only contender, but not robust. top30 -> 10 by skill posts Sharpe 1.63 (> 1.51) with a positive CI95 lower bound (+0.08) and CAGR 99.6%. This is the one configuration that beats baseline on point Sharpe. It is dissected in §5 and rejected.

5. Why the F1 skill tiebreaker is rejected (the decisive analysis)

A pool-size sensitivity sweep on the F1-skill family (the only promising lead) shows the Sharpe rising near-monotonically with the candidate-pool size N:

pool N	Sharpe	CI95	MaxDD%
12	1.29	[-0.51, 3.68]	-14.8
15	1.23	[-0.61, 3.21]	-14.9
18	1.44	[-0.25, 3.37]	-9.8
20	1.34	[-0.34, 3.22]	-12.9
25	1.72	[0.31, 3.49]	-6.8
30	1.63	[0.08, 3.30]	-15.5
35	2.24	[0.96, 3.82]	-7.3
40	2.02	[0.68, 3.53]	-6.8

This monotonic Sharpe-vs-N relationship is the tell. As the candidate pool widens, the V14e score is used only as a coarse prefilter and the skill prior does almost all of the final selection. At N=40 the picker is effectively a pure skill-prior picker with a V14e gate. That is not a "tiebreaker on V14e"; it is the pure skill-prior picker that audit 113 already proved chases raw return at the cost of risk-adjusted return. The pool size N is a researcher degree-of-freedom: where you set the dial determines whether you "beat" baseline. Picking N=30 (or N=35) post hoc to clear the gate is precisely the overfit the ship-gate exists to reject.

Honest DSR accounting confirms it. The pool-size sweep is itself an 8-config search on the F1-skill family; folding it into the trial count gives N=26 across the full honest picker search. Under N=26 deflation:

baseline 1.51 -> DSR -0.17
F1-skill pool=30 1.63 -> DSR -0.05 (below the 0.19 gate, and below the luck floor)
F1-skill pool=35 2.24 -> DSR +0.56 (only "passes" because it sits at the extreme end of a homogeneous winning sub-family; this is the most overfit, least defensible config)

The pool=30 config's nominal within-N=18 DSR of 0.17 is already below the gate (>= 0.19). Under the fully honest N=26 search it is negative.

Sub-period stability removes any remaining doubt. Splitting the 14-month OOS into two 7-month halves (pool=30 skill tiebreaker vs baseline):

	H1 (7mo) Sharpe / mean	H2 (7mo) Sharpe / mean
F1-skill pool=30	2.56 / +7.94%	1.05 / +5.42%
baseline top-10	2.52 / +2.35%*	0.72 / +2.35%

(*baseline H1 mean +5.89%.) Both books crash together in H2 (Sharpe 1.05 vs 0.72), i.e. the skill tiebreaker does not change the regime exposure; it slightly lifts the mean while inheriting the same drawdown timing. The whole-period 1.63 is propped up by a strong H1 that the baseline enjoys too. The incremental, regime-robust edge is not there.

6. Overfitting / DSR discussion

The 18 a-priori configs were defined before any OOS evaluation. No train-fold tuning was done on the picker params (in contrast to the score-tilt search in audit 113). The skill prior's EB shrinkage k=2.90 is reused as-is from audit 113 (TRAIN-fold ANOVA), not re-searched.
DSR is reported within the honest N=18 set and stress-tested at N=26 (incl. the pool-size probe). The only config that survives N=26 deflation is the N=35 pool extreme, which fails the robustness test in §5. We therefore treat the entire F1-skill family as non-shippable.
T=14 months is short. With a thin, internally-correlated EU universe, a wide-pool picker that concentrates into a handful of repeat high-skill insiders is exactly the configuration most exposed to small-sample luck. The monotonic-with-N Sharpe is the signature of that exposure, not of a stable edge.

7. Caveats / known limits

EU_strict only (the shipped universe). A broader universe or a different hold horizon could move the skill-tiebreaker into a usable region, but both are out of scope for this gate (the brief fixes T+90, cost, and winsor for comparability).
The skill prior is genuinely predictive at the trade level (audit 113: corr=0.50 OOS). This bake confirms the audit-113 root cause: a real trade-level signal does not translate into a risk-adjusted picker edge on a thin 10-name monthly book, whether folded in as a score tilt (113) or as a selection tiebreaker (here).
F3/F4 were tested at fixed, sensible parameterisations; a wider cap/temperature grid was deliberately NOT run, to protect the trial count. Their uniformly large negative Sharpes make a finer search pointless.

8. Conclusion · V14e + top-10 equal-weight is the robust optimum

Across four picker-construction families and 18 a-priori configs (plus an 8-point pool-size robustness sweep), nothing robustly beats the V14e top-10 equal-weight baseline:

F2 book-size, F3 caps, F4 sizing: all strictly worse, most catastrophically so.
F1 recency / pctMcap tiebreakers: strictly worse.
F1 skill tiebreaker: the only point-Sharpe beat, but it is a monotonic overfit dial that fails sub-period stability and the honest (N=26) DSR gate.

This is now the THIRD independent null on this universe (score re-weighting · audit 112; skill-prior score tilt · audit 113; picker construction · this audit). The convergent evidence is that V14e top-10 equal-weight on EU_strict is at or very near the robust risk-adjusted optimum for this universe and hold horizon. Further gains will require a different lever (universe expansion with its own honest-tape validation, or the hold-horizon / exit-rule axis), not a better way to slice the same V14e scores into the same 10-name monthly book.

No flag was wired (nothing passed). Production scoring (src/lib/signals.ts), the picker in src/lib/recommendation-engine.ts, WINNING_STRATEGY, and STRATEGY_PROOF are all UNCHANGED. No prod rescore run.

9. Files

Bake: scripts/_v13_bakeoff/v15c-picker-construction.ts
Results JSON: scripts/_v13_bakeoff/stats-v15c-picker.json
This audit: docs/method-review/114-picker-construction-2026-05-26.md
Production: UNCHANGED. No flag added. STRATEGY_PROOF untouched. No prod rescore.