Understanding Multiple-Testing Inflation in Quant Research, Sigma Journal

InsidersTradesSigma

The article is about method rather than marketing. There is no live database extract attached to this brief, so where article-specific performance figures would normally appear, we state n/a rather than decorate the page with fiction. The underlying issue does not require invention. It requires discipline.

The problem is not overfitting alone, it is over-reporting

A common description of this problem is "overfitting". Accurate, but incomplete. Overfitting says the chosen model has adapted too closely to historical noise. Multiple-testing inflation says something slightly more embarrassing: after enough attempts, even entirely unremarkable data will produce apparently significant winners.

If you run one test at a 5 percent significance threshold, a false positive is a 1-in-20 nuisance. If you run thousands of tests, false positives stop being a nuisance and become inventory. The expected count of lucky winners rises mechanically.

Why 583,000 trials is not a footnote

Suppose a researcher searches across:

event definitions,
filing filters,
insider-role subsets,
holding periods,
liquidity screens,
market-cap buckets,
rebalancing rules,
weighting schemes,
exclusion windows,
and transaction-size thresholds.

None of these choices is absurd. Most are defensible. Taken together, they create a combinatorial machine for producing "best" results.

The usual defence is that many combinations are correlated, so the effective number of independent tests is lower than 583,000. True. It is also irrelevant if used as an excuse to ignore the issue. Correlation among tests reduces the inflation, but does not remove it. A search over a highly structured parameter space still creates a substantial chance that the top-ranked result is a statistical souvenir.

The winner's curse, quant edition

The best-performing specification in a large search is biased upward by construction. It won the contest partly because it captured signal, if any exists, and partly because it got lucky. The larger the contest, the larger the expected luck component in the winner.

This is the same logic as auction theory's winner's curse, except the object being overpaid for is a Sharpe ratio.

What a 583k search does to p-values, Sharpe ratios, and common sense

The statistical effect of broad search is not confined to classical hypothesis tests. It contaminates nearly every summary investors like to quote.

P-values become promotional material unless adjusted

The textbook family-wise error problem is straightforward. If you test many hypotheses, the probability of at least one false rejection increases. Bonferroni correction is the blunt instrument: divide the significance level by the number of tests. It is conservative, often too conservative when tests are dependent, but it has the virtue of not flattering the researcher.

In strategy research, however, the issue is usually worse than "many p-values". Researchers often search over parameter grids, rank by backtest performance, and then report the p-value of the selected strategy as if it had been specified in advance. It was not. The p-value is therefore conditional on a hidden tournament.

Sharpe ratios are not immune

A high in-sample Sharpe ratio discovered after extensive search is not evidence in itself. Bailey and López de Prado formalised this in the Deflated Sharpe Ratio, which adjusts the observed Sharpe for non-normal returns, sample length, and the number of trials considered. The key idea is plain enough even without the formula: the hurdle for "impressive" rises with the breadth of the search.

If one strategy out of 583,000 variants shows a sparkling in-sample Sharpe, the relevant question is not "is this Sharpe above zero?" It is "is this Sharpe still unusual once we account for the fact that we went fishing with industrial equipment?"

Dependence matters, but not in the way optimists hope

In a parameter grid, neighbouring specifications are often similar. A 19-day holding period and a 20-day holding period are not independent universes. This dependence lowers the effective number of independent tests relative to the raw count.

Good. Use that fact properly.

Bad practice is to wave at dependence and therefore report no correction at all. Better practice is to estimate an effective number of trials, or use methods designed for correlated model searches, such as White's Reality Check or Hansen's Superior Predictive Ability framework. These methods ask whether the best model is genuinely superior once the search process is acknowledged.

The toolkit for deflating a giant parameter sweep

There is no single universal correction. Different tools answer different questions. The sensible approach is layered.

White, Hansen, and the family of reality checks

White's Reality Check

White's Reality Check tests whether the best-performing model in a set is significantly better than a benchmark after accounting for data snooping across the entire family. It uses bootstrap methods to approximate the distribution of the maximum performance statistic under the null.

Its strengths are conceptual honesty and direct relevance to model search. Its weaknesses are power and sensitivity to a large number of poor alternatives. If you include a zoo of hopeless models, the test can become conservative.

Hansen's Superior Predictive Ability test

Hansen's SPA test was designed to improve on some of those power issues. It downweights very poor alternatives and can provide a more informative assessment of whether any candidate genuinely outperforms the benchmark.

For a large correlated search space, SPA is often more practical than pretending Bonferroni has solved your life.

Why these tests suit strategy research

These methods are useful because they align with the actual workflow. We do not usually care whether one pre-specified coefficient differs from zero in isolation. We care whether the best strategy found after broad search is better than the benchmark in a way that survives the search process.

That is a different question, and it deserves a different test.

Deflated Sharpe Ratio

The idea

The Deflated Sharpe Ratio, associated with Bailey and López de Prado, adjusts the observed Sharpe ratio for:

non-normality of returns,
finite sample effects,
and multiple trials.

It converts "this Sharpe looks nice" into "this Sharpe is still unusual given how many chances we gave ourselves to find something nice".

Why we like it

It speaks the language strategy researchers actually use. Sharpe ratios are familiar, flawed, and still unavoidable. Deflating them is more useful than worshipping them.

What it does not do

It does not rescue a weak process. If your search leaked information, ignored costs, or repeatedly revised the benchmark after seeing the results, no elegant ratio will save you. Statistics can penalise luck. They cannot reverse dishonesty.

A large insider-filings dataset invites parameterisation. That is not a vice. The vice is selective memory about what was tried.

Step 1. Define the economic hypotheses before the fine grid

We start with higher-level hypotheses, for example:

clustered open-market buys by senior executives may signal undervaluation,
sales may be less informative because of diversification and tax motives,
larger trades relative to prior holdings may carry more information,
shorter post-filing windows may capture faster information diffusion,
and microcaps may show stronger raw effects but worse implementability.

These are economic ideas. The parameter grid comes after, not before.

Step 2. Enumerate the full model family

Once the hypotheses are set, we generate the strategy family systematically. This includes all combinations of:

event filters,
role classifications,
size thresholds,
portfolio construction rules,
entry timing,
exit timing,
rebalancing cadence,
and cost assumptions.

The point is reproducibility. If a specification was available to be chosen, it belongs in the count.

Step 3. Partition the data honestly

At minimum, we separate:

a discovery sample,
a validation sample,
and ideally a final holdout or forward period.

For time-series and event-driven data, this partition should respect chronology. Randomly shuffling observations can produce tidy cross-validation metrics and untidy real-world disappointment.

Step 4. Rank candidates using pre-declared criteria

We choose ranking criteria before looking at the leaderboard. Typical examples are:

net Sharpe,
t-statistic of alpha,
drawdown-adjusted return,
turnover-adjusted information ratio.

The criterion matters because changing it after seeing the results creates another layer of hidden search. One can data-mine the objective function as efficiently as the parameters.

Step 5. Apply search-aware inference

At this stage, the best candidate is not yet a finding. It is a nominee. We then apply search-aware evaluation, using one or more of the methods described above.

Step 6. Publish the range, not just the champion

A credible article shows:

where the chosen specification sits in the distribution,
whether nearby specifications behave similarly,
how performance changes across subperiods,
and what happens after costs and delays.

If the winner is isolated and fragile, readers should see that plainly.

What readers should demand from any article built on large-scale search

The burden is not only on researchers. Readers, allocators, and editors should ask better questions.

Show me the denominator

If someone presents a strategy with a handsome in-sample Sharpe, ask: out of how many trials was this selected? If the answer is vague, the result is too.

Show me the holdout

If there is no untouched sample, there is no clean evidence. There is only iterative persuasion.

Show me the neighbourhood

Does the strategy work across nearby parameter values, or only at one oddly specific setting? Robustness is not glamorous, but it is where genuine signal tends to live.

Show me costs and operational constraints

Insider-based strategies can involve publication lags, liquidity issues, and concentration in small names. Gross returns are often the easy part. Net returns are where fiction goes to be audited.

Show me the failed siblings

A lone successful specification tells you less than the family portrait. If hundreds of close relatives failed, the survivor may be lucky rather than superior.

Why we are explicit about 583,000 trials

There is a temptation to think that disclosing a huge search count somehow weakens the article. It does the opposite. It tells the reader that the research process has not been airbrushed.

A broad search can be entirely legitimate when the underlying phenomenon is complex and the design space is large. Insider-transaction signals are exactly that sort of problem. But legitimacy requires accounting. If 583,000 combinations were explored, then 583,000 combinations belong in the interpretive frame.

The alternative is familiar. A polished chart appears. The article speaks confidently about "the strategy". The search process remains backstage. The reported significance assumes a world in which the chosen specification was ordained from the start. It was not. It won a tournament.

That does not make the strategy false. It makes unadjusted certainty false.

A note on humility, which is cheaper than drawdowns

There is a cultural problem here as much as a statistical one. Quant research often rewards the reveal, not the restraint. The incentive is to present the best result with just enough caveat to satisfy the legal department and not enough to disturb the sales team.

We prefer a duller standard. If a signal survives broad search, out-of-sample testing, and search-aware adjustment, it is more interesting, not less. If it does not, the research was still useful. It mapped where the edge is not.

That is not failure. It is inventory control for bad ideas.

The concrete next step is straightforward: for every future insider-signal article built on broad parameter search, publish the trial count, the selection protocol, and at least one search-aware statistic alongside the headline backtest. The open question is the harder one, and the more valuable one: what is the effective number of independent tests in a highly correlated insider-strategy grid, and how stable is that estimate across market regimes?

The Impact of Multiple-Testing Inflation on Quant Research

Act on this

Multiple-testing inflation, or how to turn noise into a career