Understanding the 30/35/25/10 Weighting of Composite Scores, Sigma Journal

InsidersTradesSigma

Why not equal weights?

Equal weighting is attractive because it looks honest. It is also a confession that you do not know which features matter more.

That can be acceptable at the prototype stage. It is less acceptable once you have enough history to test whether one family of variables consistently improves discrimination. Equal weights also create a subtle problem: they over-reward buckets that contain many correlated sub-features. A bucket with ten mildly related inputs can dominate a bucket with three stronger but cleaner inputs unless you explicitly control its contribution.

So we weight at the bucket level first, then normalise within buckets. This keeps the score interpretable and prevents feature-count inflation from masquerading as insight.

Bucket one, signal quality at 30%

Signal quality asks a rude but necessary question: is this filing likely to reflect a discretionary information-bearing action, or is it administrative wallpaper?

This bucket includes things like:

open-market purchase versus sale
discretionary trade versus automatic or plan-based transaction
direct ownership versus indirect vehicle
one-off filing versus repetitive pattern
clean economic exposure versus derivatives or structured instruments
whether the filing appears complete, timely, and internally consistent

Why 30%, not 50%

Signal quality is foundational, but not all foundational variables deserve dominant weight. Its job is partly gating rather than forecasting. A poor-quality signal should be penalised heavily, yes. But once you have established that a filing is reasonably clean and discretionary, the marginal value of further quality distinctions falls.

That is why we stop at 30%. If you push this bucket too high, the model starts behaving like a filing-classification engine rather than an alpha-ranking system. It gets very good at telling you which transactions look proper, and less good at telling you which proper transactions matter most.

What tends to survive ablation

In most insider datasets, the strongest contributors inside this bucket are not exotic:

purchase versus sale
open-market versus non-open-market
discretionary versus automatic
transaction type exclusions, especially awards, tax-withholding disposals, and option exercises with immediate sale components

This is consistent with the literature. Open-market purchases have historically carried stronger information content than many categories of sales, in part because sales are motivated by diversification, tax, liquidity, and estate planning, while purchases require fresh capital and usually a clearer positive view. Seyhun’s long body of work is still the unavoidable starting point here, even if one should resist treating old US findings as eternal law.

Why this bucket is not larger

The main reason is saturation. Once obvious low-information filings are filtered or penalised, the incremental gain from ever finer classifications diminishes. You can spend months perfecting a taxonomy of derivative transactions and still learn less than you would from a simple measure of whether the trade was unusually large relative to the insider’s prior stake.

That is the first principle behind the 30/35/25/10 split: cleanliness matters, but context carries more of the residual signal.

Bucket two, transaction context at 35%

This is the largest bucket because it answers the question investors actually care about: how meaningful is the trade in context?

A €50,000 purchase can be trivial for one insider and substantial for another. Ten thousand shares can be symbolic in a mega-cap and highly consequential in a small-cap. A sale after a ten-year run-up is not the same as a sale into distress. Context converts a filing from a legal event into an economic event.

What sits inside transaction context

Typical sub-features include:

trade size relative to prior holdings
trade size relative to annual compensation, where available
trade size relative to average daily volume or free float
clustering, multiple insiders buying within a short window
first purchase after a long period of inactivity
repeat buying versus one-off signalling
buying after drawdowns versus buying near highs
concentration, whether the insider is increasing or reducing exposure meaningfully

These are not cosmetic details. They are often where persistence lives.

Why this gets the largest weight

Because context is where insider intent becomes legible.

A chief financial officer buying a modest amount every quarter under a regular pattern is less interesting than the same officer making a first meaningful open-market purchase after eighteen months of silence. Likewise, a non-executive director buying a token amount may be less informative than a divisional executive making a purchase equal to a large fraction of annual cash pay.

An ablation study tends to show this bucket doing heavy lifting in two ways:

Cross-sectional discrimination. It helps separate merely positive filings from genuinely unusual ones.
Rank ordering among good signals. Once poor-quality filings are down-weighted, context often determines which names rise to the top decile.

This is why transaction context gets 35%, the largest single share.

The practical edge of clustering

One contextual feature deserves special mention: cluster buying. Academic work has repeatedly found that purchases by multiple insiders over a relatively short period contain more information than isolated trades. The intuition is plain enough. Independent insiders all have reasons not to buy. When several do so anyway, coincidence becomes a less satisfying explanation.

That does not mean every cluster is gold. Small boards can create accidental clusters, and blackout windows can mechanically compress activity. But as a contextual feature, clustering has been persistent enough to earn material weight.

Why not 40% or more

Because context can become a backdoor for overfitting. Once you start adding many interaction terms, for example size times drawdown times volatility times sector stress, the bucket begins to absorb market-state noise and historical accidents. A score should reward unusual, economically meaningful behaviour, not every baroque combination that happened to work in one sample.

So 35% is a compromise. Large enough to matter. Small enough to keep the score anchored to the filing rather than the spreadsheet.

Bucket three, insider quality at 25%

Titles are useful, but they are not a substitute for thought.

Insider quality captures the idea that not all insiders possess the same information, influence, or decision rights. A founder-chief executive, a chief financial officer, a chair, and a newly appointed independent director do not stand in the same informational position. Nor do all executives have equal incentives to signal.

What we score here

This bucket typically includes:

role seniority, CEO, CFO, COO, chair, director, etc.
operational proximity to the business
tenure
historical trading behaviour
whether the insider has been informative in the past
founder status or large controlling stake
board independence versus executive responsibility

The trick is to avoid turning this into a worship index for famous job titles.

Why 25% is enough

Insider quality matters, but less than market folklore suggests. There are at least three reasons.

First, titles are coarse. A CFO at one firm may be deeply involved in capital allocation and guidance. At another, the role may be more procedural. A director may be passive, or may be a founder with superior information despite a plain label.

Second, selection effects are severe. Senior insiders often face stricter scrutiny, tighter blackout constraints, and larger reputational costs. That can reduce trade frequency and distort observed behaviour.

Third, context often dominates title. A mid-level but operationally close executive making a large discretionary purchase can be more informative than a ceremonial chair making a token trade.

So insider quality gets 25%. Large enough to capture genuine differences in informational position, but not so large that rank is predetermined by the org chart.

The historical trap

Many scoring systems quietly overweight this bucket because it is easy to explain. “The CEO bought” sounds decisive. It also fits on television.

But a model that leans too hard on seniority tends to become blunt. It underreacts to meaningful trades by less glamorous insiders and overreacts to symbolic trades by top officers. An ablation study usually shows that removing insider quality hurts performance, but not as much as removing transaction context. That is exactly the relationship implied by 35 versus 25.

Bucket four, market context at 10%

This is the smallest bucket on purpose.

Market context includes variables such as:

recent share-price drawdown or momentum
realised volatility
valuation percentile or simple multiple screens
sector-relative performance
liquidity regime
event proximity, earnings, guidance, capital raises, lock-up expiry

These variables can be useful. They can also be dangerous, because they are often the first place a model starts to confuse timing with information.

Why include it at all

Because insider trades do not occur in a vacuum. A purchase after a violent de-rating may carry a different expected profile from a purchase after a euphoric run. A sale into a liquidity crunch may deserve less negative interpretation than the same sale in calm conditions. Market context helps calibrate the likely meaning and tradability of the filing.

It also improves workflow. If two filings look equally strong on intrinsic grounds, market context may help decide which one deserves immediate attention.

Why only 10%

Because the filing should remain the protagonist.

Macro, technical, and valuation overlays often improve presentation more than prediction. They can sharpen entry timing, but they also inject regime dependence. A score that leans too heavily on market context risks becoming a generic factor model with an insider garnish.

In an ablation framework, this bucket often shows the smallest but still non-zero contribution. Remove it, and the score may become slightly less calibrated around stressed periods or less practical for execution. Keep it small, and you get the benefit without letting it drown the signal.

That is the logic of the final 10%.

The ablation study, what we test and what would justify the weights

Without the live article query, I cannot print fresh warehouse results. So let us be explicit about the methodology rather than inventing percentages. The audit has three layers.

Step one, bucket-level standalone tests

Each bucket is first tested on its own. The question is not whether it can predict returns in isolation with heroic tuning. The question is whether it contains independent signal after basic controls.

For each bucket, we would typically examine:

rank correlation with the full composite
monotonicity of forward outcomes by score decile
stability across market regimes
breadth, how many filings the bucket can score meaningfully
susceptibility to missing data and jurisdictional inconsistencies

A strong bucket should show some standalone discrimination and broad applicability. A weak bucket may still survive if it materially improves the full model through interaction effects, but it should not command large weight.

Step two, leave-one-bucket-out ablation

This is the core audit.

Build the full score with all four buckets. Then remove one bucket at a time and re-estimate the evaluation metrics:

top-decile versus bottom-decile spread
information coefficient or rank correlation
hit rate by horizon
turnover and implementation burden
score stability, how much rankings change when the bucket is absent

If removing transaction context causes the sharpest degradation, it deserves the largest weight. If removing market context barely matters, it deserves a small sleeve. If removing signal quality causes a large increase in false positives, that justifies a substantial but not dominant weight.

Step three, perturbation around the baseline weights

Once the relative ordering of bucket importance is established, test nearby weight sets:

25/35/30/10
30/30/30/10
35/30/25/10
30/40/20/10
30/35/20/15

The objective is not to find the single best in-sample vector. It is to identify a plateau of good performance. If 30/35/25/10 sits in a stable region, that is preferable to a brittle optimum like 27/38/23/12 that wins by a whisker on one sample and loses its nerve out of sample.

What “good” looks like

A credible weight choice should satisfy four conditions:

Directional consistency. The same buckets matter across subperiods.
Ranking robustness. Small changes in weights do not scramble the top names.
Economic plausibility. The score still reflects how insiders actually trade.
Data resilience. Missing or noisy fields do not collapse performance.

This is less glamorous than machine-learning folklore, but much more useful.

Market	Regulator	Rule	Deadline	Notes
FR	AMF	MAR Art 19	T+3	Persons discharging managerial responsibilities and closely associated persons must notify transactions within three business days, subject to MAR framework.
EU	ESMA	MAR Art 19	T+3	ESMA Q&A and technical guidance shape interpretation across member states, but implementation details can still vary.
US	SEC	Exchange Act Section 16, Form 4	T+2	Reporting universe and transaction coding differ materially from EU practice, especially around beneficial ownership and plan transactions.

Disclosure regimes influence filing timeliness, coding consistency, and therefore the reliability of score components.

Why this matters for the weights

Signal quality and insider quality are the most exposed to jurisdictional classification differences. Role labels, beneficial ownership structures, and transaction codes do not line up perfectly across markets. Transaction context is somewhat more portable, because relative size and clustering can often be reconstructed even when labels are messy. Market context is portable in theory, but can import local microstructure quirks.

This is another reason not to overweight insider quality or market context. In cross-market datasets, they can be useful but fragile.

The regulator’s view versus the investor’s view

Regulators care about prompt disclosure and market integrity. Investors care about information content. Those are related, but not identical objectives.

A filing that satisfies disclosure rules may still be economically uninteresting. Conversely, a small but unusual discretionary trade can be highly interesting even if it looks routine to a compliance team. The score therefore needs to respect the legal structure without becoming captive to it.

What would make us change 30/35/25/10

Weights should be stable, not frozen.

There are several conditions under which we would revisit the split.

If transaction coding improves materially

If data vendors or internal parsers become much better at distinguishing discretionary from non-discretionary trades across jurisdictions, signal quality might deserve a somewhat larger share. Better labels increase the value of that bucket.

If insider history becomes richer

If we build longer, cleaner longitudinal histories for individual insiders across issuers and roles, insider quality could improve. A title is weak. A title plus a documented history of informative trading is much stronger.

If market context starts dominating only in specific regimes

If market overlays prove useful only during stress periods, we might keep the 10% baseline but introduce regime-conditional scaling rather than permanently increasing the bucket. That would preserve discipline while acknowledging state dependence.

If ablation shows a flatter frontier

Suppose several nearby weight sets perform similarly and rankings remain stable. In that case, we might choose the simpler or more interpretable split, even at the cost of a tiny statistical concession. Models are for use, not admiration.

The boring virtue of a good score

A good composite score should feel slightly disappointing to anyone hoping for alchemy.

It should not require twenty hidden interactions. It should not collapse when one field is missing. It should not produce a completely new top ten because a stock moved 3% yesterday. It should mostly reward filings that are discretionary, economically meaningful, executed by relevant insiders, and not obviously explained away by routine mechanics.

That is what the 30/35/25/10 split is trying to do.

The score is not a proclamation that one bucket is “true” and another is “secondary”. It is a statement about marginal contribution under uncertainty. Transaction context earns the lead because it most often turns a filing into a signal. Signal quality comes next because garbage deserves a firm escort to the exit. Insider quality matters, but less than title-driven storytelling implies. Market context helps, but should know its place.

The next concrete step is straightforward: publish the live ablation table from the 162,000-filing warehouse, with subperiod splits and rank-stability metrics, then see whether 30/35/25/10 still sits on the robust plateau or whether one bucket has quietly been living on old reputation.