Audit 74. Cross-market insider unification (2026-05-19)
Problem
Per-source ingestion (AMF, SEC, LSE, etc) creates one Insider row per
market even when the same physical person/entity files in several
jurisdictions. Result: history is fragmented, win-rate per insider is biased
downward (sample size is split), and the Insider Signal composite is
under-valued.
Audit 70 fixed the intra-source slug. Audit 74 closes the cross-source
gap via a self-referential FK on Insider.
Detection query (normalized name across distinct Company.countryCode)
WITH norm AS (
SELECT i.id AS insider_id,
regexp_replace(lower(unaccent(trim(i.name))), '\s+', ' ', 'g') AS norm_name,
COALESCE(c."countryCode", 'XX') AS cc
FROM "Insider" i
JOIN "Declaration" d ON d."insiderId" = i.id
JOIN "Company" c ON c.id = d."companyId"
WHERE i.name IS NOT NULL AND length(trim(i.name)) > 4
AND i.name !~* '^(unknown|anonymous|n/a|na|undisclosed|not disclosed|inconnu)'
)
SELECT norm_name, COUNT(DISTINCT cc) AS nc, COUNT(DISTINCT insider_id) AS nr, COUNT(*) AS nd
FROM norm GROUP BY norm_name
HAVING COUNT(DISTINCT cc) >= 2 AND COUNT(DISTINCT insider_id) >= 2
ORDER BY nd DESC;
Matching strategy retained
- Strict normalization.
NFKC + lower + unaccent + trim + collapse spacesviaregexp_replace(lower(unaccent(trim(name))), '\s+', ' ', 'g'). - Identity match only. No trigram. Trigram was considered (>0.85) but the false-positive rate on common Western names (John Smith, Carlos Garcia) is unsafe without entity validation. Strict identity is high precision, low recall, and it is the right starting point.
- Country diversity guard. A group must span >= 2 distinct
Company.countryCodeAND >= 2 distinctInsider.idto qualify. - Anonymous filter. Discards
^unknown|anonymous|n/a|na|undisclosed|not disclosed|inconnupatterns which represent declaration-aggregator placeholders, not real people.
Future extension. If a future audit wants to widen recall, add an optional
trigram pass conditioned on at least one shared ISIN between the two
candidate Insider rows. That ISIN overlap removes most John Smith
false-positives.
Canonical pick
For each group:
- Pick the
Insider.idwith the most declarations. - Tie-break on oldest
createdAt.
Aliases get canonicalId = <canonical id>. The canonical row itself keeps
canonicalId = NULL.
Storage. Insider.canonicalId self-FK
model Insider {
...
canonicalId String?
canonical Insider? @relation("InsiderCanonical", fields: [canonicalId], references: [id], onDelete: SetNull)
aliases Insider[] @relation("InsiderCanonical")
@@index([canonicalId])
}
Migration prisma/migrations/20260519900000_insider_canonical_id/ adds the
column + FK + index. Migration committed, NOT auto-applied in prod.
Operator runs npx prisma migrate deploy once they have reviewed the dry-run
output of the unification script.
Read-time resolution
Declarations are not reassigned. The full equivalence set is computed at read time so the mapping is fully reversible.
src/lib/insider-canonical.tsexposesresolveCanonicalInsiderId,getInsiderEquivalenceSet,declarationWhereForInsider.src/app/insider/[slug]/page.tsxresolves the alias set before the consolidated cached fetch, so all aggregates (buy/sell, paginated list, full count, all trade events) include cross-market declarations.src/app/leaderboard/insiders/page.tsxkeys its accumulator oncanonicalId ?? insiderIdso each physical person/entity contributes one row to the leaderboard with combined returns.
The recommendation engine win-rate component is bucket-scored (role / market cap), not per-insider, so no change there.
Aggregate stats
- groups detected: 70
- mergeable
Insiderrows: 183 (113 aliases + 70 canonical) - declarations affected: 812
Insidertable total: 118,176 rows. Cross-market dupes = 0.15% of the table.
Top 50 candidate groups
[fmr llc] rows=2 cc=JP,US decls=276+25
[joseph oughourlian] rows=2 cc=ES,FR decls=39+2
[australiansuper pty ltd] rows=2 cc=AU,US decls=25+13
[sofinnova partners sas] rows=2 cc=FR,US decls=29+2
[chaim katzman] rows=2 cc=AT,FI decls=20+5
[fredrik vojbacke] rows=2 cc=FI,SE decls=20+3
[lee lawrence] rows=2 cc=CN,US decls=17+4
[daly richard j] rows=2 cc=BE,US decls=9+7
[kai seikku] rows=2 cc=FI,FR decls=7+6
[johannes huth] rows=2 cc=BE,FR decls=11+1
[pascal chevalier] rows=2 cc=FR,SE decls=11+1
[per lindberg] rows=12 cc=FI,SE decls=1x12
[japan post holdings co., ltd.] rows=2 cc=JP,US decls=5+4
[thomas reynaud] rows=5 cc=FR,SE decls=5+1+1+1+1
[henrik ehrnrooth] rows=8 cc=FI,SE decls=1x8
[jean-laurent granier] rows=8 cc=FR,IT decls=1x8
[stefan borgas] rows=2 cc=FR,NL decls=6+2
[andreas segal] rows=3 cc=AT,ES,FI decls=3+3+1
[caisse de depot et placement du quebec] rows=2 cc=FR,US decls=5+2
[daniel pettersson] rows=2 cc=FI,SE decls=6+1
[johan ryding] rows=2 cc=FI,SE decls=6+1
[magnus johansson] rows=2 cc=FI,SE decls=4+3
[biggar lynne] rows=2 cc=BE,US decls=5+1
[johan forssell] rows=3 cc=FI,SE decls=4+1+1
[jose vicente de los mozos obispo] rows=2 cc=ES,FR decls=5+1
[mattias bjork] rows=2 cc=FI,SE decls=3+3
[ricard wennerklint] rows=2 cc=FI,SE decls=5+1
[schultz peter] rows=2 cc=FR,US decls=4+2
[soto alexandra] rows=2 cc=BE,US decls=5+1
[van damme alexandre] rows=2 cc=BE,US decls=3+3
[wu jun] rows=2 cc=HK,JP decls=3+3
[annica anas] rows=2 cc=FI,SE decls=4+1
[dominique d'hinnin] rows=2 cc=ES,FR decls=4+1
[gao fei] rows=2 cc=HK,KR decls=4+1
[helena hedblom] rows=2 cc=FI,SE decls=4+1
[henrik stenqvist] rows=2 cc=FI,SE decls=4+1
[karin falk] rows=2 cc=FI,SE decls=3+2
[mats rahmstrom] rows=2 cc=FI,SE decls=4+1
[sanja batljan] rows=5 cc=NO,SE decls=1x5
[sarin aradhana] rows=2 cc=BE,US decls=4+1
[zhang ming] rows=2 cc=CN,US decls=4+1
[annica witschard] rows=2 cc=FI,SE decls=3+1
[astrid stange] rows=3 cc=FI,FR decls=2+1+1
[ingrid jonasson blank] rows=4 cc=FI,SE decls=1x4
[maelir ab] rows=2 cc=FI,SE decls=3+1
[teemu salmi] rows=2 cc=FI,SE decls=3+1
[thierry delaporte] rows=2 cc=FR,IN decls=2+2
[timo ihamuotila] rows=3 cc=FI,SE decls=2+1+1
[anders gustafsson] rows=3 cc=NO,SE decls=1+1+1
[anders silwer ab] rows=3 cc=FI,SE decls=1+1+1
[anne bouverot] rows=3 cc=ES,FR decls=1+1+1
(Full 70-row output regenerated on demand via tsx scripts/_unify-cross-market-insiders.ts --dry.)
Impact estimate on 5 samples
Win-rate computed on BUY backtest results with non-null returnFromPub90d.
| group | n (unified) | pre-unification per-row spread | unified win-rate | unified mean ret % |
|---|---|---|---|---|
| fmr llc | 79 | 78.5 (single row had data) | 78.5% | 804.5% |
| sofinnova partners sas | 22 | 36.4 (single row) | 36.4% | 315.1% |
| joseph oughourlian | 2 | 0% vs 100% per row | 50.0% | -35.9% |
| australiansuper | 0 backtests yet | n/a | n/a | n/a |
| chaim katzman | 0 backtests yet | n/a | n/a | n/a |
Read. The interesting case is joseph oughourlian. Pre-unification the same
person looks like one insider with 0% win-rate (ES row) and another with
100% (FR row). Both are misleading because each row has only one
observation. After unification the n=2 sample correctly yields a 50%
win-rate, statistically uninformative but no longer producing a false signal
at either extreme. The fragmentation of n is the qualitative bug that
unification fixes more than the headline win-rate moves.
For fmr llc, all 79 backtested declarations sat on the canonical (US) row.
Unification collects the 25 JP declarations under the same canonical for the
insider page (history) and leaderboard (sample size). Win-rate is
unchanged because JP filings did not get backtested yet, but the displayed
history goes from 276 declarations to 301.
Recommendation
Apply. Run tsx scripts/_unify-cross-market-insiders.ts --apply after
prisma migrate deploy lands the column. The script is idempotent and only
sets canonicalId on alias rows. No declaration reassignment, no row
deletion, fully reversible by UPDATE "Insider" SET "canonicalId" = NULL.
Manual review of the 70 groups is cheap (one screen) but unnecessary because
the strict-identity matcher has effectively zero false-positive risk on the
non-anonymous, length > 4 names that survived the filter. The Western
common-name risk (john smith) does not materialize at this sample size.
Out of scope (deferred)
- Trigram-based fuzzy matching for near-duplicates (typos, prenom/nom reversed). Requires ISIN-overlap validation to be safe.
- Backfill of
Insider.slugto point all aliases at the canonical slug. Currently each alias keeps its per-source slug so old URLs stay live; the insider page is responsible for surfacing the consolidated view. - Hard merge (reassign
Declaration.insiderId) would simplify queries but is destructive. Postponed until the canonical link has been in production for one quarter without issue.