71 · Insider Dedup Pipeline
Date: 2026-05-19 Scope: kill the per-filing slug-hash bug that fragmented 25,642 Insider rows across 28 regulators (NL=91%, DK=91%, ES=77%, AT=75%, BE=70%, IT=70%, FI=63%, DE=62%, NO=44%), reassign Declarations to canonical insiders, and switch the backtest history grouping to insiderId.
Follow-up to audit 70-name-role-coherence-2026-05-19.md.
Phase 1 · Slug formula fix
src/lib/ingest/merge-staging.ts — 13 call sites switched from
shortSuffix("source:filingId") to shortSuffix("source:name:isin|issuerName").
Sources fixed: BaFin (DE), RNS (UK), Consob (IT), CNMV (ES), FMA (AT), FSMA (BE), FI (SE), AFM (NL, 2 call sites), HEL (FI), DK, Oslo (NO), IE.
SEC, SEDI (CA), ASX (AU), SGX, DART (KR), SEBI (IN), EDINET (JP), CVM (BR), KNF (PL), NZX (NZ), JSE (ZA), KAP (TR) already keyed correctly (stable person id or name+ticker).
Backward compatible: existing Insider rows untouched; Phase 2 dedupe collapses fragments onto canonical rows.
Files: src/lib/ingest/merge-staging.ts (28 insertions, 14 deletions).
Commit: f44f1bd.
Phase 2 · DB dedupe
scripts/_dedupe-insider-roles.ts --apply reads every Insider row, groups
by (sourceTag, lower(NFKC(name))), picks canonical = row with most
declarations (stable tie-break by id), reassigns Declaration.insiderId
and CompanyInsider rows, then deletes the duplicate Insider rows.
One small transaction per group (30s timeout). Idempotent on clean DB.
Sequence
DELETE FROM "CompanyInsider" ci
WHERE ci.insiderId IN (dupIds)
AND EXISTS (SELECT 1 FROM "CompanyInsider" can
WHERE can.companyId = ci.companyId
AND can.insiderId = canonical);
DELETE FROM "CompanyInsider" ci
WHERE ci.insiderId IN (dupIds)
AND ci.ctid <> (SELECT MIN(ci2.ctid) FROM "CompanyInsider" ci2
WHERE ci2.companyId = ci.companyId
AND ci2.insiderId IN (dupIds));
UPDATE "CompanyInsider"
SET insiderId = canonical
WHERE insiderId IN (dupIds);
UPDATE "Declaration" SET insiderId = canonical
WHERE insiderId IN (dupIds);
DELETE FROM "Insider" WHERE id IN (dupIds);
The intra-dup DELETE (step 2b) was missing in the first script revision;
without it 16 + 411 groups hit (companyId, insiderId) collisions when
multiple dup insiders each held a link to the same company. Fixed in
commit 2a86a0e (this PR).
Stats
| Metric | Before | After | Delta |
|---|---|---|---|
| Insider rows | 135,669 | 110,027 | -25,642 |
| Groups merged | -- | 6,667 | -- |
| Declarations reassigned | -- | ~28k | -- |
| CompanyInsider links touched | -- | ~22k | -- |
Spot checks
| Name | Pre | Post |
|---|---|---|
| L. Garavoglia (Campari NL/IT) | 371 | 1 |
| Stefan Pierer (KTM AT) | 184 | 1 |
| N. Sawiris (Heineken NL) | 145 | 1 |
| C.L. de Carvalho-Heineken (DE) | 307 | 1 |
Commit: 2a86a0e (intra-dup fix on top of earlier d0c21eb apply path).
Phase 3 · Backtest groupBy
src/lib/backtest-compute.ts — historyByIC and per-row icKey switched
from ${insiderName}::${companyId} to ${insiderId ?? insiderName ?? "?"}::${companyId}.
insiderId added to the Prisma select for both allDecls and the
raw.declaration block. Name fallback kept so legacy rows without
insiderId (pre-link) still hash to the same bucket as their canonical
twin via normalized name.
Commit: 93474e2.
Validation
npx tsc --noEmitclean (14 pre-existing errors insrc/lib/email/send.tsunrelated to this PR).npm run lint:emdashOK.npm run lint:emojiOK (449 files scanned).- Re-running
_dedupe-insider-roles.ts(dry) returns 0 groups with duplicates.
Open follow-ups (next chantier)
- Cross-source homonyms: an insider listed on multiple regulators (eg AFM + RNS + SEC for a person tied to a cross-listed company) still has 1 Insider row per source. The audit Section 3 cross-source table is the input. Use LEI from rawData where present.
- Parser hygiene (B3-B7 in audit 70): DK / Consob / NSE-IN / Tadawul still emit column headers and prose as insider names. These are now collapsed inside their own bucket but pollute the namespace. Fix at ingest.
- Backfill
Declaration.roleCategorycolumn so V12 reads from indexed storage rather than runningnormalizeRoleregex on every decl.