70 · Names + Roles + Lang Coherence Audit
Date: 2026-05-19 Scope: cross-source data integrity for 595k Declarations / 135k Insiders ingested from 28+ regulators (AMF, SEC, BaFin, Consob, CNMV, AFM, FT-DK, FI-SE, Oslo, FMA-AT, FSMA-BE, RNS-LSE, ASX, EDINET, DART-KR, SSE/SZSE, NSE/BSE-IN, CVM-BR, etc).
TL;DR
Role mapper coverage was 65.3% before this PR. Now 84.84% on the full DB (536k roled declarations). Added Portuguese (CVM-BR, +73k decls now mapped), Chinese (董事/高管/监事/etc, +8k), Korean (대표이사/사외이사/etc, +6k), Japanese (代表取締役 etc), Swedish/Norwegian/Danish (verkställande direktör, styrelseledamot, +4k), plus generic English fallbacks (Officer, PDMR, Various Senior Officers, Vice Chair, Controller, Treasurer, Secretary), feminine ES (CONSEJERA), Italian Consob long forms.
Insider rows are catastrophically fragmented inside non-SEC sources. Slug for AFM, BaFin, AT-FMA, CNMV, Consob, FT-DK, FI-SE, FSMA-BE, NSE-IN, Oslo, etc is built as
slugify(name) + "-<src>-" + shortSuffix("<id>:<filingId>"). Because<filingId>is unique per filing, the same person creates a NEW Insider row per filing. Example:l-garavoglia(Davide Campari board) = 371 distinct Insider rows. Same forc.l. de carvalho-heineken(307),stefan-pierer(184),n. sawiris(145).Source Insider rows Distinct names Fragmentation sec 52,182 52,128 0.1% jp 1,979 1,969 0.5% in 5,462 5,055 7.5% kr 6,863 5,718 16.7% uk 180 143 20.6% au 706 464 34.3% no 604 338 44.0% de 1,747 669 61.7% fi 6,279 2,320 63.1% it 2,552 773 69.7% be 881 266 69.8% at 2,833 698 75.4% es 5,333 1,203 77.4% dk 242 21 91.3% nl 5,678 491 91.4% Excess Insider rows = 25,642 (~19% of total). Distinct (source, name) groups with duplicates: 6,667.
Backtest historical win rate is keyed on
insiderName::companyIdraw (backtest-compute.ts:327), notinsiderId::companyId. This partially neutralises bug #2 within a single source (same name → same history bucket), but breaks across:- spelling variants (
Doe JohnvsJohn Doe) - cross-source same person (Musk SEC + RNS would still be split because companyId differs anyway; safe for now since cross-listing detection is separate concern)
- spelling variants (
1 · Role-mapper coverage (full DB)
15,193 distinct insiderFunction strings → 536,433 roled declarations.
| Bucket | Decls | % |
|---|---|---|
| Directeur | 269,586 | 50.26% |
| PDG/DG | 82,858 | 15.45% |
| CA/Board | 74,105 | 13.81% |
| CFO/DAF | 28,549 | 5.32% |
| Autre | 81,335 | 15.16% |
Residual "Autre" is dominated by genuinely non-executive labels:
- Controlador ou Vinculado · 39,410 (CVM-BR: controlling shareholder, NOT exec)
- Large shareholder · 25,953
- See Remarks (various spellings) · ~6.4k (garbage; the SEC raw field for unstructured roles)
- 核心技术人员 · 981 (CN: core technical staff, not exec)
- Substantial Holder · 711 (ASX: large holder)
- Closely associated person · 671 (PDMR family member)
- truncated A.P. Møller copy-paste · 617 (ingestion bug in DK adapter, see #4)
2 · Top 50 raw role variants
Director | 125627
Controlador ou Vinculado | 39410
Conselho de Administração ou Vinculado | 35971
Diretor ou Vinculado | 33481
Large shareholder | 25953
Chief Financial Officer | 11155
Chief Executive Officer | 10665
President and CEO | 7028
Director or Supervisory Board | 6392
See Remarks | 5679
Conselho Fiscal ou Vinculado | 4519
董事 | 4331
Chief Operating Officer | 4013
President & CEO | 4007
Executive Vice President | 3341
Officer | 3227
Chief Accounting Officer | 3037
Administrateur | 2779
President | 2677
CEO | 2608
PDMR | 2348
Other senior manager | 2163
Member of the Board/Deputy member | 2136
Chief Technology Officer | 2033
董事,高管 | 1976
Senior Vice President | 1931
Styrelseledamot | 1852
Executive Chairman | 1836
Chief Legal Officer | 1612
Chairman and CEO | 1584
General Counsel | 1485
CFO | 1416
Verkställande direktör (VD) | 1388
상무 | 1325
Chief Medical Officer | 1314
Président | 1290
Annan medlem i bolagets administrations-... | 1190
Chief Commercial Officer | 1115
President and COO | 1074
高级管理人员 | 1070
监事 | 1063
高管 | 1047
Órgão Estatutário ou Vinculado | 1036
EVP & CFO | 1016
Vice President | 1015
Chief Revenue Officer | 988
核心技术人员 | 981
Executive member of the board ... senior management | 965
CEO and President | 951
Chairman & CEO | 946
3 · Top 30 dedupe candidates (within-source)
(from npx tsx scripts/_dedupe-insider-roles.ts)
src | name | rows | total_decls
nl | l. garavoglia | 371 | 371
de | c.l. de carvalho-heineken | 307 | 307
es | ferev uno strategic plans, s.l. | 232 | 232
es | candelas arranz pumar | 227 | 227
es | criteria caixa, s.a.u. | 196 | 196
at | stefan pierer | 184 | 184
kr | 국민연금공단 | 165 | 238
nl | n. sawiris | 145 | 145
nl | r. sackers | 124 | 124
nl | m. colpan | 120 | 120
nl | d.e. knibbe | 119 | 119
dk | member/executive/associated person | 114 | 114
nl | t.j.m. van hauwermeiren | 113 | 113
au | unknown substantial holder | 106 | 139
sa | sof-11 klimt cai s.à r.l. | 102 | 102
it | - internal dealing transaction | 90 | 90
es | carlos march delgado | 83 | 83
nl | m.b. swart | 79 | 79
de | p.p.f. de vries | 78 | 78
sa | metambiente, s.a. | 78 | 78
it | / internal dealing notice | 75 | 75
in | there was a change in the interests | 74 | 105
nl | j.p. elkann | 73 | 73
nl | f.j.m. schneider-maunoury | 70 | 70
at | pierer industrie ag | 70 | 70
other | roberto garcia-martinez | 68 | 68
nl | j-m. chery | 67 | 67
dk | b) lei 529900rgxd3cbr3bru63 | 62 | 62
sa | hacia s.a. (es slug) | 62 | 62
nl | h. schimmelbusch | 62 | 62
Side findings:
dk | member/executive/associated person(114 rows) · DK parser is emitting the literal column-header as insider name.it | - internal dealing transaction,it | / internal dealing notice· Consob parser is emitting subject lines as insider name.in | there was a change in the interests of the· NSE parser is emitting truncated press-release prose.au | unknown substantial holder· ASX fallback when name parse fails.sa | sof-11 klimt cai s.à r.l.slugged withat-thensa-: shows a SAUDI source tag colliding with another (likely tadawul parser writing under the wrong country tag). Worth investigating.
4 · Bugs identified
| # | Severity | Where | Bug |
|---|---|---|---|
| B1 | HIGH | merge-staging.ts:1219, 1242, 2535, 2558, 2174, 2402, 2663, 2898, 3141, 3283, 1853, 1958, 2064 |
Insider slug includes per-filing hash, fragmenting same person across N rows. NL=91%, DK=91%, ES=77%, AT=75%, BE=70%, IT=70%, DE=62%, FI=63%, NO=44% duplication. |
| B2 | MED | backtest-compute.ts:327 |
History grouping uses insiderName::companyId (raw). Robust to B1 but vulnerable to spelling variants and inter-source. |
| B3 | LOW | merge-staging.ts DK adapter |
"member/executive/associated person" written as insider name 114x — column header leaking through. |
| B4 | LOW | Consob adapter | Subject lines ("- internal dealing transaction", "/ internal dealing notice") used as insider name. |
| B5 | LOW | NSE-IN adapter | Truncated press-release prose as insider name ("there was a change in the interests of the"). |
| B6 | LOW | DK or A.P. Møller-specific parser | Truncated CEO disclosure paragraph stored as role ("gerial responsibilities in the issuer, because A.P. Møller Holding's CEO, Robert"). |
| B7 | LOW | SA / AT tag mixup | sof-11 klimt cai s.à r.l. slugged -at- then queried as sa. Possible tadawul ingestion writing wrong source tag. |
5 · Fixes shipped in this PR
src/lib/role-utils.ts: extendednormalizeRole()with:- Portuguese (CVM-BR): Diretor / Conselho de Administração / Conselho Fiscal / Órgão Estatutário / Controlador
- Chinese: 董事, 监事, 高级管理人员, 高管, 财务总监, 总经理, 副总, 董秘, 核心技术人员
- Korean: 대표이사, 사장, 회장, 사외이사, 감사, 이사, 부사장, 전무, 상무
- Japanese: 代表取締役, 取締役, 監査役, 執行役, 会長
- Swedish/Norwegian/Danish: verkställande direktör (VD), styrelseledamot, styrelseordförande, ekonomidirektör, ledande befattningshavare, arbetstagarrepresentant
- Generic English: Officer, PDMR, Senior executive, Various Senior Officers, Executive Chair, Vice Chair, Controller, Treasurer, Secretary, CAO
- Italian Consob long forms: "Persona che esercita funzioni di amministrazione / direzione"
- Spanish feminine: CONSEJERA, Alto Directivo / Directivo
- Investor catch-all: 10% Owner, Principal Stockholder, stockholder, affiliate
scripts/_audit-role-coverage.ts: reusable coverage tester (top80/top500/ full).scripts/_dedupe-insider-roles.ts: dry-only audit. Lists 6,667 within-source dedupe groups + 25,642 excess rows.--applydeliberately not implemented (needs manual review pipeline).
6 · Recommendations (next chantier)
R1 · Stop per-filing slug hashing (HIGH priority)
Change merge-staging.ts slug formula for all non-SEC sources to:
slugify(normalizeName(name)) + "-<src>-" + shortSuffix(`${exchangePrefix}:${normalizeName(name).toLowerCase()}:${companyId}`)
Keying on (name, companyId) produces 1 row per person-per-company, not per filing. SEC already does this via CIK.
Risk: needs migration to merge existing fragments. Provide via scripts/_dedupe-insider-roles.ts --apply.
R2 · Fix history grouping in backtest-compute.ts
Replace ${d.insiderName}::${d.companyId} with ${d.insiderId}::${d.companyId} once R1 is applied.
R3 · Cross-source same-name detection (LOWER priority)
Cross-source merge is risky (homonyms). Recommend manual review or LEI-based join (some sources expose LEI in rawData). Already partially detected via scripts/_dedupe-insider-roles.ts cross-source section.
R4 · Parser hygiene
B3-B7 require fixes in individual ingest adapters (dk.ts, consob.ts, nse-in.ts, tadawul-sa.ts). Validate that insider name doesn't equal column header or subject line, length sanity check (≤120 chars), no leading punctuation.
R5 · Replace cron-less role normalisation with stored bucket
Add Declaration.roleCategory column (canonical PDG/DG | CFO/DAF | Directeur | CA/Board | Autre), populated at ingest via normalizeRole(insiderFunction). Then scoring reads the column instead of running JS regex on every decl. Plus indexable.
7 · Stats
- Distinct
insiderFunctionvariants in DB: 15,193 - Roled declarations: 536,433 (out of 595,076 total = 90.1% coverage)
- Pre-fix role mapping: 65.3% non-
Autre - Post-fix role mapping: 84.84% non-
Autre(+19.5 pp) - Insider rows: 135,669
- Within-source dedupe groups: 6,667
- Excess Insider rows from B1: 25,642 (~19% of all)
- NFKC-clean names: 100% (all names go through
normalizeName()at ingest inmerge-staging.ts).
8 · 3 top findings (executive)
- Slug formula bug fragments every EU regulator's Insider table. AFM = 91% duplicate rows, DK = 91%, ES = 77%. Root cause: slug suffix is per-filing, not per-person. Win-rate, historical patterns, and
Insider.tradingPatternML labels are computed per-row → fragmented across these dupes. - Role mapper had 35% miss-rate on non-French/English sources (Portuguese, Chinese, Korean, Swedish dominant). Now patched. The V12
roleSignalMultiplierwas effectively returning 0.70 ("Autre") for ~30% of CVM-BR / DART-KR / SSE-CN / FI-SE declarations, biasing CEO/CFO multiplier downward for entire emerging-market sleeves. - Backtest history grouping uses raw
insiderNamestring, notinsiderId. Currently masked by B1 (fragmented insider IDs would be worse than raw names), but blocks any future cross-source aggregation.