41 — History-depth audit (per source)
Date: 2026-05-17 Goal: for each of 21 ingestion sources, are we hitting the right endpoint and the maximum possible historical depth (target 5 years)?
Methodology: live MIN(filingDate) / MAX(filingDate) / COUNT(*) against
each staging table (Declaration for AMF, *Filing for the others).
Source-side max history estimated from public docs / observed behaviour.
Gap = min(5y, source_cap) − current_depth.
Run via npx tsx scripts/audit-history-depth.ts.
Snapshot (2026-05-17)
| Source | Rows | Min | Max | Current | Target | Gap | Source cap | Action |
|---|---|---|---|---|---|---|---|---|
| AMF | 25,733 | 2020-03-17 | 2026-05-14 | 6.2y | 5.0y | 0.0y | BDIF archive ~2015+ | OK |
| SEC | 156,888 | 2021-05-03 | 2026-05-15 | 5.0y | 5.0y | 0.0y | EDGAR full ~2003+ | OK (could go 20y) |
| BAFIN | 295 | 2025-05-02 | 2026-05-15 | 1.0y | 1.0y | 0.0y | Rolling 12 months | Capped — OK |
| SIX | 226 | 2026-04-15 | 2026-05-15 | 0.1y | 0.1y | 0.0y | RSS 30 d only | Capped — OK |
| RNS | 192 | 2026-05-12 | 2026-05-16 | 0.0y | 4.0y | 4.0y | Investegate (slow) | Backfill 49 mo |
| SEDI | 68 | 2026-04-30 | 2026-05-16 | 0.0y | 0.1y | 0.0y | ceo.ca 30 d rolling | Capped — OK |
| CONSOB | 360 | 2026-04-02 | 2026-05-15 | 0.1y | 5.0y | 4.9y | eMarketStorage ~5y | Backfill 59 mo |
| CNMV | 101 | 2026-04-16 | 2026-05-15 | 0.1y | 5.0y | 4.9y | Hechos Relevantes ~5y | Backfill 60 mo |
| AFM | 6,392 | 2006-11-01 | 2026-05-13 | 19.5y | 5.0y | 0.0y | Bulk XML full | OK (gold standard) |
| FSMA | 906 | 2020-03-31 | 2026-05-15 | 6.1y | 2.0y | 0.0y | Rolling 50/issuer | OK |
| OSLO | 601 | 2025-12-23 | 2026-05-15 | 0.4y | 5.0y | 4.6y | NewsWeb ~5y | Backfill 56 mo |
| FI | 114 | 2026-05-13 | 2026-05-16 | 0.0y | 5.0y | 5.0y | Nasdaq HEL ~5y | Backfill 61 mo |
| DK | 182 | 2026-02-25 | 2026-05-15 | 0.2y | 5.0y | 4.8y | Finanstilsynet ~5y | Backfill 58 mo |
| ASX | 202 | 2026-05-13 | 2026-05-15 | 0.0y | 0.2y | 0.2y | Markit ~3 months | Capped — OK |
| FMA | 50 | 2026-03-31 | 2026-05-15 | 0.1y | 5.0y | 4.9y | OeKB OAM ~5y | Backfill 59 mo |
| IE | 2 | 2026-05-12 | 2026-05-16 | 0.0y | 5.0y | 5.0y | Euronext Dublin ~5y | Backfill 61 mo |
| SGX | 1 | 2026-05-17 | 2026-05-17 | 0.0y | 5.0y | 5.0y | SGX annc ~5y | Backfill 61 mo |
| BR (CVM) | 29,307 | 2025-02-01 | 2026-05-15 | 1.3y | 5.0y | 3.7y | CVM VLMO 2021+ | Backfill 45 mo |
| DART | 0 | — | — | 0.0y | 5.0y | 5.0y | OpenDART ~5y | First ingest needed |
| SEBI | 0 | — | — | 0.0y | 3.0y | 3.0y | Trendlyne ~3y | First ingest needed |
Healthy: AMF, SEC, AFM, BaFin, SIX, SEDI, ASX, FSMA (caps respected). Gaps: 11 sources have meaningful backfill potential.
Backfill ROI ranking
Sorted by (gap × low effort × data density expected).
- CVM Brazil (BR) — gap 3.7y, weekly ZIPs publicly available since
- Already pulling ~29k rows for 1.3y. Backfilling 2021/2022/2023/2024 ZIPs would quadruple dataset and unlock LATAM coverage. Highest ROI.
- CONSOB Italy — gap 4.9y, eMarketStorage paginates back ~5y. 60 mo recovery from same endpoint we already hit. Sector-mature European market with strong PDMR discipline. High ROI.
- Oslo Børs (OSLO) — gap 4.6y, NewsWeb supports date-range queries. Norwegian PDMRs trade actively (oil/seafood/shipping); 56 mo recovery would unlock a high-alpha geography. High ROI.
Runners-up (3–6): CNMV Spain (4.9y gap, same scrape pattern), FMA Austria (OeKB OAM supports historical pulls), DK Finanstilsynet (5y register available), FI Nasdaq Helsinki (Nasdaq News API has date-range).
Low ROI: RNS UK (Investegate scrape is slow and brittle — 49 mo backfill ≈ 30k pages, ~8h scrape risk). IE Dublin (low volume issuers). SGX (just turned on — needs settling time).
DART (KR) and SEBI (IN) are not gaps but first ingests — separate workstream.
Files modified
src/lib/sources-registry.ts—SOURCE_HISTORY_CAP_DAYStable,TARGET_HISTORY_YEARS,HistoryDepthtype,computeHistoryDepth(def)helper,historyDepthfield added toSourceHealthSnapshot.src/app/admin/sources/page.tsx—HistoryDepthBarwidget; new "History depth" row per source card with green/gold/grey colour + progress bar + absolute date range + source-cap caption.src/app/status/page.tsx— new "Historique / History" column on public table with abbreviatedXy / Yyratio and progress bar.scripts/audit-history-depth.ts— re-runnable audit producing the table above + JSON dump.
Next actions
- Schedule BR CVM 2021-2024 ZIP backfill (one-shot, ~4h ingest).
- Schedule CONSOB walk-back to 2021-01 (paginate eMarketStorage).
- Schedule Oslo NewsWeb 2021-01 → 2025-12 backfill.
- Decide whether to stand up DART and SEBI workers (separate scope).
- Revisit RNS scraper architecture before attempting backfill (rate limits, archive depth verification).