50 · Historical archive policy (DB-as-warehouse)
Status: active policy, 2026-05-18
Owner: Simon Azoulay
Related: 48-sources-intelligence-audit.md, src/lib/sources-registry.ts, scripts/_assert-no-deletes.mjs
TL;DR
Our DB IS the long-term archive. Daily upserts on rolling-window sources let us accumulate history beyond what each upstream itself exposes. We never delete from
Declaration.
A source can have a structural 12-month cap upstream (BaFin), a 30-day RSS window (SIX, SEDI), or a broken pagination (RNS Investegate). That cap applies only to a single fetch. Because each daily ingest upserts by Declaration.amfId @unique, every row we observe today is preserved forever, even after it falls out of the upstream rolling window tomorrow. After 5 years of daily ingestion on a 12mo-rolling source, we hold 5 years of history. That is the design.
Schema invariants we rely on
Declaration.amfId String @unique— primary natural key for dedup-on-upsert. Confirmed inprisma/schema.prisma:141.Declaration.createdAt DateTime @default(now())— represents observation time (when we first saw the row). Confirmed atprisma/schema.prisma:149.Declaration.pubDate DateTime— represents source publication time (regulator's filing timestamp). Confirmed atprisma/schema.prisma:145.- No cron, no scheduled job, no maintenance script calls
prisma.declaration.delete*.- Grep result 2026-05-18: only one occurrence in app code is
scripts/cleanup-cvm-br-anon.ts:60, a manual one-shot purge of CVM-BR anonymised rows after the source published full-name re-disclosures. That script is not on any schedule and was run once. - The Prisma client
.d.tsreferences do not count.
- Grep result 2026-05-18: only one occurrence in app code is
What this gives us
For each source, the effective history we expose to users is:
effective_history = max(upstream_cap_today, observation_window_since_first_ingest)
| Source | Upstream cap | Observation window if running daily | Notes |
|---|---|---|---|
| BaFin | 12 months | grows unbounded | structural cap, only fix is paid feed |
| SIX RSS | 30 days | grows unbounded | Sheldon JSON backfill reaches 2023-01 |
| SEDI | 30 days (ceo.ca) | grows unbounded | TSX direct blocked, paid Refinitiv lifts cap |
| RNS | ~today | grows unbounded | Investegate pagination broken |
| ASX | 90 days | grows unbounded | Markit feed limit |
| NZX | 24 months | grows unbounded | historical 403 wall |
| JSE | 24 months | grows unbounded | Sharenet rolling mirror |
| SEBI | ~36 months | grows unbounded | Trendlyne scrape |
For sources without a rolling cap (SEC EDGAR, AFM, OeKB, CVM VLMO, OpenDART, KAP) the upstream archive itself is the source of truth and we only need to ingest once per filing.
Backfills targeting paid sources are future work
Paid alternatives are surfaced in src/lib/sources-registry.ts via the paidFeed field, and rendered on /admin/sources and /status. Today's policy: we do not pay. When budget exists, we backfill in bulk from the chosen vendor's archive endpoint, then resume daily incremental from the free tier. The unique-key invariant means a bulk backfill is idempotent — it cannot duplicate or destroy existing rows.
Today's policy (canonical)
- Never delete from
Declaration. If you must purge bad data (e.g. test rows, malformed scrapes), do it via a named one-shot script underscripts/cleanup-<reason>.ts, with a comment explaining why, after manual review. Do not put deletes on any cron. createdAtis observation time,pubDateis source publication time. Do not conflate them in UI or recommendations.- Staging tables (BafinFiling, RnsFiling, etc) follow the same rule but are lower-stakes (re-derivable from upstream + Declaration). Still, no cron should
deleteManyon them. - Upserts must key on the natural ID (
amfIdfor Declaration, source-specific ID for staging). Never key on synthetic auto-increment.
Runtime safeguard
scripts/_assert-no-deletes.mjs records Declaration.count once per day in Setting["archive.declaration.history.v1"], keeps 30 daily samples, and asserts the count is monotonically non-decreasing. Exits non-zero on regression. To wire to nightly cron, add to vercel.json schedule or invoke from an existing nightly cron route.
Manual run:
node scripts/_assert-no-deletes.mjs
Sample output:
{"ok":true,"today":"2026-05-18","count":18432,"window":1,"growth30d":0,"earliest":{"day":"2026-05-18","count":18432},"latest":{"day":"2026-05-18","count":18432}}
Regression output (non-zero exit):
{"ok":false,"kind":"DECLARATION_COUNT_REGRESSION","today":"2026-05-18","count":17900,"window":12,"regression":{"day":"2026-05-18","count":17900,"max":18432}}
Future, optional
- Sunday
VACUUM ANALYZEcron on Postgres for index bloat. Not implemented yet, low priority while row count remains under 10M. IngestionSnapshottable recording raw payload hash + first-seen ts. Considered and dropped: storage cost is high, andDeclaration.createdAtalready gives us first-seen at the row grain. Revisit only if we hit a "what did the upstream payload look like exactly on day X" debugging need.
Don't break this policy without writing a successor doc.