The four metrics that matter first
There are many metrics you can track. Four deserve front-row seats.
Freshness lag
How long between source publication and successful ingestion into your canonical store? This should be measured per regulator, per filing type, and ideally by percentile, not just average. A mean lag can look healthy while a tail of delayed filings quietly ruins timeliness.
Parse success rate
Of discovered candidate documents, what fraction parse successfully into required fields? Track by source and parser version. A sudden drop usually means the source changed format, not that corporate insiders collectively forgot how forms work.
Duplicate candidate rate
How many fetched artefacts resolve to already-known document or event identities? This metric helps tune lookback windows and dedupe rules. Too low can mean you are not rescanning enough. Too high can mean your scheduler is noisy or your source index is unstable.
Source error budget
How much failure is acceptable before a source is considered degraded? This includes transport errors, empty responses when non-empty is expected, and structural changes. Error budgets are useful because they force escalation before users become your monitoring system.
Logging should preserve forensic value
Application logs are often written for developers in the middle of an incident. Filing logs should also be useful six months later when someone questions a historical record.
That means structured logs with:
- source and adapter version
- scheduler run ID
- request URL or endpoint key
- response status and latency
- raw artefact checksum
- parse outcome
- canonical record IDs emitted
- dedupe decision reason
This is not overkill. It is how you reconstruct what happened when a source changed behaviour on a holiday Monday and your parser politely pretended not to notice.
Alerting should be source-aware
A single global “pipeline failed” alert is nearly useless. It tells you something is wrong, somewhere, eventually. Better alerts are source-aware and threshold-based:
- AMF freshness lag above threshold
- SEC parse success below threshold
- one source returning zero documents during a normally active window
- duplicate rate spike after scheduler change
- amendment detection unusually low or high
The point is not to generate more alerts. It is to generate fewer, better ones. Operations teams do not need more noise. They need evidence.
Parsing is where theory meets a PDF with opinions
The architecture discussion can sound clean until parsing enters the room. Then the room catches fire.
Structured sources are a gift, not a guarantee
The SEC’s electronic filing infrastructure is comparatively structured. That helps. It does not eliminate the need for validation, normalization, and exception handling. Structured filings still contain inconsistent names, edge-case transaction codes, and amended forms that require care.
European sources under the Market Abuse Regulation framework can be more fragmented operationally because publication happens through national competent authorities and venue-specific channels. ESMA can harmonize interpretation to a degree, but it does not make every publication endpoint behave like a disciplined API.
PDFs and HTML require defensive extraction
For semi-structured or unstructured sources, defensive parsing matters:
- preserve the original artefact
- extract text and layout separately when possible
- use field-level confidence scores
- validate against expected ranges and enumerations
- quarantine low-confidence parses for review or delayed publication
A parser should be allowed to say “I do not know”. Systems that force every document into a complete schema tend to manufacture certainty. That is excellent for throughput and poor for truth.
Canonical schemas need room for source quirks
A common schema for insider transactions usually includes issuer, insider, role, transaction date, notification date, instrument, transaction type, price, quantity, currency, and source metadata. That is the easy part.
The hard part is preserving source-specific nuance without polluting the canonical model beyond use. You need extension fields or source payload retention so that unusual attributes are not discarded. Otherwise, every normalization decision becomes irreversible, and irreversible decisions made under parser uncertainty are how future arguments begin.
Governance, lineage, and replay are what make the data investable
Research-grade data is not merely data that exists. It is data whose provenance can be explained and whose transformations can be replayed.