Building a Trustworthy Filing Pipeline for Insider Transactions, Sigma Journal

InsidersTradesSigma

The system is not one pipeline, it is 17 uneasy truces

A regulator feed sounds singular. In practice, each source is its own treaty negotiation with reality.

Some regulators publish structured feeds. Some publish HTML lists that change shape when a webmaster has an energetic afternoon. Some provide PDFs that are technically text, in the same way a brick is technically furniture. Some timestamp in local market time, some in UTC, some with no explicit timezone at all. Some expose amendments clearly, some bury them in document labels, some appear to believe versioning is a moral weakness.

For an insider-transactions platform, these differences are not cosmetic. They determine how often you poll, what constitutes a new record, how you detect revisions, and how you reconcile a filing seen via two different channels.

Why cron still matters

There is a temptation to sneer at cron jobs, as if anything not described as “event-driven” must be powered by candlelight. That is unfair. Polling remains the practical default for many regulatory sources because there is no webhook, no reliable push feed, and no guarantee that “last updated” means what it says.

Cron, or a scheduler playing the same role, gives you three useful properties.

First, predictability. You know when a source will be checked and can reason about freshness.

Second, replayability. You can rerun a failed window, or backfill a date range after a parser fix.

Third, source-specific control. A regulator that updates once daily should not be treated like one that posts continuously through market hours.

The catch is that cron by itself is not architecture. It is merely the part that wakes the rest of the machinery.

The adapter pattern is not optional

A common mistake is to force all regulators through one generic fetch-parse-normalize function too early. That produces a codebase with the aesthetic of simplicity and the operational profile of a trapdoor.

A better pattern is per-regulator adapters with a shared contract:

discover new candidate documents
fetch raw artefacts
extract source metadata
parse into a canonical schema
emit lineage and quality signals

This lets you standardize outputs without pretending inputs are standard. It also makes source-specific fixes possible without collateral damage elsewhere. The AMF does not care how the SEC labels XML elements, and your code should show the same healthy indifference.

Market	Regulator	Rule	Deadline	Notes
FR	AMF	MAR Art 19	T+3	Persons discharging managerial responsibilities and closely associated persons must notify transactions within the market-abuse framework.
EU	ESMA	MAR framework	T+3	ESMA provides Q&A and supervisory interpretation across EU venues, but publication mechanics remain national.
US	SEC	Section 16, Form 4	T+2	Structured electronic filing is comparatively friendly, which is not the same as pleasant.

Disclosure rules may be broadly comparable, but operational publication and retrieval patterns differ materially by regulator.

Polling cadence should follow publication behaviour

An effective scheduler uses different cadences for different sources:

high-frequency polling during known publication windows
lower-frequency polling overnight or on weekends
adaptive backoff when a source is stale or unavailable
occasional deep scans to catch late-posted or re-indexed filings

This is less glamorous than machine learning and more useful than most of it.

For example, a regulator that typically posts disclosures in batches at fixed times benefits from targeted polling around those windows, plus a later reconciliation scan. A source that updates continuously may need steady polling and stronger duplicate suppression. A source with fragile infrastructure may need gentler concurrency and more patience than your average engineer can naturally provide.

Retries need taxonomy, not enthusiasm

“Retry on failure” is one of those phrases that sounds complete and is not.

Failures differ. A timeout, a 429 rate-limit response, a transient 502, a malformed document, and a parser exception should not all trigger the same behaviour. Good retry design starts with classification:

transport errors: retry with exponential backoff and jitter
server overload or rate limiting: retry more slowly, reduce concurrency, respect headers if present
permanent not found: do not hammer the source because one URL was wrong
content parse failures: store raw artefact, alert if threshold breached, retry only if source documents are known to mutate
schema mismatches: route to parser-maintenance queue, because repeating the same mistake faster is not resilience

Jitter matters because synchronized retries can turn one upstream wobble into a self-inflicted denial of service. This is standard distributed-systems hygiene, but it becomes especially important when many jobs are keyed to the same top-of-hour schedule. Regulators did not ask to be load-tested by your impatience.

Watermarks are useful, until they lie

Most schedulers maintain a watermark, last seen timestamp, last filing ID, last page token. This is necessary and dangerous.

Necessary, because you need an efficient way to discover new records.

Dangerous, because source ordering is often unstable. Documents may appear late. Timestamps may be revised. Search indexes may lag the publication page. A filing can be visible in one endpoint before another. If your watermark logic assumes monotonicity where none exists, you will quietly skip records.

The usual remedy is a sliding lookback window. Instead of asking “what is newer than my last timestamp”, ask “what has appeared in the last X hours or days”, then deduplicate downstream. This costs extra fetches but buys safety. In filing pipelines, safety is generally cheaper than forensic embarrassment.

Idempotency is the difference between a dataset and a rumour

If a scheduler reruns a job, a worker crashes halfway through, or a source republishes the same filing with minor metadata changes, the system should remain calm. Idempotency is how.

In plain terms, the same filing processed twice should not produce two economic events.

Raw capture first, normalization second

A robust pattern is to separate the pipeline into layers:

raw acquisition, immutable storage of fetched artefacts and metadata
extraction, source-specific parsing into structured fields
canonical normalization, mapping into a common schema
entity resolution, linking issuer, insider, instrument, and venue
publication, making the cleaned filing available to downstream research and products

This layered design gives you replayability. If a parser improves, you can reprocess raw captures without re-hitting the regulator. If a dedupe rule changes, you can rerun normalization without pretending history never happened.

It also creates auditability. You can answer the uncomfortable but necessary question: why does this row exist?

Choosing a filing identity

Idempotency depends on a stable key. That is easy when the source offers a unique accession number or document identifier. It is less easy when the source offers only a title, date, and a URL that changes if someone reorganizes the website.

In practice, filing identity often combines several elements:

source regulator
source-native filing ID, if available
publication timestamp
issuer identifier
insider name or role
transaction date
document hash, for the raw artefact

The exact hierarchy matters. If you rely only on document hashes, a trivial formatting change can create a false “new” filing. If you rely only on timestamps and names, two legitimate filings can collapse into one. The sensible approach is a deterministic identity strategy with confidence tiers, plus explicit handling for amendments and corrections.

Amendments are not duplicates

This is where many pipelines become accidentally fictional. A corrected filing may share most fields with the original and still represent a material update. Treating every near-match as a duplicate can erase the amendment trail. Treating every amendment as a wholly new event can double-count activity.

The right model usually stores:

a document-level record, each published artefact
an event-level record, the economic transaction as currently understood
a version or supersession link, connecting amendments to originals where possible

This lets researchers choose. Some want the first public signal. Some want the latest corrected state. Some want both. Your system should not force a philosophical answer through a primary key constraint.

The four metrics that matter first

There are many metrics you can track. Four deserve front-row seats.

Freshness lag

How long between source publication and successful ingestion into your canonical store? This should be measured per regulator, per filing type, and ideally by percentile, not just average. A mean lag can look healthy while a tail of delayed filings quietly ruins timeliness.

Parse success rate

Of discovered candidate documents, what fraction parse successfully into required fields? Track by source and parser version. A sudden drop usually means the source changed format, not that corporate insiders collectively forgot how forms work.

Duplicate candidate rate

How many fetched artefacts resolve to already-known document or event identities? This metric helps tune lookback windows and dedupe rules. Too low can mean you are not rescanning enough. Too high can mean your scheduler is noisy or your source index is unstable.

Source error budget

How much failure is acceptable before a source is considered degraded? This includes transport errors, empty responses when non-empty is expected, and structural changes. Error budgets are useful because they force escalation before users become your monitoring system.

Logging should preserve forensic value

Application logs are often written for developers in the middle of an incident. Filing logs should also be useful six months later when someone questions a historical record.

That means structured logs with:

source and adapter version
scheduler run ID
request URL or endpoint key
response status and latency
raw artefact checksum
parse outcome
canonical record IDs emitted
dedupe decision reason

This is not overkill. It is how you reconstruct what happened when a source changed behaviour on a holiday Monday and your parser politely pretended not to notice.

Alerting should be source-aware

A single global “pipeline failed” alert is nearly useless. It tells you something is wrong, somewhere, eventually. Better alerts are source-aware and threshold-based:

AMF freshness lag above threshold
SEC parse success below threshold
one source returning zero documents during a normally active window
duplicate rate spike after scheduler change
amendment detection unusually low or high

The point is not to generate more alerts. It is to generate fewer, better ones. Operations teams do not need more noise. They need evidence.

Parsing is where theory meets a PDF with opinions

The architecture discussion can sound clean until parsing enters the room. Then the room catches fire.

Structured sources are a gift, not a guarantee

The SEC’s electronic filing infrastructure is comparatively structured. That helps. It does not eliminate the need for validation, normalization, and exception handling. Structured filings still contain inconsistent names, edge-case transaction codes, and amended forms that require care.

European sources under the Market Abuse Regulation framework can be more fragmented operationally because publication happens through national competent authorities and venue-specific channels. ESMA can harmonize interpretation to a degree, but it does not make every publication endpoint behave like a disciplined API.

PDFs and HTML require defensive extraction

For semi-structured or unstructured sources, defensive parsing matters:

preserve the original artefact
extract text and layout separately when possible
use field-level confidence scores
validate against expected ranges and enumerations
quarantine low-confidence parses for review or delayed publication

A parser should be allowed to say “I do not know”. Systems that force every document into a complete schema tend to manufacture certainty. That is excellent for throughput and poor for truth.

Canonical schemas need room for source quirks

A common schema for insider transactions usually includes issuer, insider, role, transaction date, notification date, instrument, transaction type, price, quantity, currency, and source metadata. That is the easy part.

The hard part is preserving source-specific nuance without polluting the canonical model beyond use. You need extension fields or source payload retention so that unusual attributes are not discarded. Otherwise, every normalization decision becomes irreversible, and irreversible decisions made under parser uncertainty are how future arguments begin.

Governance, lineage, and replay are what make the data investable

Research-grade data is not merely data that exists. It is data whose provenance can be explained and whose transformations can be replayed.

Immutable raw storage is the insurance policy

Every fetched artefact should be stored immutably with checksum, fetch timestamp, source endpoint, and relevant headers or metadata. Storage is cheap. Regret is expensive.

Immutable raw storage supports:

parser improvements
historical audits
dispute resolution
recovery after downstream bugs
independent verification of source changes

If a regulator removes or alters a historical document, your raw archive may be the only stable record of what your system actually saw at the time.

Reprocessing should be routine, not exceptional

Pipelines mature. Parsers improve. Entity mappings get corrected. New amendment logic is introduced. If reprocessing is painful, it will be avoided. If it is avoided, known errors become institutional memory.

A healthy architecture treats replay as normal:

rerun extraction for a source and date range
rebuild canonical records from raw captures
compare old and new outputs
publish diffs with lineage

This is one reason queue-based, stage-separated systems age better than monoliths. They let you replay one layer without pretending the whole world needs to restart.

Data contracts beat tribal knowledge

When 17 regulators are involved, undocumented assumptions multiply. A data contract for each adapter should specify:

discovery method
expected publication lag
identity strategy
amendment handling
required and optional fields
known edge cases
alert thresholds
test fixtures

Without this, the system resides in the heads of a few engineers and one person who “knows the French source”. That person will eventually go on holiday, which is inconsiderate but common.

What a sensible architecture looks like in practice

There is no single correct stack, but there is a recognisable shape to systems that survive.

A reference design

At a high level:

scheduler layer triggers per-regulator discovery jobs on source-specific cadences
discovery workers enumerate candidate filings or pages
fetch workers retrieve raw artefacts and write immutable storage
parse workers run source-specific extraction with versioned parsers
normalization workers map into canonical schema and apply idempotent upserts
entity resolution services standardize issuers, insiders, and instruments
quality and observability layer computes freshness, success, duplicate, and anomaly metrics
publication layer exposes document-level and event-level datasets

The scheduler can be cron, a workflow orchestrator, or a queue-driven timer service. The naming matters less than the guarantees. Can jobs be retried safely? Can they be replayed deterministically? Can you see where a filing is stuck? If not, the architecture is decorative.

Testing should mirror source reality

Unit tests are necessary and insufficient. The useful test suite also includes:

fixture-based parser tests using real historical documents
contract tests for source discovery assumptions
regression tests for known source format changes
replay tests over historical windows
chaos-style tests for timeouts, partial fetches, and duplicate deliveries

A parser that passes synthetic tests and fails on the regulator’s annual redesign is not robust. It is optimistic.

The boring metrics to publish internally

If you run such a system, a weekly internal review should include:

median and p95 freshness by source
parse success by parser version
count of quarantined documents
duplicate suppression ratio
amendment linkage rate
unresolved entity count
top source-specific incidents and time to recovery

None of this is glamorous. That is exactly why it is useful.

The regulatory layer still shapes the engineering layer

It is tempting to treat all this as pure infrastructure. It is not. Disclosure rules shape publication urgency, amendment frequency, and user expectations.

For example, under the EU Market Abuse Regulation, persons discharging managerial responsibilities and closely associated persons must notify transactions within a defined deadline, commonly understood in the Art. 19 framework as three business days. In the US, Section 16 Form 4 reporting generally operates on a two-business-day deadline. Those legal clocks influence what “fresh” means operationally, and what users consider a miss rather than a delay.

That does not mean engineering can infer legal compliance from publication timestamps alone. It means the architecture should be aware that timeliness is part of the product’s credibility.

How to Build a Reliable Filing Pipeline for Insider Transactions

Act on this