A quant stack is only as good as the plumbing, and most alpha dies somewhere between Authorization and page 37.
A quant stack is only as good as the plumbing, and most alpha dies somewhere between Authorization and page 37.
There are two ways to write a quant tutorial. One is to promise a market-beating model in 40 lines. The other is to show the parts that break first. We will choose the second option, partly because it is honest, and partly because broken pagination is less glamorous but more common.
This piece walks through a practical Python workflow for building a first insider-filings model with a REST API. The emphasis is on four things:
Because no live article-specific data extract was provided, I will not pretend otherwise. Where a number is unavailable, it is marked n/a. The only hard internal figure we do have is the scale of the filings archive, 162,000 filings. That is enough to make one point immediately: once the dataset is large enough, hand-wavy scripts stop being charming.
You know basic Python, requests, and pandas. You do not need a full factor-research platform. A laptop, a token, and a mild tolerance for JSON are sufficient.
We will keep the API examples generic, because the article brief asks for endpoints, auth, pagination, and a simple alpha example, not a reverse-engineered shrine to any one implementation detail. Replace the placeholder paths and fields with your actual API schema.
For a research client, keep three layers separate:
If you collapse those into one notebook cell, you will get results quickly. You will also get a future bug report written by your past self.
A REST API for filings usually exposes some combination of these resources:
/filings/issuers/insiders/transactions/symbols or /instrumentsThe exact naming varies. The pattern does not.
Most APIs in this category use bearer-token authentication or an API key header. A clean pattern is to wrap a requests.Session so every request shares headers, timeout rules, and retry logic.
import time
from typing import Dict, Iterable, List, Optional
import requests
import pandas as pd
class InsiderAPI:
def __init__(self, base_url: str, api_key: str, timeout: int = 30):
self.base_url = base_url.rstrip("/")
self.session = requests.Session()
self.session.headers.update({
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
"User-Agent": "sigma-journal-python-tutorial/1.0"
})
self.timeout = timeout
def _get(self, path: str, params: Optional[Dict] = None) -> Dict:
url = f"{self.base_url}{path}"
resp = self.session.get(url, params=params or {}, timeout=self.timeout)
resp.raise_for_status()
return resp.json()
That is enough to get started. It is not enough for production. In practice you also want:
429 and transient 5xx,from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class InsiderAPI:
def __init__(self, base_url: str, api_key: str, timeout: int = 30):
self.base_url = base_url.rstrip("/")
self.session = requests.Session()
self.session.headers.update({
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
"User-Agent": "sigma-journal-python-tutorial/1.0"
})
retry = Retry(
total=5,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
adapter = HTTPAdapter(max_retries=retry)
self.session.mount("https://", adapter)
self.session.mount("http://", adapter)
self.timeout = timeout
def _get(self, path: str, params=None):
url = f"{self.base_url}{path}"
r = self.session.get(url, params=params or {}, timeout=self.timeout)
r.raise_for_status()
return r.json()
Dry, boring, useful. A good start.
Filings APIs often support filters such as:
from_dateto_datecountryissuer_idinsider_idtransaction_typepagepage_sizecursorIf your API offers both offset pagination and cursor pagination, prefer cursor pagination for large pulls. Offset pagination is simple. It is also a fine way to discover race conditions when new records arrive mid-download.
Pagination is where many research datasets become quietly wrong. Missing one page in a 162,000-filing archive will not trigger a dramatic exception. It will simply make your backtest a little more fictional.
A common response shape looks something like this:
{
"results": [...],
"page": 3,
"page_size": 100,
"total_pages": 57,
"total_count": 5678
}
A straightforward loop:
def fetch_all_filings_offset(api: InsiderAPI, start_date: str, end_date: str) -> pd.DataFrame:
page = 1
rows = []
while True:
payload = api._get("/filings", params={
"from_date": start_date,
"to_date": end_date,
"page": page,
"page_size": 500
})
batch = payload.get("results", [])
if not batch:
break
rows.extend(batch)
total_pages = payload.get("total_pages")
if total_pages is not None and page >= total_pages:
break
page += 1
return pd.DataFrame(rows)
This works, with one caveat. If new filings are added while you are paging through the result set, later pages can shift. The fix is simple in principle and often neglected in practice: sort by a stable key, and where possible query a fixed historical window.
A more robust response shape:
{
"results": [...],
"next_cursor": "eyJpZCI6..."
}
The loop becomes:
def fetch_all_filings_cursor(api: InsiderAPI, start_date: str, end_date: str) -> pd.DataFrame:
cursor = None
rows = []
while True:
params = {
"from_date": start_date,
"to_date": end_date,
"page_size": 500
}
if cursor:
params["cursor"] = cursor
payload = api._get("/filings", params=params)
batch = payload.get("results", [])
rows.extend(batch)
cursor = payload.get("next_cursor")
if not cursor:
break
return pd.DataFrame(rows)
Cursor pagination is usually the right answer for bulk extraction. If your API supports it, use it.
For daily research, do not re-pull the full archive every morning unless you enjoy wasting bandwidth and introducing failure points. Keep a local store and fetch incrementally:
filed_at > last_seen_timestamp,That overlap window matters. Regulators and issuers are not always as punctual as your cron job.
A raw filing payload is not yet a signal. It is an administrative object with nested fields, optional values, and enough edge cases to keep a parser employed.
For a first model, you usually want:
filing_idissuer_idissuer_nametickerinsider_idinsider_nameinsider_roletransaction_datefiling_datetransaction_typesharespricecurrencyvalueownership_nature if available, direct or indirectsecurity_typeIf the API does not provide value, compute it as shares * price where sensible, and mark missing components explicitly.
def normalize_filings(df: pd.DataFrame) -> pd.DataFrame:
out = pd.DataFrame({
"filing_id": df.get("id"),
"issuer_id": df.get("issuer_id"),
"issuer_name": df.get("issuer_name"),
"ticker": df.get("ticker"),
"insider_id": df.get("insider_id"),
"insider_name": df.get("insider_name"),
"insider_role": df.get("insider_role"),
"transaction_date": pd.to_datetime(df.get("transaction_date"), errors="coerce"),
"filing_date": pd.to_datetime(df.get("filing_date"), errors="coerce"),
"transaction_type": df.get("transaction_type"),
"shares": pd.to_numeric(df.get("shares"), errors="coerce"),
"price": pd.to_numeric(df.get("price"), errors="coerce"),
"currency": df.get("currency"),
"security_type": df.get("security_type"),
})
out["value"] = out["shares"] * out["price"]
out["days_to_file"] = (out["filing_date"] - out["transaction_date"]).dt.days
return out
This is intentionally plain. The point is to establish typed columns and basic diagnostics, not to win an award for decorative abstraction.
A first insider strategy usually improves more from sensible exclusions than from sophisticated weighting. Consider filtering out:
The exact rules depend on jurisdiction and feed design. The principle is universal: if you cannot explain why a transaction belongs in the signal, it probably should not.
One filing can contain several transaction lines. If an executive buys in three tranches on one day, treating them as three independent signals may overstate breadth. A common fix is to aggregate to an event level:
That gives you a cleaner unit for ranking.
Now for the part people skip to. Fair enough.
The literature on insider trading has long documented that some insider purchases, especially open-market buys by senior officers and directors, contain information about future returns. The effect is not uniform, not free, and not immune to crowding. But it is a reasonable starting point for a simple model.
A first prototype can be:
Rank recent issuer-level insider buy activity by a conviction score, then select the top names for further testing.
That is not a finished strategy. It is a signal candidate.
For each issuer over the last 30 calendar days, compute:
Then define a score such as:
score =
log1p(total_buy_value)
+ 0.75 * distinct_buyers
+ 1.0 * senior_buyer_flag
- 0.05 * days_since_last_buy
You can argue with the coefficients. You should. The point is to start with something legible.
For a first pass, focus on buys. Sells are noisier because people sell for many reasons, taxes, diversification, school fees, yachts. Buys require cash moving in the difficult direction. Markets tend to notice.
import numpy as np
def build_buy_signal(df: pd.DataFrame, as_of_date: str) -> pd.DataFrame:
as_of = pd.Timestamp(as_of_date)
x = df.copy()
x = x[
(x["transaction_type"].str.lower().isin(["buy", "purchase", "acquisition"])) &
(x["transaction_date"].notna()) &
(x["transaction_date"] <= as_of) &
(x["transaction_date"] >= as_of - pd.Timedelta(days=30))
]
x["is_senior"] = x["insider_role"].fillna("").str.lower().str.contains(
"ceo|chief|cfo|director|chair"
)
grouped = (
x.groupby(["issuer_id", "issuer_name", "ticker"], dropna=False)
.agg(
total_buy_value=("value", "sum"),
distinct_buyers=("insider_id", "nunique"),
senior_buyer_flag=("is_senior", "max"),
last_buy_date=("transaction_date", "max")
)
.reset_index()
)
grouped["days_since_last_buy"] = (as_of - grouped["last_buy_date"]).dt.days
grouped["score"] = (
np.log1p(grouped["total_buy_value"].fillna(0))
+ 0.75 * grouped["distinct_buyers"].fillna(0)
+ 1.0 * grouped["senior_buyer_flag"].astype(int)
- 0.05 * grouped["days_since_last_buy"].fillna(30)
)
return grouped.sort_values("score", ascending=False)
This produces a ranked table. It does not yet answer whether the signal works. It does answer whether your pipeline can generate a daily candidate list without improvisation.
That is enough to create a repeatable research loop.
Insider data has a special talent for looking cleaner than it is. A filing is a formal document, therefore one feels reassured. That reassurance should be rationed.
This is the classic error. If you rank securities using transaction_date but could only have observed the filing on filing_date or publication timestamp, your backtest is cheating. Quite a lot of published nonsense begins this way.
The correct date for signal availability is the earliest timestamp at which your system could have seen the filing. Depending on your feed, that may be:
created_at or published_at.If all you have is a filing date with no intraday time, assume the signal becomes tradable no earlier than the next session, unless your market convention and feed timing support something more precise.
A filing can be corrected, refiled, or represented in multiple rows. If your API exposes amendment flags or versioning, use them. If not, deduplicate on a conservative key such as issuer, insider, transaction date, shares, price, and direction, while keeping the raw records archived.
An insider role field that says Officer in one case and Chief Financial Officer in another is not a taxonomy, it is a suggestion. Build a mapping layer. Keep it versioned. Expect to revise it.
If your universe spans multiple markets, notional values in local currency are not directly comparable. Convert to a base currency using a consistent FX source if your score uses value. If you do not have reliable FX mapping yet, rank within market first. Better a modestly scoped model than a global one held together by wishful arithmetic.
Below is a compact script that ties the pieces together. It assumes a /filings endpoint and a bearer token.
import pandas as pd
import numpy as np
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class InsiderAPI:
def __init__(self, base_url: str, api_key: str, timeout: int = 30):
self.base_url = base_url.rstrip("/")
self.session = requests.Session()
self.session.headers.update({
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
"User-Agent": "sigma-journal-python-tutorial/1.0"
})
retry = Retry(
total=5,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
adapter = HTTPAdapter(max_retries=retry)
self.session.mount("https://", adapter)
self.timeout = timeout
def get(self, path: str, params=None):
r = self.session.get(f"{self.base_url}{path}", params=params or {}, timeout=self.timeout)
r.raise_for_status()
return r.json()
def fetch_filings(self, start_date: str, end_date: str) -> pd.DataFrame:
cursor = None
rows = []
while True:
params = {
"from_date": start_date,
"to_date": end_date,
"page_size": 500
}
if cursor:
params["cursor"] = cursor
payload = self.get("/filings", params=params)
batch = payload.get("results", [])
rows.extend(batch)
cursor = payload.get("next_cursor")
if not cursor:
break
return pd.DataFrame(rows)
def normalize_filings(df: pd.DataFrame) -> pd.DataFrame:
out = pd.DataFrame({
"filing_id": df.get("id"),
"issuer_id": df.get("issuer_id"),
"issuer_name": df.get("issuer_name"),
"ticker": df.get("ticker"),
"insider_id": df.get("insider_id"),
"insider_name": df.get("insider_name"),
"insider_role": df.get("insider_role"),
"transaction_date": pd.to_datetime(df.get("transaction_date"), errors="coerce"),
"filing_date": pd.to_datetime(df.get("filing_date"), errors="coerce"),
"transaction_type": df.get("transaction_type"),
"shares": pd.to_numeric(df.get("shares"), errors="coerce"),
"price": pd.to_numeric(df.get("price"), errors="coerce"),
"currency": df.get("currency"),
"security_type": df.get("security_type"),
})
out["value"] = out["shares"] * out["price"]
return out
def build_signal(df: pd.DataFrame, as_of_date: str) -> pd.DataFrame:
as_of = pd.Timestamp(as_of_date)
x = df.copy()
x["transaction_type"] = x["transaction_type"].fillna("").str.lower()
x["insider_role"] = x["insider_role"].fillna("")
x = x[
x["transaction_type"].isin(["buy", "purchase", "acquisition"]) &
x["transaction_date"].between(as_of - pd.Timedelta(days=30), as_of)
]
x["is_senior"] = x["insider_role"].str.lower().str.contains("ceo|chief|cfo|director|chair")
x = x.dropna(subset=["issuer_id", "transaction_date"])
signal = (
x.groupby(["issuer_id", "issuer_name", "ticker"], dropna=False)
.agg(
total_buy_value=("value", "sum"),
distinct_buyers=("insider_id", "nunique"),
senior_buyer_flag=("is_senior", "max"),
last_buy_date=("transaction_date", "max")
)
.reset_index()
)
signal["days_since_last_buy"] = (as_of - signal["last_buy_date"]).dt.days
signal["score"] = (
np.log1p(signal["total_buy_value"].fillna(0))
+ 0.75 * signal["distinct_buyers"].fillna(0)
+ 1.0 * signal["senior_buyer_flag"].astype(int)
- 0.05 * signal["days_since_last_buy"].fillna(30)
)
return signal.sort_values("score", ascending=False)
if __name__ == "__main__":
api = InsiderAPI(base_url="https://api.example.com", api_key="YOUR_API_KEY")
raw = api.fetch_filings(start_date="2026-01-01", end_date="2026-03-31")
filings = normalize_filings(raw)
signal = build_signal(filings, as_of_date="2026-03-31")
signal.to_csv("daily_insider_signal.csv", index=False)
print(signal.head(20))
This is enough to produce a ranked file. It is also enough to reveal where your schema differs from your assumptions, which is the real educational value of any first script.
The next upgrades are predictable:
At that point you have moved from tutorial to tooling.
A simple score is fine. An untested score is decoration.
Before any backtest, run these checks:
Are filing counts stable over time, or did your pull miss a week?
What share of records have missing ticker, price, shares, or transaction_date?
How many duplicate filing IDs or event keys exist?
What is the distribution of filing_date - transaction_date?
If any of these produce surprises, postpone your alpha ambitions and fix the data.
Even a simple event study is informative:
Because the article brief did not include return data or a live backtest extract, I will not fabricate Sharpe ratios for your entertainment. The correct figure here is n/a.
Once the basic buy signal runs, the next sensible extensions are:
Differentiate between CEO, CFO, chair, and non-executive director.
Reward multiple insiders buying the same issuer within a short window.
Scale transaction value by insider compensation proxy, market cap, or average daily volume if available.
Different jurisdictions encode insider roles, deadlines, and transaction categories differently. Harmonization is work, not a footnote.
A brilliant signal in names you cannot trade is a hobby.
| Market | Regulator | Rule | Deadline | Notes |
|---|---|---|---|---|
| FR | AMF | MAR Art 19 | T+3 | Persons discharging managerial responsibilities and closely associated persons must notify transactions promptly, no later than three business days. |
| EU | ESMA and national competent authorities | MAR Art 19 | T+3 | Core framework is harmonised under MAR, but dissemination and formatting differ by venue and authority. |
| US | SEC | Exchange Act Section 16, Form 4 | T+2 | Most reportable insider transactions by officers, directors, and 10 percent owners are due within two business days. |
If you are consuming a REST API rather than scraping regulator websites directly, your research quality depends on two layers of timeliness:
That means your model should record both the regulatory event date and the API-observed availability timestamp where possible. If your endpoint exposes created_at, updated_at, or published_at, keep them all. Storage is cheap. False precision in backtests is expensive.
A good API tutorial should leave you with something operational, not merely inspired. By the end of this workflow, your target deliverable is simple: a script or notebook that can authenticate, fetch filings incrementally, normalize them into a stable table, compute a transparent buy-conviction score, and export a daily signal file.
That is not the final model. It is the first honest one.
The concrete next step is to wire this signal into a small event-study backtest using filing availability timestamps rather than transaction dates, then test whether cluster buying by senior insiders adds incremental predictive power after basic liquidity filters. If it does, you have a research programme. If it does not, at least your pagination will still be correct, which is more than can be said for many grander efforts.
Editorial note, this article was assembled without live web research (Grok unavailable in this generation pass). Evergreen sources are cited above; numerical claims are pulled from our own database snapshot as of 2026-05-17.
Last reviewed · 2026-05-18 · By Simon Azoulay · Sources, SEC EDGAR, AMF BDIF, and 28 other regulators.
Learn how to convert insider transactions CSV data into effective Excel dashboards using pivot tables and formulas in ju...
Explore the complexities of creating a reliable filing pipeline for insider transactions across multiple regulators, foc...
Explore the differences between Free, Developer, and Pro tiers for insider transactions research tools, focusing on limi...
Explore how to create efficient webhook alerts for insider filings, focusing on setup, payload schema, and latency to en...
Explore how to structure MCP servers for querying insider data, focusing on schema design, tool discovery, and handling ...
V13.5_stack was in production. Audits 100-103 demonstrated its collapse on the expanded universe. A detailed account of ...