Building an Insider Filings Model with Python and REST API, Sigma Journal

InsidersTradesSigma

Authentication and endpoint design

A REST API for filings usually exposes some combination of these resources:

/filings
/issuers
/insiders
/transactions
/symbols or /instruments

The exact naming varies. The pattern does not.

API key auth in Python

Most APIs in this category use bearer-token authentication or an API key header. A clean pattern is to wrap a requests.Session so every request shares headers, timeout rules, and retry logic.

Example client skeleton

import time
from typing import Dict, Iterable, List, Optional
import requests
import pandas as pd

class InsiderAPI:
    def __init__(self, base_url: str, api_key: str, timeout: int = 30):
        self.base_url = base_url.rstrip("/")
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Accept": "application/json",
            "User-Agent": "sigma-journal-python-tutorial/1.0"
        })
        self.timeout = timeout

    def _get(self, path: str, params: Optional[Dict] = None) -> Dict:
        url = f"{self.base_url}{path}"
        resp = self.session.get(url, params=params or {}, timeout=self.timeout)
        resp.raise_for_status()
        return resp.json()

That is enough to get started. It is not enough for production. In practice you also want:

retries for 429 and transient 5xx,
logging of path, params, status, and latency,
optional backoff,
a consistent way to save raw payloads.

A less fragile request method

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class InsiderAPI:
    def __init__(self, base_url: str, api_key: str, timeout: int = 30):
        self.base_url = base_url.rstrip("/")
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Accept": "application/json",
            "User-Agent": "sigma-journal-python-tutorial/1.0"
        })
        retry = Retry(
            total=5,
            backoff_factor=0.5,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["GET"]
        )
        adapter = HTTPAdapter(max_retries=retry)
        self.session.mount("https://", adapter)
        self.session.mount("http://", adapter)
        self.timeout = timeout

    def _get(self, path: str, params=None):
        url = f"{self.base_url}{path}"
        r = self.session.get(url, params=params or {}, timeout=self.timeout)
        r.raise_for_status()
        return r.json()

Dry, boring, useful. A good start.

Common endpoint parameters

Filings APIs often support filters such as:

from_date
to_date
country
issuer_id
insider_id
transaction_type
page
page_size
cursor

If your API offers both offset pagination and cursor pagination, prefer cursor pagination for large pulls. Offset pagination is simple. It is also a fine way to discover race conditions when new records arrive mid-download.

Pagination without losing records

Pagination is where many research datasets become quietly wrong. Missing one page in a 162,000-filing archive will not trigger a dramatic exception. It will simply make your backtest a little more fictional.

Offset pagination

A common response shape looks something like this:

{
  "results": [...],
  "page": 3,
  "page_size": 100,
  "total_pages": 57,
  "total_count": 5678
}

A straightforward loop:

def fetch_all_filings_offset(api: InsiderAPI, start_date: str, end_date: str) -> pd.DataFrame:
    page = 1
    rows = []

    while True:
        payload = api._get("/filings", params={
            "from_date": start_date,
            "to_date": end_date,
            "page": page,
            "page_size": 500
        })
        batch = payload.get("results", [])
        if not batch:
            break

        rows.extend(batch)

        total_pages = payload.get("total_pages")
        if total_pages is not None and page >= total_pages:
            break

        page += 1

    return pd.DataFrame(rows)

This works, with one caveat. If new filings are added while you are paging through the result set, later pages can shift. The fix is simple in principle and often neglected in practice: sort by a stable key, and where possible query a fixed historical window.

Cursor pagination

A more robust response shape:

{
  "results": [...],
  "next_cursor": "eyJpZCI6..."
}

The loop becomes:

def fetch_all_filings_cursor(api: InsiderAPI, start_date: str, end_date: str) -> pd.DataFrame:
    cursor = None
    rows = []

    while True:
        params = {
            "from_date": start_date,
            "to_date": end_date,
            "page_size": 500
        }
        if cursor:
            params["cursor"] = cursor

        payload = api._get("/filings", params=params)
        batch = payload.get("results", [])
        rows.extend(batch)

        cursor = payload.get("next_cursor")
        if not cursor:
            break

    return pd.DataFrame(rows)

Cursor pagination is usually the right answer for bulk extraction. If your API supports it, use it.

Incremental syncs beat heroic re-downloads

For daily research, do not re-pull the full archive every morning unless you enjoy wasting bandwidth and introducing failure points. Keep a local store and fetch incrementally:

pull filings with filed_at > last_seen_timestamp,
upsert by unique filing ID,
re-query a small overlap window, for example the last 2 to 7 days, to catch corrections or late updates.

That overlap window matters. Regulators and issuers are not always as punctual as your cron job.

From JSON to a research table

A raw filing payload is not yet a signal. It is an administrative object with nested fields, optional values, and enough edge cases to keep a parser employed.

Fields worth extracting first

For a first model, you usually want:

filing_id
issuer_id
issuer_name
ticker
insider_id
insider_name
insider_role
transaction_date
filing_date
transaction_type
shares
price
currency
value
ownership_nature if available, direct or indirect
security_type

If the API does not provide value, compute it as shares * price where sensible, and mark missing components explicitly.

Normalization example

def normalize_filings(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame({
        "filing_id": df.get("id"),
        "issuer_id": df.get("issuer_id"),
        "issuer_name": df.get("issuer_name"),
        "ticker": df.get("ticker"),
        "insider_id": df.get("insider_id"),
        "insider_name": df.get("insider_name"),
        "insider_role": df.get("insider_role"),
        "transaction_date": pd.to_datetime(df.get("transaction_date"), errors="coerce"),
        "filing_date": pd.to_datetime(df.get("filing_date"), errors="coerce"),
        "transaction_type": df.get("transaction_type"),
        "shares": pd.to_numeric(df.get("shares"), errors="coerce"),
        "price": pd.to_numeric(df.get("price"), errors="coerce"),
        "currency": df.get("currency"),
        "security_type": df.get("security_type"),
    })

    out["value"] = out["shares"] * out["price"]
    out["days_to_file"] = (out["filing_date"] - out["transaction_date"]).dt.days

    return out

This is intentionally plain. The point is to establish typed columns and basic diagnostics, not to win an award for decorative abstraction.

The cleaning rules that matter most

A first insider strategy usually improves more from sensible exclusions than from sophisticated weighting. Consider filtering out:

non-common-share instruments if your thesis is about equity conviction,
tiny transactions below a minimum notional threshold,
automatic or plan-based transactions where flagged,
amended filings if they duplicate prior records and your schema does not already consolidate them,
transactions with missing dates or impossible prices.

The exact rules depend on jurisdiction and feed design. The principle is universal: if you cannot explain why a transaction belongs in the signal, it probably should not.

Grouping multiple line items into one event

One filing can contain several transaction lines. If an executive buys in three tranches on one day, treating them as three independent signals may overstate breadth. A common fix is to aggregate to an event level:

group by issuer, insider, transaction date, and transaction direction,
sum shares and value,
keep the earliest filing timestamp visible to the strategy.

That gives you a cleaner unit for ranking.

A simple insider alpha prototype

Now for the part people skip to. Fair enough.

The literature on insider trading has long documented that some insider purchases, especially open-market buys by senior officers and directors, contain information about future returns. The effect is not uniform, not free, and not immune to crowding. But it is a reasonable starting point for a simple model.

The simplest workable thesis

A first prototype can be:

Rank recent issuer-level insider buy activity by a conviction score, then select the top names for further testing.

That is not a finished strategy. It is a signal candidate.

A practical conviction score

For each issuer over the last 30 calendar days, compute:

total buy value,
number of distinct insiders buying,
whether at least one buyer is a director or C-level executive,
recency of the latest buy.

Then define a score such as:

score =
    log1p(total_buy_value)
    + 0.75 * distinct_buyers
    + 1.0 * senior_buyer_flag
    - 0.05 * days_since_last_buy

You can argue with the coefficients. You should. The point is to start with something legible.

Why buys, not sells

For a first pass, focus on buys. Sells are noisier because people sell for many reasons, taxes, diversification, school fees, yachts. Buys require cash moving in the difficult direction. Markets tend to notice.

Example signal code

import numpy as np

def build_buy_signal(df: pd.DataFrame, as_of_date: str) -> pd.DataFrame:
    as_of = pd.Timestamp(as_of_date)

    x = df.copy()
    x = x[
        (x["transaction_type"].str.lower().isin(["buy", "purchase", "acquisition"])) &
        (x["transaction_date"].notna()) &
        (x["transaction_date"] <= as_of) &
        (x["transaction_date"] >= as_of - pd.Timedelta(days=30))
    ]

    x["is_senior"] = x["insider_role"].fillna("").str.lower().str.contains(
        "ceo|chief|cfo|director|chair"
    )

    grouped = (
        x.groupby(["issuer_id", "issuer_name", "ticker"], dropna=False)
         .agg(
             total_buy_value=("value", "sum"),
             distinct_buyers=("insider_id", "nunique"),
             senior_buyer_flag=("is_senior", "max"),
             last_buy_date=("transaction_date", "max")
         )
         .reset_index()
    )

    grouped["days_since_last_buy"] = (as_of - grouped["last_buy_date"]).dt.days
    grouped["score"] = (
        np.log1p(grouped["total_buy_value"].fillna(0))
        + 0.75 * grouped["distinct_buyers"].fillna(0)
        + 1.0 * grouped["senior_buyer_flag"].astype(int)
        - 0.05 * grouped["days_since_last_buy"].fillna(30)
    )

    return grouped.sort_values("score", ascending=False)

This produces a ranked table. It does not yet answer whether the signal works. It does answer whether your pipeline can generate a daily candidate list without improvisation.

A sample daily workflow

Pull new filings from the API.
Normalize and append to local storage.
Rebuild the latest rolling 30-day event set.
Compute issuer-level scores.
Export top-ranked names to CSV, database, or downstream backtest engine.

That is enough to create a repeatable research loop.

The avoidable mistakes in insider-data modelling

Insider data has a special talent for looking cleaner than it is. A filing is a formal document, therefore one feels reassured. That reassurance should be rationed.

Look-ahead bias via the wrong date

This is the classic error. If you rank securities using transaction_date but could only have observed the filing on filing_date or publication timestamp, your backtest is cheating. Quite a lot of published nonsense begins this way.

The correct date for signal availability is the earliest timestamp at which your system could have seen the filing. Depending on your feed, that may be:

regulator publication time,
vendor ingestion time,
your own API's created_at or published_at.

If all you have is a filing date with no intraday time, assume the signal becomes tradable no earlier than the next session, unless your market convention and feed timing support something more precise.

Duplicate and amended filings

A filing can be corrected, refiled, or represented in multiple rows. If your API exposes amendment flags or versioning, use them. If not, deduplicate on a conservative key such as issuer, insider, transaction date, shares, price, and direction, while keeping the raw records archived.

Role classification is messy

An insider role field that says Officer in one case and Chief Financial Officer in another is not a taxonomy, it is a suggestion. Build a mapping layer. Keep it versioned. Expect to revise it.

Currency and value comparability

If your universe spans multiple markets, notional values in local currency are not directly comparable. Convert to a base currency using a consistent FX source if your score uses value. If you do not have reliable FX mapping yet, rank within market first. Better a modestly scoped model than a global one held together by wishful arithmetic.

A compact end-to-end Python example

Below is a compact script that ties the pieces together. It assumes a /filings endpoint and a bearer token.

import pandas as pd
import numpy as np
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class InsiderAPI:
    def __init__(self, base_url: str, api_key: str, timeout: int = 30):
        self.base_url = base_url.rstrip("/")
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Accept": "application/json",
            "User-Agent": "sigma-journal-python-tutorial/1.0"
        })
        retry = Retry(
            total=5,
            backoff_factor=0.5,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["GET"]
        )
        adapter = HTTPAdapter(max_retries=retry)
        self.session.mount("https://", adapter)
        self.timeout = timeout

    def get(self, path: str, params=None):
        r = self.session.get(f"{self.base_url}{path}", params=params or {}, timeout=self.timeout)
        r.raise_for_status()
        return r.json()

    def fetch_filings(self, start_date: str, end_date: str) -> pd.DataFrame:
        cursor = None
        rows = []

        while True:
            params = {
                "from_date": start_date,
                "to_date": end_date,
                "page_size": 500
            }
            if cursor:
                params["cursor"] = cursor

            payload = self.get("/filings", params=params)
            batch = payload.get("results", [])
            rows.extend(batch)

            cursor = payload.get("next_cursor")
            if not cursor:
                break

        return pd.DataFrame(rows)

def normalize_filings(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame({
        "filing_id": df.get("id"),
        "issuer_id": df.get("issuer_id"),
        "issuer_name": df.get("issuer_name"),
        "ticker": df.get("ticker"),
        "insider_id": df.get("insider_id"),
        "insider_name": df.get("insider_name"),
        "insider_role": df.get("insider_role"),
        "transaction_date": pd.to_datetime(df.get("transaction_date"), errors="coerce"),
        "filing_date": pd.to_datetime(df.get("filing_date"), errors="coerce"),
        "transaction_type": df.get("transaction_type"),
        "shares": pd.to_numeric(df.get("shares"), errors="coerce"),
        "price": pd.to_numeric(df.get("price"), errors="coerce"),
        "currency": df.get("currency"),
        "security_type": df.get("security_type"),
    })
    out["value"] = out["shares"] * out["price"]
    return out

def build_signal(df: pd.DataFrame, as_of_date: str) -> pd.DataFrame:
    as_of = pd.Timestamp(as_of_date)

    x = df.copy()
    x["transaction_type"] = x["transaction_type"].fillna("").str.lower()
    x["insider_role"] = x["insider_role"].fillna("")

    x = x[
        x["transaction_type"].isin(["buy", "purchase", "acquisition"]) &
        x["transaction_date"].between(as_of - pd.Timedelta(days=30), as_of)
    ]

    x["is_senior"] = x["insider_role"].str.lower().str.contains("ceo|chief|cfo|director|chair")
    x = x.dropna(subset=["issuer_id", "transaction_date"])

    signal = (
        x.groupby(["issuer_id", "issuer_name", "ticker"], dropna=False)
         .agg(
             total_buy_value=("value", "sum"),
             distinct_buyers=("insider_id", "nunique"),
             senior_buyer_flag=("is_senior", "max"),
             last_buy_date=("transaction_date", "max")
         )
         .reset_index()
    )

    signal["days_since_last_buy"] = (as_of - signal["last_buy_date"]).dt.days
    signal["score"] = (
        np.log1p(signal["total_buy_value"].fillna(0))
        + 0.75 * signal["distinct_buyers"].fillna(0)
        + 1.0 * signal["senior_buyer_flag"].astype(int)
        - 0.05 * signal["days_since_last_buy"].fillna(30)
    )

    return signal.sort_values("score", ascending=False)

if __name__ == "__main__":
    api = InsiderAPI(base_url="https://api.example.com", api_key="YOUR_API_KEY")
    raw = api.fetch_filings(start_date="2026-01-01", end_date="2026-03-31")
    filings = normalize_filings(raw)
    signal = build_signal(filings, as_of_date="2026-03-31")
    signal.to_csv("daily_insider_signal.csv", index=False)
    print(signal.head(20))

This is enough to produce a ranked file. It is also enough to reveal where your schema differs from your assumptions, which is the real educational value of any first script.

Turning the script into a research asset

The next upgrades are predictable:

save raw API responses to disk or object storage,
persist normalized tables in DuckDB, Postgres, or Parquet,
add unit tests for date parsing and role mapping,
parameterize the lookback window and minimum transaction size,
add a benchmarked backtest layer.

At that point you have moved from tutorial to tooling.

What to test before you trust the signal

A simple score is fine. An untested score is decoration.

Market	Regulator	Rule	Deadline	Notes
FR	AMF	MAR Art 19	T+3	Persons discharging managerial responsibilities and closely associated persons must notify transactions promptly, no later than three business days.
EU	ESMA and national competent authorities	MAR Art 19	T+3	Core framework is harmonised under MAR, but dissemination and formatting differ by venue and authority.
US	SEC	Exchange Act Section 16, Form 4	T+2	Most reportable insider transactions by officers, directors, and 10 percent owners are due within two business days.

Reporting deadlines shape signal availability. The legal rule is not the same thing as your feed timestamp, but it is where the clock starts.

Creating a Practical Insider Filings Model with Python

Act on this

The API tutorial most people actually need

What this tutorial assumes

The minimum viable architecture

Authentication and endpoint design

API key auth in Python

Example client skeleton

A less fragile request method

Common endpoint parameters

Pagination without losing records

Offset pagination

Cursor pagination

Incremental syncs beat heroic re-downloads

From JSON to a research table

Fields worth extracting first

Normalization example

The cleaning rules that matter most

Grouping multiple line items into one event

A simple insider alpha prototype

The simplest workable thesis

A practical conviction score

Why buys, not sells

Example signal code

A sample daily workflow

The avoidable mistakes in insider-data modelling

Look-ahead bias via the wrong date

Duplicate and amended filings

Role classification is messy

Currency and value comparability

A compact end-to-end Python example

Turning the script into a research asset

What to test before you trust the signal

Sanity checks

Coverage drift

Null rates

Duplicate rates

Filing lag

A first backtest design

Useful refinements after the first prototype

Role weighting

Cluster buying

Relative size

Market-specific rules

Liquidity filters

Why this matters for API users specifically

The practical payoff

Sources & further reading