blog/12 min read

How to Scrape SEC EDGAR Filings Without Building a Parser

Scrape SEC EDGAR filings the right way: fair-access compliance, 8-K trigger classification, Form 4 insider patterns, 13F deltas. Python code included.

How to Scrape SEC EDGAR Filings Without Building a Parser

Every quant team, fintech RAG project, and corp-dev workflow eventually runs into the same wall: they need to scrape SEC EDGAR filings, they figure it'll be a weekend project, and three weeks later they have a Python script that successfully downloads raw XBRL and absolutely nothing useful sitting on top of it. The data is "free." The intelligence layer is the entire job.

I have built this stack twice for different employers and once more for my own Apify actor portfolio. The reusable lesson: downloading EDGAR is a 10% problem; classifying 8-K material events, detecting Form 4 cluster buys, computing 13F quarter-over-quarter position deltas, and pulling earnings-call transcripts out of 8-K Exhibit 99.x is the other 90%. This post walks through the legal and technical realities, what every existing solution gets wrong, and the actual code that produces usable intelligence.

Why the existing options fail

EDGAR is famous for being a buffet that nobody can eat. Here is the honest field guide.

Raw data.sec.gov calls. Free, official, well-documented at sec.gov/edgar/sec-api-documentation. You get the filing index, the submission JSON per CIK, and the raw filing text. What you don't get: any classification, any event extraction, any insider pattern detection, any cross-filing diffing. You're parsing 8-K item numbers, regexing XBRL, building your own NLP. Three weeks minimum to a usable v1.

sec-api.io. Commercial wrapper. Starts at $99/mo, the good filters (8-K item classification, insider Form 4 search) gate at $299-$599. Quality is fine, lock-in is real, and the API surface bends toward financial data feeds, not engineering primitives.

EDGAR Online / Intelligize / AlphaSense. Enterprise-tier products with sales calls, NDAs, and five-figure starting prices. Built for fund managers and law firms that need workflow tools, not scrapers.

Python libraries (sec-edgar-downloader, edgartools, python-sec-edgar). Open-source helpers around the raw API. They speed up the download step. None of them ship trigger classification, cluster-buy detection, or 13F deltas. You still write the analysis layer.

ChatGPT / Claude with web browsing. Works for one filing at a time, fails at scale, hallucinates rule numbers, and burns tokens you don't need to burn.

The gap: a free public-data pipeline with the analysis layer baked in.

The legal and rate-limit reality

EDGAR is one of the few datasets where the licensing question is fully settled - everything is mandated-public by federal securities law. The friction is the SEC's fair access policy which says:

You must declare a real User-Agent header with a real contact (e.g. "Your Company contact@yourdomain.com"). Anonymous and generic UAs get blocked.
10 requests per second, hard cap. Burst higher and your IP gets a temporary 403.
No simulating a browser - hit the JSON endpoints, not the HTML pages.

Any production scraper has to enforce both. The actor I'll show below uses an 8-req/sec semaphore to stay under the limit with a comfortable margin and requires the User-Agent as an input field.

What "classify a filing" actually means

The 12-15 form types that drive real analyst work are well-defined. Here is what each one needs in practice:

Form	What it is	What you have to extract
8-K	Material event report	Item number -> event category (M&A, exec change, going concern, restatement, debt issuance, customer loss, bankruptcy, auditor change) with text evidence
10-K / 10-Q	Annual / quarterly report	Risk factors section -> going-concern detector, MD&A diffs, segment changes
13F-HR	Institutional holdings	Quarter-over-quarter position delta per holding: new / increased / decreased / exited
Form 4	Insider trades	Cluster-buy detection (multiple insiders within rolling window), unusual-size flag, C-suite-only flag, 10b5-1 plan termination
Form D	Private placement	Total offering, amount sold, investor count, related persons
S-1	IPO registration	Underwriters, use of proceeds, risk factors
13D / 13G	Beneficial ownership	Activist vs passive distinction, % stake changes

If you build it yourself, each of these is a regex + state-machine + edge-case spreadsheet. The 8-K classifier alone has ~30 item codes and dozens of common phrasings per category.

The walkthrough: build vs buy

Build path (60-200 engineering hours)

+-------------+    +-----------+    +--------------+    +-------------+
| User-Agent  | -> | Fetcher   | -> | Form parser  | -> | Classifier  |
| compliance  |    | (8 r/s    |    | (8-K items,  |    | + evidence  |
|             |    |  semaphor)|    |  10-K XBRL)  |    | extraction  |
+-------------+    +-----------+    +--------------+    +-------------+
                                                              |
                                                              v
                                                      +---------------+
                                                      | Per-form      |
                                                      | analytics:    |
                                                      | - 13F deltas  |
                                                      | - F4 clusters |
                                                      | - GC detector |
                                                      +---------------+

Stack: httpx async client, lxml for XBRL/HTML, pydantic for schema. The painful parts are (a) 13F quarter-over-quarter joins (you have to keep state), (b) 8-K Exhibit 99 transcript extraction (every issuer formats differently), and (c) Form 4 cluster windows (you need a per-issuer rolling join).

Buy path: prebuilt EDGAR actor

I wrote one of these as part of my Apify portfolio: seibs.co/sec-edgar-intel. It hits the free EDGAR endpoints, handles fair-access compliance, and ships the analysis layer. There are 30+ other EDGAR actors on Apify - most are raw-download utilities, mine adds the classification pass and per-form analytics. Compare against the alternatives before committing.

from apify_client import ApifyClient

# Token from https://console.apify.com/account/integrations
client = ApifyClient("YOUR_APIFY_TOKEN")

# Mode 1: 8-K material event scan across a watchlist
run = client.actor("seibs.co/sec-edgar-intel").call(run_input={
    "mode": "company_filings_8k",
    "tickers": ["NVDA", "AMD", "INTC", "MU", "AVGO"],
    "lookback_days": 30,
    "user_agent": "Acme Research contact@acme.com",  # REQUIRED by SEC
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item.get("event_type") not in ("mergers_acquisitions", "executive_change",
                                       "going_concern", "restatement"):
        continue
    print(f"[{item['ticker']}] {item['filing_date']} {item['event_type']}")
    print(f"  confidence: {item['confidence']}")
    print(f"  evidence:   {item['evidence_text'][:200]}...")
    print(f"  url:        {item['filing_url']}")

Sample output:

[NVDA] 2026-05-10 executive_change
  confidence: 0.93
  evidence:   On May 9, 2026, the Company announced that John Smith,
              Chief Financial Officer, will retire effective June 30, 2026...
  url:        https://www.sec.gov/Archives/edgar/data/1045810/...

Mode 2: Form 4 cluster-buy detection

run = client.actor("seibs.co/sec-edgar-intel").call(run_input={
    "mode": "form4_insider",
    "tickers": ["SMCI"],
    "lookback_days": 60,
    "cluster_window_days": 7,
    "user_agent": "Acme Research contact@acme.com",
})

for trade in client.dataset(run["defaultDatasetId"]).iterate_items():
    flags = trade.get("flags", {})
    if flags.get("cluster_buy") or flags.get("unusual_size"):
        print(f"{trade['filing_date']} {trade['insider_name']} ({trade['insider_title']})")
        print(f"  shares: {trade['shares']:,} @ ${trade['price']}")
        print(f"  flags:  {[k for k,v in flags.items() if v]}")

Mode 3: 13F quarter-over-quarter position deltas

run = client.actor("seibs.co/sec-edgar-intel").call(run_input={
    "mode": "13f_position_changes",
    "manager_ciks": ["0001067983"],   # Berkshire
    "vs_quarter": "previous",
    "user_agent": "Acme Research contact@acme.com",
})

for holding in client.dataset(run["defaultDatasetId"]).iterate_items():
    if holding["change_type"] in ("new", "exited"):
        print(f"{holding['change_type'].upper():8s} {holding['ticker']:6s} "
              f"{holding['shares_delta']:>+12,} sh "
              f"(${holding['value_delta_usd']:>+15,.0f})")

What you can do with this data

Use case	Mode + filter
Daily catalyst alerts	`mode=company_filings_8k`, schedule daily, webhook on `event_type in [...]`
Insider conviction screen	`mode=form4_insider`, filter `flags.cluster_buy AND flags.c_suite_only`
Smart-money tracking	`mode=13f_position_changes`, manager CIK list, aggregate consensus
Pre-IPO venture intel	`mode=form_d`, `industry="software"`, `min_amount_usd=10_000_000`
Going-concern screener	`mode=risk_factors`, filter `going_concern_detected=true`
Earnings tone analysis	`mode=earnings_transcripts`, pipe `transcript_turns` into LLM
Class-action prediction	`mode=company_filings_8k`, watch for `restatement` + `auditor_change` clustering
Securities litigation prep	`mode=full_text_search`, keywords + date window

The 8-K event classifier is the highest-signal mode for most users. Set up a daily cron, webhook the alerts to a Slack channel or push to your CRM, and your analysts stop reading every 8-K to find the three that matter.

MCP / AI agent integration

If you're wiring an LLM agent (Claude, GPT, LangChain, LlamaIndex), there's a sibling actor seibs.co/mcp-sec-edgar-intel that exposes the same backend as six Model Context Protocol tools (get_company_filings, get_8k_triggers, get_form4_insider_activity, get_13f_positions_change, get_recent_form_d, get_earnings_transcript). Your agent discovers the tools and calls them directly without you wiring HTTP plumbing.

Honest limitations

A handful of things any EDGAR scraper - including mine - cannot do well.

Latency floor is ~60 seconds. EDGAR publishes filings within seconds of acceptance, but the actor polls on a cadence. If you need millisecond-grade catalyst trading, this is not your data source. Bloomberg and Refinitiv exist for that and charge accordingly.

13F is 45-day-stale by design. 13F-HRs are filed within 45 days of quarter-end. The "smart money tracker" use case is by definition lagged ~45-135 days behind the actual trade. This is a regulatory artifact, not a tool limitation.

8-K classifier is heuristic, not perfect. Confidence scores in the 0.8-1.0 range are reliable; 0.5-0.8 needs a human read; below 0.5 is informational only. Expect ~92% precision on the high-confidence bucket and ~75% recall across all true events (some 8-Ks are too vague to classify mechanically).

No XBRL financials parsing. This actor extracts events and classifications, not balance-sheet line items. For XBRL-derived fundamentals (revenue, EPS, segments), use the SEC's own Financial Statement Data Sets or a fundamentals API.

Form 4 cluster detection requires 2+ insiders in window. A single insider buying $50M of stock will trigger unusual_size=true but not cluster_buy=true. Some funds care more about the lone-wolf signal - filter accordingly.

Fair-access compliance is your responsibility. The actor enforces 8 req/sec and requires a real User-Agent, but if you ignore the input field and pass a generic UA, the SEC can still block you. Use a real contact email tied to a domain you own.

FAQ

Q: Is scraping SEC EDGAR legal? A: Yes. EDGAR is the SEC's public filing system. Every record is mandated-public by federal securities law. The only rule is the SEC's fair-access policy (real User-Agent header, 10 requests/second cap).

Q: Do I need an API key for SEC EDGAR? A: No. The official data.sec.gov endpoints are keyless. You only need to set a User-Agent header with a real contact email per the SEC's fair-access policy.

Q: How do I detect 8-K material events programmatically? A: Parse the 8-K Item number (Item 1.01 = material agreement, Item 2.01 = acquisition, Item 5.02 = exec change, Item 4.02 = restatement, Item 7.01 = Reg FD, etc) then run a phrase-level classifier on the body for confirmation. The full Item code table is at sec.gov/forms.

Q: What's the difference between Form 4 cluster-buy and unusual-size flags? A: unusual_size=true fires when a single insider trades >2 standard deviations above their personal historical average. cluster_buy=true fires when 2+ insiders at the same issuer buy within a rolling N-day window (default 7 days). Cluster buys historically carry the higher predictive signal for forward returns.

Q: How do I track changes in 13F filings quarter over quarter? A: Diff the latest 13F-HR against the prior quarter's 13F-HR by CUSIP. Classify each holding as new / increased / decreased / exited / unchanged based on the share count delta. The actor does this automatically; if you build it yourself, persist prior-quarter snapshots keyed on (manager_cik, cusip, quarter_end).

Q: Can I get earnings call transcripts from EDGAR? A: Yes, but indirectly. Companies that file the transcript as 8-K Exhibit 99.1 or 99.2 have it on EDGAR. Many don't - they only post to investor relations. The actor's earnings_transcripts mode pulls the ones that are filed.

Q: What's the cheapest way to monitor a watchlist of 100 tickers daily? A: Schedule the actor's company_filings_8k mode daily across all 100 tickers, dedupe in your downstream by accession_number, and only act on event_type matches. Cost is roughly $0.005 per filing emitted + $0.010 per high-signal classification - typically $1-3 per daily run for a 100-ticker watchlist.

Q: Does this work for foreign filings (20-F, 6-K)? A: Yes. Foreign private issuers file 20-F (annual) and 6-K (interim) on EDGAR. The actor pulls them; the 8-K classifier doesn't apply (6-K has no item taxonomy) but full-text search and risk-factor extraction do.

Q: How fresh is the data vs Bloomberg / FactSet? A: Same source, ~30-90 second polling latency vs Bloomberg's near-real-time feed. For analyst workflows, equivalent. For HFT, no.

Q: Can I get historical filings from before EDGAR full-text search started in 2001? A: Pre-2001 filings are on EDGAR but the full-text index doesn't cover them - you have to fetch them individually by accession number. The actor handles fetching; full-text search is 2001+.

Try it free

Run sec-edgar-intel on Apify - free plan covers ~600 filings per month. For agent workflows, mcp-sec-edgar-intel exposes the same backend as MCP tools.

Related actors in the portfolio:

us-gov-contracts-intel - federal contract awards (USAspending + SAM.gov) for the same companies you're tracking on EDGAR.
court-records-intel - pull PACER federal court cases to surface securities litigation, going-concern triggers, and IP disputes.
uspto-patent-intel - patent and trademark intelligence on the same issuers (IP-driven acquisitions show up here first).

About the author

I'm a solo MSP operator who builds B2B web-scraping actors at apify.com/seibs.co when I'm not running incident calls. The portfolio has 30+ live actors covering lead generation, intent data, SEC/USPTO/court records, and AI agent wrappers - all pay-per-event so you only pay for what's emitted. Find me at seibs.co.

actors mentioned

next step / 30 seconds

Not sure which actor matches your use case?

Answer 3 questions and we surface the 2-3 best matches in the portfolio. No email gate, no signup.

Find my actor Browse all 35 More posts