How to Scrape SEC EDGAR Filings Without Building a Parser
Scrape SEC EDGAR filings the right way: fair-access compliance, 8-K trigger classification, Form 4 insider patterns, 13F deltas. Python code included.
Scrape SEC EDGAR filings the right way: fair-access compliance, 8-K trigger classification, Form 4 insider patterns, 13F deltas. Python code included.
Every quant team, fintech RAG project, and corp-dev workflow eventually runs into the same wall: they need to scrape SEC EDGAR filings, they figure it'll be a weekend project, and three weeks later they have a Python script that successfully downloads raw XBRL and absolutely nothing useful sitting on top of it. The data is "free." The intelligence layer is the entire job.
I have built this stack twice for different employers and once more for my own Apify actor portfolio. The reusable lesson: downloading EDGAR is a 10% problem; classifying 8-K material events, detecting Form 4 cluster buys, computing 13F quarter-over-quarter position deltas, and pulling earnings-call transcripts out of 8-K Exhibit 99.x is the other 90%. This post walks through the legal and technical realities, what every existing solution gets wrong, and the actual code that produces usable intelligence.
EDGAR is famous for being a buffet that nobody can eat. Here is the honest field guide.
Raw data.sec.gov calls. Free, official, well-documented at sec.gov/edgar/sec-api-documentation. You get the filing index, the submission JSON per CIK, and the raw filing text. What you don't get: any classification, any event extraction, any insider pattern detection, any cross-filing diffing. You're parsing 8-K item numbers, regexing XBRL, building your own NLP. Three weeks minimum to a usable v1.
sec-api.io. Commercial wrapper. Starts at $99/mo, the good filters (8-K item classification, insider Form 4 search) gate at $299-$599. Quality is fine, lock-in is real, and the API surface bends toward financial data feeds, not engineering primitives.
EDGAR Online / Intelligize / AlphaSense. Enterprise-tier products with sales calls, NDAs, and five-figure starting prices. Built for fund managers and law firms that need workflow tools, not scrapers.
Python libraries (sec-edgar-downloader, edgartools, python-sec-edgar). Open-source helpers around the raw API. They speed up the download step. None of them ship trigger classification, cluster-buy detection, or 13F deltas. You still write the analysis layer.
ChatGPT / Claude with web browsing. Works for one filing at a time, fails at scale, hallucinates rule numbers, and burns tokens you don't need to burn.
The gap: a free public-data pipeline with the analysis layer baked in.
EDGAR is one of the few datasets where the licensing question is fully settled - everything is mandated-public by federal securities law. The friction is the SEC's fair access policy which says:
User-Agent header with a real contact (e.g. "Your Company contact@yourdomain.com"). Anonymous and generic UAs get blocked.Any production scraper has to enforce both. The actor I'll show below uses an 8-req/sec semaphore to stay under the limit with a comfortable margin and requires the User-Agent as an input field.
The 12-15 form types that drive real analyst work are well-defined. Here is what each one needs in practice:
| Form | What it is | What you have to extract |
|---|---|---|
| 8-K | Material event report | Item number -> event category (M&A, exec change, going concern, restatement, debt issuance, customer loss, bankruptcy, auditor change) with text evidence |
| 10-K / 10-Q | Annual / quarterly report | Risk factors section -> going-concern detector, MD&A diffs, segment changes |
| 13F-HR | Institutional holdings | Quarter-over-quarter position delta per holding: new / increased / decreased / exited |
| Form 4 | Insider trades | Cluster-buy detection (multiple insiders within rolling window), unusual-size flag, C-suite-only flag, 10b5-1 plan termination |
| Form D | Private placement | Total offering, amount sold, investor count, related persons |
| S-1 | IPO registration | Underwriters, use of proceeds, risk factors |
| 13D / 13G | Beneficial ownership | Activist vs passive distinction, % stake changes |
If you build it yourself, each of these is a regex + state-machine + edge-case spreadsheet. The 8-K classifier alone has ~30 item codes and dozens of common phrasings per category.
+-------------+ +-----------+ +--------------+ +-------------+
| User-Agent | -> | Fetcher | -> | Form parser | -> | Classifier |
| compliance | | (8 r/s | | (8-K items, | | + evidence |
| | | semaphor)| | 10-K XBRL) | | extraction |
+-------------+ +-----------+ +--------------+ +-------------+
|
v
+---------------+
| Per-form |
| analytics: |
| - 13F deltas |
| - F4 clusters |
| - GC detector |
+---------------+
Stack: httpx async client, lxml for XBRL/HTML, pydantic for schema. The painful parts are (a) 13F quarter-over-quarter joins (you have to keep state), (b) 8-K Exhibit 99 transcript extraction (every issuer formats differently), and (c) Form 4 cluster windows (you need a per-issuer rolling join).
I wrote one of these as part of my Apify portfolio: seibs.co/sec-edgar-intel. It hits the free EDGAR endpoints, handles fair-access compliance, and ships the analysis layer. There are 30+ other EDGAR actors on Apify - most are raw-download utilities, mine adds the classification pass and per-form analytics. Compare against the alternatives before committing.
from apify_client import ApifyClient
# Token from https://console.apify.com/account/integrations
client = ApifyClient("YOUR_APIFY_TOKEN")
# Mode 1: 8-K material event scan across a watchlist
run = client.actor("seibs.co/sec-edgar-intel").call(run_input={
"mode": "company_filings_8k",
"tickers": ["NVDA", "AMD", "INTC", "MU", "AVGO"],
"lookback_days": 30,
"user_agent": "Acme Research contact@acme.com", # REQUIRED by SEC
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item.get("event_type") not in ("mergers_acquisitions", "executive_change",
"going_concern", "restatement"):
continue
print(f"[{item['ticker']}] {item['filing_date']} {item['event_type']}")
print(f" confidence: {item['confidence']}")
print(f" evidence: {item['evidence_text'][:200]}...")
print(f" url: {item['filing_url']}")
Sample output:
[NVDA] 2026-05-10 executive_change
confidence: 0.93
evidence: On May 9, 2026, the Company announced that John Smith,
Chief Financial Officer, will retire effective June 30, 2026...
url: https://www.sec.gov/Archives/edgar/data/1045810/...
run = client.actor("seibs.co/sec-edgar-intel").call(run_input={
"mode": "form4_insider",
"tickers": ["SMCI"],
"lookback_days": 60,
"cluster_window_days": 7,
"user_agent": "Acme Research contact@acme.com",
})
for trade in client.dataset(run["defaultDatasetId"]).iterate_items():
flags = trade.get("flags", {})
if flags.get("cluster_buy") or flags.get("unusual_size"):
print(f"{trade['filing_date']} {trade['insider_name']} ({trade['insider_title']})")
print(f" shares: {trade['shares']:,} @ ${trade['price']}")
print(f" flags: {[k for k,v in flags.items() if v]}")
run = client.actor("seibs.co/sec-edgar-intel").call(run_input={
"mode": "13f_position_changes",
"manager_ciks": ["0001067983"], # Berkshire
"vs_quarter": "previous",
"user_agent": "Acme Research contact@acme.com",
})
for holding in client.dataset(run["defaultDatasetId"]).iterate_items():
if holding["change_type"] in ("new", "exited"):
print(f"{holding['change_type'].upper():8s} {holding['ticker']:6s} "
f"{holding['shares_delta']:>+12,} sh "
f"(${holding['value_delta_usd']:>+15,.0f})")
| Use case | Mode + filter |
|---|---|
| Daily catalyst alerts | mode=company_filings_8k, schedule daily, webhook on event_type in [...] |
| Insider conviction screen | mode=form4_insider, filter flags.cluster_buy AND flags.c_suite_only |
| Smart-money tracking | mode=13f_position_changes, manager CIK list, aggregate consensus |
| Pre-IPO venture intel | mode=form_d, industry="software", min_amount_usd=10_000_000 |
| Going-concern screener | mode=risk_factors, filter going_concern_detected=true |
| Earnings tone analysis | mode=earnings_transcripts, pipe transcript_turns into LLM |
| Class-action prediction | mode=company_filings_8k, watch for restatement + auditor_change clustering |
| Securities litigation prep | mode=full_text_search, keywords + date window |
The 8-K event classifier is the highest-signal mode for most users. Set up a daily cron, webhook the alerts to a Slack channel or push to your CRM, and your analysts stop reading every 8-K to find the three that matter.
If you're wiring an LLM agent (Claude, GPT, LangChain, LlamaIndex), there's a sibling actor seibs.co/mcp-sec-edgar-intel that exposes the same backend as six Model Context Protocol tools (get_company_filings, get_8k_triggers, get_form4_insider_activity, get_13f_positions_change, get_recent_form_d, get_earnings_transcript). Your agent discovers the tools and calls them directly without you wiring HTTP plumbing.
A handful of things any EDGAR scraper - including mine - cannot do well.
Latency floor is ~60 seconds. EDGAR publishes filings within seconds of acceptance, but the actor polls on a cadence. If you need millisecond-grade catalyst trading, this is not your data source. Bloomberg and Refinitiv exist for that and charge accordingly.
13F is 45-day-stale by design. 13F-HRs are filed within 45 days of quarter-end. The "smart money tracker" use case is by definition lagged ~45-135 days behind the actual trade. This is a regulatory artifact, not a tool limitation.
8-K classifier is heuristic, not perfect. Confidence scores in the 0.8-1.0 range are reliable; 0.5-0.8 needs a human read; below 0.5 is informational only. Expect ~92% precision on the high-confidence bucket and ~75% recall across all true events (some 8-Ks are too vague to classify mechanically).
No XBRL financials parsing. This actor extracts events and classifications, not balance-sheet line items. For XBRL-derived fundamentals (revenue, EPS, segments), use the SEC's own Financial Statement Data Sets or a fundamentals API.
Form 4 cluster detection requires 2+ insiders in window. A single insider buying $50M of stock will trigger unusual_size=true but not cluster_buy=true. Some funds care more about the lone-wolf signal - filter accordingly.
Fair-access compliance is your responsibility. The actor enforces 8 req/sec and requires a real User-Agent, but if you ignore the input field and pass a generic UA, the SEC can still block you. Use a real contact email tied to a domain you own.
Q: Is scraping SEC EDGAR legal? A: Yes. EDGAR is the SEC's public filing system. Every record is mandated-public by federal securities law. The only rule is the SEC's fair-access policy (real User-Agent header, 10 requests/second cap).
Q: Do I need an API key for SEC EDGAR?
A: No. The official data.sec.gov endpoints are keyless. You only need to set a User-Agent header with a real contact email per the SEC's fair-access policy.
Q: How do I detect 8-K material events programmatically? A: Parse the 8-K Item number (Item 1.01 = material agreement, Item 2.01 = acquisition, Item 5.02 = exec change, Item 4.02 = restatement, Item 7.01 = Reg FD, etc) then run a phrase-level classifier on the body for confirmation. The full Item code table is at sec.gov/forms.
Q: What's the difference between Form 4 cluster-buy and unusual-size flags?
A: unusual_size=true fires when a single insider trades >2 standard deviations above their personal historical average. cluster_buy=true fires when 2+ insiders at the same issuer buy within a rolling N-day window (default 7 days). Cluster buys historically carry the higher predictive signal for forward returns.
Q: How do I track changes in 13F filings quarter over quarter?
A: Diff the latest 13F-HR against the prior quarter's 13F-HR by CUSIP. Classify each holding as new / increased / decreased / exited / unchanged based on the share count delta. The actor does this automatically; if you build it yourself, persist prior-quarter snapshots keyed on (manager_cik, cusip, quarter_end).
Q: Can I get earnings call transcripts from EDGAR?
A: Yes, but indirectly. Companies that file the transcript as 8-K Exhibit 99.1 or 99.2 have it on EDGAR. Many don't - they only post to investor relations. The actor's earnings_transcripts mode pulls the ones that are filed.
Q: What's the cheapest way to monitor a watchlist of 100 tickers daily?
A: Schedule the actor's company_filings_8k mode daily across all 100 tickers, dedupe in your downstream by accession_number, and only act on event_type matches. Cost is roughly $0.005 per filing emitted + $0.010 per high-signal classification - typically $1-3 per daily run for a 100-ticker watchlist.
Q: Does this work for foreign filings (20-F, 6-K)? A: Yes. Foreign private issuers file 20-F (annual) and 6-K (interim) on EDGAR. The actor pulls them; the 8-K classifier doesn't apply (6-K has no item taxonomy) but full-text search and risk-factor extraction do.
Q: How fresh is the data vs Bloomberg / FactSet? A: Same source, ~30-90 second polling latency vs Bloomberg's near-real-time feed. For analyst workflows, equivalent. For HFT, no.
Q: Can I get historical filings from before EDGAR full-text search started in 2001? A: Pre-2001 filings are on EDGAR but the full-text index doesn't cover them - you have to fetch them individually by accession number. The actor handles fetching; full-text search is 2001+.
Run sec-edgar-intel on Apify - free plan covers ~600 filings per month. For agent workflows, mcp-sec-edgar-intel exposes the same backend as MCP tools.
Related actors in the portfolio:
us-gov-contracts-intel - federal contract awards (USAspending + SAM.gov) for the same companies you're tracking on EDGAR.court-records-intel - pull PACER federal court cases to surface securities litigation, going-concern triggers, and IP disputes.uspto-patent-intel - patent and trademark intelligence on the same issuers (IP-driven acquisitions show up here first).I'm a solo MSP operator who builds B2B web-scraping actors at apify.com/seibs.co when I'm not running incident calls. The portfolio has 30+ live actors covering lead generation, intent data, SEC/USPTO/court records, and AI agent wrappers - all pay-per-event so you only pay for what's emitted. Find me at seibs.co.
Answer 3 questions and we surface the 2-3 best matches in the portfolio. No email gate, no signup.