The free, open-source web data platform.
Turn any website into clean, structured data β markdown, JSON, CSV, or AI-ready output. Self-host in one command. No rate limits. No credit cards. No cloud lock-in.
Live Demo Β· Docs Β· Quickstart
Imagine you want the data from a website β maybe a product's price, a competitor's blog posts, or an entire documentation site β but it's locked in HTML, JavaScript, and CSS that's hard to work with programmatically. Netleaf solves this.
You give Netleaf a URL. It opens the page in a real browser (handling JavaScript-heavy sites that simple scrapers miss), extracts the content, and hands it back to you as clean Markdown text, structured JSON, a CSV file, or whatever format you need.
No coding required for basic use. Just run one Docker command and hit the API.
Website URL β Netleaf β Clean data (Markdown / JSON / CSV / ZIP)
Built for developers and researchers who:
- Are tired of Firecrawl's free tier running out mid-project
- Want to extract structured data using their own AI keys β or zero AI cost via Ollama
- Need to crawl entire websites automatically, not just individual pages
- Want to know exactly what changed on a site between two crawls (change detection)
- Prefer owning their data β nothing leaves your machine
Prerequisites: Docker Desktop installed. That's it.
Docker is a tool that packages software into isolated containers so you can run complex apps (databases, servers, queues) with a single command β no manual installation of each component.
git clone https://github.com/Ramcode64/netleaf
cd netleaf
cp .env.example .env
docker compose upThat's it. In about 30 seconds:
| Service | URL | What it is |
|---|---|---|
| API | http://localhost:3000 |
The REST API β send requests here |
| Dashboard | http://localhost:3001 |
Web UI β manage keys, view crawl history |
No signup needed. The default mode (LOCAL_MODE=true) skips all authentication β just start making requests immediately.
# Try it right now β scrape any page to Markdown
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'| Feature | What it does |
|---|---|
| πΏ Scrape | Fetch any single page β Markdown, HTML, or plain text |
| πΈοΈ Crawl | Follow all links on a site automatically, up to N pages |
| πΊοΈ Map | Discover every URL on a site in seconds (reads sitemap/robots.txt) |
| π€ Extract | Use an AI model to pull structured data fields from any page |
| π Search | Run a web search and optionally scrape the top results |
| β° Schedule | Run any crawl on a repeating cron timer (daily, hourly, etc.) |
| π Diff | Compare two crawl runs β see exactly what pages were added, removed, or changed |
| π¦ Export | Download crawl results as JSON, CSV, XML, or a ZIP of Markdown files |
Firecrawl is the closest spiritual predecessor. Netleaf is what Firecrawl should be for people who self-host.
| Firecrawl | Netleaf | |
|---|---|---|
| Self-hosted | Yes, but complex setup (S3, multiple configs) | Yes β single docker compose up |
| Free tier | 500 credits/month on their cloud | Unlimited on your own hardware |
| AI extraction | Locked to their internal stack | Your choice: Claude, OpenAI, or Ollama |
| 100% offline | No | Yes β run Ollama locally, zero API calls ever |
| Scheduled crawls | No | Yes β cron-based, managed via UI |
| Change detection | No | Yes β diff any two crawl snapshots |
| Export formats | JSON only | JSON, CSV, XML, Markdown ZIP |
| No-auth local mode | No β always requires auth | Yes β LOCAL_MODE=true, no key needed |
| License | AGPL (cloud is proprietary) | MIT |
| Price at scale | $16β$333/month | $0 forever on self-host |
Apify is a cloud-only scraping marketplace β powerful, but you're renting compute on their servers and running scripts written by third parties.
| Apify | Netleaf | |
|---|---|---|
| Self-hostable | No | Yes |
| Your data stays on your machine | No β stored on Apify cloud | Yes |
| Free tier | $5 platform credit/month | Unlimited on your hardware |
| Structured AI extraction | Cobble it together yourself | Built-in, multi-provider |
| Change detection | No | Yes |
| Cost at scale | $49β$499/month | $0 |
Diffbot is an enterprise AI web extraction product β impressive technology, priced for enterprise budgets.
| Diffbot | Netleaf | |
|---|---|---|
| Pricing | $299β$999/month | $0 |
| Self-hostable | No | Yes |
| Custom extraction schemas | Yes | Yes (via JSON Schema + AI) |
| Open source | No | MIT |
What is a REST API? It's a way to talk to a server using simple HTTP requests β the same protocol your browser uses. You send a request to a URL with some data, and get a response back. Tools like
curl(shown below), Postman, or any programming language can make these requests.
All endpoints return { "success": true, "data": ... }.
Authentication: In local mode (default), no auth header needed. In multi-user mode: add Authorization: Bearer nl_your_api_key to every request.
Opens the URL in a real headless browser (Chromium via Playwright), waits for JavaScript to load, then extracts the content.
What is a headless browser? A browser with no visible window. It loads pages exactly like Chrome or Firefox would β running JavaScript, rendering CSS, handling redirects β but in the background. This is how Netleaf handles modern JS-heavy sites that simple
fetch()calls miss.
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": ["markdown", "html", "links"],
"waitForSelector": "main"
}'Response:
{
"success": true,
"data": {
"url": "https://news.ycombinator.com",
"markdown": "# Hacker News\n\n1. Some article title...",
"html": "<html>...",
"metadata": { "title": "Hacker News", "statusCode": 200 }
}
}| Option | Type | Description |
|---|---|---|
formats |
string[] |
"markdown", "html", "text", or "links" (same-host links) |
waitForSelector |
string |
CSS selector to wait for before extracting (e.g. "main"). If not found within 5s, a non-fatal warnings entry is returned and partial content is still delivered. |
timeout |
number |
Navigation timeout in ms (1000β60000, default 30000) |
Markdown link/image hrefs are absolutized against the page URL, so the output is portable. A warnings array is included only when something non-fatal happened (e.g. a missing waitForSelector).
Starts an automatic crawl from a starting URL. Netleaf follows every internal link it finds, up to your maxPages limit. Runs asynchronously in the background β you get a jobId immediately and poll for results.
What is async / background job? Instead of making you wait while it crawls 1000 pages (which could take minutes), Netleaf starts the job and gives you an ID immediately. You check back whenever you want to see progress or grab results.
What is BFS (Breadth-First Search)? The crawl strategy. It processes pages level by level β first the homepage, then all pages linked from the homepage, then all pages linked from those, and so on. This ensures you get the most important pages first.
# 1. Start the crawl
curl -X POST http://localhost:3000/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxPages": 100,
"formats": ["markdown"],
"webhookUrl": "https://your-app.com/webhook"
}'
# β { "success": true, "data": { "jobId": "abc-123" } }
# 2. Check progress β use the lightweight /status endpoint while polling
curl http://localhost:3000/v1/crawl/abc-123/status
# β { "status": "running", "totalScraped": 34, "totalFound": 89, "webhookSent": false }
# 3. Fetch full results (paginated: ?offset=&limit=, max 500/page)
curl http://localhost:3000/v1/crawl/abc-123
# 4. Export when done
curl "http://localhost:3000/v1/crawl/abc-123/export?format=csv" -o results.csvGET /v1/crawl/:id/statusβ lightweight polling (no page join); includeswebhookSentdelivery status when a webhook is attached.GET /v1/crawl/:idβ full results, paginated via?offset=&limit=.POST /v1/crawl/:id/webhookβ attach a webhook to a running job (409 if already finished).- Export formats:
jsonΒ·csvΒ·xmlΒ·zip(one.mdfile per page). - SSRF-blocked start URLs are rejected immediately with
422(not accepted then failed).
Fast URL discovery without launching a browser. Checks robots.txt β sitemap β homepage links. Returns up to 1000 URLs in under 2 seconds.
What is robots.txt? A file websites publish at
/robots.txtlisting their sitemap locations and crawling rules. What is a sitemap? An XML file listing every URL on a site β search engines like Google use it to discover pages. Netleaf reads both to find URLs instantly without having to crawl the entire site.
curl -X POST http://localhost:3000/v1/map \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "limit": 500}'{
"success": true,
"data": {
"source": "sitemap",
"links": ["https://example.com/about", "https://example.com/blog/..."],
"total": 147
}
}| Option | Type | Description |
|---|---|---|
limit |
number |
Max URLs to return (default 100, max 1000) |
includeSubdomains |
boolean |
Include links to subdomains of the target host |
includeExternal |
boolean |
Include off-domain links (capped at 50) |
When a site has no sitemap and the homepage exposes no same-host links, the response includes a note explaining the empty result (rather than looking broken).
Scrapes a page, then asks an AI model to extract exactly the fields you define β according to a schema you provide. Works with Claude, OpenAI, or completely offline with Ollama.
What is a JSON Schema? A description of the shape of data you want back. You define which fields to extract and their types (
string,number,boolean). The AI reads the page and fills in those fields.
What is Ollama? A tool that lets you run AI language models (like Llama, Mistral) entirely on your own machine with no internet connection and zero cost. No API keys, no monthly bills.
curl -X POST http://localhost:3000/v1/extract \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com/product/123",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"inStock": { "type": "boolean" }
},
"required": ["name", "price"]
},
"provider": "ollama"
}'{
"success": true,
"data": { "name": "Wireless Headphones", "price": 79.99, "inStock": true }
}| Provider | Setup | Cost |
|---|---|---|
claude |
Set ANTHROPIC_API_KEY |
~$0.001 per page |
openai |
Set OPENAI_API_KEY |
~$0.001 per page |
ollama |
Install Ollama, pull a model | $0 forever, fully offline |
Search the web via Brave Search, then optionally scrape the full content of each result.
Why Brave Search? It has a free API tier (2000 requests/month) and returns unbiased results independent of Google. Get your free key at search.brave.com.
curl -X POST http://localhost:3000/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "best open source web scraping tools 2025",
"maxResults": 5,
"scrape": true
}'Returns title, description, URL, and optionally the full Markdown of each result page. Requires BRAVE_API_KEY.
Create a crawl job that runs automatically on any schedule β daily, hourly, every Monday at 9am, whatever you need.
What is a cron expression? A compact way to write a schedule.
"0 8 * * *"means "at 8:00 AM every day"."0 */6 * * *"means "every 6 hours". You can use crontab.guru to build these visually.
# Create a schedule: crawl a competitor site daily at 8am
curl -X POST http://localhost:3000/v1/schedule \
-H "Content-Type: application/json" \
-d '{
"name": "Daily competitor check",
"cronExpression": "0 8 * * *",
"url": "https://competitor.com",
"maxPages": 50,
"webhookUrl": "https://your-app.com/on-crawl-complete"
}'
# List all your schedules
curl http://localhost:3000/v1/schedule
# Pause a schedule (without deleting it)
curl -X PATCH http://localhost:3000/v1/schedule/<id> -d '{"isActive": false}'On every crawl, Netleaf stores a fingerprint (SHA-256 hash) of each page's content. The diff endpoint compares any two crawl runs and tells you exactly what was added, removed, or changed.
What is SHA-256 hashing? A mathematical function that takes any text and produces a unique fixed-length fingerprint. If even one character changes, the fingerprint changes completely. This lets Netleaf detect content changes without storing the full page content twice.
curl "http://localhost:3000/v1/diff?jobIdA=<uuid-1>&jobIdB=<uuid-2>"{
"success": true,
"data": {
"added": ["https://example.com/new-page"],
"removed": ["https://example.com/old-page"],
"changed": ["https://example.com/pricing"],
"unchanged": 94
}
}Use this to monitor competitor pricing, track documentation changes, or build alerts when content updates.
No API key. No cloud. No cost. Extract structured data entirely on your own hardware.
# 1. Install Ollama (Mac/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# 2. Pull a model (llama3.2 is fast and small β ~2GB download)
ollama pull llama3.2
# 3. Tell Netleaf which model to use, then start it. Ollama runs on your host;
# the Docker container reaches it via host.docker.internal.
OLLAMA_URL=http://host.docker.internal:11434 OLLAMA_MODEL=llama3.2 docker compose upThen use "provider": "ollama" in /v1/extract. The entire scrape β AI extraction loop never leaves your machine.
Reasoning models (qwen3, deepseek-r1, etc.) are fully supported β Netleaf sends
think: falseand falls back to thethinkingfield so structured output is captured reliably. SetOLLAMA_MODELto whatever you've pulled; it defaults tollama3.1.
Tech stack explained for newcomers:
- Fastify β a fast Node.js web server framework (like Express, but faster)
- Playwright β Microsoft's library for controlling a real Chromium browser from code
- BullMQ β a job queue: when you start a crawl, jobs are added to a queue and processed in the background. Survives server restarts.
- Redis β an in-memory database used by BullMQ to store the job queue
- Drizzle ORM β a TypeScript library for querying PostgreSQL safely (no raw SQL strings = no SQL injection)
- PostgreSQL β the main database storing users, API keys, crawl results
- Next.js 16 β the React framework powering the web dashboard (App Router = file-based routing)
- Tailwind CSS β a utility-first CSS framework (style by adding class names)
- Auth.js v5 β handles login sessions, JWT tokens, OAuth (Google) for the dashboard
- Vitest β a fast JavaScript test runner (105 tests across all services)
netleaf/
βββ apps/
β βββ api/ # Fastify REST API (TypeScript)
β β βββ src/
β β βββ scraper/ # Playwright browser pool β headless Chromium
β β βββ crawler/ # BFS engine + link parser (cheerio)
β β βββ queue/ # BullMQ + Redis async job queue
β β βββ db/ # Drizzle ORM + PostgreSQL (schema, migrations)
β β βββ security/ # SSRF egress guard, input validators
β β βββ services/ # map Β· extract Β· search Β· diff Β· scheduler Β· webhook
β β βββ api/routes/ # scrape Β· crawl Β· map Β· extract Β· search Β· schedule Β· keys
β β
β βββ web/ # Next.js 16 dashboard (App Router)
β βββ src/
β βββ app/ # landing, auth, dashboard, docs, API routes
β βββ components/ # landing sections, dashboard widgets, docs Try-It UI
β βββ lib/ # auth (Auth.js v5), db (Drizzle), server actions
β
βββ packages/
βββ shared-types/ # TypeScript types shared between API and web
Netleaf is designed to be safe to expose as a public service, not just a personal tool.
| Protection | What it prevents |
|---|---|
| SSRF guard | Attackers using Netleaf to scrape your internal network (e.g. 192.168.x.x, AWS metadata at 169.254.169.254). All redirect chains are validated hop-by-hop. |
| Scheme allowlist | file://, javascript:, ftp://, data: URLs are rejected before any fetch |
| Schema size limits | /v1/extract schemas capped at 50KB, depth β€ 20, $ref rejected β prevents memory exhaustion |
| Rate limiting | Per-token rate limiting runs before auth. Distributed across instances via Upstash when configured (UPSTASH_REDIS_REST_URL), in-memory fallback otherwise |
| Consistent error envelope | Global handlers map DB/Redis failures β 503 (no internal hostnames leaked), malformed bodies β 400, and unknown routes β a {success:false,error} 404. Validation errors include the field path |
| CSV injection prevention | Cells starting with = + - @ \t are prefixed with ' β prevents formula injection when opened in Excel |
| No account enumeration | Registration uses constraint-violation catch (not check-then-insert) β can't probe whether an email is registered |
| Constant-time login | Dummy bcrypt compare on unknown emails equalizes response time (no timing oracle) |
| 105 tests | Dedicated SSRF test suite covering 20 attack vectors |
DNS rebinding (H-4): the headless-browser scrape/crawl path manages its own DNS and can't be IP-pinned in code. For untrusted multi-tenant deployments, pair Netleaf with a network-level egress firewall blocking outbound traffic to private/link-local/metadata ranges, then set
EGRESS_FIREWALL_DECLARED=trueto silence the startup warning. The plain-fetch paths (map/sitemap) revalidate every redirect hop.
What is SSRF? Server-Side Request Forgery. An attack where a malicious user tricks your server into making HTTP requests to internal services (your database, cloud metadata APIs, internal admin panels) that should never be publicly reachable. Netleaf's egress guard blocks this.
| Variable | Default | Required | Description |
|---|---|---|---|
LOCAL_MODE |
true |
β | Skip all auth β ideal for personal use on your own machine |
DATABASE_URL |
β | Yes (non-local) | PostgreSQL connection string e.g. postgresql://user:pass@host/db |
REDIS_URL |
redis://redis:6379 |
Yes | Redis for the job queue (Docker provides this automatically) |
PORT |
3000 |
β | API port |
ANTHROPIC_API_KEY |
β | No | Enable Claude as an extraction provider |
OPENAI_API_KEY |
β | No | Enable OpenAI as an extraction provider |
OLLAMA_URL |
http://localhost:11434 |
No | Enable Ollama for free local AI extraction |
OLLAMA_MODEL |
llama3.1 |
No | Which pulled Ollama model /v1/extract uses (e.g. qwen3.5:4b) |
BRAVE_API_KEY |
β | No | Enable /v1/search (2000 free req/month) |
WEBHOOK_SECRET |
β | No | If set, outgoing webhooks include an X-Netleaf-Signature HMAC for receiver verification |
ALLOW_PRIVATE_IPS |
false |
β | Set true only for trusted local dev β disables SSRF protection |
EGRESS_FIREWALL_DECLARED |
false |
β | Set true once a network-level egress firewall is in place (silences the H-4 startup warning) |
MAX_CONTENT_CHARS |
5000000 |
β | Max characters stored per scraped page (~5MB) |
| Variable | Required | Description |
|---|---|---|
DATABASE_URL |
Yes | Same PostgreSQL instance as the API |
AUTH_SECRET |
Yes | Random secret for session encryption. Generate: openssl rand -base64 32 |
AUTH_URL |
Yes | Full URL of your web deployment e.g. https://netleaf.vercel.app |
NEXT_PUBLIC_API_URL |
Yes | URL of the API e.g. http://localhost:3000 |
DISABLE_REGISTRATION |
No | Set "true" to block new signups on public deployments |
AUTH_GOOGLE_ID |
No | Google OAuth client ID (optional, enables Google login) |
AUTH_GOOGLE_SECRET |
No | Google OAuth client secret |
UPSTASH_REDIS_REST_URL |
No | Enable distributed (cross-instance) rate limiting on serverless |
UPSTASH_REDIS_REST_TOKEN |
No | Token paired with the Upstash REST URL above |
# Prerequisites: Node.js 20+, PostgreSQL 15+, Redis 7+
npm ci
cp .env.example .env
# Edit .env β set DATABASE_URL and REDIS_URL
# Run database migrations (creates all tables)
npm run db:migrate --workspace=apps/api
# Start API on port 3000
npm run dev --workspace=apps/api
# Start web dashboard on port 3001
npm run dev --workspace=apps/web# Run all 105 tests
npm test --workspace=apps/api
# Type-check both apps
npm run typecheck --workspace=apps/api
npm run typecheck --workspace=apps/webTests cover: scraper extraction, link parser, map service, search service, webhook service, diff service, SSRF guard (20 attack vectors), API routes, and Redis queue.
- Fork and clone the repo
npm cicp .env.example .envnpm test --workspace=apps/apiβ confirm all green before you start- Make your changes with tests
- Open a PR against
main
Guidelines:
- File naming:
lowercase-kebab.ts - Tests required for all new endpoints
- Security-sensitive changes must include tests in
src/security/
Netleaf is solid for self-hosting today (docker compose up β all features verified end-to-end, including LLM extraction). Before exposing it as a public, multi-tenant service, work through this checklist:
- Set
LOCAL_MODE=falseand provision API keys (the startup guard refusesLOCAL_MODE=true+NODE_ENV=production). - Generate strong secrets β
AUTH_SECRET,POSTGRES_PASSWORD(openssl rand -base64 32). - Set
AUTH_URLto your real web origin andNEXT_PUBLIC_API_URLto your real API origin. - Network egress firewall (H-4) β block outbound to private/link-local/metadata ranges, then set
EGRESS_FIREWALL_DECLARED=true. Required for untrusted multi-tenant. - Distributed rate limiting β set
UPSTASH_REDIS_REST_URL+UPSTASH_REDIS_REST_TOKEN(otherwise limits are per-instance). - Webhook signing β set
WEBHOOK_SECRETso receivers can verify payloads. - Serve over HTTPS (HSTS is already sent) and deploy the API somewhere reachable by the dashboard.
For single-tenant / internal use, only the secrets and HTTPS items apply.
MIT β use it, modify it, sell it, self-host it commercially. No strings attached.
Copyright Β© 2026 Aditya Salgare