agentfetch

Open-source web retrieval & research agent built for AI agents.

agentfetch is a free, local alternative to Firecrawl, Exa, Parallel.ai, and Tavily. It fetches any webpage, crawls any site, searches the web, and researches any topic — returning clean markdown and structured reports that AI agents can consume directly.

Works with LangChain, LlamaIndex, CrewAI, AutoGen, Claude MCP, OpenAI function calling, Gemini, Groq, and plain REST. No vendor lock-in, no API keys required.

Install

Standard

pip install git+https://github.com/SID1ART/agentfetch.git

Cloud notebooks (Colab, Jupyter, Kaggle)

pip install https://github.com/SID1ART/agentfetch/archive/main.zip

With extra integrations

pip install "agentfetch[langchain,llamaindex,crewai] @ git+https://github.com/SID1ART/agentfetch.git"
pip install "agentfetch[search] @ git+https://github.com/SID1ART/agentfetch.git"   # adds Google search engine

No PyPI account, no API tokens, no sign-up needed. GitHub is the source.

What makes it different

Research Agent — Tavily-style deep research: auto-decomposes questions into sub-queries, searches multiple engines, gathers full content, and synthesizes a structured report with citations via Ollama or Claude. Supports iterative follow-up (depth="deep"), structured output schemas, and four citation formats (numbered, MLA, APA, Chicago).
Smart Mode Router — detects JavaScript-heavy SPAs (Next.js, Nuxt, React) and falls back to Playwright headless browser automatically. Static pages use direct HTTP.
5-layer extraction pipeline — trafilatura → newspaper3k → readability-lxml → BeautifulSoup → plain text. Best-effort extraction from any HTML.
Never raises exceptions — always returns structured FetchResult with confidence scores, error fields, and injection detection. Agents can trust the output.
Post-extraction quality check — detects SPA shell text, low prose ratio, missing sentence structure, and downgrades confidence so agents can retry with browser engine.
In-memory LRU cache — deduplicates repeated URL fetches within a session. Configurable size/TTL via env vars. No Redis required.
Information saturation crawling — no arbitrary depth limits. CrawlStopper detects vocabulary saturation and content redundancy, stopping when enough data is gathered.
Persistent crawl store — SQLite-backed job persistence when Redis is not configured. Results survive process restarts.
Prompt injection firewall — 13 patterns detected and redacted to [REDACTED BY AGENTFETCH].
Cloudflare bypass — optional curl_cffi integration with 12 TLS fingerprint profiles (Chrome 99–124, Safari 15/17) and auto-rotation.
Browser stealth — optional playwright-stealth integration for advanced anti-detection (WebGL vendor, canvas fingerprint, navigator.webdriver removal). Enabled by default.
Robots.txt compliance — optional async parser with caching, crawl-delay, and sitemap discovery.
Proxy rotation — round-robin or random proxy pools with automatic failure tracking.
Local LLM extraction — optional Ollama integration for structured data extraction without API costs.
Redis-backed job queue — horizontal scaling for crawl operations with background workers.

Tools

Tool	Description
`agent_scrape`	Fetch any URL; auto-detects browser need. Supports ScrapeConfig (wait_for selectors, tag filtering, citation markers, proxies, JA3 profile).
`agent_crawl`	Recursive crawl with information saturation stopping, robots.txt compliance, deduplication.
`agent_search`	Web search via SearXNG, DuckDuckGo, Google, or Bing with optional result scraping.
`agent_extract`	Structured data extraction by JSON schema via Ollama, Anthropic Claude, or CSS fallback.
`agent_map`	Discover all URLs on a website via sitemap.xml and BFS crawling.
`agent_status`	Poll crawl job progress (in-memory or Redis).
`agent_research`	Research a topic: decomposes into sub-queries, searches multi-engine, gathers content, synthesizes a structured report with citations via LLM. Supports deep iterative follow-up and structured output schemas.

Library API

Function	Description
`smart_fetch(url, config=)`	Fetch a single URL; auto-detects browser need. Returns `FetchResult`.
`batch_fetch(urls, concurrency=)`	Fetch multiple URLs concurrently. Returns `list[FetchResult]`.
`search_fetch(query, sources=, max_results=)`	Search and optionally scrape results. Returns `SearchResult`.
`parallel_search(query, sources=, max_results=)`	Search engine results without scraping. Returns `tuple[list[EngineResult], list[str], dict[str, str]]`.
`smart_research(prompt, config=)`	Research a topic: decomposes query, gathers sources, synthesizes report with citations via LLM. Returns `ResearchResult`.

Quickstart

LangChain

from agentfetch.integrations.langchain.tools import AgentFetchTools
tools = AgentFetchTools
# Use with any LangChain agent

MCP (Claude Desktop, Cursor, etc.)

pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch-mcp
# configure in Claude Desktop or any MCP host

REST API

pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch serve

# Scrape
curl -X POST http://localhost:8080/agent_scrape \
  -d '{"url": "https://example.com"}'

# Research (async — returns job ID, poll for result)
curl -X POST http://localhost:8080/agent_research \
  -d '{"prompt": "Latest AI developments", "max_sources": 10}'
# → {"request_id":"abc123","status":"pending",...}

# Poll research result
curl http://localhost:8080/agent_research/abc123
# → {"request_id":"abc123","status":"complete","answer":"# Report...","sources":[...]}

# Research with streaming (SSE)
curl -X POST http://localhost:8080/agent_research/stream \
  -d '{"prompt": "Compare OpenAI and Anthropic pricing", "citation_format": "apa"}'
# → event: progress, event: result (full report)

Python library

import asyncio
from agentfetch import smart_fetch, search_fetch
from agentfetch.core.schema import ScrapeConfig

# Fetch a single URL
result = asyncio.run(smart_fetch(
    "https://en.wikipedia.org/wiki/Obsession_(2025_film)",
    config=ScrapeConfig(
        wait_for=".main-content",
        exclude_tags=["nav", "footer"],
        citation_links=True,
    )
))
print(result.content)  # clean markdown
print(result.citations)  # [1], [2] URLs

# Search with multiple engines
sr = asyncio.run(search_fetch(
    "latest AI news",
    sources=["duckduckgo", "google", "bing"],
    max_results=5,
))
print(sr.results)      # list[FetchResult]
print(sr.errors)       # per-engine errors, e.g. {"google": "rate limited (429)"}
print(sr.sources_used) # engines that returned results

# Research a topic (uses Ollama or Claude for query decomposition and synthesis)
from agentfetch import smart_research, ResearchConfig

report = asyncio.run(smart_research(
    "Compare pricing of OpenAI, Anthropic, and Google AI APIs",
    config=ResearchConfig(
        max_sources=15,
        citation_format="apa",
        depth="deep",
    )
))
print(report.answer)        # Comprehensive report with [Author, Year] citations
print(report.sources)       # list of ResearchSource with full content
print(report.response_time) # e.g. 12.34s

All integrations

Framework	Install	Tools available
LangChain	`pip install "agentfetch[langchain] @ git+https://github.com/SID1ART/agentfetch.git"`	`scrape`, `search`, `crawl`, `map`, `status`, `research`
LlamaIndex	`pip install "agentfetch[llamaindex] @ git+https://github.com/SID1ART/agentfetch.git"`	`scrape`, `search`, `crawl`, `map`, `status`, `research`
CrewAI	`pip install "agentfetch[crewai] @ git+https://github.com/SID1ART/agentfetch.git"`	`scrape`, `search`, `crawl`, `map`, `status`, `research`
AutoGen	`pip install git+https://github.com/SID1ART/agentfetch.git`	`scrape`, `search`, `crawl`, `map`, `status`, `research`
OpenAI / Gemini / Groq	`pip install git+https://github.com/SID1ART/agentfetch.git`	`scrape`, `search`, `crawl`, `map`, `status`, `research`
Claude MCP	`pip install git+https://github.com/SID1ART/agentfetch.git`	`scrape`, `search`, `crawl`, `map`, `status`, `research`
Ollama	`pip install git+https://github.com/SID1ART/agentfetch.git`	`extract`
REST	`pip install git+https://github.com/SID1ART/agentfetch.git`	All endpoints + `/agent_research` + streaming

Schema reference

`ScrapeConfig`

Field	Type	Default	Description
`wait_for`	`str`	`None`	CSS selector to wait for before extracting
`include_tags`	`list[str]`	`None`	Only extract these HTML tags
`exclude_tags`	`list[str]`	`None`	Skip these HTML tags during extraction
`viewport`	`dict`	`None`	Browser viewport `{width, height}`
`js_wait_ms`	`int`	`0`	Extra JS wait time in milliseconds
`scrape_links`	`bool`	`True`	Extract links from page
`max_content_length`	`int`	`50000`	Truncate content beyond this length
`citation_links`	`bool`	`False`	Track citation markers `[1]`, `[2]`
`proxy`	`str`	`None`	Proxy URL for this request
`cookies`	`list[dict]`	`None`	Cookies to include in browser session
`headers`	`dict[str,str]`	`None`	Custom HTTP headers
`ja3`	`str`	`None`	JA3 TLS profile for `curl_cffi` bypass (e.g. `"chrome124"`)
`stealth`	`bool`	`True`	Enable browser stealth evasions (playwright-stealth if available)
`actions`	`list[Action]`	`[]`	Action chain to execute before extraction (click, scroll, type, wait, press, select, screenshot, hover, custom_js)
`screenshot`	`bool`	`False`	Capture a full-page final screenshot (PNG, base64-encoded in `screenshot_data`)

`FetchResult`

Field	Type	Description
`url`	`str`	Requested URL
`content`	`str`	Extracted markdown content
`title`	`str`	Page title
`confidence`	`float`	Extraction quality (0.0–1.0)
`content_type`	`str`	Detected type (article, blog, product, etc.)
`word_count`	`int`	Word count of extracted content
`render_mode`	`str`	Renderer used: `static`, `browser`, or `bypass`
`latency_ms`	`int`	Total request time in milliseconds
`cached`	`bool`	Whether result came from cache
`injection_detected`	`bool`	Prompt injection was found and redacted
`links`	`list[str]`	Links extracted from the page
`error`	`str`	Error message if the fetch failed
`duplicate_of`	`str`	URL this content was deduplicated against
`retries`	`int`	Number of retries performed
`citations`	`list[str]`	Citation URLs when `citation_links=True`
`robots_allowed`	`bool`	Whether robots.txt permitted the fetch
`proxy_used`	`str`	Proxy used for this request
`normalized_url`	`str`	Normalized version of the requested URL
`screenshot_data`	`str`	Base64-encoded final PNG screenshot (when `screenshot=True` in ScrapeConfig)
`screenshots`	`list[str]`	Base64-encoded PNG screenshots from mid-flow screenshot actions (when `store_output=True`)

`Action`

Field	Type	Default	Description
Field	Type	Default	Description
-------	------	---------	-------------
`type`	`str`	—	Action type: `click`, `scroll`, `type`, `wait`, `press`, `select`, `screenshot`, `hover`, or `custom_js`
`selector`	`str`	`None`	CSS selector for `click`, `type`, `scroll`, `press`, `select`, `hover` actions
`value`	`str`	`None`	Value: text for `type`, key for `press`, ms for `wait`, pixels for `scroll`, option value for `select`, JS code for `custom_js`
`timeout`	`int`	`5000`	Timeout in ms for selector waits
`store_output`	`bool`	`False`	When `True`, stores `screenshot` action output in `FetchResult.screenshots` list

Action details:

click — clicks a CSS selector, waits for networkidle afterward
scroll — scrolls to selector, "bottom", "top", or by N pixels
type — fills an input field with value
wait — waits N milliseconds (value)
press — presses a key (value, default "Enter"), waits for networkidle
select — selects a <select> option by value
screenshot — captures a full-page PNG; stored in screenshots[] if store_output=True
hover — hovers over a CSS selector
custom_js — runs arbitrary JavaScript from value on the page, waits for networkidle

Examples:

# Hover to reveal dropdown, then click
actions = [
    Action(type="hover", selector="#nav-menu"),
    Action(type="wait", value="500"),
    Action(type="click", selector="#nav-menu .dropdown-item"),
]

# Run custom JS and capture mid-flow screenshot
actions = [
    Action(type="custom_js", value="document.querySelector('.paywall')?.remove()"),
    Action(type="screenshot", store_output=True),
]

# Full-page final screenshot
config = ScrapeConfig(screenshot=True, actions=[...])

`SearchConfig`

Field	Type	Default	Description
`max_results`	`int`	`5`	Max results per engine
`sources`	`list[str]`	`None`	Engines: `duckduckgo`, `google`, `bing`, `searxng`, `brave`, `serpapi`
`scrape_results`	`bool`	`True`	Fetch full content of each result
`searxng_url`	`str`	`""`	Self-hosted SearXNG instance URL
`topic`	`str`	`"general"`	Search topic: `general`, `news`, or `finance`
`time_range`	`str`	`None`	Time filter: `day`, `week`, `month`, or `year`
`country`	`str`	`None`	Boost results from a country (e.g. `"united states"`, `"united kingdom"`)
`include_answer`	`bool`	`False`	Include an LLM-generated answer via Ollama or Anthropic

`SearchResult`

Field	Type	Description
`query`	`str`	Original search query
`results`	`list[FetchResult]`	Search results with extracted content
`source`	`str`	Concatenated engine names used
`sources_used`	`list[str]`	Engines that returned results
`suggestions`	`list[str]`	Search suggestions (if available)
`total_results`	`int`	Total deduplicated result count
`errors`	`dict[str,str]`	Per-engine error messages (e.g. `{"google": "rate limited (429)"}`)
`answer`	`str`	LLM-generated answer when `include_answer=True` in SearchConfig

`MapConfig`

Field	Type	Default	Description
`max_depth`	`int`	`2`	Maximum crawl depth for link discovery
`max_pages`	`int`	`100`	Maximum URLs to discover
`include_patterns`	`list[str]`	`None`	Regex patterns to include only matching paths
`exclude_patterns`	`list[str]`	`None`	Regex patterns to exclude matching paths
`include_domains`	`list[str]`	`None`	Only include URLs from these domains
`exclude_domains`	`list[str]`	`None`	Exclude URLs from these domains
`respect_robots`	`bool`	`True`	Respect robots.txt during crawl discovery

`MapResult`

Field	Type	Description
`base_url`	`str`	The root URL that was mapped
`links`	`list[str]`	Discovered URLs
`total`	`int`	Total number of discovered URLs
`sources`	`list[str]`	Discovery methods used (`sitemap`, `crawl`)

`ResearchConfig`

Field	Type	Default	Description
`prompt`	`str`	—	The research question or topic
`model`	`str`	`"auto"`	Model tier: `"mini"` (fast/cheap), `"pro"` (comprehensive), `"auto"`
`max_sources`	`int`	`20`	Maximum sources to gather
`output_schema`	`dict`	`None`	JSON Schema for structured output in the report
`citation_format`	`str`	`"numbered"`	Citation style: `"numbered"`, `"mla"`, `"apa"`, `"chicago"`
`include_domains`	`list[str]`	`None`	Prioritize results from these domains
`exclude_domains`	`list[str]`	`None`	Exclude results from these domains
`depth`	`str`	`"standard"`	Research depth: `"quick"`, `"standard"`, `"deep"` (deep enables iterative follow-up queries)
`max_iterations`	`int`	`4`	Max follow-up iterations when `depth="deep"`

`ResearchSource`

Field	Type	Description
`url`	`str`	Source URL
`title`	`str`	Page title
`content`	`str`	Extracted text content
`relevance_score`	`float`	Relevance to the research question (0.0–1.0)
`citation`	`str`	Pre-formatted citation string (e.g. `[1]`, `(Author, 2024)`)

`ResearchResult`

Field	Type	Description
`request_id`	`str`	Unique job ID
`query`	`str`	Original research question
`answer`	`str`	Structured markdown report with citations
`sources`	`list[ResearchSource]`	Gathered sources with title, URL, content, and formatted citation
`structured_output`	`dict`	JSON matching `output_schema` if provided
`model_used`	`str`	LLM provider used for synthesis (`ollama`, `anthropic`)
`total_sources`	`int`	Number of sources gathered
`response_time`	`float`	Total research time in seconds
`status`	`str`	`"pending"`, `"running"`, `"complete"`, or `"failed"`
`error`	`str`	Error message if the research failed

Configuration

Environment variables

Variable	Default	Description
`REDIS_URL`	—	Redis connection for caching + job queue
`SEARXNG_URL`	—	SearXNG instance for search (falls back to DuckDuckGo + Google + Bing)
`BRAVE_SEARCH_API_KEY`	—	Brave Search API key (enables `brave` engine, preferred over DuckDuckGo)
`SERPAPI_KEY`	—	SerpAPI key (enables `serpapi` engine, preferred over Google scraping)
`GOOGLE_API_KEY`	—	Google Custom Search API key (used by `google` engine when both key and CX are set)
`GOOGLE_CX`	—	Google Custom Search CX (required with `GOOGLE_API_KEY`)
`ANTHROPIC_API_KEY`	—	For Claude-powered `agent_extract` and research agent synthesis
`ANTHROPIC_MODEL`	`claude-3-haiku-20240307`	Claude model name for extraction and research
`OLLAMA_URL`	—	Ollama endpoint for local LLM extraction and research agent
`OLLAMA_MODEL`	`llama3.2`	Ollama model name for extraction and research
`AGENTFETCH_CACHE_TTL`	`300`	In-memory LRU cache TTL (seconds)
`AGENTFETCH_CACHE_SIZE`	`100`	Max entries in in-memory LRU cache
`AGENTFETCH_STATIC_TIMEOUT`	`15`	HTTP fetch timeout (seconds)
`AGENTFETCH_BROWSER_TIMEOUT`	`30`	Playwright browser timeout (seconds)
`AGENTFETCH_MAX_RETRIES`	`2`	Max retries for failed requests
`AGENTFETCH_DOMAIN_DELAY`	`0.5`	Delay between requests to same domain
`AGENTFETCH_ROBOTS_CHECK`	`false`	Enable robots.txt compliance
`AGENTFETCH_PROXY_LIST`	—	Comma-separated proxy URLs or JSON array
`AGENTFETCH_PROXY_STRATEGY`	`round-robin`	`round-robin` or `random`
`AGENTFETCH_COOKIES_FILE`	—	Path to cookies file (Netscape or JSON)
`AGENTFETCH_PORT`	`8080`	API server port
`AGENTFETCH_JA3_PROFILE`	—	JA3 TLS profile override for `curl_cffi`
`AGENTFETCH_STEALTH`	`true`	Enable browser stealth evasions in Playwright
`AGENTFETCH_STEALTH_BASIC_FALLBACK`	`true`	Fall back to non-stealth browser if stealth fails
`AGENTFETCH_CRAWL_DB`	`agentfetch_crawl.db`	SQLite path for crawl job persistence
`AGENTFETCH_MIN_PROSE_RATIO`	`0.4`	Minimum alpha-char ratio for quality check
`AGENTFETCH_MIN_WORDS`	`10`	Minimum word count for quality check

Self-host

docker-compose up -d
# Starts API (port 8080), MCP SSE (port 8081), Redis
# Optional crawl worker:
docker compose --profile worker up -d

Architecture

                         ┌─────────────┐
                         │   Smart     │
                         │   URL       │
                         │   Router    │
                         └──────┬──────┘
                                │
              ┌─────────────────┼──────────────────┐
              │                 │                   │
              ▼                 ▼                   ▼
      ┌────────────┐   ┌──────────────┐   ┌────────────────┐
      │  Static    │   │  Cloudflare  │   │   Playwright   │
      │  HTTP      │   │  bypass      │   │   Headless     │
      │  (httpx)   │   │  (curl_cffi) │   │   Browser      │
      └─────┬──────┘   └──────┬───────┘   └───────┬────────┘
            │                 │                    │
            └─────────────────┼────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Extraction     │
                    │  Pipeline       │
                    │  trafilatura →  │
                    │  newspaper3k →  │
                    │  readability →  │
                    │  BS4 → plain    │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Sanitizer      │
                    │  (13 injection  │
                    │   patterns)     │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Post-process   │
                    │  • Citations    │
                    │  • Dedup check  │
                    │  • Max length   │
                    │  • Markdown     │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   FetchResult   │
                    │   Pydantic      │
                    │   response      │
                    └─────────────────┘

Tests

pip install -e ".[all]"
pytest tests/ -v
# 138 tests passing

License

MIT — free for any use, including commercial.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
agentfetch		agentfetch
examples		examples
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
README.md		README.md
SPECIFICATION.md		SPECIFICATION.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
test_comprehensive.py		test_comprehensive.py
test_hn_all_features.py		test_hn_all_features.py
test_hn_deep.py		test_hn_deep.py
test_hn_scrape.py		test_hn_scrape.py

Folders and files

Latest commit

History

Repository files navigation

agentfetch

Install

Standard

Cloud notebooks (Colab, Jupyter, Kaggle)

With extra integrations

What makes it different

Tools

Library API

Quickstart

LangChain

MCP (Claude Desktop, Cursor, etc.)

REST API

Python library

All integrations

Schema reference

ScrapeConfig

FetchResult

Action

SearchConfig

SearchResult

MapConfig

MapResult

ResearchConfig

ResearchSource

ResearchResult

Configuration

Environment variables

Self-host

Architecture

Tests

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ScrapeConfig`

`FetchResult`

`Action`

`SearchConfig`

`SearchResult`

`MapConfig`

`MapResult`

`ResearchConfig`

`ResearchSource`

`ResearchResult`

Packages