feat: Add PydanticAiCrawler with AI-powered HTML extraction by Mantisus · Pull Request #1964 · apify/crawlee-python

Mantisus · 2026-06-14T17:44:22Z

Description

Adds PydanticAiCrawler - a new HTTP crawler that parses pages with parsel and uses pydantic-ai as the layer for LLM interaction.
PydanticAiHtmlDistiller is a protocol for distillers that clean HTML and convert it to a compact format (e.g., cleaned HTML, Markdown) for an LLM.
- PydanticAiCleanHtmlDistiller removes comments, noisy attributes, and scripts, returning a compact HTML version.
- PydanticAiSkeletonDistiller extends PydanticAiCleanHtmlDistiller by truncating text and collapsing repeated siblings.
PydanticAiHtmlExtractor is a protocol for extractors that turn a page into structured data using a distiller and an LLM.
- PydanticAiDirectExtractor sends the distilled page to an LLM together with a Pydantic schema describing the target data and returns the validated result.
- PydanticAiSelectorExtractor asks the LLM for CSS selectors once and caches them in a KeyValueStore, so later pages are extracted without an LLM call.

Issues

Closes: Add support for AI/LLM-based HTML parsing (selectors) #1593

Testing

Added new unit tests for PydanticAiCrawler, PydanticAiCleanHtmlDistiller, PydanticAiSelectorExtractor, PydanticAiDirectExtractor, and PydanticAiSelectorExtractor.

Copilot

Pull request overview

This PR introduces an experimental AiCrawler (HTTP-based, Parsel-backed) plus a small AI-extraction subsystem that can either (a) directly extract structured data via an LLM or (b) learn & cache CSS selectors via an LLM and reuse them on later pages to avoid repeated model calls. This adds a native AI/LLM extraction path for HTTP crawlers (issue #1593) and integrates it into the public crawlee.crawlers API and docs.

Changes:

Add AiCrawler + AiCrawlingContext (context.extract(...)) and new AI distiller/extractor abstractions (AiCleanHtmlDistiller, AiSkeletonDistiller, AiDirectExtractor, AiSelectorExtractor, AiUsageStats).
Add new optional dependency extra ai (parsel + lxml clean + pydantic-ai-slim[openai]) and include it in the all extra.
Add extensive unit tests and new documentation/guide + runnable code examples for common AI crawler setups.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
uv.lock	Adds locked deps for the new `ai` optional dependency group (and includes in `all`).
pyproject.toml	Defines `ai` extra and adds it to `all`.
src/crawlee/crawlers/init.py	Re-exports AI crawler/distiller/extractor APIs behind optional imports.
src/crawlee/crawlers/_ai/init.py	AI module public surface with optional-import handling.
src/crawlee/crawlers/_ai/_ai_crawler.py	Implements `AiCrawler` wiring + experimental warning + context pipeline integration.
src/crawlee/crawlers/_ai/_ai_crawling_context.py	Adds `AiCrawlingContext` with `extract` helper and shared usage stats.
src/crawlee/crawlers/_ai/_base_distiller.py	Base distiller + JSON-script protect/unprotect helpers.
src/crawlee/crawlers/_ai/_base_extractor.py	Base extractor: model resolution, instruction composition, usage accumulation, scope helpers.
src/crawlee/crawlers/_ai/_clean_html_distiller.py	Clean/distill HTML for direct LLM extraction (size caps, attr filtering, JSON handling).
src/crawlee/crawlers/_ai/_direct_extractor.py	Direct extraction strategy using pydantic-ai output validation + usage tracking.
src/crawlee/crawlers/_ai/_prompts.py	Shared prompt instructions/notes and truncation marker constants.
src/crawlee/crawlers/_ai/_selector_extractor.py	Selector-learning extractor with caching, persistence, validation, retries, and fallback support.
src/crawlee/crawlers/_ai/_skeleton_distiller.py	Skeleton distiller for selector generation (text truncation + sibling collapsing + max-size tightening).
src/crawlee/crawlers/_ai/_types.py	Protocols for distillers/extractors and `AiUsageStats`.
src/crawlee/crawlers/_ai/_utils.py	Utility to build a default `lxml_html_clean.Cleaner` for distillers.
tests/unit/crawlers/_ai/test_ai_crawler.py	Unit tests for `AiCrawler` behavior and context/extractor forwarding.
tests/unit/crawlers/_ai/test_clean_html_distiller.py	Unit tests for `AiCleanHtmlDistiller` reduction, truncation, and size enforcement.
tests/unit/crawlers/_ai/test_direct_extractor.py	Unit tests for `AiDirectExtractor` prompt composition, scoping, retries, and usage.
tests/unit/crawlers/_ai/test_selector_extractor.py	Unit tests for selector caching, concurrency, invalid plans/data retries, fallback, persistence.
tests/unit/crawlers/_ai/test_skeleton_distiller.py	Unit tests for skeleton truncation, sibling collapsing, and oversize handling.
docs/guides/architecture_overview.mdx	Updates architecture diagrams/text to include `AiCrawler` + `AiCrawlingContext`.
docs/guides/ai_crawler.mdx	New user guide for installing and using `AiCrawler`, extractors, distillers, usage limits.
docs/guides/code_examples/ai_crawler/basic_example.py	Example: basic `AiCrawler` usage.
docs/guides/code_examples/ai_crawler/additional_instructions_example.py	Example: per-call `additional_instructions`.
docs/guides/code_examples/ai_crawler/custom_distiller_example.py	Example: custom Markdown distiller.
docs/guides/code_examples/ai_crawler/selector_extractor_example.py	Example: selector extractor + direct fallback.
docs/guides/code_examples/ai_crawler/usage_limit_example.py	Example: per-run usage limits + cumulative token budget stopping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

B4nan · 2026-06-23T07:27:56Z

I kinda hate the name AiCrawler 🙃

vdusek · 2026-06-23T10:07:23Z

I kinda hate the name AiCrawler 🙃

Why? Is it too generic / "buzzword-y"?

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext. Since this is built on top of Pydantic and PydanticAI. Which might make sense, because it also highlights that selectors can be defined using Pydantic models (similar to ParselCrawler, BeautifulSoupCrawler, ...).

B4nan · 2026-06-23T10:16:26Z

Yes, exactly, I would rather name it this way so it's clear what it does. AiCrawler sounds like a product name, not a library feature to me.

Mantisus · 2026-06-23T11:08:21Z

I kinda hate the name AiCrawler 🙃

Yeah, me too 😄. I chose it just as a "hyped" marketing name

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext.

I agree.

vdusek

Looks good! A few comments...

Pijukatel · 2026-06-24T08:18:47Z

Hi, I started reviewing by using the Crawler. It worked fine on the doc examples, so I tried something slightly harder, and I got stuck on something trivial, but hard to debug.

The default model settings ran out of tokens, but PydanticAI does not expose this at all. It hides it behind a model validation error, which is only the consequence of running out of tokens. I had to go deep inside in debug mode to even figure this out.

Example settings that should produce this error even on example code:

    model = AnthropicModel(
        'claude-opus-4-8',
        provider=AnthropicProvider(
            api_key='...'),
        settings=ModelSettings(max_tokens=100) # Some obviously small number
    )

Having this in the example Crawler, it logs the following error:

[AiCrawler] WARN Retrying request to https://crawlee.dev/ due to: Exceeded maximum output retries (1). File "/home/pijukatel/repos/crawlee-python/ai/.venv/lib/python3.14/site-packages/pydantic_ai/_output.py", line 947, in validate, return self.validator.validate_python(, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^, data or {}, allow_partial=pyd_allow_partial, context=validation_context, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^, )

This error does not have the root cause anywhere in the exception chain, so it is very misleading. I see this is more of a PydanticAI problem, but still, I think we should solve this as this kind of error will probably be common, and users will have no clue why it happened based on the useless Pydantic validation error.

Here is the debug that uncovers where the root cause is hiding, but PydanticAI never exposes this:

Maybe there could be a debug section in the guide that focuses on model responses like that. Also adding some handling around PydanticAI to surface these errors would be nice.

Mantisus · 2026-06-24T14:56:58Z

This error does not have the root cause anywhere in the exception chain, so it is very misleading. I see this is more of a PydanticAI problem, but still, I think we should solve this as this kind of error will probably be common, and users will have no clue why it happened based on the useless Pydantic validation error.
Maybe there could be a debug section in the guide that focuses on model responses like that. Also adding some handling around PydanticAI to surface these errors would be nice.

Great idea. I'll add a section on debugging.

If you find this helpful for using the package, use capture_run_messages in situations like this one

from pydantic_ai import capture_run_messages

with capture_run_messages() as messages:
    try:
        article = await context.extract(Article)
    except:
        raise
    finally:
        print('Messages from the model run:')
        for message in messages:
            print(message)

This allows you to view interactions with the model.

Pijukatel · 2026-06-25T08:21:04Z

If you find this helpful for using the package, use capture_run_messages in situations like this one

Why not enhance some errors out of the box? Maybe something like:

class ExtractorModelError(UnexpectedModelBehavior):
    def __init__(self, *args, finish_reasons, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.finish_reasons = finish_reasons

    def __str__(self):
        return f"{super().__str__()}, Model finish reasons: {self.finish_reasons}"

Wrapping this either in extractors or in extract that is defined in AiCrawler

        with capture_run_messages() as model_messages:
            try:
                ... depeneds on where we plug it...
            except UnexpectedModelBehavior as e:
                raise ExtractorModelError(
                    finish_reasons={
                        model_message.finish_reason for model_message in model_messages if isinstance(model_message, ModelResponse)}, message=e.message
                ) from e

Then the error would contain an actually useful hint:

[AiCrawler] WARN Retrying request to https://crawlee.dev/ due to: Exceeded maximum output retries (1), Model finish reasons: {'length'}. File "... )

Mantisus · 2026-06-25T14:01:30Z

Why not enhance some errors out of the box? Maybe something like

I'd prefer to avoid that. Error UnexpectedModelBehavior will also occur if the model generates data that fails validation and the number of retries has already been exhausted. There are likely other situations as well that we haven't encountered yet.

pydantic_ai.exceptions.UnexpectedModelBehavior: Exceeded maximum output retries (1)

I also believe that, depending on the model provider, the information provided by ModelResponse may vary and further confuse the user.

Pijukatel

Nice. Just some minor suggestions.

vdusek

A few more comments

vdusek · 2026-06-28T10:25:59Z

Hi @szaganek, could we ask you for a final doc style review? Thank you.

janbuchar

I gave this just a brief read, but I like it. LGTM!

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Co-authored-by: Edyta <142720610+szaganek@users.noreply.github.com>

vdusek

LGTM, thanks 🚀

Mantisus self-assigned this Jun 15, 2026

Mantisus marked this pull request as ready for review June 17, 2026 19:27

Mantisus requested review from Pijukatel, janbuchar and vdusek June 17, 2026 19:27

vdusek requested a review from Copilot June 23, 2026 06:40

Copilot started reviewing on behalf of vdusek June 23, 2026 06:40 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread src/crawlee/crawlers/_ai/_selector_extractor.py Outdated

Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated

Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated

vdusek reviewed Jun 23, 2026

View reviewed changes

Mantisus changed the title ~~feat: Add AiCrawler with AI-powered HTML extraction~~ feat: Add PydanticAiCrawler with AI-powered HTML extraction Jun 23, 2026

Pijukatel approved these changes Jun 26, 2026

View reviewed changes

Comment thread src/crawlee/crawlers/_pydantic_ai/_clean_html_distiller.py

Comment thread src/crawlee/crawlers/_pydantic_ai/_pydantic_ai_crawler.py

Comment thread src/crawlee/crawlers/_pydantic_ai/_utils.py

vdusek reviewed Jun 27, 2026

View reviewed changes

vdusek requested a review from szaganek June 28, 2026 10:25

szaganek approved these changes Jun 29, 2026

View reviewed changes

janbuchar approved these changes Jun 29, 2026

View reviewed changes

Mantisus added 6 commits June 30, 2026 14:36

Add AiCrawler with AI-powered HTML extraction

cdc3b84

add tests

c709f2e

add docs

7a4f17d

rename

6569d3a

fix

baf1c61

update docs

d242c9d

Mantisus and others added 7 commits June 30, 2026 14:36

add debugging doc section

21cc4a1

polish

df14103

Apply suggestions from code review

0cadc32

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

bump pydantic-ai

aaa9a5e

Apply suggestions from code review

0a5a466

Co-authored-by: Edyta <142720610+szaganek@users.noreply.github.com>

update lock file

1efd9bb

fixes

08926ed

Mantisus force-pushed the llm-html-crawler branch from 064d575 to 08926ed Compare June 30, 2026 14:37

Mantisus added 2 commits June 30, 2026 14:53

update selector prrompt

6fcbf0d

new tests

65a5181

Mantisus requested a review from vdusek June 30, 2026 20:56

vdusek approved these changes Jul 1, 2026

View reviewed changes

vdusek changed the title ~~feat: Add PydanticAiCrawler with AI-powered HTML extraction~~ feat: Add PydanticAiCrawler with AI-powered HTML extraction Jul 1, 2026

vdusek merged commit ccfff76 into apify:master Jul 1, 2026
35 checks passed

Uh oh!

Conversation

Mantisus commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

B4nan commented Jun 23, 2026

Uh oh!

vdusek commented Jun 23, 2026

Uh oh!

B4nan commented Jun 23, 2026

Uh oh!

Mantisus commented Jun 23, 2026

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pijukatel commented Jun 24, 2026

Uh oh!

Mantisus commented Jun 24, 2026

Uh oh!

Pijukatel commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mantisus commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Mantisus commented Jun 14, 2026 •

edited

Loading

Pijukatel commented Jun 25, 2026 •

edited

Loading

Mantisus commented Jun 25, 2026 •

edited

Loading

vdusek commented Jun 28, 2026 •

edited

Loading