Skip to content

feat: Add PydanticAiCrawler with AI-powered HTML extraction#1964

Merged
vdusek merged 15 commits into
apify:masterfrom
Mantisus:llm-html-crawler
Jul 1, 2026
Merged

feat: Add PydanticAiCrawler with AI-powered HTML extraction#1964
vdusek merged 15 commits into
apify:masterfrom
Mantisus:llm-html-crawler

Conversation

@Mantisus

@Mantisus Mantisus commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Adds PydanticAiCrawler - a new HTTP crawler that parses pages with parsel and uses pydantic-ai as the layer for LLM interaction.
  • PydanticAiHtmlDistiller is a protocol for distillers that clean HTML and convert it to a compact format (e.g., cleaned HTML, Markdown) for an LLM.
    • PydanticAiCleanHtmlDistiller removes comments, noisy attributes, and scripts, returning a compact HTML version.
    • PydanticAiSkeletonDistiller extends PydanticAiCleanHtmlDistiller by truncating text and collapsing repeated siblings.
  • PydanticAiHtmlExtractor is a protocol for extractors that turn a page into structured data using a distiller and an LLM.
    • PydanticAiDirectExtractor sends the distilled page to an LLM together with a Pydantic schema describing the target data and returns the validated result.
    • PydanticAiSelectorExtractor asks the LLM for CSS selectors once and caches them in a KeyValueStore, so later pages are extracted without an LLM call.

Issues

Testing

  • Added new unit tests for PydanticAiCrawler, PydanticAiCleanHtmlDistiller, PydanticAiSelectorExtractor, PydanticAiDirectExtractor, and PydanticAiSelectorExtractor.

@Mantisus Mantisus self-assigned this Jun 15, 2026
@Mantisus Mantisus marked this pull request as ready for review June 17, 2026 19:27
@vdusek vdusek requested a review from Copilot June 23, 2026 06:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental AiCrawler (HTTP-based, Parsel-backed) plus a small AI-extraction subsystem that can either (a) directly extract structured data via an LLM or (b) learn & cache CSS selectors via an LLM and reuse them on later pages to avoid repeated model calls. This adds a native AI/LLM extraction path for HTTP crawlers (issue #1593) and integrates it into the public crawlee.crawlers API and docs.

Changes:

  • Add AiCrawler + AiCrawlingContext (context.extract(...)) and new AI distiller/extractor abstractions (AiCleanHtmlDistiller, AiSkeletonDistiller, AiDirectExtractor, AiSelectorExtractor, AiUsageStats).
  • Add new optional dependency extra ai (parsel + lxml clean + pydantic-ai-slim[openai]) and include it in the all extra.
  • Add extensive unit tests and new documentation/guide + runnable code examples for common AI crawler setups.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds locked deps for the new ai optional dependency group (and includes in all).
pyproject.toml Defines ai extra and adds it to all.
src/crawlee/crawlers/init.py Re-exports AI crawler/distiller/extractor APIs behind optional imports.
src/crawlee/crawlers/_ai/init.py AI module public surface with optional-import handling.
src/crawlee/crawlers/_ai/_ai_crawler.py Implements AiCrawler wiring + experimental warning + context pipeline integration.
src/crawlee/crawlers/_ai/_ai_crawling_context.py Adds AiCrawlingContext with extract helper and shared usage stats.
src/crawlee/crawlers/_ai/_base_distiller.py Base distiller + JSON-script protect/unprotect helpers.
src/crawlee/crawlers/_ai/_base_extractor.py Base extractor: model resolution, instruction composition, usage accumulation, scope helpers.
src/crawlee/crawlers/_ai/_clean_html_distiller.py Clean/distill HTML for direct LLM extraction (size caps, attr filtering, JSON handling).
src/crawlee/crawlers/_ai/_direct_extractor.py Direct extraction strategy using pydantic-ai output validation + usage tracking.
src/crawlee/crawlers/_ai/_prompts.py Shared prompt instructions/notes and truncation marker constants.
src/crawlee/crawlers/_ai/_selector_extractor.py Selector-learning extractor with caching, persistence, validation, retries, and fallback support.
src/crawlee/crawlers/_ai/_skeleton_distiller.py Skeleton distiller for selector generation (text truncation + sibling collapsing + max-size tightening).
src/crawlee/crawlers/_ai/_types.py Protocols for distillers/extractors and AiUsageStats.
src/crawlee/crawlers/_ai/_utils.py Utility to build a default lxml_html_clean.Cleaner for distillers.
tests/unit/crawlers/_ai/test_ai_crawler.py Unit tests for AiCrawler behavior and context/extractor forwarding.
tests/unit/crawlers/_ai/test_clean_html_distiller.py Unit tests for AiCleanHtmlDistiller reduction, truncation, and size enforcement.
tests/unit/crawlers/_ai/test_direct_extractor.py Unit tests for AiDirectExtractor prompt composition, scoping, retries, and usage.
tests/unit/crawlers/_ai/test_selector_extractor.py Unit tests for selector caching, concurrency, invalid plans/data retries, fallback, persistence.
tests/unit/crawlers/_ai/test_skeleton_distiller.py Unit tests for skeleton truncation, sibling collapsing, and oversize handling.
docs/guides/architecture_overview.mdx Updates architecture diagrams/text to include AiCrawler + AiCrawlingContext.
docs/guides/ai_crawler.mdx New user guide for installing and using AiCrawler, extractors, distillers, usage limits.
docs/guides/code_examples/ai_crawler/basic_example.py Example: basic AiCrawler usage.
docs/guides/code_examples/ai_crawler/additional_instructions_example.py Example: per-call additional_instructions.
docs/guides/code_examples/ai_crawler/custom_distiller_example.py Example: custom Markdown distiller.
docs/guides/code_examples/ai_crawler/selector_extractor_example.py Example: selector extractor + direct fallback.
docs/guides/code_examples/ai_crawler/usage_limit_example.py Example: per-run usage limits + cumulative token budget stopping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/crawlee/crawlers/_ai/_selector_extractor.py Outdated
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated
@B4nan

B4nan commented Jun 23, 2026

Copy link
Copy Markdown
Member

I kinda hate the name AiCrawler 🙃

@vdusek

vdusek commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

I kinda hate the name AiCrawler 🙃

Why? Is it too generic / "buzzword-y"?

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext. Since this is built on top of Pydantic and PydanticAI. Which might make sense, because it also highlights that selectors can be defined using Pydantic models (similar to ParselCrawler, BeautifulSoupCrawler, ...).

@B4nan

B4nan commented Jun 23, 2026

Copy link
Copy Markdown
Member

Yes, exactly, I would rather name it this way so it's clear what it does. AiCrawler sounds like a product name, not a library feature to me.

@Mantisus

Copy link
Copy Markdown
Collaborator Author

I kinda hate the name AiCrawler 🙃

Yeah, me too 😄. I chose it just as a "hyped" marketing name

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext.

I agree.

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few comments...

Comment thread src/crawlee/crawlers/_pydantic_ai/_skeleton_distiller.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py
Comment thread docs/guides/ai_crawler.mdx Outdated
Comment thread docs/guides/code_examples/ai_crawler/selector_extractor_example.py Outdated
Comment thread docs/guides/ai_crawler.mdx Outdated
@Mantisus Mantisus changed the title feat: Add AiCrawler with AI-powered HTML extraction feat: Add PydanticAiCrawler with AI-powered HTML extraction Jun 23, 2026
@Pijukatel

Copy link
Copy Markdown
Collaborator

Hi, I started reviewing by using the Crawler. It worked fine on the doc examples, so I tried something slightly harder, and I got stuck on something trivial, but hard to debug.

The default model settings ran out of tokens, but PydanticAI does not expose this at all. It hides it behind a model validation error, which is only the consequence of running out of tokens. I had to go deep inside in debug mode to even figure this out.

Example settings that should produce this error even on example code:

    model = AnthropicModel(
        'claude-opus-4-8',
        provider=AnthropicProvider(
            api_key='...'),
        settings=ModelSettings(max_tokens=100) # Some obviously small number
    )

Having this in the example Crawler, it logs the following error:

[AiCrawler] WARN Retrying request to https://crawlee.dev/ due to: Exceeded maximum output retries (1). File "/home/pijukatel/repos/crawlee-python/ai/.venv/lib/python3.14/site-packages/pydantic_ai/_output.py", line 947, in validate, return self.validator.validate_python(, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^, data or {}, allow_partial=pyd_allow_partial, context=validation_context, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^, )

This error does not have the root cause anywhere in the exception chain, so it is very misleading. I see this is more of a PydanticAI problem, but still, I think we should solve this as this kind of error will probably be common, and users will have no clue why it happened based on the useless Pydantic validation error.

Here is the debug that uncovers where the root cause is hiding, but PydanticAI never exposes this:
image

Maybe there could be a debug section in the guide that focuses on model responses like that. Also adding some handling around PydanticAI to surface these errors would be nice.

@Mantisus

Copy link
Copy Markdown
Collaborator Author

This error does not have the root cause anywhere in the exception chain, so it is very misleading. I see this is more of a PydanticAI problem, but still, I think we should solve this as this kind of error will probably be common, and users will have no clue why it happened based on the useless Pydantic validation error.
Maybe there could be a debug section in the guide that focuses on model responses like that. Also adding some handling around PydanticAI to surface these errors would be nice.

Great idea. I'll add a section on debugging.

If you find this helpful for using the package, use capture_run_messages in situations like this one

from pydantic_ai import capture_run_messages

with capture_run_messages() as messages:
    try:
        article = await context.extract(Article)
    except:
        raise
    finally:
        print('Messages from the model run:')
        for message in messages:
            print(message)

This allows you to view interactions with the model.

@Pijukatel

Pijukatel commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

If you find this helpful for using the package, use capture_run_messages in situations like this one

Why not enhance some errors out of the box? Maybe something like:

class ExtractorModelError(UnexpectedModelBehavior):
    def __init__(self, *args, finish_reasons, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.finish_reasons = finish_reasons

    def __str__(self):
        return f"{super().__str__()}, Model finish reasons: {self.finish_reasons}"

Wrapping this either in extractors or in extract that is defined in AiCrawler

        with capture_run_messages() as model_messages:
            try:
                ... depeneds on where we plug it...
            except UnexpectedModelBehavior as e:
                raise ExtractorModelError(
                    finish_reasons={
                        model_message.finish_reason for model_message in model_messages if isinstance(model_message, ModelResponse)}, message=e.message
                ) from e

Then the error would contain an actually useful hint:

[AiCrawler] WARN Retrying request to https://crawlee.dev/ due to: Exceeded maximum output retries (1), Model finish reasons: {'length'}. File "... )

@Mantisus

Mantisus commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Why not enhance some errors out of the box? Maybe something like

I'd prefer to avoid that. Error UnexpectedModelBehavior will also occur if the model generates data that fails validation and the number of retries has already been exhausted. There are likely other situations as well that we haven't encountered yet.

pydantic_ai.exceptions.UnexpectedModelBehavior: Exceeded maximum output retries (1)

I also believe that, depending on the model provider, the information provided by ModelResponse may vary and further confuse the user.

@Pijukatel Pijukatel left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Just some minor suggestions.

Comment thread src/crawlee/crawlers/_pydantic_ai/_clean_html_distiller.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_pydantic_ai_crawler.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_utils.py

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated
Comment thread src/crawlee/crawlers/_pydantic_ai/_pydantic_ai_crawler.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_clean_html_distiller.py
Comment thread src/crawlee/crawlers/_pydantic_ai/__init__.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_base_extractor.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_pydantic_ai_crawler.py
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
@vdusek vdusek requested a review from szaganek June 28, 2026 10:25
@vdusek

vdusek commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Hi @szaganek, could we ask you for a final doc style review? Thank you.

Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated
Comment thread docs/guides/pydantic_ai_crawler.mdx Outdated

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this just a brief read, but I like it. LGTM!

Mantisus and others added 7 commits June 30, 2026 14:36
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Edyta <142720610+szaganek@users.noreply.github.com>
@Mantisus Mantisus requested a review from vdusek June 30, 2026 20:56

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks 🚀

@vdusek vdusek changed the title feat: Add PydanticAiCrawler with AI-powered HTML extraction feat: Add PydanticAiCrawler with AI-powered HTML extraction Jul 1, 2026
@vdusek vdusek merged commit ccfff76 into apify:master Jul 1, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for AI/LLM-based HTML parsing (selectors)

8 participants