_apply_prompt_caching exceeds the 4 cache_control breakpoint limit in multi-turn conversations · Issue #2448 · bytedance/deer-flow · GitHub
Skip to content

_apply_prompt_caching exceeds the 4 cache_control breakpoint limit in multi-turn conversations #2448

@newhwa

Description

@newhwa

_apply_prompt_caching exceeds the 4 cache_control breakpoint limit in multi-turn conversations

Bug Description

ClaudeChatModel._apply_prompt_caching() injects cache_control: {"type": "ephemeral"} into every text block in the system prompt, every content block in the last N messages (prompt_cache_size, default 3), and the last tool definition. In multi-turn conversations with structured content blocks, this easily exceeds 4 total breakpoints — the hard limit enforced by both the Anthropic API and AWS Bedrock.

This causes a 400 Bad Request:

A maximum of 4 blocks with cache_control may be provided. Found 5.

When streaming, the 400 produces zero SSE chunks, which surfaces as:

LLM request failed: No generations found in stream.

Root Cause

In claude_provider.py, _apply_prompt_caching() adds cache_control to:

  1. Every text block in system (could be 1–2+ blocks)
  2. Every content block in the last prompt_cache_size messages (each message can have multiple blocks — text, tool_use, tool_result, etc.)
  3. The last tool definition

Example in a 5-message conversation with 1 system block, 3 recent messages (2 blocks each), and 1 tool:

  • system: 1 breakpoint
  • messages: 6 breakpoints (3 messages × 2 blocks)
  • tools: 1 breakpoint
  • Total: 8 — well over the limit of 4

Reproduction

  1. Configure DeerFlow with ClaudeChatModel and enable_prompt_caching: true
  2. Start a multi-turn conversation (3+ turns with tool usage)
  3. The 2nd or 3rd LLM call fails with No generations found in stream

Suggested Fix

Instead of marking every block, use a budget of 4 breakpoints and place them strategically. The most effective placement is on the last eligible blocks, since later breakpoints cover more prefix content and yield better cache hit rates:

def _apply_prompt_caching(self, payload: dict) -> None:
    """Apply ephemeral cache_control to up to 4 strategic positions."""
    MAX_BREAKPOINTS = 4
    candidates = []  # list of dicts that could receive cache_control

    # Collect candidate blocks in document order
    system = payload.get("system")
    if isinstance(system, list):
        for block in system:
            if isinstance(block, dict) and block.get("type") == "text":
                candidates.append(block)
    
    messages = payload.get("messages", [])
    cache_start = max(0, len(messages) - self.prompt_cache_size)
    for i in range(cache_start, len(messages)):
        msg = messages[i]
        if not isinstance(msg, dict):
            continue
        content = msg.get("content")
        if isinstance(content, list):
            for block in content:
                if isinstance(block, dict):
                    candidates.append(block)
        elif isinstance(content, str) and content:
            # Convert to list format for cache_control support
            msg["content"] = [{"type": "text", "text": content}]
            candidates.append(msg["content"][0])

    tools = payload.get("tools", [])
    if tools and isinstance(tools[-1], dict):
        candidates.append(tools[-1])

    # Apply cache_control to the LAST N candidates only
    for block in candidates[-MAX_BREAKPOINTS:]:
        block["cache_control"] = {"type": "ephemeral"}

Environment

  • DeerFlow: latest main branch
  • langchain-anthropic: 1.3.4
  • anthropic SDK: 0.84.0
  • Backend: AWS Bedrock via proxy (also reproducible with direct Anthropic API)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions