_apply_prompt_caching exceeds the 4 cache_control breakpoint limit in multi-turn conversations
Bug Description
ClaudeChatModel._apply_prompt_caching() injects cache_control: {"type": "ephemeral"} into every text block in the system prompt, every content block in the last N messages (prompt_cache_size, default 3), and the last tool definition. In multi-turn conversations with structured content blocks, this easily exceeds 4 total breakpoints — the hard limit enforced by both the Anthropic API and AWS Bedrock.
This causes a 400 Bad Request:
A maximum of 4 blocks with cache_control may be provided. Found 5.
When streaming, the 400 produces zero SSE chunks, which surfaces as:
LLM request failed: No generations found in stream.
Root Cause
In claude_provider.py, _apply_prompt_caching() adds cache_control to:
- Every text block in
system (could be 1–2+ blocks)
- Every content block in the last
prompt_cache_size messages (each message can have multiple blocks — text, tool_use, tool_result, etc.)
- The last tool definition
Example in a 5-message conversation with 1 system block, 3 recent messages (2 blocks each), and 1 tool:
- system: 1 breakpoint
- messages: 6 breakpoints (3 messages × 2 blocks)
- tools: 1 breakpoint
- Total: 8 — well over the limit of 4
Reproduction
- Configure DeerFlow with
ClaudeChatModel and enable_prompt_caching: true
- Start a multi-turn conversation (3+ turns with tool usage)
- The 2nd or 3rd LLM call fails with
No generations found in stream
Suggested Fix
Instead of marking every block, use a budget of 4 breakpoints and place them strategically. The most effective placement is on the last eligible blocks, since later breakpoints cover more prefix content and yield better cache hit rates:
def _apply_prompt_caching(self, payload: dict) -> None:
"""Apply ephemeral cache_control to up to 4 strategic positions."""
MAX_BREAKPOINTS = 4
candidates = [] # list of dicts that could receive cache_control
# Collect candidate blocks in document order
system = payload.get("system")
if isinstance(system, list):
for block in system:
if isinstance(block, dict) and block.get("type") == "text":
candidates.append(block)
messages = payload.get("messages", [])
cache_start = max(0, len(messages) - self.prompt_cache_size)
for i in range(cache_start, len(messages)):
msg = messages[i]
if not isinstance(msg, dict):
continue
content = msg.get("content")
if isinstance(content, list):
for block in content:
if isinstance(block, dict):
candidates.append(block)
elif isinstance(content, str) and content:
# Convert to list format for cache_control support
msg["content"] = [{"type": "text", "text": content}]
candidates.append(msg["content"][0])
tools = payload.get("tools", [])
if tools and isinstance(tools[-1], dict):
candidates.append(tools[-1])
# Apply cache_control to the LAST N candidates only
for block in candidates[-MAX_BREAKPOINTS:]:
block["cache_control"] = {"type": "ephemeral"}
Environment
- DeerFlow: latest main branch
- langchain-anthropic: 1.3.4
- anthropic SDK: 0.84.0
- Backend: AWS Bedrock via proxy (also reproducible with direct Anthropic API)
_apply_prompt_cachingexceeds the 4 cache_control breakpoint limit in multi-turn conversationsBug Description
ClaudeChatModel._apply_prompt_caching()injectscache_control: {"type": "ephemeral"}into every text block in the system prompt, every content block in the last N messages (prompt_cache_size, default 3), and the last tool definition. In multi-turn conversations with structured content blocks, this easily exceeds 4 total breakpoints — the hard limit enforced by both the Anthropic API and AWS Bedrock.This causes a 400 Bad Request:
When streaming, the 400 produces zero SSE chunks, which surfaces as:
Root Cause
In
claude_provider.py,_apply_prompt_caching()addscache_controlto:system(could be 1–2+ blocks)prompt_cache_sizemessages (each message can have multiple blocks — text, tool_use, tool_result, etc.)Example in a 5-message conversation with 1 system block, 3 recent messages (2 blocks each), and 1 tool:
Reproduction
ClaudeChatModelandenable_prompt_caching: trueNo generations found in streamSuggested Fix
Instead of marking every block, use a budget of 4 breakpoints and place them strategically. The most effective placement is on the last eligible blocks, since later breakpoints cover more prefix content and yield better cache hit rates:
Environment