_apply_prompt_caching exceeds the 4 cache_control breakpoint limit in multi-turn conversations

## `_apply_prompt_caching` exceeds the 4 cache_control breakpoint limit in multi-turn conversations

### Bug Description

`ClaudeChatModel._apply_prompt_caching()` injects `cache_control: {"type": "ephemeral"}` into every text block in the system prompt, every content block in the last N messages (`prompt_cache_size`, default 3), and the last tool definition. In multi-turn conversations with structured content blocks, this easily exceeds **4 total breakpoints** &mdash; the hard limit enforced by both the Anthropic API and AWS Bedrock.

This causes a **400 Bad Request**:

```
A maximum of 4 blocks with cache_control may be provided. Found 5.
```

When streaming, the 400 produces zero SSE chunks, which surfaces as:

```
LLM request failed: No generations found in stream.
```

### Root Cause

In `claude_provider.py`, `_apply_prompt_caching()` adds `cache_control` to:
1. **Every** text block in `system` (could be 1&ndash;2+ blocks)
2. **Every** content block in the last `prompt_cache_size` messages (each message can have multiple blocks &mdash; text, tool_use, tool_result, etc.)
3. The last tool definition

Example in a 5-message conversation with 1 system block, 3 recent messages (2 blocks each), and 1 tool:
- system: 1 breakpoint
- messages: 6 breakpoints (3 messages &times; 2 blocks)
- tools: 1 breakpoint
- **Total: 8** &mdash; well over the limit of 4

### Reproduction

1. Configure DeerFlow with `ClaudeChatModel` and `enable_prompt_caching: true`
2. Start a multi-turn conversation (3+ turns with tool usage)
3. The 2nd or 3rd LLM call fails with `No generations found in stream`

### Suggested Fix

Instead of marking every block, use a **budget of 4 breakpoints** and place them strategically. The most effective placement is on the **last** eligible blocks, since later breakpoints cover more prefix content and yield better cache hit rates:

```python
def _apply_prompt_caching(self, payload: dict) -> None:
    """Apply ephemeral cache_control to up to 4 strategic positions."""
    MAX_BREAKPOINTS = 4
    candidates = []  # list of dicts that could receive cache_control

    # Collect candidate blocks in document order
    system = payload.get("system")
    if isinstance(system, list):
        for block in system:
            if isinstance(block, dict) and block.get("type") == "text":
                candidates.append(block)
    
    messages = payload.get("messages", [])
    cache_start = max(0, len(messages) - self.prompt_cache_size)
    for i in range(cache_start, len(messages)):
        msg = messages[i]
        if not isinstance(msg, dict):
            continue
        content = msg.get("content")
        if isinstance(content, list):
            for block in content:
                if isinstance(block, dict):
                    candidates.append(block)
        elif isinstance(content, str) and content:
            # Convert to list format for cache_control support
            msg["content"] = [{"type": "text", "text": content}]
            candidates.append(msg["content"][0])

    tools = payload.get("tools", [])
    if tools and isinstance(tools[-1], dict):
        candidates.append(tools[-1])

    # Apply cache_control to the LAST N candidates only
    for block in candidates[-MAX_BREAKPOINTS:]:
        block["cache_control"] = {"type": "ephemeral"}
```

### Environment

- DeerFlow: latest main branch
- langchain-anthropic: 1.3.4
- anthropic SDK: 0.84.0
- Backend: AWS Bedrock via proxy (also reproducible with direct Anthropic API)


Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_apply_prompt_caching exceeds the 4 cache_control breakpoint limit in multi-turn conversations #2448