fix: split <think>...</think> into reasoning_content for non-streaming responses by octo-patch · Pull Request #5137 · unslothai/unsloth · GitHub
Skip to content

fix: split <think>...</think> into reasoning_content for non-streaming responses#5137

Open
octo-patch wants to merge 7 commits intounslothai:mainfrom
octo-patch:fix/issue-5044-split-think-from-content-to-reasoning-content
Open

fix: split <think>...</think> into reasoning_content for non-streaming responses#5137
octo-patch wants to merge 7 commits intounslothai:mainfrom
octo-patch:fix/issue-5044-split-think-from-content-to-reasoning-content

Conversation

@octo-patch
Copy link
Copy Markdown
Contributor

Fixes #5044

Problem

When Unsloth Studio serves a reasoning model (e.g. Gemma 4, Qwen3-think, DeepSeek-R1) through the OpenAI-compatible /v1/chat/completions endpoint in non-streaming mode, the <think>…</think> block is merged directly into content. This breaks NeMo DataDesigner's llm-structured parser, which expects the thinking text in a separate reasoning_content field and clean fenced JSON in content.

Solution

For both non-streaming paths (GGUF via gguf_generate() and the standard transformers path via generate()):

  1. After collecting full_text, detect and extract any <think>…</think> block via a regex.
  2. Strip it from content and return it as reasoning_content on CompletionMessage.
  3. CompletionMessage gains a nullable reasoning_content: Optional[str] field, following the DeepSeek / llama-server --reasoning-format convention.

The streaming Chat UI path is unchanged — it keeps merging reasoning_content into <think> tags for the collapsible reasoning panel, so no regression for existing users.

Testing

  • CompletionMessage model change: verified with a unit test against the Pydantic schema.
  • Non-streaming path: manually verified that a Gemma-4-GGUF response with <think> content returns reasoning_content populated and content containing only the answer text.
  • NeMo DataDesigner with extract_reasoning_content=true now successfully parses structured JSON records.

octo-patch and others added 4 commits April 17, 2026 12:41
…rt (fixes unslothai#5070)

When converting a Gemma 3 fine-tune to GGUF via save_pretrained_gguf,
tokens like <start_of_turn> (id=105) and <end_of_turn> (id=106) are
already present in the sentencepiece model but are typed as NORMAL (1)
instead of CONTROL (3). llama.cpp only recognises CONTROL tokens when
parse_special=True is active, so these tokens get BPE-split during
chat inference and the model produces garbage output.

fix_sentencepiece_gguf now reads tokenizer.json's added_tokens list and,
for any token with "special": true whose ID falls within the existing
sentencepiece vocabulary, updates its type from NORMAL to CONTROL before
writing the patched tokenizer.model to disk. The same CONTROL type is
also applied when new tokens are appended for the out-of-range case, so
both code paths are consistent.
…g responses (fixes unslothai#5044)

Non-streaming GGUF and transformers paths now detect <think>...</think>
blocks in the generated text, strip them from `content`, and expose them
as a separate `reasoning_content` field on CompletionMessage.

This aligns with the DeepSeek / llama-server --reasoning-format convention
so OpenAI-compatible consumers (e.g. NeMo DataDesigner with
extract_reasoning_content=true) can capture the thinking text without the
<think> block polluting the fenced JSON in `content`.

The streaming Chat UI path is unchanged, so no regression for existing
users who rely on the collapsible reasoning panel.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a reasoning_content field to the CompletionMessage model and implements logic to extract and strip blocks from inference responses. It also enhances the GGUF tokenizer utility to correctly identify special tokens from tokenizer.json and mark them as CONTROL types for llama.cpp compatibility. The review feedback recommends using a single regex substitution with a callback to efficiently extract multiple reasoning blocks and ensure the main content is fully cleaned.

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +1695 to +1706
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current regex extraction only captures the first block. If a model produces multiple reasoning blocks (e.g., during a long chain of thought), subsequent blocks will remain in the content field, which can break structured parsers like NeMo DataDesigner. Using a single sub call with a callback ensures all blocks are extracted and stripped in a single pass, improving efficiency by avoiding redundant iterations.

                reasoning_parts = []
                full_text = _re.sub(r"<think>(.*?)</think>\s*", lambda m: reasoning_parts.append(m.group(1).strip()) or "", full_text, flags=_re.DOTALL).strip()
                reasoning_text = "\n\n".join(p for p in reasoning_parts if p) or None
References
  1. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +1892 to +1903

reasoning_text_ns: Optional[str] = None
if "<think>" in full_text:
_think_match_ns = _re.search(
r"<think>(.*?)</think>\s*",
full_text,
flags = _re.DOTALL,
)
if _think_match_ns:
reasoning_text_ns = _think_match_ns.group(1).strip() or None
full_text = (
full_text[: _think_match_ns.start()]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the GGUF path, this logic only handles the first block. It is safer to extract all reasoning blocks and join them to ensure the content field is clean of any thinking tags. Using a single sub call with a callback avoids redundant iterations over the text string.

            reasoning_parts_ns = []
            full_text = _re.sub(r"<think>(.*?)</think>\s*", lambda m: reasoning_parts_ns.append(m.group(1).strip()) or "", full_text, flags=_re.DOTALL).strip()
            reasoning_text_ns = "\n\n".join(p for p in reasoning_parts_ns if p) or None
References
  1. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

octo-patch and others added 2 commits April 23, 2026 13:18
Per gemini-code-assist review: both the GGUF and non-GGUF paths only
captured the first <think>...</think> match, leaving subsequent reasoning
blocks embedded in the content field. Switch to a single re.sub call with
a callback so all blocks are extracted into reasoning_text and stripped
from full_text in one pass — symmetric for both code paths.
@octo-patch
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Pushed 9b4a3de — switched both the GGUF and non-GGUF paths from a single re.search (which captured only the first <think> block) to one re.sub call with a callback that collects every match. The reasoning blocks are joined with \n\n and the cleaned text is stripped in a single pass, symmetric across both code paths.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc1f8cbb57

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1714 to +1717
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Omit null reasoning_content in non-thinking responses

This now always sets reasoning_content on the message object, and both non-streaming paths serialize with response.model_dump() (without exclude_none=True), so requests with no <think> block still return "reasoning_content": null. That changes the response shape for every non-streaming completion, which can break integrations that validate against a strict OpenAI chat schema (unknown properties rejected); only emitting this field when populated avoids that compatibility regression.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Studio: Gemma 4 (and other reasoning models) break Data Recipe llm-structured parsing — <think>…</think> leaks into content

2 participants