fix: split <think>...</think> into reasoning_content for non-streaming responses#5137
Conversation
…rt (fixes unslothai#5070) When converting a Gemma 3 fine-tune to GGUF via save_pretrained_gguf, tokens like <start_of_turn> (id=105) and <end_of_turn> (id=106) are already present in the sentencepiece model but are typed as NORMAL (1) instead of CONTROL (3). llama.cpp only recognises CONTROL tokens when parse_special=True is active, so these tokens get BPE-split during chat inference and the model produces garbage output. fix_sentencepiece_gguf now reads tokenizer.json's added_tokens list and, for any token with "special": true whose ID falls within the existing sentencepiece vocabulary, updates its type from NORMAL to CONTROL before writing the patched tokenizer.model to disk. The same CONTROL type is also applied when new tokens are appended for the out-of-range case, so both code paths are consistent.
…g responses (fixes unslothai#5044) Non-streaming GGUF and transformers paths now detect <think>...</think> blocks in the generated text, strip them from `content`, and expose them as a separate `reasoning_content` field on CompletionMessage. This aligns with the DeepSeek / llama-server --reasoning-format convention so OpenAI-compatible consumers (e.g. NeMo DataDesigner with extract_reasoning_content=true) can capture the thinking text without the <think> block polluting the fenced JSON in `content`. The streaming Chat UI path is unchanged, so no regression for existing users who rely on the collapsible reasoning panel.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request adds a reasoning_content field to the CompletionMessage model and implements logic to extract and strip blocks from inference responses. It also enhances the GGUF tokenizer utility to correctly identify special tokens from tokenizer.json and mark them as CONTROL types for llama.cpp compatibility. The review feedback recommends using a single regex substitution with a callback to efficiently extract multiple reasoning blocks and ensure the main content is fully cleaned.
There was a problem hiding this comment.
The current regex extraction only captures the first block. If a model produces multiple reasoning blocks (e.g., during a long chain of thought), subsequent blocks will remain in the content field, which can break structured parsers like NeMo DataDesigner. Using a single sub call with a callback ensures all blocks are extracted and stripped in a single pass, improving efficiency by avoiding redundant iterations.
reasoning_parts = []
full_text = _re.sub(r"<think>(.*?)</think>\s*", lambda m: reasoning_parts.append(m.group(1).strip()) or "", full_text, flags=_re.DOTALL).strip()
reasoning_text = "\n\n".join(p for p in reasoning_parts if p) or NoneReferences
- To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.
|
|
||
| reasoning_text_ns: Optional[str] = None | ||
| if "<think>" in full_text: | ||
| _think_match_ns = _re.search( | ||
| r"<think>(.*?)</think>\s*", | ||
| full_text, | ||
| flags = _re.DOTALL, | ||
| ) | ||
| if _think_match_ns: | ||
| reasoning_text_ns = _think_match_ns.group(1).strip() or None | ||
| full_text = ( | ||
| full_text[: _think_match_ns.start()] |
There was a problem hiding this comment.
Similar to the GGUF path, this logic only handles the first block. It is safer to extract all reasoning blocks and join them to ensure the content field is clean of any thinking tags. Using a single sub call with a callback avoids redundant iterations over the text string.
reasoning_parts_ns = []
full_text = _re.sub(r"<think>(.*?)</think>\s*", lambda m: reasoning_parts_ns.append(m.group(1).strip()) or "", full_text, flags=_re.DOTALL).strip()
reasoning_text_ns = "\n\n".join(p for p in reasoning_parts_ns if p) or NoneReferences
- To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.
Per gemini-code-assist review: both the GGUF and non-GGUF paths only captured the first <think>...</think> match, leaving subsequent reasoning blocks embedded in the content field. Switch to a single re.sub call with a callback so all blocks are extracted into reasoning_text and stripped from full_text in one pass — symmetric for both code paths.
for more information, see https://pre-commit.ci
|
Thanks for the review! Pushed 9b4a3de — switched both the GGUF and non-GGUF paths from a single |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc1f8cbb57
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Omit null reasoning_content in non-thinking responses
This now always sets reasoning_content on the message object, and both non-streaming paths serialize with response.model_dump() (without exclude_none=True), so requests with no <think> block still return "reasoning_content": null. That changes the response shape for every non-streaming completion, which can break integrations that validate against a strict OpenAI chat schema (unknown properties rejected); only emitting this field when populated avoids that compatibility regression.
Useful? React with 👍 / 👎.

Fixes #5044
Problem
When Unsloth Studio serves a reasoning model (e.g. Gemma 4, Qwen3-think, DeepSeek-R1) through the OpenAI-compatible
/v1/chat/completionsendpoint in non-streaming mode, the<think>…</think>block is merged directly intocontent. This breaks NeMo DataDesigner'sllm-structuredparser, which expects the thinking text in a separatereasoning_contentfield and clean fenced JSON incontent.Solution
For both non-streaming paths (GGUF via
gguf_generate()and the standard transformers path viagenerate()):full_text, detect and extract any<think>…</think>block via a regex.contentand return it asreasoning_contentonCompletionMessage.CompletionMessagegains a nullablereasoning_content: Optional[str]field, following the DeepSeek / llama-server--reasoning-formatconvention.The streaming Chat UI path is unchanged — it keeps merging
reasoning_contentinto<think>tags for the collapsible reasoning panel, so no regression for existing users.Testing
CompletionMessagemodel change: verified with a unit test against the Pydantic schema.<think>content returnsreasoning_contentpopulated andcontentcontaining only the answer text.extract_reasoning_content=truenow successfully parses structured JSON records.