iframe-proxy

octo-patch · 2026-04-23T01:22:12Z

Problem

When Unsloth Studio serves a reasoning model (e.g. Gemma 4, Qwen3-think, DeepSeek-R1) through the OpenAI-compatible /v1/chat/completions endpoint in non-streaming mode, the <think>…</think> block is merged directly into content. This breaks NeMo DataDesigner's llm-structured parser, which expects the thinking text in a separate reasoning_content field and clean fenced JSON in content.

Solution

For both non-streaming paths (GGUF via gguf_generate() and the standard transformers path via generate()):

After collecting full_text, detect and extract any <think>…</think> block via a regex.
Strip it from content and return it as reasoning_content on CompletionMessage.
CompletionMessage gains a nullable reasoning_content: Optional[str] field, following the DeepSeek / llama-server --reasoning-format convention.

The streaming Chat UI path is unchanged — it keeps merging reasoning_content into <think> tags for the collapsible reasoning panel, so no regression for existing users.

Testing

CompletionMessage model change: verified with a unit test against the Pydantic schema.
Non-streaming path: manually verified that a Gemma-4-GGUF response with <think> content returns reasoning_content populated and content containing only the answer text.
NeMo DataDesigner with extract_reasoning_content=true now successfully parses structured JSON records.

…rt (fixes unslothai#5070) When converting a Gemma 3 fine-tune to GGUF via save_pretrained_gguf, tokens like <start_of_turn> (id=105) and <end_of_turn> (id=106) are already present in the sentencepiece model but are typed as NORMAL (1) instead of CONTROL (3). llama.cpp only recognises CONTROL tokens when parse_special=True is active, so these tokens get BPE-split during chat inference and the model produces garbage output. fix_sentencepiece_gguf now reads tokenizer.json's added_tokens list and, for any token with "special": true whose ID falls within the existing sentencepiece vocabulary, updates its type from NORMAL to CONTROL before writing the patched tokenizer.model to disk. The same CONTROL type is also applied when new tokens are appended for the out-of-range case, so both code paths are consistent.

…g responses (fixes unslothai#5044) Non-streaming GGUF and transformers paths now detect <think>...</think> blocks in the generated text, strip them from `content`, and expose them as a separate `reasoning_content` field on CompletionMessage. This aligns with the DeepSeek / llama-server --reasoning-format convention so OpenAI-compatible consumers (e.g. NeMo DataDesigner with extract_reasoning_content=true) can capture the thinking text without the <think> block polluting the fenced JSON in `content`. The streaming Chat UI path is unchanged, so no regression for existing users who rely on the collapsible reasoning panel.

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request adds a reasoning_content field to the CompletionMessage model and implements logic to extract and strip blocks from inference responses. It also enhances the GGUF tokenizer utility to correctly identify special tokens from tokenizer.json and mark them as CONTROL types for llama.cpp compatibility. The review feedback recommends using a single regex substitution with a callback to efficiently extract multiple reasoning blocks and ensure the main content is fully cleaned.

gemini-code-assist · 2026-04-23T01:25:31Z

+                reasoning_text: Optional[str] = None
+                if "<think>" in full_text:
+                    _think_match = _re.search(
+                        r"<think>(.*?)</think>\s*",
+                        full_text,
+                        flags = _re.DOTALL,
+                    )
+                    if _think_match:
+                        reasoning_text = _think_match.group(1).strip() or None
+                        full_text = (
+                            full_text[: _think_match.start()]
+                            + full_text[_think_match.end() :]


The current regex extraction only captures the first block. If a model produces multiple reasoning blocks (e.g., during a long chain of thought), subsequent blocks will remain in the content field, which can break structured parsers like NeMo DataDesigner. Using a single sub call with a callback ensures all blocks are extracted and stripped in a single pass, improving efficiency by avoiding redundant iterations.

reasoning_parts = [] full_text = _re.sub(r"<think>(.*?)</think>\s*", lambda m: reasoning_parts.append(m.group(1).strip()) or "", full_text, flags=_re.DOTALL).strip() reasoning_text = "\n\n".join(p for p in reasoning_parts if p) or None

References

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

gemini-code-assist · 2026-04-23T01:25:32Z


+            reasoning_text_ns: Optional[str] = None
+            if "<think>" in full_text:
+                _think_match_ns = _re.search(
+                    r"<think>(.*?)</think>\s*",
+                    full_text,
+                    flags = _re.DOTALL,
+                )
+                if _think_match_ns:
+                    reasoning_text_ns = _think_match_ns.group(1).strip() or None
+                    full_text = (
+                        full_text[: _think_match_ns.start()]


Similar to the GGUF path, this logic only handles the first block. It is safer to extract all reasoning blocks and join them to ensure the content field is clean of any thinking tags. Using a single sub call with a callback avoids redundant iterations over the text string.

reasoning_parts_ns = [] full_text = _re.sub(r"<think>(.*?)</think>\s*", lambda m: reasoning_parts_ns.append(m.group(1).strip()) or "", full_text, flags=_re.DOTALL).strip() reasoning_text_ns = "\n\n".join(p for p in reasoning_parts_ns if p) or None

References

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

Per gemini-code-assist review: both the GGUF and non-GGUF paths only captured the first <think>...</think> match, leaving subsequent reasoning blocks embedded in the content field. Switch to a single re.sub call with a callback so all blocks are extracted into reasoning_text and stripped from full_text in one pass — symmetric for both code paths.

for more information, see https://pre-commit.ci

octo-patch · 2026-04-23T05:19:08Z

Thanks for the review! Pushed 9b4a3de — switched both the GGUF and non-GGUF paths from a single re.search (which captured only the first <think> block) to one re.sub call with a callback that collects every match. The reasoning blocks are joined with \n\n and the cleaned text is stripped in a single pass, symmetric across both code paths.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc1f8cbb57

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T05:23:35Z

+                            message = CompletionMessage(
+                                content = full_text,
+                                reasoning_content = reasoning_text,
+                            ),


Omit null reasoning_content in non-thinking responses

This now always sets reasoning_content on the message object, and both non-streaming paths serialize with response.model_dump() (without exclude_none=True), so requests with no <think> block still return "reasoning_content": null. That changes the response shape for every non-streaming completion, which can break integrations that validate against a strict OpenAI chat schema (unknown properties rejected); only emitting this field when populated avoids that compatibility regression.

Useful? React with 👍 / 👎.

octo-patch and others added 4 commits April 17, 2026 12:41

Merge branch 'main' into fix/issue-5070-gemma3-gguf-special-token-types

1b51d1d

Merge branch 'main' into fix/issue-5070-gemma3-gguf-special-token-types

d72ffb4

octo-patch requested review from danielhanchen, mmathew23 and rolandtannous as code owners April 23, 2026 01:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

7f7758d

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

octo-patch and others added 2 commits April 23, 2026 13:18

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc1f8cb

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: split <think>...</think> into reasoning_content for non-streaming responses#5137

fix: split <think>...</think> into reasoning_content for non-streaming responses#5137
octo-patch wants to merge 7 commits intounslothai:mainfrom
octo-patch:fix/issue-5044-split-think-from-content-to-reasoning-content

octo-patch commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

octo-patch commented Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

octo-patch commented Apr 23, 2026

Problem

Solution

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

octo-patch commented Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants