Add persistent wiki-RAG pipeline with vendored Graphify, lint-driven enrichment, and debug/maintenance APIs by zohairshafi · Pull Request #5103 · unslothai/unsloth · GitHub
Skip to content

Add persistent wiki-RAG pipeline with vendored Graphify, lint-driven enrichment, and debug/maintenance APIs#5103

Open
zohairshafi wants to merge 24 commits intounslothai:mainfrom
zohairshafi:feat/wiki-rag-graphify
Open

Add persistent wiki-RAG pipeline with vendored Graphify, lint-driven enrichment, and debug/maintenance APIs#5103
zohairshafi wants to merge 24 commits intounslothai:mainfrom
zohairshafi:feat/wiki-rag-graphify

Conversation

@zohairshafi
Copy link
Copy Markdown

This PR implements a persistent wiki-RAG system for Unsloth Studio, inspired by Andrej Karpathy’s design notes

It uses Graphify v3 tooling and includes the full integration needed for end-to-end operation inside this repository.

This change gives Unsloth users a practical way to evaluate their models inside a real wiki-RAG workflow, not just one-off prompts. It enables side-by-side benchmarking of model behavior in persistent knowledge pipelines, including how well each model helps build and maintain an evolving wiki.

What this PR includes

  1. Vendors Graphify into the repo so the Studio wiki pipeline can compile and run without external submodule setup.
  2. Adds the core wiki stack for ingestion, indexing, retrieval, and maintenance workflows.
  3. Adds robust watcher lifecycle handling and ingestion flow hardening.
  4. Adds lint-driven enrichment with optional web gap filling for missing concepts.
  5. Adds RAG debug and wiki maintenance endpoints for inspectability and operational control.
  6. Adds regression coverage for the wiki/RAG pipeline and related tooling behavior.
  7. Updates project documentation with implementation and operations details in updates.md.

Why this change
This establishes a persistent, maintainable knowledge layer between raw sources and chat-time retrieval, improving retrieval quality, observability, and long-term wiki health.

Validation

  1. New/updated backend wiki-RAG tests were added and pass locally.
  2. The feature set is documented in updates.md, including behavior, endpoints, and configuration knobs.

Notes
This is a large change set because Graphify is intentionally vendored for runtime completeness.
Operational details, configuration options, and endpoint behavior are documented in updates.md.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces 'graphify,' a comprehensive tool for generating knowledge graphs from code, documentation, and research papers to facilitate codebase understanding. It includes AST extraction for numerous languages, community detection, and various export formats like interactive HTML and Obsidian vaults, while also integrating a wiki-based RAG system into the inference backend. Feedback focused on critical security and performance improvements, including restricting allowed URL schemes to prevent SSRF, correcting text sanitization regexes, ensuring path consistency through environment variables, adopting timezone-aware timestamps, and optimizing file processing and graph analysis for large datasets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The _ALLOWED_SCHEMES set includes "file", which contradicts the docstring for validate_url and the security model described in SECURITY.md. Allowing the file scheme in a tool that fetches arbitrary URLs can lead to Server-Side Request Forgery (SSRF) or local file disclosure if a user-provided URL or a malicious redirect points to a local file. This should be restricted to http and https only.

Suggested change
_ALLOWED_SCHEMES = {"http", "https", "file"}
_ALLOWED_SCHEMES = {"http", "https"}

"---",
f'type: "{query_type}"',
f'date: "{now.isoformat()}"',
f'question: "{re.sub(chr(10) + chr(13), " ", question).replace(chr(34), chr(39))}"',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex chr(10) + chr(13) matches the literal sequence \n\r. This will not match standard Unix (\n) or Windows (\r\n) line endings. To correctly sanitize the question for YAML frontmatter and prevent it from breaking the format, you should replace all carriage returns and newlines with spaces.

Suggested change
f'question: "{re.sub(chr(10) + chr(13), " ", question).replace(chr(34), chr(39))}"',
f'question: "{re.sub(r"[\\r\\n]+", " ", question).replace(chr(34), chr(39))}"',

from core.wiki.manager import WikiManager
from core.wiki.ingestor import WikiIngestor
from pathlib import Path
self.vault_root = Path("/tmp/unsloth_wiki")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The vault_root is hardcoded to /tmp/unsloth_wiki, which is not persistent. Additionally, studio/backend/core/inference/tools.py uses an environment variable UNSLOTH_WIKI_VAULT for the same purpose. This should be updated to respect the environment variable for consistency and persistence.

Suggested change
self.vault_root = Path("/tmp/unsloth_wiki")
self.vault_root = Path(os.getenv("UNSLOTH_WIKI_VAULT", "/tmp/unsloth_wiki"))

Comment on lines +2187 to +2188
import datetime
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

datetime.datetime.now() creates a naive datetime object using the local system time. It is a best practice to use timezone-aware datetimes (e.g., UTC) to ensure consistency across different server environments and avoid ambiguity during daylight saving time transitions.

            import datetime\n            timestamp = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%d %H:%M:%S")

Comment thread graphify/graphify/detect.py Outdated
try:
if path.suffix.lower() == ".pdf":
return len(extract_pdf_text(path).split())
return len(path.read_text(errors="ignore").split())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

path.read_text() reads the entire file into memory at once. If the corpus contains very large files (e.g., large documentation or source files), this could lead to excessive memory usage or a MemoryError. It is safer to read the file line by line or in chunks when counting words.

        with path.open("r", errors="ignore") as f:\n            return sum(len(line.split()) for line in f)

Comment on lines +354 to +355
betweenness = nx.betweenness_centrality(G)
# Top bridge nodes that are NOT file-level hubs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

nx.betweenness_centrality(G) is a computationally expensive operation ($O(VE)$ for unweighted graphs). While acceptable for small graphs, it can become a significant performance bottleneck as the knowledge graph grows. Consider using faster centrality measures (like degree centrality) or limiting this analysis to a subgraph if performance becomes an issue.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 583d65717d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +987 to +990
else:
template_messages = messages

# --- RAG Injection ---
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Copy messages before injecting RAG system context

When system_prompt is empty, template_messages aliases the original messages list, so the later insert(0, context_message) mutates the caller’s conversation in place. In this flow, _save_chat_history_to_wiki(messages) then persists the injected RAG block as if it were user conversation, which can recursively pollute future retrieval context and inflate prompts across turns. Build template_messages as a copy before insertion to keep prompt-only context out of persisted chat history.

Useful? React with 👍 / 👎.

from core.wiki.manager import WikiManager
from core.wiki.ingestor import WikiIngestor
from pathlib import Path
self.vault_root = Path("/tmp/unsloth_wiki")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use configured wiki vault path in backend worker

This hardcodes the backend worker to /tmp/unsloth_wiki while the route layer and watcher use UNSLOTH_WIKI_VAULT. If operators set a custom vault path, non-GGUF inference reads/writes a different wiki than /wiki/* endpoints and startup watcher maintenance, causing split state and missing RAG context on one path. The worker should resolve the same env-configured vault root as the rest of the wiki stack.

Useful? React with 👍 / 👎.

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +708 to +709
_, ingestor = _get_route_wiki_components()
ingestor.ingest_file(file_path, contributor="Unsloth Studio")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid double-ingesting chat history files

After writing a chat-history batch to raw/, this code ingests the file immediately, but main.py enables WikiIngestionWatcher by default on the same raw/ directory. In that default setup, each flushed chat-history file is ingested twice (once here, once by the watcher), causing redundant index rebuilds/log churn and unnecessary maintenance work. Choose a single ingestion path (watcher or direct ingest) or mark these files as already handled.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8fe16c3de

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

def get_current_model(self) -> Optional[str]:
"""Get currently active model name"""
return self.active_model_name
def _get_rag_context(self, query: str) -> str:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore get_current_model on InferenceBackend

check_vision_model_compatibility() still calls self.get_current_model(), but this commit removed that helper when this section was replaced with wiki/RAG methods. Any caller that hits the vision compatibility helper now gets AttributeError instead of a boolean compatibility result, which breaks the vision-guard flow at runtime.

Useful? React with 👍 / 👎.

Comment on lines +2216 to +2217
filename = f"chat_history_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.md"
file_path = self.vault_root / "raw" / filename
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use collision-proof filenames for chat history flushes

This flush path uses a filename with only second-level precision and then writes with write_text, so two flushes in the same second target the same file and the later flush overwrites earlier buffered snapshots. That is reachable under concurrent traffic or when UNSLOTH_WIKI_CHAT_HISTORY_FLUSH_SECONDS=0, causing silent history loss and weaker downstream RAG context.

Useful? React with 👍 / 👎.

Comment on lines +9 to +10
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle missing watchdog dependency for default watcher mode

The watcher module imports watchdog at module scope, so environments without that extra cannot import core.wiki.watcher at all. Because watcher startup is enabled by default, this leads to startup falling back to the exception path and silently disabling automatic raw-file ingestion/analysis unless watchdog is preinstalled (it is not listed in the backend default requirements file).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 877bb00ba2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1006 to +1007
# Save history for future RAG
self._save_chat_history_to_wiki(messages)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip RAG/history hooks for internal wiki LLM calls

generate_chat_response now always runs _get_rag_context(...) and _save_chat_history_to_wiki(messages), but this method is also used by _route_wiki_llm_stub for wiki ingestion/enrichment/retry flows (not just user chat). In that path, internal maintenance prompts are persisted as chat history and re-ingested into the wiki, which pollutes retrieval with non-conversational prompt scaffolding and can recursively bloat the RAG corpus over time. Add an opt-out flag (or a dedicated internal generation path) so only user-facing chat requests write history / inject RAG.

Useful? React with 👍 / 👎.

Comment on lines +2192 to +2194
role = msg.get("role", "unknown").capitalize()
content = msg.get("content", "").strip()
if content:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize message content types before buffering chat history

This assumes every message content is a string and calls .strip(). The legacy /generate/stream route forwards GenerateRequest.messages as raw OpenAI-format dicts, so content can be non-string (e.g., multimodal arrays/tool payloads); in those cases this raises AttributeError, the outer except swallows it, and the entire snapshot is dropped. Coerce/serialize non-string content before trimming so history persistence remains reliable across valid message shapes.

Useful? React with 👍 / 👎.

Add an explicit generation flag to bypass RAG/history hooks for internal wiki maintenance prompts, and harden chat history buffering by normalizing non-string message content.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bfe81a205

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
max_new_tokens = _WIKI_LLM_MAX_TOKENS,
repetition_penalty = 1.0,
cancel_event = None,
enable_wiki_rag_history = False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unsupported wiki-history flag from orchestrator call

_route_wiki_llm_stub passes enable_wiki_rag_history=False to backend.generate_chat_response, but this route gets its backend from get_inference_backend() (the orchestrator), whose generate_chat_response does not accept that keyword. The resulting TypeError is swallowed by the local except, so transformer-backed wiki calls silently fall back to returning the prompt text instead of model output, degrading /wiki/query, enrichment, and fallback-retry behavior whenever GGUF is not active.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/wiki/watcher.py Outdated
Comment on lines +165 to +169
self._processed_mtime_ns[resolved] = mtime_ns
self._processed_hash[resolved] = file_hash

logger.info(f"New file detected in wiki raw directory: {file_path}")
title = self.ingestor.ingest_file(file_path, contributor = self.contributor)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record watcher hashes only after successful ingest

The watcher marks a file as processed (_processed_mtime_ns/_processed_hash) before ingest_file succeeds. If ingestion fails transiently (e.g., file still being finalized), later events with unchanged content are skipped as duplicates, so that raw file may never be ingested unless it is edited again. This breaks reliable at-least-once ingestion for newly dropped files.

Useful? React with 👍 / 👎.

@rolandtannous
Copy link
Copy Markdown
Collaborator

@zohairshafi can you please address the comments made by the automated reviewers (the valid comments). Can you also share screenshots of how this functionality looks like. also why Graphify?

zohairshafi and others added 2 commits April 20, 2026 15:45
Add dry-run/apply duplicate merge maintenance for entity/concept pages, including archival and wikilink rewrites. Also apply source-exclusion earlier in retrieval/rerank and compact index planning when source pages are disabled.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 740bdd76a8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +187 to +190
return (
lowered in self._SKIPPED_LOCAL_FILENAMES
or name.startswith("._")
or name.startswith(".")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ignore hidden subdirectories when scanning raw ingest candidates

ingest_pending_raw_files() recursively scans raw/ and relies on should_skip_local_file(), but this predicate only checks the basename and does not exclude files under hidden directories (for example raw/.archive/...). After /wiki/archive/stale moves raw files into .archive, those files become eligible for re-ingestion on the next pending-ingest run, which can silently resurrect archived content and churn duplicate wiki pages.

Useful? React with 👍 / 👎.

Comment on lines +597 to +598
if not ranked:
ranked = self._rank_pages(question)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor analysis-page exclusion during empty ranking fallback

When include_analysis_pages_in_query is disabled, query() first filters out analysis/* pages, but if that leaves no candidates it immediately reruns _rank_pages(question) without reapplying the filter. In that case analysis pages are reintroduced despite the config flag, so deployments that disable analysis-page retrieval can still get self-referential analysis context.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 152d0f4b5f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

cleaned = content.strip()
if not cleaned:
raise ValueError(f"Ingestion produced empty content for {file_path}")
return file_path.stem, cleaned
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use collision-resistant titles for local wiki ingest

Returning only file_path.stem here makes distinct files with the same basename (for example raw/repoA/README.md and raw/repoB/README.md) share one source_title; downstream LLMWikiEngine.ingest_source slugs that title to the same sources/<slug>.md path, so later ingests silently overwrite earlier content. This causes real data loss and missing context whenever users ingest directory trees that contain common filenames.

Useful? React with 👍 / 👎.

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +526 to +528
"remember",
"token",
"previous message",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect history intent with explicit phrases only

Including the bare keyword "token" in history-intent detection makes many non-history queries (for example tokenization, API token usage, token limits) flip into the history branch. In that branch, retrieval is biased toward or replaced by chat-history-* pages, so prompt context can become unrelated to the user’s actual topic and degrade answer quality. History mode should require explicit conversation-memory phrases instead of this broad match.

Useful? React with 👍 / 👎.

@zohairshafi
Copy link
Copy Markdown
Author

@rolandtannous The issues should now be addressed in the latest commit.

As for graphify - I'm using it mainly as a pre-existing architecture, so as to not reimplement features for wiki page merge/delete lifecycle.
What was reused:

  • Deduping and canonicalization thinking from graphify build/detect flows.
  • Manifest and incremental update strategy so maintenance can be applied safely and predictably.
  • Hash/cache mindset for stable change detection and avoiding unnecessary recomputation.

This saved a lot of implementation/design time, roughly 1-2 days, mainly by avoiding re-inventing candidate selection, incremental rebuild logic, and maintenance safety checks. But now that I have everything running, if you want to make sure everything is self contained, I could try to implement these features natively.

So far I've been using this branch to test out a personal research assistant of sorts to see how well the wiki can analyse and link together various research papers across domains and how it updates the wiki as I add new papers that add/invalidate older papers. The screenshots below show the same:

  1. Wiki Folder Structure
folder_structure
  1. Index File Example
index_file
  1. Entities Folder Example
entities_folder
  1. Concepts Folder Example
concepts_folder
  1. Analysis File Example
analysis_example
  1. Use Case Example - I asked it about a particular paper of mine and it picks up two related papers and tells me about it.
use_case_example use_case_example_two
  1. Maintenance and Enrichment Example
enrichment_example

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7d9f627c6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment thread studio/backend/core/wiki/ingestor.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b4f360ff0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment thread studio/backend/core/wiki/watcher.py Outdated
- use full source identity for stale archive grouping

- watch raw tree recursively in WikiIngestionWatcher

- add regressions for stale grouping and recursive scheduling
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c05518fd16

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/inference/inference.py Outdated
Comment on lines +2290 to +2293
file_path.write_text(content, encoding = "utf-8")
self._pending_chat_history_blocks.clear()
self._chat_history_buffer_started_at = None
logger.info(f"Saved buffered chat history to {file_path}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ingest flushed chat batches when watcher is unavailable

This flush path writes buffered chat history to raw/ and immediately clears the in-memory buffer, but it never invokes wiki_ingestor to index the file. In transformer-backed mode, if UNSLOTH_WIKI_WATCHER=false (or watcher startup fails), these snapshots remain un-ingested, so _get_rag_context cannot retrieve recent history even though persistence appears successful. Add a direct-ingest fallback for this path when automatic watching is not active.

Useful? React with 👍 / 👎.

zohairshafi and others added 4 commits April 21, 2026 10:22
- default merge-maintenance max_merges to 512 via env

- add UNSLOTH_WIKI_MERGE_MAINTENANCE_MAX_MERGES wiring in model/manager/engine

- default GGUF prefill read timeout to 5 hours (env override retained)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e0acb77f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/wiki/ingestor.py Outdated
@zohairshafi zohairshafi requested a review from Manan17 as a code owner April 22, 2026 17:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 528804d275

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +254 to +256
_RAG_MAX_PAGES = int(os.getenv("UNSLOTH_WIKI_RAG_MAX_PAGES", "8"))
_RAG_MAX_CHARS_PER_PAGE = int(os.getenv("UNSLOTH_WIKI_RAG_MAX_CHARS_PER_PAGE", "1800"))
_RAG_MAX_TOTAL_CHARS = int(os.getenv("UNSLOTH_WIKI_RAG_MAX_TOTAL_CHARS", "12000"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse wiki env integers defensively

These module-level assignments call int(os.getenv(...)) directly, so any non-numeric value (including empty-string env overrides) raises ValueError during import and prevents the inference routes from loading at startup. Since these are operational knobs, a single misconfigured deployment variable can take the whole API down before request handling begins; use the same guarded parsing pattern already used elsewhere in this codebase.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict URL validation to http/https schemes

validate_url() is documented and messaged as allowing only HTTP(S), but _ALLOWED_SCHEMES currently includes "file", which means file://... inputs are accepted and then read by safe_fetch(). In any flow that calls graphify ingest with user-provided URLs, this allows local file exfiltration (for example file:///etc/passwd) into generated artifacts.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants