Add persistent wiki-RAG pipeline with vendored Graphify, lint-driven enrichment, and debug/maintenance APIs#5103
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces 'graphify,' a comprehensive tool for generating knowledge graphs from code, documentation, and research papers to facilitate codebase understanding. It includes AST extraction for numerous languages, community detection, and various export formats like interactive HTML and Obsidian vaults, while also integrating a wiki-based RAG system into the inference backend. Feedback focused on critical security and performance improvements, including restricting allowed URL schemes to prevent SSRF, correcting text sanitization regexes, ensuring path consistency through environment variables, adopting timezone-aware timestamps, and optimizing file processing and graph analysis for large datasets.
There was a problem hiding this comment.
The _ALLOWED_SCHEMES set includes "file", which contradicts the docstring for validate_url and the security model described in SECURITY.md. Allowing the file scheme in a tool that fetches arbitrary URLs can lead to Server-Side Request Forgery (SSRF) or local file disclosure if a user-provided URL or a malicious redirect points to a local file. This should be restricted to http and https only.
| _ALLOWED_SCHEMES = {"http", "https", "file"} | |
| _ALLOWED_SCHEMES = {"http", "https"} |
| "---", | ||
| f'type: "{query_type}"', | ||
| f'date: "{now.isoformat()}"', | ||
| f'question: "{re.sub(chr(10) + chr(13), " ", question).replace(chr(34), chr(39))}"', |
There was a problem hiding this comment.
The regex chr(10) + chr(13) matches the literal sequence \n\r. This will not match standard Unix (\n) or Windows (\r\n) line endings. To correctly sanitize the question for YAML frontmatter and prevent it from breaking the format, you should replace all carriage returns and newlines with spaces.
| f'question: "{re.sub(chr(10) + chr(13), " ", question).replace(chr(34), chr(39))}"', | |
| f'question: "{re.sub(r"[\\r\\n]+", " ", question).replace(chr(34), chr(39))}"', |
| from core.wiki.manager import WikiManager | ||
| from core.wiki.ingestor import WikiIngestor | ||
| from pathlib import Path | ||
| self.vault_root = Path("/tmp/unsloth_wiki") |
There was a problem hiding this comment.
The vault_root is hardcoded to /tmp/unsloth_wiki, which is not persistent. Additionally, studio/backend/core/inference/tools.py uses an environment variable UNSLOTH_WIKI_VAULT for the same purpose. This should be updated to respect the environment variable for consistency and persistence.
| self.vault_root = Path("/tmp/unsloth_wiki") | |
| self.vault_root = Path(os.getenv("UNSLOTH_WIKI_VAULT", "/tmp/unsloth_wiki")) |
| import datetime | ||
| timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") |
There was a problem hiding this comment.
datetime.datetime.now() creates a naive datetime object using the local system time. It is a best practice to use timezone-aware datetimes (e.g., UTC) to ensure consistency across different server environments and avoid ambiguity during daylight saving time transitions.
import datetime\n timestamp = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%d %H:%M:%S")| try: | ||
| if path.suffix.lower() == ".pdf": | ||
| return len(extract_pdf_text(path).split()) | ||
| return len(path.read_text(errors="ignore").split()) |
There was a problem hiding this comment.
path.read_text() reads the entire file into memory at once. If the corpus contains very large files (e.g., large documentation or source files), this could lead to excessive memory usage or a MemoryError. It is safer to read the file line by line or in chunks when counting words.
with path.open("r", errors="ignore") as f:\n return sum(len(line.split()) for line in f)| betweenness = nx.betweenness_centrality(G) | ||
| # Top bridge nodes that are NOT file-level hubs |
There was a problem hiding this comment.
nx.betweenness_centrality(G) is a computationally expensive operation (
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 583d65717d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| else: | ||
| template_messages = messages | ||
|
|
||
| # --- RAG Injection --- |
There was a problem hiding this comment.
Copy messages before injecting RAG system context
When system_prompt is empty, template_messages aliases the original messages list, so the later insert(0, context_message) mutates the caller’s conversation in place. In this flow, _save_chat_history_to_wiki(messages) then persists the injected RAG block as if it were user conversation, which can recursively pollute future retrieval context and inflate prompts across turns. Build template_messages as a copy before insertion to keep prompt-only context out of persisted chat history.
Useful? React with 👍 / 👎.
| from core.wiki.manager import WikiManager | ||
| from core.wiki.ingestor import WikiIngestor | ||
| from pathlib import Path | ||
| self.vault_root = Path("/tmp/unsloth_wiki") |
There was a problem hiding this comment.
Use configured wiki vault path in backend worker
This hardcodes the backend worker to /tmp/unsloth_wiki while the route layer and watcher use UNSLOTH_WIKI_VAULT. If operators set a custom vault path, non-GGUF inference reads/writes a different wiki than /wiki/* endpoints and startup watcher maintenance, causing split state and missing RAG context on one path. The worker should resolve the same env-configured vault root as the rest of the wiki stack.
Useful? React with 👍 / 👎.
| _, ingestor = _get_route_wiki_components() | ||
| ingestor.ingest_file(file_path, contributor="Unsloth Studio") |
There was a problem hiding this comment.
Avoid double-ingesting chat history files
After writing a chat-history batch to raw/, this code ingests the file immediately, but main.py enables WikiIngestionWatcher by default on the same raw/ directory. In that default setup, each flushed chat-history file is ingested twice (once here, once by the watcher), causing redundant index rebuilds/log churn and unnecessary maintenance work. Choose a single ingestion path (watcher or direct ingest) or mark these files as already handled.
Useful? React with 👍 / 👎.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8fe16c3de
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def get_current_model(self) -> Optional[str]: | ||
| """Get currently active model name""" | ||
| return self.active_model_name | ||
| def _get_rag_context(self, query: str) -> str: |
There was a problem hiding this comment.
Restore get_current_model on InferenceBackend
check_vision_model_compatibility() still calls self.get_current_model(), but this commit removed that helper when this section was replaced with wiki/RAG methods. Any caller that hits the vision compatibility helper now gets AttributeError instead of a boolean compatibility result, which breaks the vision-guard flow at runtime.
Useful? React with 👍 / 👎.
| filename = f"chat_history_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.md" | ||
| file_path = self.vault_root / "raw" / filename |
There was a problem hiding this comment.
Use collision-proof filenames for chat history flushes
This flush path uses a filename with only second-level precision and then writes with write_text, so two flushes in the same second target the same file and the later flush overwrites earlier buffered snapshots. That is reachable under concurrent traffic or when UNSLOTH_WIKI_CHAT_HISTORY_FLUSH_SECONDS=0, causing silent history loss and weaker downstream RAG context.
Useful? React with 👍 / 👎.
| from watchdog.observers import Observer | ||
| from watchdog.events import FileSystemEventHandler |
There was a problem hiding this comment.
Handle missing watchdog dependency for default watcher mode
The watcher module imports watchdog at module scope, so environments without that extra cannot import core.wiki.watcher at all. Because watcher startup is enabled by default, this leads to startup falling back to the exception path and silently disabling automatic raw-file ingestion/analysis unless watchdog is preinstalled (it is not listed in the backend default requirements file).
Useful? React with 👍 / 👎.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 877bb00ba2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Save history for future RAG | ||
| self._save_chat_history_to_wiki(messages) |
There was a problem hiding this comment.
Skip RAG/history hooks for internal wiki LLM calls
generate_chat_response now always runs _get_rag_context(...) and _save_chat_history_to_wiki(messages), but this method is also used by _route_wiki_llm_stub for wiki ingestion/enrichment/retry flows (not just user chat). In that path, internal maintenance prompts are persisted as chat history and re-ingested into the wiki, which pollutes retrieval with non-conversational prompt scaffolding and can recursively bloat the RAG corpus over time. Add an opt-out flag (or a dedicated internal generation path) so only user-facing chat requests write history / inject RAG.
Useful? React with 👍 / 👎.
| role = msg.get("role", "unknown").capitalize() | ||
| content = msg.get("content", "").strip() | ||
| if content: |
There was a problem hiding this comment.
Normalize message content types before buffering chat history
This assumes every message content is a string and calls .strip(). The legacy /generate/stream route forwards GenerateRequest.messages as raw OpenAI-format dicts, so content can be non-string (e.g., multimodal arrays/tool payloads); in those cases this raises AttributeError, the outer except swallows it, and the entire snapshot is dropped. Coerce/serialize non-string content before trimming so history persistence remains reliable across valid message shapes.
Useful? React with 👍 / 👎.
Add an explicit generation flag to bypass RAG/history hooks for internal wiki maintenance prompts, and harden chat history buffering by normalizing non-string message content.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9bfe81a205
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| max_new_tokens = _WIKI_LLM_MAX_TOKENS, | ||
| repetition_penalty = 1.0, | ||
| cancel_event = None, | ||
| enable_wiki_rag_history = False, |
There was a problem hiding this comment.
Remove unsupported wiki-history flag from orchestrator call
_route_wiki_llm_stub passes enable_wiki_rag_history=False to backend.generate_chat_response, but this route gets its backend from get_inference_backend() (the orchestrator), whose generate_chat_response does not accept that keyword. The resulting TypeError is swallowed by the local except, so transformer-backed wiki calls silently fall back to returning the prompt text instead of model output, degrading /wiki/query, enrichment, and fallback-retry behavior whenever GGUF is not active.
Useful? React with 👍 / 👎.
| self._processed_mtime_ns[resolved] = mtime_ns | ||
| self._processed_hash[resolved] = file_hash | ||
|
|
||
| logger.info(f"New file detected in wiki raw directory: {file_path}") | ||
| title = self.ingestor.ingest_file(file_path, contributor = self.contributor) |
There was a problem hiding this comment.
Record watcher hashes only after successful ingest
The watcher marks a file as processed (_processed_mtime_ns/_processed_hash) before ingest_file succeeds. If ingestion fails transiently (e.g., file still being finalized), later events with unchanged content are skipped as duplicates, so that raw file may never be ingested unless it is edited again. This breaks reliable at-least-once ingestion for newly dropped files.
Useful? React with 👍 / 👎.
|
@zohairshafi can you please address the comments made by the automated reviewers (the valid comments). Can you also share screenshots of how this functionality looks like. also why Graphify? |
Add dry-run/apply duplicate merge maintenance for entity/concept pages, including archival and wikilink rewrites. Also apply source-exclusion earlier in retrieval/rerank and compact index planning when source pages are disabled.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 740bdd76a8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return ( | ||
| lowered in self._SKIPPED_LOCAL_FILENAMES | ||
| or name.startswith("._") | ||
| or name.startswith(".") |
There was a problem hiding this comment.
Ignore hidden subdirectories when scanning raw ingest candidates
ingest_pending_raw_files() recursively scans raw/ and relies on should_skip_local_file(), but this predicate only checks the basename and does not exclude files under hidden directories (for example raw/.archive/...). After /wiki/archive/stale moves raw files into .archive, those files become eligible for re-ingestion on the next pending-ingest run, which can silently resurrect archived content and churn duplicate wiki pages.
Useful? React with 👍 / 👎.
| if not ranked: | ||
| ranked = self._rank_pages(question) |
There was a problem hiding this comment.
Honor analysis-page exclusion during empty ranking fallback
When include_analysis_pages_in_query is disabled, query() first filters out analysis/* pages, but if that leaves no candidates it immediately reruns _rank_pages(question) without reapplying the filter. In that case analysis pages are reintroduced despite the config flag, so deployments that disable analysis-page retrieval can still get self-referential analysis context.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 152d0f4b5f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| cleaned = content.strip() | ||
| if not cleaned: | ||
| raise ValueError(f"Ingestion produced empty content for {file_path}") | ||
| return file_path.stem, cleaned |
There was a problem hiding this comment.
Use collision-resistant titles for local wiki ingest
Returning only file_path.stem here makes distinct files with the same basename (for example raw/repoA/README.md and raw/repoB/README.md) share one source_title; downstream LLMWikiEngine.ingest_source slugs that title to the same sources/<slug>.md path, so later ingests silently overwrite earlier content. This causes real data loss and missing context whenever users ingest directory trees that contain common filenames.
Useful? React with 👍 / 👎.
| "remember", | ||
| "token", | ||
| "previous message", |
There was a problem hiding this comment.
Detect history intent with explicit phrases only
Including the bare keyword "token" in history-intent detection makes many non-history queries (for example tokenization, API token usage, token limits) flip into the history branch. In that branch, retrieval is biased toward or replaced by chat-history-* pages, so prompt context can become unrelated to the user’s actual topic and degrade answer quality. History mode should require explicit conversation-memory phrases instead of this broad match.
Useful? React with 👍 / 👎.
|
@rolandtannous The issues should now be addressed in the latest commit. As for graphify - I'm using it mainly as a pre-existing architecture, so as to not reimplement features for wiki page merge/delete lifecycle.
This saved a lot of implementation/design time, roughly 1-2 days, mainly by avoiding re-inventing candidate selection, incremental rebuild logic, and maintenance safety checks. But now that I have everything running, if you want to make sure everything is self contained, I could try to implement these features natively. So far I've been using this branch to test out a personal research assistant of sorts to see how well the wiki can analyse and link together various research papers across domains and how it updates the wiki as I add new papers that add/invalidate older papers. The screenshots below show the same:
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7d9f627c6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b4f360ff0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- use full source identity for stale archive grouping - watch raw tree recursively in WikiIngestionWatcher - add regressions for stale grouping and recursive scheduling
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c05518fd16
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| file_path.write_text(content, encoding = "utf-8") | ||
| self._pending_chat_history_blocks.clear() | ||
| self._chat_history_buffer_started_at = None | ||
| logger.info(f"Saved buffered chat history to {file_path}") |
There was a problem hiding this comment.
Ingest flushed chat batches when watcher is unavailable
This flush path writes buffered chat history to raw/ and immediately clears the in-memory buffer, but it never invokes wiki_ingestor to index the file. In transformer-backed mode, if UNSLOTH_WIKI_WATCHER=false (or watcher startup fails), these snapshots remain un-ingested, so _get_rag_context cannot retrieve recent history even though persistence appears successful. Add a direct-ingest fallback for this path when automatic watching is not active.
Useful? React with 👍 / 👎.
- default merge-maintenance max_merges to 512 via env - add UNSLOTH_WIKI_MERGE_MAINTENANCE_MAX_MERGES wiring in model/manager/engine - default GGUF prefill read timeout to 5 hours (env override retained)
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e0acb77f2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 528804d275
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| _RAG_MAX_PAGES = int(os.getenv("UNSLOTH_WIKI_RAG_MAX_PAGES", "8")) | ||
| _RAG_MAX_CHARS_PER_PAGE = int(os.getenv("UNSLOTH_WIKI_RAG_MAX_CHARS_PER_PAGE", "1800")) | ||
| _RAG_MAX_TOTAL_CHARS = int(os.getenv("UNSLOTH_WIKI_RAG_MAX_TOTAL_CHARS", "12000")) |
There was a problem hiding this comment.
Parse wiki env integers defensively
These module-level assignments call int(os.getenv(...)) directly, so any non-numeric value (including empty-string env overrides) raises ValueError during import and prevents the inference routes from loading at startup. Since these are operational knobs, a single misconfigured deployment variable can take the whole API down before request handling begins; use the same guarded parsing pattern already used elsewhere in this codebase.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Restrict URL validation to http/https schemes
validate_url() is documented and messaged as allowing only HTTP(S), but _ALLOWED_SCHEMES currently includes "file", which means file://... inputs are accepted and then read by safe_fetch(). In any flow that calls graphify ingest with user-provided URLs, this allows local file exfiltration (for example file:///etc/passwd) into generated artifacts.
Useful? React with 👍 / 👎.









This PR implements a persistent wiki-RAG system for Unsloth Studio, inspired by Andrej Karpathy’s design notes
It uses Graphify v3 tooling and includes the full integration needed for end-to-end operation inside this repository.
This change gives Unsloth users a practical way to evaluate their models inside a real wiki-RAG workflow, not just one-off prompts. It enables side-by-side benchmarking of model behavior in persistent knowledge pipelines, including how well each model helps build and maintain an evolving wiki.
What this PR includes
Why this change
This establishes a persistent, maintainable knowledge layer between raw sources and chat-time retrieval, improving retrieval quality, observability, and long-term wiki health.
Validation
Notes
This is a large change set because Graphify is intentionally vendored for runtime completeness.
Operational details, configuration options, and endpoint behavior are documented in updates.md.