This file provides guidance to Claude Code (claude.ai/code) when working with
code in this repository. Any design docs that you write can be stored in the
design/ folder (which is ignored by git).
We use a justfile as the single task runner. Run just to see all available
commands, or just <command> to execute one. All recipes use uv run to
guarantee the lockfile venv is used (matching CI).
- Format code:
just format(ruff fix + format viauv run) - Lint code:
just lint(ruff check + format verification) - Run CI checks locally:
just ci(exactly what GitHub Actions runs) - Install dependencies:
uv sync
- Process articles:
just process(CLI interface to main functionality) - Check database:
just check(view article statistics) - Start web interface:
just frontend(FastHTML web UI on localhost:5001) - Reset processing:
just reset(reset article processing status) - Domain management:
just domains/just init <name>
- Run
just testto execute the suite undertests/. - Pass extra pytest args:
just test -k test_merger --tb=short - CI runs lint + tests on every PR via
.github/workflows/test.yml. - Key coverage areas include embedding similarity, lexical blocking, per-type threshold resolution, embedding fingerprints, entity mergers, merge dispute agent routing, extraction caching/retry, canonical name selection, name variant detection, profile grounding, profile versioning, privacy mode enforcement, LLM multi-tool-call recovery, and frontend version navigation.
- Async tests (
@pytest.mark.asyncio) are excluded in CI due to a missing pytest-asyncio configuration.
The CLI entry point src/process_and_extract.py coordinates the pipeline end to end using a parallel producer/consumer model:
- Configuration Loading:
DomainConfigreadsconfigs/<domain>/to resolve Parquet input paths and output directories. - Article Loading: PyArrow loads the domain's article table and normalises rows before processing.
- Relevance Checking:
ArticleProcessor.check_relevancecalls the helpers insrc/engine/relevance.py(Gemini or Ollama) to skip irrelevant sources. - Parallel Extraction: Multiple extraction workers process articles concurrently via
ThreadPoolExecutor. Within each article, the 4 entity-type extractions also run in parallel. A shared LLM semaphore (configure_llm_concurrency()) bounds cloud API concurrency. - Extraction Caching:
ExtractionSidecarCache(src/utils/extraction_cache.py) stores extraction outputs as JSON files keyed on content hash, model, entity type, prompt hash, schema hash, and temperature. Version-based invalidation lets you bump the cache version in config to force re-extraction without deleting files. - Quality Controls & Retry:
src/utils/quality_controls.pyvalidates extraction output (required fields, name normalisation, within-article dedup) and profile quality (min length, citation regex, tag count, confidence range). Failures are captured inPhaseOutcomeobjects (src/utils/outcomes.py). When severe QC flags (zero_entities,high_drop_rate,many_duplicates,many_low_quality_names) are detected, the extractor automatically retries once with a repair hint. - Entity Merging:
EntityMergerfollows an evidence-first cost structure — cheap checks run before expensive LLM calls:- Lexical blocking: RapidFuzz pre-filters candidates using configurable thresholds from the
dedupsection of domain config. - Batched embeddings:
embed_batch_result_sync()computes vectors for all new entities at once rather than one-by-one. - Similarity scoring: Per-entity-type similarity thresholds and embedding fingerprints (
"{model}:{dim}") ensure model-change detection. - Match checking: LLM-based match verification only runs for candidates that pass the cheap filters.
- Merge dispute agent: When a match result falls in the "gray band" (similarity within ±
MERGE_GRAY_BAND_DELTAof threshold) with confidence belowMERGE_UNCERTAIN_CONFIDENCE_CUTOFF,MergeDisputeAgentprovides a second-stage LLM analysis that can override the initial merge/skip decision. Deferred cases are written to a review queue JSONL file. - Canonical name selection: 5-layer deterministic scoring (
score_canonical_name()insrc/utils/name_variants.py) picks the best display name, penalizing acronyms, generic phrases, and contextual suffixes.
- Lexical blocking: RapidFuzz pre-filters candidates using configurable thresholds from the
- Profile Versioning:
src/engine/profiles.pymaintainsVersionedProfilehistory whenever entity content changes. - Profile Grounding: Post-processing verification (
verify_profile_grounding()) extracts citation markers from profile text, looks up the cited source articles, and uses an LLM to check whether each claim is supported. TheGroundingReportincludes a grounding score and per-claim support levels. - Persistence: Batched Parquet writes per entity type via
write_entities_table()avoid write amplification. Article processing status is tracked in a sidecar JSON file (ProcessingStatus) rather than rewriting the articles Parquet.
- Input: Domain configs point to Parquet files with columns such as
id,title,content,url, andpublished_date. - Processing:
ArticleProcessororchestrates extraction for four entity types, records reflection metadata, runs QC checks (with retry), and keeps track of processing status. Each phase returns aPhaseOutcomecarrying success/failure context. A single merge actor (the main thread) consumes extraction results in article order and is the only writer to shared state, so no locking is needed. - Output: Each run updates people/organizations/locations/events tables under the domain's output directory (
DomainConfig.get_output_dir()). - Storage: Entities include embeddings (with model/dimension/fingerprint metadata), provenance metadata, processing timestamps, profile version histories, and optional grounding reports.
- Cloud Models: Defaults come from
CLOUD_MODELinsrc/constants.py(default:gemini/gemini-2.0-flash) and are executed through provider-routed SDK clients (OpenAI SDK for OpenAI-compatible endpoints, Anthropic SDK for Claude models) with Instructor wrappers insrc/utils/llm.py. Provider routing is handled bysrc/utils/provider_routing.py. Multi-tool-call recovery handles Instructor edge cases. - Local Models: Ollama is accessed via
OLLAMA_MODEL(default:ollama/qwen2.5:32b-instruct-q5_K_M) for extraction, relevance checks, and match verification. Both models can be overridden via environment variables (HINBOX_CLOUD_MODEL,HINBOX_OLLAMA_MODEL). - Embeddings:
EmbeddingManager(src/utils/embeddings/manager.py) chooses cloud/local/hybrid providers and caches vectors for similarity scoring. When--localis active,ensure_local_embeddings_available()enforces local-only mode. - Structured Output: Dynamic Pydantic models in
src/dynamic_models.pyand list factories enforce schema consistency for both cloud and local responses. - Privacy Mode:
--localCLI flag callsdisable_llm_callbacks()to disable all telemetry and forces local embedding mode.
The web interface (src/frontend/) uses FastHTML with an "Archival Elegance" design theme:
- Routes: Modular route handlers in
routes/(home, people, organizations, locations, events) - Data Access: Centralized data loading from Parquet files (
data_access.py) - Filtering: Search and filter utilities (
filters.py) - Components & Helpers: Shared UI building blocks (
components.py) — confidence badges, version selectors, tag pills, alias display — and helpers for profile versions (entity_helpers.py) - Configuration: App setup, shared state, and
main_layout()sidebar+content layout (app_config.py) - Design System: Crimson Pro headings, IBM Plex Sans body, warm teal-slate primary (
#2c5f7c), amber accent (#c97b3a) — CSS variables instatic/styles.css - Static Assets: CSS and font loading under
static/
- Engine Modules:
article_processor.py,extractors.py,mergers.py,match_checker.py,merge_dispute_agent.py, andprofiles.pyare surfaced viasrc/engine/__init__.pyfor a stable import path. - LLM Helpers:
src/utils/llm.pyandsrc/utils/extraction.pywrap SDK/Instructor interactions.src/utils/provider_routing.pyresolves model prefixes to SDK targets. Multi-tool-call recovery inllm.pyhandles Instructor edge cases. - Embeddings: Providers, manager, and similarity helpers live in
src/utils/embeddings/. - Caching:
src/utils/extraction_cache.pyprovides the persistent sidecar cache;src/utils/cache_utils.pyhas a thread-safe LRU cache and stable hashing helpers shared across modules. - Name Handling:
src/utils/name_variants.pyprovides deterministic name normalisation, acronym detection/generation, equivalence expansion, and canonical name scoring — used by both QC and the merge pipeline. - Quality Controls:
src/utils/quality_controls.py(extraction QC, profile QC, profile grounding verification) andsrc/utils/outcomes.py(PhaseOutcomestructured results) provide deterministic validation. - Processing Status:
src/utils/processing_status.pymanages a sidecar JSON file tracking which articles have been processed, replacing the old in-Parquet status approach. - Tests:
tests/covers embedding accuracy, merger behaviour (lexical blocking, per-type thresholds, fingerprints), merge dispute agent routing, extraction caching and retry, canonical name selection, name variant detection, profile grounding, profile versioning, privacy mode enforcement, domain path resolution, and frontend history rendering. - Scripts: Utility scripts in
scripts/support domain management (init_domain.py,list_domains.py), data fetching, resets, and diagnostics.
- Models: Configured in
src/constants.pywith cloud/local model specifications. Override defaults viaHINBOX_CLOUD_MODELandHINBOX_OLLAMA_MODELenv vars. - Dedup: Per-entity-type similarity thresholds, lexical blocking, and merge gray-band/confidence settings configured in the
dedupsection ofconfigs/<domain>/config.yaml - Extraction Cache: Version-based invalidation via
cache.extraction.versionin domain config; cache files live under{output_dir}/cache/extractions/v{version}/ - Logging: Structured Rich-based logging in
src/logging_config.pywith colour-coded decision lines (DecisionKind: NEW, MERGE, SKIP, DISPUTE, DEFER, ERROR) and gated profile panels (--show-profiles) - Privacy:
--localflag callsdisable_llm_callbacks()and forces local embedding mode. The_CALLBACKS_ENABLEDflag inconstants.pycontrols telemetry. - Environment: Requires
GEMINI_API_KEYfor cloud processing, optionalOLLAMA_API_URLfor local
- When finishing a chunk of work, check with the user to confirm the fix, then:
- Run
just formatto auto-fix formatting - Run
just lintto verify no remaining issues - Run
just testto execute the test suite - Commit and push changes
- Run
- Before pushing / opening a PR, run
just ciwhich executes the exact same checks as GitHub Actions:All justfile recipes usejust ci
uv runso there are no version mismatches between local and CI.
- The application has no users yet, so don't worry too much about backwards compatibility. Just make it work.
- When using type hints for dicts or tuples, prefer
typing.Dict/typing.Tupleover the built-in generics for consistency with existing code.
