Sunbelt Computer Software

Evaluation Protocol

This repository can run SpreadsheetLLM-style table-detection and QA experiments, but it does not bundle the original paper datasets or claim paper metric reproduction by default. Use this protocol to keep synthetic, reconstructed, and paper-original runs separate.

Claim Levels

Do not describe a result as paper-comparable unless the run record includes compatible dataset split metadata, encoder settings, coordinate mode, prompt serializer, model/backend, baseline status, and metric definition.

The metadata validator enforces this distinction with an explicit claim_level field. synthetic is the default for directory-scanned local fixtures, manifest-driven datasets default to reconstructed, and paper-original is accepted only when the result record carries concrete dataset, split, model/backend, prompt serializer, coordinate mode, baseline, metric, encoder, and tokenizer metadata. Paper-original claims are rejected when compression metrics used the char/4 tokenizer fallback.

Paper Benchmark Shape

The paper reports two main evaluation surfaces:

Table detection: 188 spreadsheets, 311 tables, split into Small, Medium, Large, and Huge by token usage, evaluated with EoB-0 exact boundary matching.
Spreadsheet QA: 64 spreadsheets and 307 QA items, with answers represented as cell addresses or formulas.

These assets are not bundled. A manifest-based run can represent equivalent or reconstructed data, but the output must remain labeled as reconstructed unless the original paper assets and procedures are available.

Table-Detection Manifest

Use run_llm_evaluation.py --manifest path/to/table_manifest.json for manifest-driven table detection. Relative workbook paths are resolved from the manifest file location.

{
  "dataset_name": "synthetic_tables_v1",
  "dataset_version": "1",
  "claim_level": "synthetic",
  "split_name": "test",
  "items": [
    {
      "spreadsheet_path": "book1.xlsx",
      "tables": [
        {"range": "A1:D8"}
      ]
    }
  ]
}

Example:

python run_llm_evaluation.py --manifest datasets/synthetic/table_manifest.json \
  --backend echo \
  --echo-response "['range': 'A1:D8']" \
  --out-record runs/table_eval.json

The result record includes evaluation_metadata with dataset, split, encoder, coordinate, prompt, backend, baseline, and metric fields.

QA Manifest

Use run_qa_evaluation.py --manifest path/to/qa_manifest.json for manifest-driven QA. QA pairs may include answer_type for type-aware exact match normalization.

{
  "dataset_name": "synthetic_qa_v1",
  "dataset_version": "1",
  "claim_level": "synthetic",
  "split_name": "test",
  "items": [
    {
      "spreadsheet_path": "book1.xlsx",
      "qa_pairs": [
        {
          "question": "Which cell contains total revenue?",
          "answer": "[D8]",
          "answer_type": "cell_address",
          "table_range": "A1:D8"
        }
      ]
    }
  ]
}

Supported answer_type values:

cell_address: trim and uppercase.
formula: trim, remove whitespace, and uppercase.
free_text: collapse whitespace and casefold.
literal: trim only. This is the default when unspecified.

Example:

python run_qa_evaluation.py --manifest datasets/synthetic/qa_manifest.json \
  --backend echo \
  --echo-response "['range': 'A1:D8']" \
  --out-record runs/qa_eval.json

Fine-Tuning Metadata

Use prepare_finetuning_data.py --metadata-output finetune_manifest.json to write a sidecar describing the generated JSONL. Each JSONL record also includes per-sheet metadata with original ranges, compact prompt ranges, and coordinate mode so training targets can be audited later.

python prepare_finetuning_data.py datasets/train output/finetune.jsonl \
  --metadata-output output/finetune_manifest.json

Evaluation against fine-tuned outputs should compare the JSONL sidecar with the evaluation record before making parity claims.

Baselines

TaPEx is available through the optional HuggingFace transformers wrapper.
Binder is marked unavailable until a real adapter is implemented.
Missing optional dependencies should be reported as skip reasons, not as zero scores.

Claim level	Allowed when	Report wording
Synthetic	Data was generated by this repo or another synthetic generator.	"Synthetic benchmark result"
Reconstructed	Data/splits were rebuilt from public or user-supplied sources.	"Reconstructed benchmark result"
Paper-original	Original paper data, splits, model procedure, and metric settings are available.	"Paper-comparable result"

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Protocol

Claim Levels

Paper Benchmark Shape

Table-Detection Manifest

QA Manifest

Fine-Tuning Metadata

Baselines

Sunbelt Computer Software

PL/B Language Development and Support

FilesExpand file tree

EVALUATION.md

Latest commit

History

EVALUATION.md

File metadata and controls

Evaluation Protocol

Claim Levels

Paper Benchmark Shape

Table-Detection Manifest

QA Manifest

Fine-Tuning Metadata

Baselines