Spreadsheet_LLM_Encoder/EVALUATION.md at main · kingkillery/Spreadsheet_LLM_Encoder · GitHub
Skip to content

Latest commit

 

History

History
140 lines (112 loc) · 4.54 KB

File metadata and controls

140 lines (112 loc) · 4.54 KB

Evaluation Protocol

This repository can run SpreadsheetLLM-style table-detection and QA experiments, but it does not bundle the original paper datasets or claim paper metric reproduction by default. Use this protocol to keep synthetic, reconstructed, and paper-original runs separate.

Claim Levels

Claim level Allowed when Report wording
Synthetic Data was generated by this repo or another synthetic generator. "Synthetic benchmark result"
Reconstructed Data/splits were rebuilt from public or user-supplied sources. "Reconstructed benchmark result"
Paper-original Original paper data, splits, model procedure, and metric settings are available. "Paper-comparable result"

Do not describe a result as paper-comparable unless the run record includes compatible dataset split metadata, encoder settings, coordinate mode, prompt serializer, model/backend, baseline status, and metric definition.

The metadata validator enforces this distinction with an explicit claim_level field. synthetic is the default for directory-scanned local fixtures, manifest-driven datasets default to reconstructed, and paper-original is accepted only when the result record carries concrete dataset, split, model/backend, prompt serializer, coordinate mode, baseline, metric, encoder, and tokenizer metadata. Paper-original claims are rejected when compression metrics used the char/4 tokenizer fallback.

Paper Benchmark Shape

The paper reports two main evaluation surfaces:

  • Table detection: 188 spreadsheets, 311 tables, split into Small, Medium, Large, and Huge by token usage, evaluated with EoB-0 exact boundary matching.
  • Spreadsheet QA: 64 spreadsheets and 307 QA items, with answers represented as cell addresses or formulas.

These assets are not bundled. A manifest-based run can represent equivalent or reconstructed data, but the output must remain labeled as reconstructed unless the original paper assets and procedures are available.

Table-Detection Manifest

Use run_llm_evaluation.py --manifest path/to/table_manifest.json for manifest-driven table detection. Relative workbook paths are resolved from the manifest file location.

{
  "dataset_name": "synthetic_tables_v1",
  "dataset_version": "1",
  "claim_level": "synthetic",
  "split_name": "test",
  "items": [
    {
      "spreadsheet_path": "book1.xlsx",
      "tables": [
        {"range": "A1:D8"}
      ]
    }
  ]
}

Example:

python run_llm_evaluation.py --manifest datasets/synthetic/table_manifest.json \
  --backend echo \
  --echo-response "['range': 'A1:D8']" \
  --out-record runs/table_eval.json

The result record includes evaluation_metadata with dataset, split, encoder, coordinate, prompt, backend, baseline, and metric fields.

QA Manifest

Use run_qa_evaluation.py --manifest path/to/qa_manifest.json for manifest-driven QA. QA pairs may include answer_type for type-aware exact match normalization.

{
  "dataset_name": "synthetic_qa_v1",
  "dataset_version": "1",
  "claim_level": "synthetic",
  "split_name": "test",
  "items": [
    {
      "spreadsheet_path": "book1.xlsx",
      "qa_pairs": [
        {
          "question": "Which cell contains total revenue?",
          "answer": "[D8]",
          "answer_type": "cell_address",
          "table_range": "A1:D8"
        }
      ]
    }
  ]
}

Supported answer_type values:

  • cell_address: trim and uppercase.
  • formula: trim, remove whitespace, and uppercase.
  • free_text: collapse whitespace and casefold.
  • literal: trim only. This is the default when unspecified.

Example:

python run_qa_evaluation.py --manifest datasets/synthetic/qa_manifest.json \
  --backend echo \
  --echo-response "['range': 'A1:D8']" \
  --out-record runs/qa_eval.json

Fine-Tuning Metadata

Use prepare_finetuning_data.py --metadata-output finetune_manifest.json to write a sidecar describing the generated JSONL. Each JSONL record also includes per-sheet metadata with original ranges, compact prompt ranges, and coordinate mode so training targets can be audited later.

python prepare_finetuning_data.py datasets/train output/finetune.jsonl \
  --metadata-output output/finetune_manifest.json

Evaluation against fine-tuned outputs should compare the JSONL sidecar with the evaluation record before making parity claims.

Baselines

  • TaPEx is available through the optional HuggingFace transformers wrapper.
  • Binder is marked unavailable until a real adapter is implemented.
  • Missing optional dependencies should be reported as skip reasons, not as zero scores.