This repository can run SpreadsheetLLM-style table-detection and QA experiments, but it does not bundle the original paper datasets or claim paper metric reproduction by default. Use this protocol to keep synthetic, reconstructed, and paper-original runs separate.
Do not describe a result as paper-comparable unless the run record includes compatible dataset split metadata, encoder settings, coordinate mode, prompt serializer, model/backend, baseline status, and metric definition.
The metadata validator enforces this distinction with an explicit
claim_level field. synthetic is the default for directory-scanned local
fixtures, manifest-driven datasets default to reconstructed, and
paper-original is accepted only when the result record carries concrete
dataset, split, model/backend, prompt serializer, coordinate mode, baseline,
metric, encoder, and tokenizer metadata. Paper-original claims are rejected
when compression metrics used the char/4 tokenizer fallback.
The paper reports two main evaluation surfaces:
- Table detection: 188 spreadsheets, 311 tables, split into Small, Medium, Large, and Huge by token usage, evaluated with EoB-0 exact boundary matching.
- Spreadsheet QA: 64 spreadsheets and 307 QA items, with answers represented as cell addresses or formulas.
These assets are not bundled. A manifest-based run can represent equivalent or reconstructed data, but the output must remain labeled as reconstructed unless the original paper assets and procedures are available.
Use run_llm_evaluation.py --manifest path/to/table_manifest.json for
manifest-driven table detection. Relative workbook paths are resolved from the
manifest file location.
{
"dataset_name": "synthetic_tables_v1",
"dataset_version": "1",
"claim_level": "synthetic",
"split_name": "test",
"items": [
{
"spreadsheet_path": "book1.xlsx",
"tables": [
{"range": "A1:D8"}
]
}
]
}Example:
python run_llm_evaluation.py --manifest datasets/synthetic/table_manifest.json \
--backend echo \
--echo-response "['range': 'A1:D8']" \
--out-record runs/table_eval.jsonThe result record includes evaluation_metadata with dataset, split, encoder,
coordinate, prompt, backend, baseline, and metric fields.
Use run_qa_evaluation.py --manifest path/to/qa_manifest.json for
manifest-driven QA. QA pairs may include answer_type for type-aware exact
match normalization.
{
"dataset_name": "synthetic_qa_v1",
"dataset_version": "1",
"claim_level": "synthetic",
"split_name": "test",
"items": [
{
"spreadsheet_path": "book1.xlsx",
"qa_pairs": [
{
"question": "Which cell contains total revenue?",
"answer": "[D8]",
"answer_type": "cell_address",
"table_range": "A1:D8"
}
]
}
]
}Supported answer_type values:
cell_address: trim and uppercase.formula: trim, remove whitespace, and uppercase.free_text: collapse whitespace and casefold.literal: trim only. This is the default when unspecified.
Example:
python run_qa_evaluation.py --manifest datasets/synthetic/qa_manifest.json \
--backend echo \
--echo-response "['range': 'A1:D8']" \
--out-record runs/qa_eval.jsonUse prepare_finetuning_data.py --metadata-output finetune_manifest.json to
write a sidecar describing the generated JSONL. Each JSONL record also includes
per-sheet metadata with original ranges, compact prompt ranges, and coordinate
mode so training targets can be audited later.
python prepare_finetuning_data.py datasets/train output/finetune.jsonl \
--metadata-output output/finetune_manifest.jsonEvaluation against fine-tuned outputs should compare the JSONL sidecar with the evaluation record before making parity claims.
- TaPEx is available through the optional HuggingFace
transformerswrapper. - Binder is marked unavailable until a real adapter is implemented.
- Missing optional dependencies should be reported as skip reasons, not as zero scores.
