InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
InteractScience is a benchmark specifically designed to evaluate the capability of large language models in generating interactive scientific demonstration code. This project provides a complete evaluation pipeline including model inference, automated testing, and multi-dimensional assessment.
.
├── data/ # Benchmark dataset
│ ├── interactscience.jsonl # Main dataset file containing problems and references
│ └── snapshots/ # Reference screenshot directory
│ ├── *_Snapshot-1.png
│ ├── *_Snapshot-2.png
│ └── ...
├── PFT_tests/ # Program Functionality Testing (PFT) scripts
│ ├── *.spec.js # Playwright test scripts
│ └── ...
├── VQT_tests/ # Visual Quality Testing (VQT) scripts
│ ├── *.spec.js # Playwright test scripts
│ └── ...
├── eval/ # Model inference results
│ ├── interactscience_lm_*.jsonl # Language model inference results
│ ├── interactscience_vlm_*.jsonl # Vision-language model inference results
│ └── ...
├── results/ # Test result data
│ ├── lm_results/ # Language model test results
│ │ ├── PFT_test_results/ # Program functionality test results
│ │ ├── VQT_test_results/ # Visual quality test results
│ │ ├── VQT_clip_results/ # CLIP scoring results
│ │ └── VQT_vlm_judge_results/ # VLM scoring results
│ └── vlm_results/ # Vision-language model test results
├── run_generation.sh # Model inference script
├── run_benchmark.sh # Automated testing script
├── run_vlm_as_judge.sh # VLM scoring script
├── cal_metrics.py # Metrics calculation script
├── test_llm.py # Language model testing main program
├── vlm_as_judge.py # VLM scoring main program
├── clip_score.py # CLIP score calculation
└── extract_and_save_code.py # Code extraction and saving
First install Node.js and npm, then install the Playwright testing environment:
# Install project dependencies
npm install
# Install Playwright browsers
npx playwright installUse the run_generation.sh script for model inference:
# Edit the model path and parameters in the script
vim run_generation.sh
# Run inference (requires model path configuration)
bash run_generation.shScript Description:
- Starts vLLM API server
- Calls
test_llm.pyfor inference - Results saved to
eval/directory
Use the run_benchmark.sh script for automated testing:
# Set the model name to test
export MODEL="your_model_name"
# Run tests
bash run_benchmark.shTesting Process:
- Extract HTML code from inference results (
extract_and_save_code.py) - Execute Program Functionality Testing (PFT) using
playwright_PFT.config.js - Execute Visual Quality Testing (VQT) using
playwright_VQT.config.js - Calculate CLIP similarity scores (
clip_score.py) - Results saved to
results/directory
Use run_vlm_as_judge.sh for VLM-as-Judge evaluation:
# Edit model and path configuration in the script
vim run_vlm_as_judge.sh
# Run VLM scoring
bash run_vlm_as_judge.shScoring Description:
- Uses vision-language models to score generated results
- Compares reference screenshots with generated screenshots
- Evaluation based on predefined checklists
Use cal_metrics.py and cal_vlm_as_judege_score.py to calculate final metrics:
python cal_metrics.py
python cal_vlm_as_judege_score.pyMain dataset file, each line contains a test sample:
id: Unique identifierquestion: Detailed HTML implementation planlm_system_prompt: Language model system promptvlm_system_prompt: Vision-language model system promptimage_path: List of reference screenshot pathssnapshot_checklists: Visual verification checklists
Located in data/snapshots/ directory, naming format:
{task_id}_Snapshot-{number}.png
- Validates functional correctness of HTML code
- Checks interactive element behavior
- Tests JavaScript logic
- Generates page screenshots
- Compares with reference screenshots
- Calculates perceptual similarity (CLIP scores)
- Calculates semantic correctness (VLM-judge scores)
Language model testing main program:
python test_llm.py \
--dataset_path data/interactscience.jsonl \
--prompt_type lm_system_prompt \
--dump_path eval/result.jsonl \
--model_path your_model_path \
--base_url http://localhost:8000/v1 \
--api_key EMPTYVLM scoring main program:
python vlm_as_judge.py \
--reference_image_dir data/snapshots \
--generated_image_dir generated_images \
--checklist_file data/checklists.jsonl \
--output_path results/vlm_judge.jsonl \
--base_url your_api_endpoint \
--api_key your_api_key- Program Functionality Test Pass Rate: Percentage of PFT test cases passed
- Visual Quality Score: Visual similarity based on CLIP model
- VLM Score: Comprehensive score given by multimodal models
We have evaluated 30 state-of-the-art large language models on the InteractScience benchmark. The results are available in the results/ directory.
@article{InteractScience,
author = {Qiaosheng Chen and Yang Liu and Lei Li and Kai Chen and Qipeng Guo and Gong Cheng and Fei Yuan},
title = {InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation},
journal = {arXiv preprint arXiv:2510.09724},
year = {2025}
}
