Cross-language code clone detection (Java ↔ Python) with three experiment pipelines:
- direct: one LLM call per pair on raw source code
- algo_based: extract language-agnostic algorithms (2 calls) + compare algorithms (1 call)
- agentic: ReAct-style agent that uses tools + loads skills from the
skills/folder
The project supports resumable runs and run-level token/cost logging.
- Entry points
main.py— run one experiment (pipeline × model × dataset)prepare_dataset.py— build normalized datasets underdata/evaluate.py— evaluate result CSVs and write reports
- Source packages
src/core/— constants, paths, loggingsrc/io/— dataset loader, result writer, token usage writersrc/inference/— OpenRouter LLM client, prompts, retry/pacing, response parsingsrc/workflows/— direct, algo_based, agentic workflows
- Skills (used by agentic workflow)
skills/— MarkdownSKILL.mdfiles grouped by category
- Raw inputs
raw_data/— raw dataset JSONs used by dataset preparation
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtSet the environment variable by creating a file named .env in the repo root:
OPENROUTER_API_KEY=YOUR_KEY_HEREThe LLM client is implemented in src/inference/llm.py and uses OpenRouter’s OpenAI-compatible API base URL.
python prepare_dataset.pyThis generates normalized datasets under data/ (for example: data/java_python_xl.json and data/java_python_cn.json). Pipelines will error if these files do not exist.
Run one pipeline/model/dataset combination:
python main.py --pipeline direct --model deepseek_v3 --dataset xlcost
python main.py --pipeline algo_based --model deepseek_v3 --dataset xlcost
python main.py --pipeline agentic --model deepseek_v3 --dataset xlcostTo see valid choices (pipelines, models, datasets):
python main.py --helpPipelines are implemented in:
src/workflows/direct_workflow.pysrc/workflows/algo_based_workflow.pysrc/workflows/agentic_workflow.py
Re-running the same command is safe:
- If a pair already has a successful (non-
ERROR) row in the results CSV, it is skipped. - If a pair previously produced
ERROR, it will be retried (staleERRORrows are removed before retry).
This behavior is implemented in src/io/result_helper.py.
The project writes outputs under an output/ folder in the repo root:
output/direct/output/algo_based/output/agentic/
Each run appends rows to a file named like:
results_<model>_<dataset>.csv
Example:
output/direct/results_deepseek_v3_xlcost.csv
Rows include the pair id, ground truth, predicted label.
Each run appends one row to:
output/<pipeline>/token_usage.csv
Token usage is recorded even when the run is interrupted or crashes (the status will reflect that).
Columns:
pipeline,model,dataset,pairs,elapsed_seconds,run_status,successful_requests,prompt_tokens,completion_tokens,total_tokens
Token logging is implemented in src/io/token_usage_writer.py.
The algorithm-based pipeline may write extracted algorithms to:
output/algo_based/algorithms_<model>_<dataset>.json
This is handled by src/io/algorithm_writer.py.
- Logs are written to a file under the
logs/folder (created automatically):logs/experiment.log
- No console logging handler is attached by default, so the terminal output should mainly be the
tqdmprogress bar.
Logging is configured in src/core/logger.py.
Evaluate experiment outputs (recursive scan under output/ by default):
python evaluate.pyEvaluate a single CSV:
python evaluate.py --file output/direct/results_deepseek_v3_xlcost.csvEvaluation is done in evaluate.py.
If you see errors about missing data/java_python_*.json files, run:
python prepare_dataset.pyand confirm the raw inputs exist under raw_data/.
If the agentic pipeline cannot find skills, confirm the skills/ folder exists at the repo root and contains the expected SKILL.md files. The project root path is defined in src/core/constants.py.
ReAct-style agents are sensitive to output formatting. If you see frequent parsing/format errors in the agentic pipeline, try:
- a more instruction-following model alias
- reducing temperature (default is deterministic)
- tightening prompts and parsing tolerance in the agentic workflow components under
src/inference/andsrc/workflows/agentic_workflow.py
