This module contains Java Microbenchmark Harness (JMH) benchmarks for OpenRefine.
- Java (JDK 11+)
- Maven
From the repository root:
mvn -pl benchmark -am -DskipTests packageThis produces:
benchmark/target/openrefine-benchmarks.jar
java -jar benchmark/target/openrefine-benchmarks.jar -lUseful optional flags:
-f <n>: forks-wi <n>: warmup iterations-i <n>: measurement iterations-w <duration>: warmup time per iteration (for example200ms)-r <duration>: measurement time per iteration (for example300ms)-t <n>: threads-p name=value: benchmark parameter override
Class:
org.openrefine.benchmark.ToNumberBenchmark
Methods:
toDoubleNewtoLongNew
Built-in parameter:
iterations:1000,10000
Examples:
# Run the whole class
java -jar benchmark/target/openrefine-benchmarks.jar ToNumberBenchmark
# Run one method
java -jar benchmark/target/openrefine-benchmarks.jar ToNumberBenchmark.toLongNew
# Override parameter
java -jar benchmark/target/openrefine-benchmarks.jar ToNumberBenchmark -p iterations=10000Class:
org.openrefine.benchmark.ApacheLevenshteinBenchmark
Methods:
apacheLevenshteinvicinoLevenshtein
Purpose:
- Compare raw Levenshtein distance computation performance between Apache Commons Text and Vicino.
Dataset selection order:
- JVM property
openrefine.benchmark.dataset(if provided) main/tests/data/acm_large.txt(if present)main/tests/data/government_contracts.csv- Synthetic fallback dataset
Examples:
# Run full distance comparison
java -jar benchmark/target/openrefine-benchmarks.jar ApacheLevenshteinBenchmark
# Run with explicit dataset path
java -Dopenrefine.benchmark.dataset=/root/data/itunes_amazon_tableB_large.txt \
-jar benchmark/target/openrefine-benchmarks.jar ApacheLevenshteinBenchmark
# Run one method only
java -jar benchmark/target/openrefine-benchmarks.jar ApacheLevenshteinBenchmark.apacheLevenshteinClass:
org.openrefine.benchmark.KNNLevenshteinClusteringBenchmark
Methods:
apacheKNNClusteringvicinoKNNClustering
Purpose:
- Compare full kNN clustering runtime, not just distance calls.
Built-in parameter (safe defaults):
rowCount:200,500
Default warmup/measurement (in code):
- Warmup: 1 iteration x 200ms
- Measurement: 2 iterations x 300ms
Examples:
# Run end-to-end comparison with defaults
java -jar benchmark/target/openrefine-benchmarks.jar KNNLevenshteinClusteringBenchmark
# Run with explicit dataset and small row count
java -Dopenrefine.benchmark.dataset=/root/data/itunes_amazon_tableB_large.txt \
-jar benchmark/target/openrefine-benchmarks.jar KNNLevenshteinClusteringBenchmark -p rowCount=200
# Extra-safe quick profile
java -jar benchmark/target/openrefine-benchmarks.jar KNNLevenshteinClusteringBenchmark \
-f 1 -wi 0 -i 1 -r 1s -w 0s -p rowCount=200 -t 1- End-to-end clustering benchmarks can be CPU-intensive.
- If your machine runs hot, use:
- lower
rowCount -t 1- fewer iterations (
-wi 0 -i 1) for quick checks
- lower
- JMH can appear to stay on one iteration for a while whenever each operation is CPU intensive.
