GitHub - ClickHouse/TextBench · GitHub
Skip to content

ClickHouse/TextBench

Folders and files

Repository files navigation

TextBench — Full-Text Search Benchmark

A reproducible benchmark comparing ClickHouse and Elasticsearch for full-text search (FTS) workloads over log data at scale (1B, 10B, 50B rows).


Goal

Produce a fair, reproducible benchmark measuring query latency and storage efficiency for FTS workloads typical of log analytics. The benchmark focuses exclusively on capabilities available in both systems.

Included:

  • Single-term queries
  • AND / OR boolean queries
  • FTS combined with structured predicates (service, severity, time range)
  • Aggregations driven by FTS filters (faceting, GROUP BY, time histograms)

Excluded:

  • Elasticsearch-only features (phrase matching, wildcards)

Dataset

Synthetic OpenTelemetry logs generated by the OpenTelemetry demo application.

Property Value
Total rows 50 billion
Time range 2025-09-23 00:00:002025-09-23 23:59:59 (one day)
Format Parquet (1B rows per file, sorted by Timestamp)
S3 bucket s3://public-pme/text_bench/ (public, region: eu-west-3)
Files part_000.parquetpart_049.parquet

Schema (OpenTelemetry log record)

Column Type Notes
Timestamp DateTime Primary key component
TraceId String
SpanId String
TraceFlags UInt8
SeverityText LowCardinality(String)
SeverityNumber UInt8
ServiceName LowCardinality(String) Primary key component
Body String Full-text indexed field
ResourceAttributes Map(String, String)
ScopeAttributes Map(String, String)
LogAttributes Map(String, String)

Benchmark Matrix

Scale Files Rows Time range covered
1B part_000 1,000,000,000 00:00:0000:29:24
10B part_000part_009 10,000,000,000 00:00:0004:53:57
50B part_000part_049 50,000,000,000 00:00:0023:59:59

Deployment Environment

  • Cloud: AWS
  • Instance type: m6i.8xlarge (32 vCPU, 128 GiB RAM)
  • Storage: GP3 EBS - 16K IOPS
  • Region: eu-west-3
  • Topology: Single node (1× ClickHouse, 1× Elasticsearch — separate instances)

Benchmark Procedure

For each dataset scale:

  1. Ingest the dataset
  2. Restart the service (clears in-memory state)
  3. Run each query 3 times, recording:
    • Cold run: run 1 (after restart, cold OS page cache)
    • Hot run: best of runs 2–3 (warm OS page cache)

Metrics captured:

  • Index size on disk (data size + text index size)
  • Query elapsed time (cold and hot) for each query

Queries (Q1–Q10)

# Description FTS pattern Result shape
Q1 Fetch rows — AND, service filter + time-bounded 30 min AND fetch + sort
Q2 count() — single token, full corpus single token count()
Q3 GROUP BY service + count — OR, full corpus OR aggregation
Q4 Date histogram + count — single token, service filter single token histogram
Q5 GROUP BY service + count — AND, full corpus AND aggregation
Q6 Fetch rows — complex boolean, service + severity + time-bounded 30 min AND + OR fetch + sort
Q7 count() — OR 5 high-frequency tokens, full corpus OR count()
Q8 Fetch rows — OR 5 tokens, full corpus OR fetch + sort
Q9 Date histogram 1hr + count — AND, full corpus AND histogram

ClickHouse queries (clickhouse/queries.sql)

-- Q1: AND, time-bounded
SELECT Timestamp, ServiceName, SeverityText, Body FROM otel_logs
WHERE ServiceName = 'checkout'
  AND hasAllTokens(Body, ['failed', 'order'])
  AND Timestamp >= '2025-09-23 00:00:00' AND Timestamp < '2025-09-23 00:30:00'
ORDER BY Timestamp DESC LIMIT 100;

-- Q2: single token count
SELECT count() FROM otel_logs WHERE hasToken(Body, 'timeout');

-- Q3: OR 5 tokens, GROUP BY service
SELECT ServiceName, count() AS cnt FROM otel_logs
WHERE hasAnyTokens(Body, ['exception', 'timeout', 'failed'])
GROUP BY ServiceName ORDER BY cnt DESC;

-- Q4: date histogram, single token, service filter
SELECT toStartOfHour(Timestamp) AS hour, count() AS cnt FROM otel_logs
WHERE ServiceName = 'checkout' AND hasToken(Body, 'payment')
GROUP BY hour ORDER BY hour;

-- Q5: AND, GROUP BY service
SELECT ServiceName, count() AS cnt FROM otel_logs
WHERE hasAllTokens(Body, ['connection', 'reset'])
GROUP BY ServiceName ORDER BY cnt DESC;

-- Q6: complex boolean, time-bounded
SELECT Timestamp, ServiceName, SeverityText, Body FROM otel_logs
WHERE ServiceName = 'frontend'
  AND (hasAllTokens(Body, ['connection', 'reset']) OR hasToken(Body, 'timeout'))
  AND SeverityNumber >= 13
  AND Timestamp >= '2025-09-23 00:00:00' AND Timestamp < '2025-09-23 00:30:00'
ORDER BY Timestamp DESC LIMIT 100;

-- Q7: OR 5 tokens count
SELECT count() FROM otel_logs
WHERE hasAnyTokens(Body, ['error', 'exception', 'failed', 'timeout', 'refused']);

-- Q8: OR 5 tokens fetch
SELECT Timestamp, ServiceName, SeverityText, Body FROM otel_logs
WHERE hasAnyTokens(Body, ['error', 'exception', 'failed', 'timeout', 'refused'])
ORDER BY Timestamp DESC LIMIT 100;

-- Q9: date histogram, AND, full corpus
SELECT toStartOfHour(Timestamp) AS hour, count() AS cnt FROM otel_logs
WHERE hasAllTokens(Body, ['connection', 'reset'])
GROUP BY hour ORDER BY hour;

Reproducing the Benchmark

ClickHouse

cd clickhouse

# Download parquet files from S3 (example: 10B = parts 0–9)
mkdir -p /data/text_bench
for i in $(seq -w 0 9); do
  wget -q -O /data/text_bench/part_0${i}.parquet \
    https://public-pme.s3.eu-west-3.amazonaws.com/text_bench/part_0${i}.parquet
done

# Create table and load data (10B = 10 files, 4 parallel workers)
./create_and_load.sh text_bench_10B otel_logs /data/text_bench 10 success.log error.log 4

# Run benchmark — result written to results/results_10B.json
./run_queries.sh \
  --query-file queries_10.sql \
  --database text_bench_10B \
  --dataset-size 10000000000 \
  --machine "m6i.8xlarge, 10000gib gp3" \
  > results/results_10B.json

Elasticsearch

cd elasticsearch

# Install and start Elasticsearch
./install.sh
./start.sh

# Create indices for 10B (parts 0–9)
./create_indexes.sh 0 9

# Download from S3 and ingest
./load_data.sh 0 9

# Run DSL benchmark — result written to results/m6i.8xlarge_otel_logs_10b.json
./benchmark.sh 10b

# Run ES|QL benchmark — result written to results/m6i.8xlarge_otel_logs_10b_esql.json
./benchmark_esql.sh 10b

To run all three scales:

for scale in 1b 10b 50b; do
  ./benchmark.sh $scale
  ./benchmark_esql.sh $scale
done

To capture detailed disk usage per field (slow — ~1 min/shard):

./disk_usage.sh 50b   # writes results/m6i.8xlarge_otel_logs_50b_disk_usage.json

The S3 bucket is public and accessible without credentials:

aws s3 cp s3://public-pme/text_bench/part_000.parquet . --no-sign-request
# or
wget https://public-pme.s3.eu-west-3.amazonaws.com/text_bench/part_000.parquet

Result format

Benchmark emit a JSON object result file:

{
  "system": "ClickHouse",
  "version": "26.3.1",
  "date": "2026-04-20",
  "machine": "m6i.8xlarge, 10000gib gp3",
  "dataset_size": 10000000000,
  "total_size": 995602681520,
  "data_size": 990689862160,
  "text_index_size": 4912812010,
  "result": [
    [0.306, 0.069, 0.065],
    ...
  ]
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors