| 🎯 Best Private LogLoss | 📊 Best Public LogLoss | ⚡ Training Time |
|---|---|---|
| 0.38484 | 0.38671 | ~45 min (single epoch) |
This project serves as a laboratory for exploring and synthesizing state-of-the-art architectures in Click-Through Rate (CTR) prediction. Rather than implementing a single traditional model, we focus on Hybrid Architecture Synthesis—combining orthogonal strengths from various seminal research papers into unified, high-performance encoders.
Our primary goal is to investigate how explicit cross-networks, attention-based encoders, and field-level importance gating can be fused to capture complex feature interactions in high-cardinality sparse datasets like Avazu.
Our best submission achieved a Private LogLoss of 0.38484 and Public LogLoss of 0.38671 on the Avazu CTR Prediction competition. Below are the optimal hyperparameters discovered through extensive Optuna-based Bayesian optimization.
💡 Tip: You can modify these parameters in
config.pyto experiment with different configurations.
📋 Click to expand full optimal configuration
| Component | Parameter | Value |
|---|---|---|
| Backbone | Type | gated_dcn |
| Diversity | Weight | 0.001177 |
| Feature Bagging | Ratio | 0.827 |
| Aggregation | Method | mean |
| Parameter | Value |
|---|---|
| Enabled | ✅ True |
| Layers | 13 |
| Low Rank | 52 |
| LayerNorm | ✅ True |
| Parameter | Value |
|---|---|
| Hidden Dims | [1408] |
| Activation | relu |
| Dropout | 0.101 |
| Skip Connections | ✅ True |
| LayerNorm | ✅ True |
| Parameter | Value |
|---|---|
| Enabled | ✅ True |
| Activation | gelu |
| Low Rank | None |
| Head | Hidden Dims | Activation | Dropout | LayerNorm |
|---|---|---|---|---|
| 1 | [128] |
tanh |
0.455 |
❌ |
| 2 | [32] |
tanh |
0.383 |
❌ |
| 3 | [512] |
silu |
0.413 |
✅ |
| 4 | [16] |
mish |
0.068 |
✅ |
Dense Parameters (AdamW)
| Parameter | Value |
|---|---|
| Learning Rate | 2.234e-4 |
| Weight Decay | 3.203e-5 |
| Warmup Ratio | 0.402 |
| Decay Type | none |
Embedding Parameters (Adagrad)
| Parameter | Value |
|---|---|
| Learning Rate | 0.589 |
| Weight Decay | 0.0 |
| Warmup Ratio | 0.346 |
| Decay Type | linear |
| Min LR | 2.04e-7 |
| Parameter | Value |
|---|---|
| Batch Size | 4096 |
| Epochs | 1 |
| Gradient Clipping | 4.968 |
| AMP | ✅ float16 |
| Compile | ✅ torch.compile |
The laboratory implements and synthesizes ideas from several key research directions:
|
|
|
|
The flagship architecture of this lab is the MultiHeadDiversityModel. It represents our current best attempt at architectural synthesis:
graph TD
subgraph Input["🔌 Sparse Input"]
F1[Fields 1..N] --> EMB[Hybrid Embedding Layer]
EMB --> BAG[Feature Bagging / Masking]
end
subgraph Backbone["🧠 Shared Research Backbone"]
BAG --> FG[Feature Gating Layer]
FG --> DCN["DCNv2 Cross Layers<br/>(13 layers, rank 52)"]
DCN --> MLP["Residual MLP<br/>(1408 units)"]
end
subgraph DiverseHeads["🎯 Multi-Head Prediction"]
MLP --> H1["Head 1: tanh<br/>(128 units)"]
MLP --> H2["Head 2: tanh<br/>(32 units)"]
MLP --> H3["Head 3: silu<br/>(512 units)"]
MLP --> H4["Head 4: mish<br/>(16 units)"]
end
subgraph Aggregation["🔗 Adaptive Fusion"]
H1 & H2 & H3 & H4 --> AGG[Mean Aggregation]
AGG --> OUT[Final CTR Probability]
end
subgraph Optimization["📉 Objective Function"]
OUT --> BCE[BCE Loss]
H1 & H2 & H3 & H4 --> DIV["Diversity Regularization<br/>(λ = 0.00118)"]
BCE & DIV --> LOSS[Total Multi-Objective Loss]
end
We use Optuna to navigate the vast search space (~34 parameters) of our hybrid architectures. Our advanced tuning script supports:
| Feature | Description |
|---|---|
| 🌳 TPE Sampler | Tree-structured Parzen Estimator for Bayesian search |
| ✂️ MedianPruner | Aggressive early stopping of unpromising trials |
| 💾 SQLite Persistence | Resume large-scale studies across sessions |
| 📊 Real-time Dashboard | Optuna Dashboard for visualization |
# Launch a 100-trial optimization study
python misc/tune_hyperparams.py --n-trials 100 --timeout 28800- 🔢 Interaction Depth: Number of DCN layers vs. Transformer layers
- 🎛️ Diversity Calibration: Tuning the weight of diversity regularization
- 🎨 Per-Head Hyperparameters: Individual activation functions and skip-connection strategies
- 📐 Embedding Dynamics: Adaptive learning rates for sparse vs. dense parameters
avazu-ctr/
├── 📂 src/
│ ├── 📂 models/
│ │ ├── 📂 architectures/ # Full hybrid implementations (STEC, MultiHeadDiversity, GatedDCN)
│ │ └── 📂 layers/ # Primitive research blocks (CrossNetwork, SENet, FeatureGating)
│ ├── 📂 training/ # Training engine with hybrid optimizer support
│ └── 📂 config_types/ # Type definitions for configuration validation
├── 📂 misc/ # Research tools (tune_hyperparams.py, EDA scripts)
├── 📂 papers/ # Foundational research papers
├── 📂 data/ # Raw and processed datasets
├── 📄 pyproject.toml # Project config & dependencies (uv)
├── 📄 uv.lock # Locked dependency versions
├── 📄 config.py # Best hyperparameter configuration
├── 📄 data_processor.py # Polars-based streaming data pipeline
└── 📄 train.py # Main training entry point
This project uses uv for fast, reliable dependency management.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Sync dependencies (PyTorch CUDA 13.0). For CPU-only, omit the env var.
UV_TORCH_BACKEND=cu130 uv sync --extra dev# Blazing fast Polars-based streaming processing
uv run python data_processor.py# 1. Start a tuning study to find architectural sweet spots
uv run python misc/tune_hyperparams.py --n-trials 50
# 2. Train the full model with best config
uv run python train.py
# 3. Analyze results via TensorBoard
uv run tensorboard --logdir=runs# Run tests
uv run pytest
# Format and lint
uv run ruff format . && uv run ruff check .
# Type check
uv run ty check- Foundation: Avazu CTR Prediction Dataset
- Architecture: Synthesized from DCNv2, FiBiNET, and STEC papers
- Tools: Built with PyTorch, Polars, and Optuna
Licensed under the MIT License
Built with ❤️ for the CTR research community
