[Feature Request] Native MAX Serving Support for parasail-ai/GritLM-7B-vllm (GRITLM Architecture) by Tharun-Kumar-McW · Pull Request #6713 · modular/modular · GitHub
Skip to content

[Feature Request] Native MAX Serving Support for parasail-ai/GritLM-7B-vllm (GRITLM Architecture)#6713

Open
Tharun-Kumar-McW wants to merge 7 commits into
modular:mainfrom
Tharun-Kumar-McW:my-fix
Open

[Feature Request] Native MAX Serving Support for parasail-ai/GritLM-7B-vllm (GRITLM Architecture)#6713
Tharun-Kumar-McW wants to merge 7 commits into
modular:mainfrom
Tharun-Kumar-McW:my-fix

Conversation

@Tharun-Kumar-McW

Copy link
Copy Markdown

Linked issue

Fixes #6684

Type of change

  • New feature or public API

Motivation

GritLM-7B is a Mistral-7B-based model trained with Generative Representational
Instruction Tuning (GRIT), enabling both high-quality text generation and dense
vector embeddings from a single model. It is widely used in retrieval-augmented
generation (RAG) pipelines and semantic search applications.

MAX had no native support for the GritLM architecture class. This PR adds a implementation so parasail-ai/GritLM-7B-vllm can be served directly via max serve without any
custom flags after registration.


What changed

Added max/python/max/pipelines/architectures/gritlm/ — a new ModuleV3
architecture package for the GritLM model family.

New files:

File Purpose
__init__.py Exports ARCHITECTURES = [gritlm_arch] for MAX loader discovery
arch.py SupportedArchitecture registration — name matches architectures field in config.json
model_config.py GritLMConfig dataclass — parses HuggingFace config including sliding_window
gritlm.py GritLM / GritLMTextModel ModuleV3 graph (CausalLM path only)
model.py GritLMModelPipelineModelWithKVCache wrapper, input preparation, output unpacking
weight_adapters.py Remaps model.*language_model.*, drops pooling head weights, casts dtype
layers/attention.py GritLMAttention — GQA with SLIDING_WINDOW_CAUSAL_MASK on every layer
layers/transformer_block.py GritLMTransformerBlock — standard pre-norm decoder block
BUILD.bazel Defines the gritlm Python library and its dependencies for Bazel builds.

Architecture highlights:

  • Mistral-7B backbone: 32 layers, hidden_size=4096, GQA (32 Q / 8 KV heads),
    SwiGLU MLP, RMSNorm.
  • Sliding window attention on all 32 layers (sliding_window=4096) using
    flash_attention_ragged with MHAMaskVariant.SLIDING_WINDOW_CAUSAL_MASK.
  • Standard Mistral RoPE (rope_theta=10000, no scaling).
  • Separate lm_head (tie_word_embeddings=false).
  • CausalLM path only — gritlm_pooling_head.weight is dropped by the
    weight adapter since MAX serves text generation alone for this model.

Testing

Verified on parasail-ai/GritLM-7B-vllm with GPU serving:

max serve \
  --model-path parasail-ai/GritLM-7B-vllm \
  --custom-architectures architectures/gritlm \
  --max-batch-size 256 \
  --max-length 4096 \
  --quantization-encoding bfloat16

Smoke test — model generates correctly :

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"parasail-ai/GritLM-7B-vllm",
       "messages":[{"role":"user","content":"What is 2+2?"}],
       "max_tokens":32,"temperature":0}'

Output :

2+2 is equal to 4.

GSM8K accuracy vs vLLM reference :

Model Task Accuracy vs Reference
parasail-ai/GritLM-7B-vllm gsm8k_cot_llama 0.506 98.2%

Checklist

  • The linked issue above has been reviewed by a maintainer and is agreed-upon,
    or this is a trivial fix that does not need prior approval
  • PR is small and focused — single new architecture, no changes to existing
    architectures
  • I ran ./bazelw run format to format my changes
  • I added or updated tests to cover my changes
  • If AI tools assisted with this contribution, I have included an
    Assisted-by: trailer in my commit message or this PR description

Assisted-by: AI

@Tharun-Kumar-McW Tharun-Kumar-McW requested a review from a team as a code owner June 23, 2026 11:09
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@Tharun-Kumar-McW

Copy link
Copy Markdown
Author

modular-cla-bot Bot added a commit to modular/cla that referenced this pull request Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Native MAX Serving Support for parasail-ai/GritLM-7B-vllm (GRITLM Architecture)

1 participant