xllm

Current

xllm is an Apache 2.0 licensed high-performance inference engine for LLMs optimized for diverse AI accelerators including NVIDIA and Ascend.

Currency ID xllm

Date Mar 23, 2026

Language English

Signal

xllm · 2026-03-23

GitHub repository jd-opensource/xllm published on 2026-03-23. A high-performance inference engine for LLMs, optimized for diverse AI accelerators. Supports DeepSeek, GLM, Qwen, and other model families. Apache 2.0 license. Technical report available via arXiv.

Context

xllm represents JD.com's open-source contribution to the LLM serving infrastructure stack. It targets multi-accelerator environments, specifically highlighting support for NVIDIA GPUs and domestic hardware like Ascend. The project provides day-0 support for recent model releases including GLM-5 and GLM-4.7, indicating active maintenance aligned with upstream model releases.

Relevance

The entry maps to the infrastructure layer dedicated to model serving and inference optimization. It competes or complements existing high-throughput engines by offering hardware-agnostic deployment paths. This reduces vendor lock-in for operators requiring specific accelerator support beyond standard CUDA ecosystems.

Current State

The project is in active development with published documentation and Docker images. It supports multiple model families including Qwen, GLM, and DeepSeek. The codebase is licensed under Apache 2.0, facilitating integration into commercial and open-source pipelines. Technical reports suggest optimizations for memory bandwidth and kernel fusion specific to target hardware.

Open Questions

Performance benchmarks against vLLM and SGLang across heterogeneous hardware remain to be validated. Community adoption metrics and contribution guidelines require observation. Specific support matrix for non-NVIDIA accelerators beyond Ascend is not fully detailed in the initial signal.

Connections

xllm operates within the same functional domain as vllm and sglang, serving as an alternative inference runtime. xinference provides a broader API abstraction layer that may integrate xllm as a backend. qwen3-5-ollama-local-deployment shares the Qwen model family focus, though xllm targets broader serving scenarios.

Openflows

xllm

Signal

Context

Relevance

Current State

Open Questions

Connections

Connections

Linked from

External references

Mediation note