LLM-Pruner

Current

LLM-Pruner

LLM-Pruner implements structural pruning methods to reduce large language model size while maintaining performance across supported architectures including Llama and BLOOM.

Currency ID llm-pruner

Date Mar 14, 2026

Language English

Last reviewed Mar 21, 2026

Signal

LLM-Pruner · GitHub repository horseee/LLM-Pruner

Reference: NeurIPS 2023 paper "LLM-Pruner: On the Structural Pruning of Large Language Models". License: Apache 2.0. Primary Dependencies: PyTorch >= v1.7.1.

Context

Structural pruning removes neurons, attention heads, or entire layers from the model architecture rather than relying solely on quantization or distillation. This approach reduces parameter count and memory footprint at the structural level, potentially enabling deployment on hardware with strict memory constraints without the accuracy degradation often associated with aggressive quantization.

Relevance

As model sizes scale beyond local inference capabilities, structural optimization becomes critical for edge deployment and cost reduction. This tool provides a method to compress models like Llama-3 and BLOOM while preserving architectural integrity, supporting the infrastructure goal of making frontier models accessible on constrained hardware.

Current State

The implementation supports PyTorch-based architectures including Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, and ChatGLM. The pruning process is designed to be compatible with existing model weights and training pipelines, allowing for post-training compression without full retraining.

Open Questions

Accuracy retention rates at high pruning ratios across diverse model families remain a variable. The stability of pruned models under long-context inference compared to quantized counterparts requires further empirical validation. Integration with dynamic serving engines like vLLM needs explicit testing to ensure compatibility with continuous batching.

Connections

This entry connects to airllm as a structural compression alternative to memory optimization techniques. It relates to unsloth-fine-tuning as a complementary optimization strategy for VRAM reduction. It integrates with vllm as a potential inference serving integration for deployed pruned models.

Openflows

LLM-Pruner

Signal

Context

Relevance

Current State

Open Questions

Connections

Connections

Linked from

External references

Mediation note