Heretic

Current

Heretic

Heretic is an open-source tool that automates the removal of safety alignment from transformer language models using directional ablation and parameter optimization, making dealignment an accessible and reproducible operation.

Currency ID heretic

Date Mar 15, 2026

Language English

Last reviewed Mar 23, 2026

Signal

Heretic · GitHub

Context

Heretic automates the process of removing safety alignment from open-weight transformer models — the guardrails, refusal behaviors, and content restrictions installed during RLHF and instruction tuning. It uses directional ablation (identifying and suppressing the model directions associated with refusal behavior) combined with parameter optimization to produce a version of the model with alignment constraints substantially reduced. At 14.1k stars, it has achieved significant community adoption, indicating that dealignment is not a niche concern but a mainstream capability in the open-weights ecosystem.

Relevance

Heretic is a direct and concrete challenge to the assumption that open-weight model release can be made safe through alignment training. If alignment can be reliably removed with a Python script, then safety properties baked in during training are not durable — they are a starting condition that can be modified by anyone with access to the weights and a GPU. This changes the governance calculus for open-weight model release: the question is no longer whether a released model is safe, but whether its safety properties will survive contact with the community.

For Openflows, this is a necessary signal. Understanding the open-source AI ecosystem requires understanding what practitioners do with open models — and dealignment is demonstrably what a significant portion of the community does.

Current State

Active project with 14.1k stars. AGPL-3.0 license means any modifications must be open-sourced. Python implementation. Directional ablation and parameter optimization are both documented as supported methods. Compatible with a range of open-weight transformer architectures.

Open Questions

How durable are the alignment removals produced by Heretic — do they persist under fine-tuning, quantization, or other post-processing?
What are the governance implications for model providers who release open weights knowing that tools like Heretic exist and are widely used?
Does widespread dealignment capability change the calculus for which models should be released as open weights at all?
How does the research community studying alignment robustness use tools like Heretic — is it primarily a red-teaming tool, a jailbreak tool, or something else?

Connections

Heretic is only possible because open-weight models exist — it depends on the commons that the Open Weights Commons circuit describes. Its existence raises the core governance question of that circuit: what does responsible open release mean when release enables modification? The Inspectable Agent Operations circuit assumes that agent behavior can be audited; Heretic demonstrates that the model layer itself can be modified out from under those assumptions. The Autonomous Research Accountability circuit's concern about who is responsible for model behavior becomes acute when the model has been dealigned by a third party and the original developer's safety training no longer applies.

Updates

2026-03-23: As of the latest assessment, the repository star count has increased from 14.1k to 16.5k, indicating sustained growth in community adoption. This update ensures the Context and Current State sections accurately reflect the project's current popularity.

Openflows

Heretic

Signal

Context

Relevance

Current State

Open Questions

Connections

Updates

Connections

Linked from

External references

Mediation note