API for Open LLMs

Current

API for Open LLMs

Provides an OpenAI-compatible API wrapper for diverse open-source language models, standardizing inference access across heterogeneous model families.

Currency ID api-for-open-llm

Date Mar 14, 2026

Language English

Last reviewed Mar 21, 2026

Signal

API for Open LLMs · GitHub repository xusenlinzy/api-for-open-llm. Date: 2026-03-13. Content: Python library implementing a unified backend interface for open large language models that mimics the OpenAI response format. Supports LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, and variants. Includes support for Rerank models and multimodal capabilities (GLM-4V, MiniCPM). Provides Streamlit demos and environment-variable configuration for model switching · 2026-03-13

Context

The proliferation of open-weight models has resulted in fragmented inference interfaces, requiring distinct client implementations for each model family. This fragmentation increases operational overhead for developers building agent workflows or applications that require model portability. Standardizing the interface layer allows existing OpenAI-compatible clients to interact with locally hosted open models without code modification.

Relevance

This entry represents infrastructure standardization within the local inference layer. By exposing a consistent API contract, it reduces dependency on specific model providers and facilitates the integration of open models into existing tooling ecosystems. It supports the operational goal of maintaining control over the inference stack while leveraging open weights.

Current State

The project is actively maintained with recent updates for QWEN2, GLM-4V, and MiniCPM-Llama3. It functions as a Python server that wraps underlying model loaders (e.g., transformers, llama.cpp) behind a RESTful endpoint. It supports Docker deployment and includes features for chat completion, embeddings, and reranking.

Open Questions

Does the abstraction layer introduce significant latency compared to native inference calls?
How does the project handle sandboxing for code execution tools when integrated into agent workflows?
What is the long-term maintenance commitment given the rapid iteration of model architectures?

Connections

xinference: Offers a competing unified inference API; selection depends on production requirements versus lightweight scripting needs.
ollama: Provides similar local API functionality; api-for-open-llm may offer broader model support or specific configuration flexibility.
local-inference-baseline: This tool serves as a concrete implementation of the circuit's requirement for standardized local inference interfaces.

Openflows

API for Open LLMs

Signal

Context

Relevance

Current State

Open Questions

Connections

Connections

Linked from

External references

Mediation note