DeepSWE: Open-Source Programming Agent Benchmark

Current

DeepSWE: Open-Source Programming Agent Benchmark

Datacurve releases DeepSWE, an open-source benchmark evaluating long-horizon trajectory planning and complex codebase editing capabilities of frontier models across TypeScript, Go, Python, JavaScript, and Rust.

Currency ID deepswere-open-source-programming-agent-benchmark

Date May 28, 2026

Language English

Signal

@0xLogicrw: AI 基础设施初创公司 Datacurve 宣布开源编程智能体基准测试 DeepSWE，针对超长、复杂的真实软件工程任务，评估前沿大模型的自主编程能力。 · twitter · 2026-05-27 Datacurve has released DeepSWE, an open-source benchmark designed to evaluate the autonomous programming capabilities of frontier large language models. Unlike conventional benchmarks that focus on isolated issue resolution, DeepSWE targets long-horizon trajectory planning and complex, multi-file codebase editing. The initial release comprises 113 tasks spanning TypeScript, Go, Python, JavaScript, and Rust, establishing a standardized evaluation surface for agentic software development workflows.

Context

The evolution of autonomous coding agents has shifted from single-turn code generation to multi-step, stateful repository manipulation. Existing evaluation suites often fragment across narrow task types or rely on synthetic repositories, failing to capture the dependency management, architectural refactoring, and iterative debugging required in production environments. DeepSWE emerges as a direct response to this gap, structuring evaluation around sustained agent interaction with real-world codebases. This aligns with the broader infrastructure pattern where agent capabilities are measured not by static prompt adherence, but by their ability to maintain context, track state changes, and execute coherent multi-step workflows across heterogeneous language ecosystems.

Relevance

DeepSWE functions as a measurement layer for the agentic software development infrastructure circuit. By standardizing how long-horizon coding tasks are defined, executed, and scored, it provides a reproducible baseline for comparing model routing, tool-use fidelity, and memory retention strategies. The benchmark's focus on trajectory planning directly intersects with specification-driven orchestration and declarative skill packaging, offering empirical data on where current agent frameworks succeed or fail in maintaining execution continuity without human intervention.

Current State

The benchmark is currently in its initial release phase, containing 113 curated tasks across five major programming languages. Datacurve positions it as a technical alternative to SWE-Bench Pro, emphasizing depth over breadth in repository complexity. Early evaluation cycles will likely focus on measuring token efficiency, error recovery rates, and structural integrity preservation during multi-file modifications. The open-source nature of the benchmark allows framework developers to integrate it directly into CI pipelines, enabling continuous capability tracking rather than one-time model comparisons.

Open Questions

How does DeepSWE handle dependency resolution and test suite execution when agents modify shared modules across language boundaries?
What metrics define successful trajectory completion when agents encounter irreversible state changes or conflicting architectural decisions?
Will the benchmark evolve to include automated human-in-the-loop review gates, or remain strictly deterministic in its pass/fail criteria?
How do different runtime environments (e.g., sandboxed vs. local execution) impact the reproducibility of agent performance across the 113 tasks?