24 tasks across 11 languages, tested on 14 models. Overall pass rate: 13.5%
OTelBench evaluates how well AI models can instrument code with OpenTelemetry. Each task requires adding tracing, metrics, or logging to real-world codebases across multiple programming languages.
We test frontier models from OpenAI, Anthropic, Google, and others using OpenRouter. The benchmark is part of Quesma’s research into AI-assisted observability.
Models ranked by pass rate across all 26 tasks. Cost and time columns show totals for running the complete benchmark. See the full methodology for how we evaluate each model’s output.
| Cheapest | grok-4.1-fast | $0.03 |
| Fastest | gemini-3-pro-preview | 4m |
| Cheapest | glm-4.7 | $0.23 |
| Fastest | claude-opus-4.5 | 7m |
| Cheapest | claude-opus-4.5 | $1.24 |
| Fastest | claude-opus-4.5 | 10m |
Tasks sorted by difficulty (highest pass rate first). Languages include Go, Java, C++, Python, Rust, PHP, JavaScript, and more. A pass rate of 0% means no model successfully completed the task—these represent the hardest instrumentation challenges.
Performance heatmap showing results for each model-task combination. Green indicates efficient solutions, red indicates expensive ones, and white cells represent unsolved tasks. This matrix helps identify which models excel at specific types of instrumentation challenges.
Cost-performance tradeoff across all models (total API cost to run all 26 tasks). The blue line shows the Pareto frontier—models on this line offer optimal cost efficiency for their performance level. Models below the line are dominated by better alternatives.
Speed-performance tradeoff (average duration per task, including all inference and tool-use turns). Faster models appear on the left. For time-sensitive applications, this chart helps identify models that balance quick response times with high accuracy.
Model pass rate plotted against public release date.