Can AI instrument OpenTelemetry?

24 tasks across 11 languages, tested on 14 models. Overall pass rate: 13.5%

OTelBench evaluates how well AI models can instrument code with OpenTelemetry. Each task requires adding tracing, metrics, or logging to real-world codebases across multiple programming languages.

We test frontier models from OpenAI, Anthropic, Google, and others using OpenRouter. The benchmark is part of Quesma’s research into AI-assisted observability.

Models #

1
Anthropic
claude-opus-4.5
27.8%
2
OpenAI
gpt-5.2
25.0%
3
Anthropic
claude-sonnet-4.5
20.8%
4
Google
gemini-3-flash-preview
18.1%
5
OpenAI
gpt-5.2-codex
16.7%
6
Google
gemini-3-pro-preview
15.3%
7
OpenAI
gpt-5.1
13.9%
8
Z.ai
glm-4.7
12.5%
9
DeepSeek
deepseek-v3.2
11.1%
10
OpenAI
gpt-5.1-codex-max
11.1%
11
Kimi
kimi-k2-thinking
6.9%
12
Anthropic
claude-haiku-4.5
5.6%
13
Grok
grok-4
4.2%
14
Grok
grok-4.1-fast
2.8%

Models ranked by pass rate across all 26 tasks. Cost and time columns show totals for running the complete benchmark. See the full methodology for how we evaluate each model’s output.

View all models →

Tasks #

Easy Go

go-otel-microservices-traces

53% pass rate
Cheapest Grok grok-4.1-fast $0.03
Fastest Google gemini-3-pro-preview 4m
Medium JS

js-otel-microservices

18% pass rate
Cheapest Z.ai glm-4.7 $0.23
Fastest Anthropic claude-opus-4.5 7m
Hard PHP

php-otel-microservices

3% pass rate
Cheapest Anthropic claude-opus-4.5 $1.24
Fastest Anthropic claude-opus-4.5 10m

Tasks sorted by difficulty (highest pass rate first). Languages include Go, Java, C++, Python, Rust, PHP, JavaScript, and more. A pass rate of 0% means no model successfully completed the task—these represent the hardest instrumentation challenges.

View all tasks →

Model × Task Matrix #

Color by:
$0.01
$2 unsolved

Performance heatmap showing results for each model-task combination. Green indicates efficient solutions, red indicates expensive ones, and white cells represent unsolved tasks. This matrix helps identify which models excel at specific types of instrumentation challenges.

Model Trade-offs #

Cost vs performance #

Cost-performance tradeoff across all models (total API cost to run all 26 tasks). The blue line shows the Pareto frontier—models on this line offer optimal cost efficiency for their performance level. Models below the line are dominated by better alternatives.

Speed vs performance #

Speed-performance tradeoff (average duration per task, including all inference and tool-use turns). Faster models appear on the left. For time-sensitive applications, this chart helps identify models that balance quick response times with high accuracy.

Performance Over Time #

Model pass rate plotted against public release date.