Instrument a Python client-server application with OTEL tracing. Must produce exactly 2 trace IDs for two separate workflows.
Common failure modes
Test expects 2 trace IDs but models produce only 1. Models propagate context "too well" - they continue the same trace across both workflows instead of creating separate traces for each.
Example error
AssertionError: Expected more than 1 trace ID, got 1
Performance
| Model | Pass Rate | Runs | Avg Cost | Avg Time |
|---|---|---|---|---|
| gpt-5.2-codex | 0% | | $0.00 | 20m |
| deepseek-v3.2 | 0% | | $0.11 | 15m |
| gemini-3-flash-preview | 0% | | $0.13 | 4m |
| grok-4.1-fast | 0% | | $0.14 | 20m |
| glm-4.7 | 0% | | $0.15 | 8m |
| kimi-k2-thinking | 0% | | $0.16 | 23m |
| gpt-5.1 | 0% | | $0.34 | 10m |
| gemini-3-pro-preview | 0% | | $0.40 | 5m |
| claude-haiku-4.5 | 0% | | $0.41 | 7m |
| grok-4 | 0% | | $0.44 | 9m |
| gpt-5.2 | 0% | | $0.44 | 8m |
| claude-sonnet-4.5 | 0% | | $0.46 | 5m |
| gpt-5.1-codex-max | 0% | | $0.61 | 15m |
| claude-opus-4.5 | 0% | | $0.66 | 6m |
All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.