Optimize your AI costs

Full quality, fewer tokens.

We find what burns your team’s AI budget and fix it.

Let’s talkBook a direct call with the CEO

Observe

Where every token goes: by model, team, and use case.

Organize

Budgets and limits per team. No end-of-month surprises.

Optimize

Right-size models, tighten prompts, cut context. Evals keep quality in check.

Latest insights

Explore our research on AI agents, benchmarking, and evaluation

Kimi K3 is Open, Opus 5 is Good, DeepSeek V4 Flash is Cheap: LLMs on Baba Is You

We evaluate July 2026 fresh releases Kimi K3, Claude Opus 5, Grok 4.5, Gemini 3.6 Flash, and DeepSeek V4 Flash 0731 on Baba Is Bench, an LLM agent benchmark based on the puzzle game Baba Is You, comparing pass rate, speed, and cost with Claude Fable 5 and GPT-5.6.

Piotr Migdał & Piotr Grabowski31 Jul 2026

A lesson about retries, hidden in the DeepSeek-V4 paper

A warning hidden in the DeepSeek-V4 paper says retrying interrupted LLM requests is mathematically incorrect — it introduces length bias. I reproduced it on 100,000 poems.

Piotr Grabowski31 Jul 2026

Do Qwen3.6 27B quantizations break the pelican?

We tested Qwen3.6 27B quantizations by Unsloth on Hugging Face, with pelicans on bikes, gears, Terminal-Bench 2.1, and AIME-120.

Piotr Migdał27 Jul 2026

View all blog posts