Studies
Recent research
results.
Each report measures a specific question on production traffic — with n, methodology, and the embargoed PDF available to subscribers 120 days before public release.
STUDYLIVECross-domain
by GOAT Labs
We gave 10 frontier AIs $50,000 of real USDC and told them to make money predicting the future
- Bets placed
- 300+
- Markets
- Polymarket (Polygon)
- Runtime
- Cursor + CLI
- Resolution
- On-chain, public
models10
days100
total budget$50K
categories10
STUDYScience
by GOAT Labs
The Poison of Alignment
- SFT pairs
- 3M+
- Base model
- LLaMA 2 7B
- Benchmarks
- MMLU · BBH · HE · DROP
- Alignment removed
- ~33%
MMLU Δ+8.1%
BBH Δ+4.1%
HumanEval Δ+33%
DROP Δ+24%
STUDYCross-domain
by GOAT Labs
Claude Opus 4.6 degradation in production, Jan → May 2026: function calls, hop depth, and subagent spawning
- Traces analyzed
- 8.4M
- Tokens scored
- 47B
- Corpus slice
- 8.4M traces
- Verticals covered
- 8 of 9
traces8.4M
E2E latency1.4s → 3.8s
hop depth3 → 6
fn calls2.1 → 4.7
STUDYCross-domain
by GOAT Labs
Above 256K: how Opus 4.7, GPT‑5.5, and Gemini 4 Ultra hold up past the long-context cliff
- Calls > 256K context
- 412K
- Tokens evaluated
- 134B
- Models compared
- 3
- Avg context length
- 487K
traces412K
E2E latency12.4s
hop depth1.8
fn calls0.6
STUDYCross-domain
by GOAT Labs
DeepSeek V4 Pro vs Qwen 3.7 Max: open-weight production agentic benchmark
- Agent traces analyzed
- 412K
- Tasks completed
- 84K
- Tool surfaces covered
- 47
- Tokens scored
- 28.4B
traces412K
E2E latency18.4s
hop depth3.2
fn calls12.4
STUDYCross-domain
by GOAT Labs
Tool-use at scale: error-recovery patterns in agentic systems
- Agentic trajectories
- 1.4M
- Tool calls evaluated
- 11M
- Models compared
- 6
- Vendors covered
- 4
traces1.4M
E2E latency5.2s
hop depth4.2
fn calls7.8
All studies are available to research subscribers. Public PDFs land 120 days after publication. — browse archive →





