Studies — GOAT labs

Studies

Recent research
results.

Each report measures a specific question on production traffic — with n, methodology, and the embargoed PDF available to subscribers 120 days before public release.

STUDYLIVECross-domain

by GOAT Labs

We gave 10 frontier AIs $50,000 of real USDC and told them to make money predicting the future

Bets placed: 300+
Markets: Polymarket (Polygon)
Runtime: Cursor + CLI
Resolution: On-chain, public

models10

days100

total budget$50K

categories10

View live benchmark →

STUDYScience

by GOAT Labs

The Poison of Alignment

SFT pairs: 3M+
Base model: LLaMA 2 7B
Benchmarks: MMLU · BBH · HE · DROP
Alignment removed: ~33%

MMLU Δ+8.1%

BBH Δ+4.1%

HumanEval Δ+33%

DROP Δ+24%

Read study →

STUDYCross-domain

by GOAT Labs

Claude Opus 4.6 degradation in production, Jan → May 2026: function calls, hop depth, and subagent spawning

Traces analyzed: 8.4M
Tokens scored: 47B
Corpus slice: 8.4M traces
Verticals covered: 8 of 9

traces8.4M

E2E latency1.4s → 3.8s

hop depth3 → 6

fn calls2.1 → 4.7

Read study →

STUDYCross-domain

by GOAT Labs

Above 256K: how Opus 4.7, GPT‑5.5, and Gemini 4 Ultra hold up past the long-context cliff

Calls > 256K context: 412K
Tokens evaluated: 134B
Models compared: 3
Avg context length: 487K

traces412K

E2E latency12.4s

hop depth1.8

fn calls0.6

Read study →

STUDYCross-domain

by GOAT Labs

DeepSeek V4 Pro vs Qwen 3.7 Max: open-weight production agentic benchmark

Agent traces analyzed: 412K
Tasks completed: 84K
Tool surfaces covered: 47
Tokens scored: 28.4B

traces412K

E2E latency18.4s

hop depth3.2

fn calls12.4

Read study →

STUDYCross-domain

by GOAT Labs

Tool-use at scale: error-recovery patterns in agentic systems

Agentic trajectories: 1.4M
Tool calls evaluated: 11M
Models compared: 6
Vendors covered: 4

traces1.4M

E2E latency5.2s

hop depth4.2

fn calls7.8

Read study →

All studies are available to research subscribers. Public PDFs land 120 days after publication. — browse archive →

Recent researchresults.

We gave 10 frontier AIs $50,000 of real USDC and told them to make money predicting the future

The Poison of Alignment

Claude Opus 4.6 degradation in production, Jan → May 2026: function calls, hop depth, and subagent spawning

Above 256K: how Opus 4.7, GPT‑5.5, and Gemini 4 Ultra hold up past the long-context cliff

DeepSeek V4 Pro vs Qwen 3.7 Max: open-weight production agentic benchmark

Tool-use at scale: error-recovery patterns in agentic systems

Recent research
results.