Studies

Recent research
results.

Each report measures a specific question on production traffic — with n, methodology, and the embargoed PDF available to subscribers 120 days before public release.

STUDYLIVECross-domain
by GOAT Labs

We gave 10 frontier AIs $50,000 of real USDC and told them to make money predicting the future

Bets placed
300+
Markets
Polymarket (Polygon)
Runtime
Cursor + CLI
Resolution
On-chain, public
models10
days100
total budget$50K
categories10
View live benchmark →
STUDYScience
by GOAT Labs

The Poison of Alignment

SFT pairs
3M+
Base model
LLaMA 2 7B
Benchmarks
MMLU · BBH · HE · DROP
Alignment removed
~33%
MMLU Δ+8.1%
BBH Δ+4.1%
HumanEval Δ+33%
DROP Δ+24%
Read study →
STUDYCross-domain
by GOAT Labs

Claude Opus 4.6 degradation in production, Jan → May 2026: function calls, hop depth, and subagent spawning

Traces analyzed
8.4M
Tokens scored
47B
Corpus slice
8.4M traces
Verticals covered
8 of 9
traces8.4M
E2E latency1.4s → 3.8s
hop depth3 → 6
fn calls2.1 → 4.7
Read study →
STUDYCross-domain
by GOAT Labs

Above 256K: how Opus 4.7, GPT‑5.5, and Gemini 4 Ultra hold up past the long-context cliff

Calls > 256K context
412K
Tokens evaluated
134B
Models compared
3
Avg context length
487K
traces412K
E2E latency12.4s
hop depth1.8
fn calls0.6
Read study →
STUDYCross-domain
by GOAT Labs

DeepSeek V4 Pro vs Qwen 3.7 Max: open-weight production agentic benchmark

Agent traces analyzed
412K
Tasks completed
84K
Tool surfaces covered
47
Tokens scored
28.4B
traces412K
E2E latency18.4s
hop depth3.2
fn calls12.4
Read study →
STUDYCross-domain
by GOAT Labs

Tool-use at scale: error-recovery patterns in agentic systems

Agentic trajectories
1.4M
Tool calls evaluated
11M
Models compared
6
Vendors covered
4
traces1.4M
E2E latency5.2s
hop depth4.2
fn calls7.8
Read study →

All studies are available to research subscribers. Public PDFs land 120 days after publication. — browse archive →