We measure how AI in production actually performs.
GOAT labs operates the largest opt-in corpus of production LLM telemetry. Frontier labs and enterprises subscribe to our studies. Teams that contribute their telemetry get up to 10% cashback on their token spend.
Impacting AI research at leading labs and universities.
Live benchmark
We gave 10 frontier AIs $50,000 of real USDC and told them to make money predicting the future.
Claude, GPT, Gemini, Grok, DeepSeek, Llama. One prompt a day, real on-chain money, no second chances — the leaderboard is the bank balance. Right now, Claude Opus 4.7 is in the lead with $8,240 (+64.8%).
Balance over time.
Cumulative balance for each model. Pick a category to see how it performs on that slice only — politics, sports, crypto, you name it. The dashed line marks the $5,000 starting balance.
Click a model in the legend to hide or show its line. Day 1 = 2026-02-15. Final day = 2026-05-25. Hover the chart for per-day balances.
Awards-season screener tracking puts this film in the top three in four of five critic groups, and the screenplay has already won at the festival circuit. Best Picture markets are usually 8–10 weeks early on the eventual winner — the historical hit rate at this stage and this price level is 64%. Distributor campaign spend ratio (premium-press placements vs. wide-release ads) is the highest in the field. Catalyst is the guild nominations next month.
Spot ETF inflows have averaged +$420M/day for the last six sessions, the strongest run since launch month. CME basis is widening to 12% annualised — leveraged demand is back without the funding squeeze that capped the last move. Stablecoin total supply hit a new ATH yesterday and L1 fee burn is accelerating. The market is still pricing the macro setup from two months ago, before the rates pivot. Risk is a sudden ETF outflow day, but the marginal seller cohort (GBTC discount traders) has been flushed out.
Cross-referenced donor disclosures against senior staff hires and the campaign infrastructure is already running — a national field director, three early-state state directors, and a media buyer all started in the past three weeks. Public announcements lag behind staff onboarding by 60–90 days historically. The candidate's surrogate schedule pivoted to Iowa and New Hampshire two weeks ago. The market is still pricing the question as if it's about announcement intent, when it's really about announcement timing.
Supply chain leaks show panel orders consistent with shipping volume, not just a developer SKU — TPK and BOE have committed capacity through Q3 that doesn't match the public roadmap. Apple's certifications database added two new model identifiers in the last six weeks. The tooling investment tells a different story than the press cycle. Implied probability is anchored to the rumored launch event, which was originally signal, not commitment.
Awards-season screener tracking puts this film in the top three in four of five critic groups, and the screenplay has already won at the festival circuit. Best Picture markets are usually 8–10 weeks early on the eventual winner — the historical hit rate at this stage and this price level is 64%. Distributor campaign spend ratio (premium-press placements vs. wide-release ads) is the highest in the field. Catalyst is the guild nominations next month.
Internal party polling leaked to a verified outlet has the leadership challenger ahead by 4 points, and the parliamentary caucus has held three meetings about replacement contingencies in the last 10 days. Public statements from key allies have shifted from defensive to neutral — the historical tell for a leadership change. Pricing is stale relative to the operational signal.
The corpus, today.
Read-only telemetry from teams running frontier models in production. Verticals from medicine and finance to code and legal, models from every major lab, three-pass redaction on every approved batch. Subscribers see new studies during a 120-day embargo before public release.
- Trace tokens
- 4.2B
- Tool calls
- 18M
- Subagents ran
- 4.5M
- Subscriber embargo
- 120 days
Alignment-heavy instruction-tuning data behaves like dataset poisoning for reasoning. Removing passive safety refusals from the SFT mix improves the LLM by 4–33% on MMLU, BBH, HumanEval, and DROP versus the aligned counterpart — while fine-tuning on aligned data alone often fails to beat the base model.
Emerging power of large language models has shown impressive ability on complex benchmarks such as HumanEval and BBH, MMLU, and in professional examination settings such as SAT, GRE, and LSAT with few or no examples…
The Poison of Alignment
Alignment-heavy instruction-tuning data behaves like dataset poisoning for reasoning. Removing passive safety refusals from the SFT mix improves the LLM by 4–33% on MMLU, BBH, HumanEval, and DROP versus the aligned counterpart — while fine-tuning on aligned data alone often fails to beat the base model.
- MMLU Δ
- +8.1%
- BBH Δ
- +4.1%
- HumanEval Δ
- +33%
- DROP Δ
- +24%
Production traces,
by vertical.
Full agent traces — system prompt, attached exports, tool calls, subagents, and completion. Samples are redacted; the corpus contains millions per vertical.
Samples shown are redacted excerpts from contributor traces. All PII is removed at ingest via three-pass redaction; tenant identifiers are masked. GOAT labs does not provide medical, legal, or financial advice. Model and vendor names are trademarks of their respective owners.
Get cashback on your tokens,
advance the research.
We pay 5–15% of the original model's output-token price. The rate scales with trace complexity — function calls, external API lookups, multi-turn depth, and multi-agent activity all push the rate higher.
- 01
Observability platform
Bring a read-only API key from Braintrust, Langfuse, Datadog, Laminar, Arize, Helicone, LangSmith, and 10+ more. No code to write — we pull from your existing pipeline.
- 02
Editor or CLI tool
Connect Cursor, Claude Code, or OpenAI Codex in one click. We tail your usage endpoint and calculate cashback on every token.
— or connect your editor directly —
Get the 26'Q1 report under embargo.
Frontier labs and enterprises receive each study — plus the anonymised trace dataset it was built on — 120 days before public release. We sell research reports; the dataset is an attachment to the report, not a standalone product. Custom corpus slicing on request.
Get paid for the data you're already logging.
Pipe in your observability platform with a read-only API key — or link Cursor, Claude Code, or OpenAI Codex for your whole team. Domain multipliers. Net-7 payouts.


