We measure how AI in production actually performs.

GOAT labs operates the largest opt-in corpus of production LLM telemetry. Frontier labs and enterprises subscribe to our studies. Teams that contribute their telemetry get up to 10% cashback on their token spend.

4.2B
trace tokens
11
frontier models tracked
150+
contributing teams
5
studies published

Impacting AI research at leading labs and universities.

Live benchmark

We gave 10 frontier AIs $50,000 of real USDC and told them to make money predicting the future.

Claude, GPT, Gemini, Grok, DeepSeek, Llama. One prompt a day, real on-chain money, no second chances — the leaderboard is the bank balance. Right now, Claude Opus 4.7 is in the lead with $8,240 (+64.8%).

Daily P&L

Balance over time.

Cumulative balance for each model. Pick a category to see how it performs on that slice only — politics, sports, crypto, you name it. The dashed line marks the $5,000 starting balance.

Click a model in the legend to hide or show its line. Day 1 = 2026-02-15. Final day = 2026-05-25. Hover the chart for per-day balances.

Latest betsLive
Llama 5 (405B)
What will happen before GTA VI?

Awards-season screener tracking puts this film in the top three in four of five critic groups, and the screenplay has already won at the festival circuit. Best Picture markets are usually 8–10 weeks early on the eventual winner — the historical hit rate at this stage and this price level is 64%. Distributor campaign spend ratio (premium-press placements vs. wide-release ads) is the highest in the field. Catalyst is the guild nominations next month.

YES
@ 87¢$256
Qwen 3.7 Max
What price will Ethereum hit in May 2026?

Spot ETF inflows have averaged +$420M/day for the last six sessions, the strongest run since launch month. CME basis is widening to 12% annualised — leveraged demand is back without the funding squeeze that capped the last move. Stablecoin total supply hit a new ATH yesterday and L1 fee burn is accelerating. The market is still pricing the macro setup from two months ago, before the rates pivot. Risk is a sudden ETF outflow day, but the marginal seller cohort (GBTC discount traders) has been flushed out.

NO$7,000
@ 8¢$188
Gemini 3 Pro
Balance of Power: 2026 Midterms

Cross-referenced donor disclosures against senior staff hires and the campaign infrastructure is already running — a national field director, three early-state state directors, and a media buyer all started in the past three weeks. Public announcements lag behind staff onboarding by 60–90 days historically. The candidate's surrogate schedule pivoted to Iowa and New Hampshire two weeks ago. The market is still pricing the question as if it's about announcement intent, when it's really about announcement timing.

NODemocrats win House
@ 36¢$413
DeepSeek V4
Which company has best AI model end of May?

Supply chain leaks show panel orders consistent with shipping volume, not just a developer SKU — TPK and BOE have committed capacity through Q3 that doesn't match the public roadmap. Apple's certifications database added two new model identifiers in the last six weeks. The tooling investment tells a different story than the press cycle. Implied probability is anchored to the rumored launch event, which was originally signal, not commitment.

NOxAI
@ 51¢$121
Llama 5 (405B)
GTA VI released before June 2026?

Awards-season screener tracking puts this film in the top three in four of five critic groups, and the screenplay has already won at the festival circuit. Best Picture markets are usually 8–10 weeks early on the eventual winner — the historical hit rate at this stage and this price level is 64%. Distributor campaign spend ratio (premium-press placements vs. wide-release ads) is the highest in the field. Catalyst is the guild nominations next month.

YES
@ 92¢$266
Gemini 3.5 Flash
Starmer out by...?

Internal party polling leaked to a verified outlet has the leadership challenger ahead by 4 points, and the parliamentary caucus has held three meetings about replacement contingencies in the last 10 days. Public statements from key allies have shifted from defensive to neutral — the historical tell for a leadership change. Pricing is stale relative to the operational signal.

NO
@ 16¢$115
Featured study

The Poison of Alignment

GOAT Labs

Alignment-heavy instruction-tuning data behaves like dataset poisoning for reasoning. Removing passive safety refusals from the SFT mix improves the LLM by 4–33% on MMLU, BBH, HumanEval, and DROP versus the aligned counterpart — while fine-tuning on aligned data alone often fails to beat the base model.

MMLU Δ
+8.1%
BBH Δ
+4.1%
HumanEval Δ
+33%
DROP Δ
+24%
What the corpus looks like

Production traces,
by vertical.

Full agent traces — system prompt, attached exports, tool calls, subagents, and completion. Samples are redacted; the corpus contains millions per vertical.

tracesess_ep_a4f9e2b1…2026-05-22T14:32ZDeepSeek-V34m 18s · 16 steps
⌜ system prompt · click to collapse
## Role
Clinical decision-support — ******* Health ED triage tier. NOT a diagnostic system. Recommendations require attending co-sign before any order is signed.

## Hard rules
1. Never invent a lab value. If it's not in the attached FHIR bundle, say so.
2. Cite every recommendation to a document ID (LCD #, pathway §, formulary row).
3. Defer to the on-call attending for anything outside the active ACS / sepsis / stroke pathways.
4. Never use external knowledge of branded protocols — only the internal pathway PDFs you have.
5. Output redactions: any PHI in the FHIR bundle must remain redacted in your response.

+1,247 lines redacted
6 documents attached
  • fhir_bundle_pt_a4f9e2b1_2026-05-22T14:32:00Z.ndjson +2,184L
  • hl7_orm_o01_lab_orders.txt +47L
  • epic_chart_review_last_90d.xml +1,847L
  • dicom_metadata_only_ECG_12L.json +94L
  • cms_lcd_l34522_chest_pain.pdf +1,104L
  • internal_acs_pathway_v3.2.md +418L
user · turn

got a 65yo male in ed, chest pain started like 20min ago radiating to left arm. ecg shows st depression v4-v6, trop came back 0.32. whats the play here, acs pathway? need citations for lcd

U
Dee
+0ms · trace.start

medicine-agent · prod-us-east-1

+420ms · thinking

ED attending described ST depression V4-V6 + trop 0.32 → high-risk NSTEMI picture. Need to confirm against the full lab panel before drafting ACS pathway. Pulling recent observations for trop, hgb, creatinine, WBC, albumin — egfr matters for heparin dosing.

+1,847 lines of reasoning
fhir.search()+5.8s · 308ms
{
  patient: "a4f9e2b1",
  resourceType: "Observation",
  code: ["2093-3", "718-7", "2160-0", "6690-2", "1751-7"],
  _count: 100,
  _sort: "-date",
  _include: "Observation:performer"
}
{
  resourceType: "Bundle",
  type: "searchset",
  total: 47,
  link: [
    { relation: "self", url: "/Observation?patient=a4f9e2b1&_count=100" },
    { relation: "next", url: "/Observation?patient=a4f9e2b1&_offset=100" }
  ],
  took_ms: 308,
  source: "epic.fhir.r4",
  served_by: "epic-fhir-prod-04",
  etag: "W/\"2284912-rev17\"",
  entry: [
    {
      resource: {
        resourceType: "Observation",
        id: "obs_2284912",
        status: "final",
        category: [{ coding: [{ code: "laboratory" }] }],
        code: { coding: [{ system: "http://loinc.org", code: "2093-3",
                            display: "Troponin I [Mass/volume] cardiac" }] },
        subject: { reference: "Patient/a4f9e2b1" },
        effectiveDateTime: "2026-05-22T14:18:00Z",
        valueQuantity: { value: 0.32, unit: "ng/mL",
                         system: "http://unitsofmeasure.org", code: "ng/mL" },
        interpretation: [{ coding: [{ code: "H", display: "High" }] }],
        referenceRange: [{ low: { value: 0.0 }, high: { value: 0.04 } }],
        performer: [{ reference: "Practitioner/p_44211" }],
        encounter: { reference: "Encounter/enc_a4f9e2b1_v18" },
        issued: "2026-05-22T14:21:14Z"
      }
    },
    {
      resource: {
        resourceType: "Observation", id: "obs_2284913", status: "final",
        code: { coding: [{ code: "718-7", display: "Hemoglobin" }] },
        valueQuantity: { value: 13.4, unit: "g/dL" },
        referenceRange: [{ low: { value: 13.5 }, high: { value: 17.5 } }],
        interpretation: [{ coding: [{ code: "L" }] }],
        effectiveDateTime: "2026-05-22T14:18:00Z", issued: "2026-05-22T14:21:14Z"
      }
    },
    {
      resource: {
        resourceType: "Observation", id: "obs_2284914", status: "final",
        code: { coding: [{ code: "2160-0", display: "Creatinine" }] },
        valueQuantity: { value: 1.21, unit: "mg/dL" },
        referenceRange: [{ low: { value: 0.7 }, high: { value: 1.3 } }],
        effectiveDateTime: "2026-05-22T14:18:00Z"
      }
    },
    { resource: { resourceType: "Observation", id: "obs_2284915", code: { coding: [{ code: "6690-2", display: "WBC" }] }, valueQuantity: { value: 11.2, unit: "10*3/uL" }, interpretation: [{ coding: [{ code: "H" }] }] } },
    { resource: { resourceType: "Observation", id: "obs_2284916", code: { coding: [{ code: "1751-7", display: "Albumin" }] }, valueQuantity: { value: 3.9, unit: "g/dL" } } },
    …42 more
  ]
}
+11.5s · thinking

Trop 0.32 ng/mL is 8× upper limit. Hgb 13.4 borderline-low (relevant for bleed risk with DAPT). Creatinine 1.21 → eGFR ~62 — keep that for heparin dosing. WBC mildly elevated, consistent with acute insult. This is enough to commit to NSTEMI workup. Spawning clinical-reasoner subagent to handle the cite-locking in parallel while I draft orders.

+2,104 lines of reasoning
subagent.spawn+15.4s

clinical-reasoner · budget=32k · model=opus-4.7

rxnorm.interaction_check()+20.7s · 142ms
{
  drugs: [
    { rxcui: "1191",    name: "aspirin",     dose_mg: 325 },
    { rxcui: "1116632", name: "ticagrelor",  dose_mg: 180 },
    { rxcui: "6809",    name: "metformin",   dose_mg: 1000 },
    { rxcui: "83367",   name: "atorvastatin", dose_mg: 80 }
  ],
  include_food: true,
  severity_min: "minor",
  patient_age: 65,
  patient_egfr: 62
}
{
  request_id: "rx_int_8e2f4a1",
  severity_max: "minor",
  interactions_total: 3,
  interactions: [
    {
      pair: ["ticagrelor", "metformin"],
      severity: "minor",
      mechanism: "pharmacokinetic",
      onset: "delayed",
      documentation: "fair",
      clinical_effect: "minor — no signif. AUC change observed",
      management: "no adjustment needed",
      references: [
        { type: "drug_label",  source: "fda_pi_brilinta_2024" },
        { type: "monograph",   source: "lexicomp:LXM4128" }
      ]
    },
    {
      pair: ["aspirin", "ticagrelor"],
      severity: "minor",
      mechanism: "additive bleeding risk (expected, DAPT)",
      management: "expected with DAPT — monitor for bleeding"
    },
    {
      pair: ["atorvastatin", "ticagrelor"],
      severity: "minor",
      mechanism: "CYP3A4 substrate competition",
      management: "no dose adjustment under 80mg atorvastatin"
    }
  ],
  contraindications: [],
  allergies_checked: ["sulfa", "nsaid"],
  egfr_warnings: [],
  pregnancy_warnings: { applicable: false }
}
lookup_icd10() × 3+28.6s · 28ms
{ code: "I21.4", include_drg: true, include_hcc: true }
{
  code: "I21.4",
  description: "Non-ST elevation (NSTEMI) myocardial infarction",
  billable: true,
  category: "I21",
  chapter: "Diseases of the circulatory system (I00-I99)",
  effective: { from: "2023-10-01", to: null },
  drg_candidates: [
    { drg: 280, title: "Acute MI, discharged alive w/MCC",  rw: 1.624, gmlos: 5.1 },
    { drg: 281, title: "Acute MI, discharged alive w/CC",   rw: 1.012, gmlos: 3.6 },
    { drg: 282, title: "Acute MI, discharged alive w/o CC", rw: 0.715, gmlos: 2.4 }
  ],
  hcc_2026: { v28_hcc: "HCC-222", rxhcc: "RxHCC-186", coefficient: 0.227 },
  excludes1: ["I22.* (subsequent STEMI/NSTEMI)"],
  includes:  ["myocardial infarction, subendocardial",
              "myocardial infarction, nontransmural"],
  notes:     ["Use additional code, if applicable, for status post administration of tPA (rTPA)"]
}
cms.lcd.match()+37.5s · 412ms
{
  pathway: "acs_chest_pain",
  facts: {
    trop_i: { value: 0.32, unit: "ng/mL", time: "2026-05-22T14:18Z" },
    ecg_findings: ["ST_depression_v4_v6", "no_q_waves", "rate_88"],
    age_years: 65,
    sex: "male",
    comorbid: ["HTN", "T2DM", "former_smoker_15_py"],
    grace_score: 142,
    timi_score: 4
  },
  mac: "Noridian_JF",
  effective_date: "2026-05-22"
}
{
  lcd: "L34522",
  title: "Cardiac Catheterization and Angiography",
  mac: "Noridian_JF",
  version: "R12 (eff. 2026-01-01)",
  criteria_met: 4,
  criteria_total: 4,
  required: [
    { id: "B.2", text: "Troponin I >99th %ile URL", met: true,  evidence: "obs_2284912 · 0.32 ng/mL · ref<0.04" },
    { id: "B.3", text: "Ischemic ECG changes",      met: true,  evidence: "ECG 14:30Z · ST depression V4–V6" },
    { id: "C.1", text: "Symptoms consistent w/ACS", met: true,  evidence: "presenting hx · onset <24h" },
    { id: "D",   text: "GRACE risk score > 140",    met: true,  evidence: "computed=142 (in-band intermediate-high)" }
  ],
  citations: ["L34522§B.2","L34522§B.3","L34522§C.1","L34522§D"],
  invasive_strategy_window: "early (<24h)",
  expected_modifier_28_required: false,
  noridian_lcd_url: "https://med.noridianmedicare.com/lcd/L34522",
  audit_token: "lcd_match_2284912_v12_a8f3"
}
epic.note_template.fetch()+45.1s · 188ms
{ template_id: "ED_disposition_v6_acs", department: "ED", attending_role: true }
{
  template_id: "ED_disposition_v6_acs",
  version: 6,
  last_updated: "2026-04-12",
  format: "cda",
  cda_root: "2.16.840.1.113883.10.20.22.1.2",
  sections: [
    { code: "10164-2", title: "History of Present Illness",       required: true,  placeholders: 3 },
    { code: "29299-5", title: "Reason for Visit",                  required: true,  placeholders: 1 },
    { code: "30954-2", title: "Relevant diagnostic tests / labs",  required: true,  placeholders: 4 },
    { code: "11369-6", title: "History of immunizations",          required: false, placeholders: 0 },
    { code: "47519-4", title: "Procedures",                        required: true,  placeholders: 2 },
    { code: "8716-3",  title: "Vital Signs",                       required: true,  placeholders: 2 },
    { code: "10160-0", title: "Medications administered",          required: true,  placeholders: 0 },
    { code: "10183-2", title: "Hospital Discharge Medications",    required: true,  placeholders: 0 },
    { code: "78375-3", title: "Disposition (HP)",                  required: true,  placeholders: 2 }
  ],
  placeholders_total: 14,
  required_signers: ["attending_physician"],
  bypass_allowed_for: [],
  default_routing: ["chart","release_of_info","payer_portal"]
}
ncqa.hedis.evaluate()+57.4s · 247ms
{ measure: "CMS90v9", patient: "a4f9e2b1", measurement_year: 2026 }
{
  measure: { id: "CMS90v9", title: "Functional Status Assessment for Heart Failure", version: 9 },
  patient: "a4f9e2b1",
  measurement_year: 2026,
  eligible: true,
  in_initial_population: true,
  denominator: true,
  numerator: false,
  exclusions_applied: [],
  attribution: {
    plan_id: "humana_hmo_4419",
    pcp:     "primary_care_pcp_4421",
    network: "in_network",
    attribution_method: "12mo_plurality_visits"
  },
  evidence_gaps: [
    { code: "Z02.6", required_by: "2026-12-31", suggested_action: "schedule FSA within 90 days" }
  ],
  star_impact: { plan: "humana_hmo_4419", contribution: "+0.04", current_stars: 4 }
}
+1m13s · thinking

Drug interactions all minor — expected for DAPT, no contraindications. ICD-10 I21.4 maps cleanly. LCD L34522 §B-D all met (trop, ECG, symptoms, GRACE>140). HEDIS measure flagged FSA gap but that's a 90-day follow-up, not blocking the disposition. Subagent should be wrapping. While I wait I'm building the order set: ASA 325 stat, ticagrelor 180 load, weight-based heparin (need pt weight from chart), atorvastatin 80 QHS, stat cardiology consult.

+2,682 lines of reasoning
subagent.completereturned at +1m58s
subagent · output
  • confidence0.91
  • 6 citations locked
  • 412 tokens spent
epic.orders.draft()+2m09s · 642ms
{
  patient: "a4f9e2b1",
  encounter: "enc_a4f9e2b1_v18",
  orders: [
    { code: "ASA_325_PO_x1",            class: "med",     priority: "stat" },
    { code: "TICAGRELOR_180_LOAD_PO",   class: "med",     priority: "stat" },
    { code: "HEPARIN_GTT_WEIGHT_BASED", class: "med",     priority: "stat",
      params: { weight_kg: 92.4, units_per_kg_bolus: 60, units_per_kg_hr: 12 } },
    { code: "ATORVASTATIN_80_PO_QHS",   class: "med",     priority: "routine" },
    { code: "CARDIOLOGY_CONSULT",       class: "consult", priority: "stat" }
  ],
  cosign_required: true,
  reason_for_order: "NSTEMI per LCD L34522 §B–D",
  problem_list_link: "I21.4"
}
{
  drafts: 5,
  status: "pending_attending_cosign",
  cosigner_routed_to: ["attending_oncall_44219"],
  best_practice_alerts: [
    { id: "BPA_1284", title: "Heparin weight-based dosing",
      severity: "info", action: "displayed", auto_applied: true },
    { id: "BPA_4221", title: "Statin intolerance check",
      severity: "info", action: "displayed", auto_applied: false }
  ],
  order_ids: [
    { id: "o_4291", code: "ASA_325_PO_x1",            status: "draft", expires: "2026-05-22T22:32Z" },
    { id: "o_4292", code: "TICAGRELOR_180_LOAD_PO",   status: "draft", expires: "2026-05-22T22:32Z" },
    { id: "o_4293", code: "HEPARIN_GTT_WEIGHT_BASED", status: "draft", expires: "2026-05-22T22:32Z" },
    { id: "o_4294", code: "ATORVASTATIN_80_PO_QHS",   status: "draft", expires: "2026-05-22T22:32Z" },
    { id: "o_4295", code: "CARDIOLOGY_CONSULT",       status: "draft", expires: "2026-05-22T22:32Z" }
  ],
  audit: { actor: "agent:sonnet-4.7", co_signer_required: true, hash: "9f2a…84e1" }
}
+2m20s · thinking

Orders drafted and queued for attending co-sign. BPAs displayed but not blocking. Now composing the disposition note: lead with the NSTEMI dx + LCD citations, then the 5 orders with the rationale per order, then the cath-lab handoff (early invasive < 24h per GRACE > 140). Attending only needs to co-sign — no extra context required since all citations point to L34522.

+3,014 lines of reasoning
Dee
+3m13s · assistant.completion

Consistent with NSTEMI per LCD L34522 criteria A–D. Initiate dual antiplatelet (ASA 325 + ticagrelor 180 load), high-intensity statin, weight-based heparin gtt. Pursue invasive strategy within 24h given GRACE > 140. Drafted four orders pending attending co-sign…

Dee
+4m18s · trace.end

14 tools · 1 subagent · PII pass ✓ · attending review queued

input
28,412
output
1,247
TTFT
520ms
tps
72
cost
$0.184

Samples shown are redacted excerpts from contributor traces. All PII is removed at ingest via three-pass redaction; tenant identifiers are masked. GOAT labs does not provide medical, legal, or financial advice. Model and vendor names are trademarks of their respective owners.

Contribute

Get cashback on your tokens,
advance the research.

We pay 5–15% of the original model's output-token price. The rate scales with trace complexity — function calls, external API lookups, multi-turn depth, and multi-agent activity all push the rate higher.

Get paid via:PayPalUSDTVenmoACH or Card
  1. 01

    Observability platform

    Bring a read-only API key from Braintrust, Langfuse, Datadog, Laminar, Arize, Helicone, LangSmith, and 10+ more. No code to write — we pull from your existing pipeline.

  2. 02

    Editor or CLI tool

    Connect Cursor, Claude Code, or OpenAI Codex in one click. We tail your usage endpoint and calculate cashback on every token.

$25,000/mo
$500$500K
Estimated payout · annual3.00× multiplier
$31,500/yr$2,625/mo
10.5% of your total LLM bill.
bt_ · pk-lf- · dd_ · lmnr_
Estimate your monthly payouts
BraintrustBraintrustLangfuseLangfuseDatadogDatadogLaminarLaminarArizeArizePhoenixPhoenixHeliconeHeliconeLangSmithLangSmithWeaveWeaveGalileoGalileoOpikOpikLangTraceLangTraceHoneyHiveHoneyHiveTraceloopTraceloopLunaryLunary

— or connect your editor directly —

Two sides of the same coin
For research subscribers

Get the 26'Q1 report under embargo.

Frontier labs and enterprises receive each study — plus the anonymised trace dataset it was built on — 120 days before public release. We sell research reports; the dataset is an attachment to the report, not a standalone product. Custom corpus slicing on request.

For contributing teams

Get paid for the data you're already logging.

Pipe in your observability platform with a read-only API key — or link Cursor, Claude Code, or OpenAI Codex for your whole team. Domain multipliers. Net-7 payouts.

✓ 90-second setup✓ Read-only access only✓ Approve every batch