Media Summary: John Yang is a PhD student at Stanford and the creator of the We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: ... AI agents are now writing and shipping production code autonomously — and the benchmarks prove it. In this video: 0:00 — The ...

Benchtalks 2 From Swe Bench - Detailed Analysis & Overview

John Yang is a PhD student at Stanford and the creator of the We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: ... AI agents are now writing and shipping production code autonomously — and the benchmarks prove it. In this video: 0:00 — The ... Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment ... In this AI Research Roundup episode, Alex discusses the paper: 'Claw- In this episode, Kilian Lieret, Research Software Engineer, and Carlos Jimenez, Computer Science PhD Candidate at Princeton ...

Photo Gallery

Benchtalks #2: From SWE-bench to ProgramBench: The Future of Coding Benchmarks with John Yang
SWE-Bench is getting replaced???
Beyond SWE-Bench Pro - Where do Agents go from Here?
SWE Bench Verified - AI Benchmark
AI Agents Just Crossed a Dangerous Line (SWE-bench 70%+)
What is SWE Bench ?
What is Swe Bench Pro?
SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?
The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Chain of Thought | Introducing SWE-Bench Pro
Claw-SWE-Bench: Benchmark for LLM Coding Agents
Interpreting SWE-bench Scores
View Detailed Profile
Benchtalks #2: From SWE-bench to ProgramBench: The Future of Coding Benchmarks with John Yang

Benchtalks #2: From SWE-bench to ProgramBench: The Future of Coding Benchmarks with John Yang

John Yang is a PhD student at Stanford and the creator of the

SWE-Bench is getting replaced???

SWE-Bench is getting replaced???

We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: ...

Beyond SWE-Bench Pro - Where do Agents go from Here?

Beyond SWE-Bench Pro - Where do Agents go from Here?

Yanis He (

SWE Bench Verified - AI Benchmark

SWE Bench Verified - AI Benchmark

SWE

AI Agents Just Crossed a Dangerous Line (SWE-bench 70%+)

AI Agents Just Crossed a Dangerous Line (SWE-bench 70%+)

AI agents are now writing and shipping production code autonomously — and the benchmarks prove it. In this video: 0:00 — The ...

What is SWE Bench ?

What is SWE Bench ?

SWE Bench

What is Swe Bench Pro?

What is Swe Bench Pro?

What is

SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?

SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?

SWE

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment ...

Chain of Thought | Introducing SWE-Bench Pro

Chain of Thought | Introducing SWE-Bench Pro

Introducing

Claw-SWE-Bench: Benchmark for LLM Coding Agents

Claw-SWE-Bench: Benchmark for LLM Coding Agents

In this AI Research Roundup episode, Alex discusses the paper: 'Claw-

Interpreting SWE-bench Scores

Interpreting SWE-bench Scores

SWE

SWE bench & SWE agent | Data Brew | Episode 44

SWE bench & SWE agent | Data Brew | Episode 44

In this episode, Kilian Lieret, Research Software Engineer, and Carlos Jimenez, Computer Science PhD Candidate at Princeton ...