Media Summary: In this AI Research Roundup episode, Alex discusses the paper: ' In this AI Research Roundup episode, Alex discusses the paper: 'NatureBench: Can Episode 1 of a series on building and running AI

Slopcodebench Benchmarking How Coding Agents - Detailed Analysis & Overview

In this AI Research Roundup episode, Alex discusses the paper: ' In this AI Research Roundup episode, Alex discusses the paper: 'NatureBench: Can Episode 1 of a series on building and running AI SWE-Bench is one of the most popular (and difficult) A model just scored 95% on SWE-bench — and that number tells you almost nothing about whether it can fix a bug in your repo. In this AI Research Roundup episode, Alex discusses the paper: 'ProgramBench: Can Language Models Rebuild Programs From ...

John Yang is a PhD student at Stanford and the creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ... In this AI Research Roundup episode, Alex discusses the paper: 'Claw-SWE-Bench: A Claude Mythos 5 scored 95.5% on SWE-bench Verified as of June 27, 2026 — up from 4.4% when GPT-4 attempted the same ... Today we're releasing Ramp SWE-Bench: a private, production-grounded

Photo Gallery

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)
SlopCodeBench: Evaluating Iterative Coding Agents
SlopCodeBench: Measuring Code Erosion as Agents Iterate
NatureBench: Testing Coding Agents on Science
Local Coding Agents on Strix Halo and R9700: Pi, Opencode, and SWE-bench Mini Benchmarks
Evaluate agents on SWE-Bench
The SWE-bench Lie: Why "95%" Says Nothing About Your Code
ProgramBench: New Coding Benchmark for LLM Agents
Benchmarking AI Agents Against Realistic Analytical Tasks with ADE-bench
Benchtalks #2: From SWE-bench to ProgramBench: The Future of Coding Benchmarks with John Yang
Claw-SWE-Bench: Benchmark for LLM Coding Agents
AI Just Solved Coding: The SWE-Bench Data Nobody Is Talking About
View Detailed Profile
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)

Title:

SlopCodeBench: Evaluating Iterative Coding Agents

SlopCodeBench: Evaluating Iterative Coding Agents

In this AI Research Roundup episode, Alex discusses the paper: '

SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench

NatureBench: Testing Coding Agents on Science

NatureBench: Testing Coding Agents on Science

In this AI Research Roundup episode, Alex discusses the paper: 'NatureBench: Can

Local Coding Agents on Strix Halo and R9700: Pi, Opencode, and SWE-bench Mini Benchmarks

Local Coding Agents on Strix Halo and R9700: Pi, Opencode, and SWE-bench Mini Benchmarks

Episode 1 of a series on building and running AI

Evaluate agents on SWE-Bench

Evaluate agents on SWE-Bench

SWE-Bench is one of the most popular (and difficult)

The SWE-bench Lie: Why "95%" Says Nothing About Your Code

The SWE-bench Lie: Why "95%" Says Nothing About Your Code

A model just scored 95% on SWE-bench — and that number tells you almost nothing about whether it can fix a bug in your repo.

ProgramBench: New Coding Benchmark for LLM Agents

ProgramBench: New Coding Benchmark for LLM Agents

In this AI Research Roundup episode, Alex discusses the paper: 'ProgramBench: Can Language Models Rebuild Programs From ...

Benchmarking AI Agents Against Realistic Analytical Tasks with ADE-bench

Benchmarking AI Agents Against Realistic Analytical Tasks with ADE-bench

[2026 - DAY 2 -

Benchtalks #2: From SWE-bench to ProgramBench: The Future of Coding Benchmarks with John Yang

Benchtalks #2: From SWE-bench to ProgramBench: The Future of Coding Benchmarks with John Yang

John Yang is a PhD student at Stanford and the creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ...

Claw-SWE-Bench: Benchmark for LLM Coding Agents

Claw-SWE-Bench: Benchmark for LLM Coding Agents

In this AI Research Roundup episode, Alex discusses the paper: 'Claw-SWE-Bench: A

AI Just Solved Coding: The SWE-Bench Data Nobody Is Talking About

AI Just Solved Coding: The SWE-Bench Data Nobody Is Talking About

Claude Mythos 5 scored 95.5% on SWE-bench Verified as of June 27, 2026 — up from 4.4% when GPT-4 attempted the same ...

Evaluate coding agents on financial SWE work with Ramp SWE-Bench

Evaluate coding agents on financial SWE work with Ramp SWE-Bench

Today we're releasing Ramp SWE-Bench: a private, production-grounded