Media Summary: In this AI Research Roundup episode, Alex discusses the paper: ' A model just scored 95% on SWE-bench — and that number tells you almost nothing about whether it can fix a bug in your repo. Authors: Charlie Curtsinger, Emery D. Berger Abstract: Improving performance is a central concern for software developers.

Slopcodebench Measuring Code Erosion As - Detailed Analysis & Overview

In this AI Research Roundup episode, Alex discusses the paper: ' A model just scored 95% on SWE-bench — and that number tells you almost nothing about whether it can fix a bug in your repo. Authors: Charlie Curtsinger, Emery D. Berger Abstract: Improving performance is a central concern for software developers. Scotts thoughts on AI slop and slowing down. Slowing Down: ... In this AI Research Roundup episode, Alex discusses the paper: 'NatureBench: Can Coding Agents Match the Published SOTA of ... This side-by-side comparison demonstrates the real-world performance difference between standard large language model (LLM) ...

Caveat emptor! FORENSIC DISCLAIMER: Technical Note: " Claude Mythos 5 scored 95.5% on SWE-bench Verified as of June 27, 2026 — up from 4.4% when GPT-4 attempted the same ...

Photo Gallery

SlopCodeBench: Measuring Code Erosion as Agents Iterate
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)
SlopCodeBench: Evaluating Iterative Coding Agents
The SWE-bench Lie: Why "95%" Says Nothing About Your Code
Coz: finding code that counts with causal profiling
37,000 Lines of Slop
NatureBench: Testing Coding Agents on Science
Erlang Factory SF 2015 - Scott Lystig Fritchie - Actively measuring and profiling Code
Speculative decoding vs standard LLM inference: Side-by-side speed benchmark
145 - Code vs. [REDACTED] (BEST
AI Just Solved Coding: The SWE-Bench Data Nobody Is Talking About
View Detailed Profile
SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)

Title:

SlopCodeBench: Evaluating Iterative Coding Agents

SlopCodeBench: Evaluating Iterative Coding Agents

In this AI Research Roundup episode, Alex discusses the paper: '

The SWE-bench Lie: Why "95%" Says Nothing About Your Code

The SWE-bench Lie: Why "95%" Says Nothing About Your Code

A model just scored 95% on SWE-bench — and that number tells you almost nothing about whether it can fix a bug in your repo.

Coz: finding code that counts with causal profiling

Coz: finding code that counts with causal profiling

Authors: Charlie Curtsinger, Emery D. Berger Abstract: Improving performance is a central concern for software developers.

37,000 Lines of Slop

37,000 Lines of Slop

Scotts thoughts on AI slop and slowing down. Slowing Down: ...

NatureBench: Testing Coding Agents on Science

NatureBench: Testing Coding Agents on Science

In this AI Research Roundup episode, Alex discusses the paper: 'NatureBench: Can Coding Agents Match the Published SOTA of ...

Erlang Factory SF 2015 - Scott Lystig Fritchie - Actively measuring and profiling Code

Erlang Factory SF 2015 - Scott Lystig Fritchie - Actively measuring and profiling Code

... what you were

Speculative decoding vs standard LLM inference: Side-by-side speed benchmark

Speculative decoding vs standard LLM inference: Side-by-side speed benchmark

This side-by-side comparison demonstrates the real-world performance difference between standard large language model (LLM) ...

145 - Code vs. [REDACTED] (BEST

145 - Code vs. [REDACTED] (BEST

Caveat emptor! FORENSIC DISCLAIMER: Technical Note: "

AI Just Solved Coding: The SWE-Bench Data Nobody Is Talking About

AI Just Solved Coding: The SWE-Bench Data Nobody Is Talking About

Claude Mythos 5 scored 95.5% on SWE-bench Verified as of June 27, 2026 — up from 4.4% when GPT-4 attempted the same ...