Media Summary: This video introduces a new series on testing This lecture discusses the critical shift from evaluating static LLMs to complex On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ...

Offline Evaluations For Ai Agents - Detailed Analysis & Overview

This video introduces a new series on testing This lecture discusses the critical shift from evaluating static LLMs to complex On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ... Shishir Patal, a Research Scientist at Meta, delivered a presentation on

Photo Gallery

The agent evaluation revolution
LLM as a Judge: Scaling AI Evaluation Strategies
How to Evaluate AI Agents: Comprehensive Strategies for Reliable, High‑Quality Agentic Systems
Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary
How to Evaluate Your AI Agent Using Test Cases and Metrics
Agent evaluation with ADK & Vertex AI | The Agent Factory Podcast
Offline Evaluations for AI Agents in WSO2 Integrator
Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind
Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil
Observability and Evals for AI Agents: A Simple Breakdown
Agentic Evals by Shishir Patil
Evaluating and Debugging Non-Deterministic AI Agents
View Detailed Profile
The agent evaluation revolution

The agent evaluation revolution

This video introduces a new series on testing

LLM as a Judge: Scaling AI Evaluation Strategies

LLM as a Judge: Scaling AI Evaluation Strategies

Ready to become a certified watsonx

How to Evaluate AI Agents: Comprehensive Strategies for Reliable, High‑Quality Agentic Systems

How to Evaluate AI Agents: Comprehensive Strategies for Reliable, High‑Quality Agentic Systems

Evaluating

Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary

Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary

This lecture discusses the critical shift from evaluating static LLMs to complex

How to Evaluate Your AI Agent Using Test Cases and Metrics

How to Evaluate Your AI Agent Using Test Cases and Metrics

Building reliable

Agent evaluation with ADK & Vertex AI | The Agent Factory Podcast

Agent evaluation with ADK & Vertex AI | The Agent Factory Podcast

Learn how to effectively

Offline Evaluations for AI Agents in WSO2 Integrator

Offline Evaluations for AI Agents in WSO2 Integrator

Agent evaluations

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ...

Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil

Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil

Is 2025 the year of

Observability and Evals for AI Agents: A Simple Breakdown

Observability and Evals for AI Agents: A Simple Breakdown

You don't know what your

Agentic Evals by Shishir Patil

Agentic Evals by Shishir Patil

Shishir Patal, a Research Scientist at Meta, delivered a presentation on

Evaluating and Debugging Non-Deterministic AI Agents

Evaluating and Debugging Non-Deterministic AI Agents

Evaluate

Agent Evaluation Series | Foundations & Evaluation Fundamentals for Agentic AI | FastTrack TechTalk

Agent Evaluation Series | Foundations & Evaluation Fundamentals for Agentic AI | FastTrack TechTalk

Discover how Microsoft's