Media Summary: Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ... Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ...
Mech Interp Reading Group The - Detailed Analysis & Overview
Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ... Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ... Join us in this session as we dive into "What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free ... Join us in this session as we dive into "Learning a Generative Meta-Model of LLM Activations" by Grace Luo, Jiahai Feng, Trevor ... Join us in this session as we dive into "Tracing the thoughts of a large language model" by Anthropic!
Join us in this session as we dive into "The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind" by ... Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ... Join us in this session as we dive into "Open Problems in Mechanistic interpretability" by Lee Sharkey et al.! Join us in this session as we dive into "The Circuits Research Landscape: Results and Perspectives" by Anthropic, EleutherAI, ... Join us in this session as we dive into "Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?" by Maxime ... Join us in this session as we dive into "Subliminal Learning Is Steering Vector Distillation" by Camila Blank, Agam Bhatia, ...
Join us in this session as we dive into "Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought" by ...