Media Summary: Join us in this session as we dive into "What Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ...
Mech Interp Reading Group Do - Detailed Analysis & Overview
Join us in this session as we dive into "What Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ... Join us in this session as we dive into "Liars' Bench: Evaluating Lie Detectors for Language Models" by Kieron Kretschmar, Walter ... Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ... Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ...
Join us in this session as we dive into "Tracing the thoughts of a large language model" by Anthropic! Read the article here: ... Join us in this session as we dive into "Learning a Generative Meta-Model of LLM Activations" by Grace Luo, Jiahai Feng, Trevor ... Join us in this session as we dive into "Attribution-based Parameter Decomposition" by Dan Braun, Lucius Bushnaq, Stefan ... Join us in this session as we dive into "Tracing Attention Computation Through Feature Interactions" by Harish Kamath et al. Join us in this session as we dive into "There Will Be a Scientific Theory of Deep Learning" by Jamie Simon, Daniel Kunin, ... Join us in this session as we dive into "Symmetry in language statistics shapes the geometry of model representations" by Dhruva ...
Join us in this session as we dive into "Automated Weak-to-Strong Researcher" by Jiaxin Wen, Liang Qiu, Joe Benton, Jan ...