Media Summary: Join us in this session as we dive into "What Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ...

Mech Interp Reading Group Do - Detailed Analysis & Overview

Join us in this session as we dive into "What Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ... Join us in this session as we dive into "Liars' Bench: Evaluating Lie Detectors for Language Models" by Kieron Kretschmar, Walter ... Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ... Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ...

Join us in this session as we dive into "Tracing the thoughts of a large language model" by Anthropic! Read the article here: ... Join us in this session as we dive into "Learning a Generative Meta-Model of LLM Activations" by Grace Luo, Jiahai Feng, Trevor ... Join us in this session as we dive into "Attribution-based Parameter Decomposition" by Dan Braun, Lucius Bushnaq, Stefan ... Join us in this session as we dive into "Tracing Attention Computation Through Feature Interactions" by Harish Kamath et al. Join us in this session as we dive into "There Will Be a Scientific Theory of Deep Learning" by Jamie Simon, Daniel Kunin, ... Join us in this session as we dive into "Symmetry in language statistics shapes the geometry of model representations" by Dhruva ...

Join us in this session as we dive into "Automated Weak-to-Strong Researcher" by Jiaxin Wen, Liang Qiu, Joe Benton, Jan ...

Photo Gallery

Mech Interp Reading Group - What Do VLMs NOTICE?
Mech Interp Reading Group -  Eliciting Secret Knowledge from Language Models
Mech Interp Reading Group - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Mech Interp Reading Group - Liars' Bench: Evaluating Lie Detectors for Language Models
Mech Interp Reading Group - Formal Mech Interp: Automated Circuit Discovery with Provable Guarantees
Mech Interp Reading Group - ITDA: A Scalable Approach to Interpreting Large Language Models
Mech Interp Reading Group - Tracing the thoughts of a large language model
Mech Interp Reading Group - Learning a Generative Meta-Model of LLM Activations
Mech Interp Reading Group - Attribution-based Parameter Decomposition
Mech Interp Reading Group - Tracing Attention Computation Through Feature Interactions
Mech Interp Reading Group - There Will Be a Scientific Theory of Deep Learning
Mech Interp Reading Group - Symmetry in language statistics shapes geometry of model representations
View Detailed Profile
Mech Interp Reading Group - What Do VLMs NOTICE?

Mech Interp Reading Group - What Do VLMs NOTICE?

Join us in this session as we dive into "What

Mech Interp Reading Group -  Eliciting Secret Knowledge from Language Models

Mech Interp Reading Group - Eliciting Secret Knowledge from Language Models

Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ...

Mech Interp Reading Group - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Mech Interp Reading Group - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ...

Mech Interp Reading Group - Liars' Bench: Evaluating Lie Detectors for Language Models

Mech Interp Reading Group - Liars' Bench: Evaluating Lie Detectors for Language Models

Join us in this session as we dive into "Liars' Bench: Evaluating Lie Detectors for Language Models" by Kieron Kretschmar, Walter ...

Mech Interp Reading Group - Formal Mech Interp: Automated Circuit Discovery with Provable Guarantees

Mech Interp Reading Group - Formal Mech Interp: Automated Circuit Discovery with Provable Guarantees

Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ...

Mech Interp Reading Group - ITDA: A Scalable Approach to Interpreting Large Language Models

Mech Interp Reading Group - ITDA: A Scalable Approach to Interpreting Large Language Models

Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ...

Mech Interp Reading Group - Tracing the thoughts of a large language model

Mech Interp Reading Group - Tracing the thoughts of a large language model

Join us in this session as we dive into "Tracing the thoughts of a large language model" by Anthropic! Read the article here: ...

Mech Interp Reading Group - Learning a Generative Meta-Model of LLM Activations

Mech Interp Reading Group - Learning a Generative Meta-Model of LLM Activations

Join us in this session as we dive into "Learning a Generative Meta-Model of LLM Activations" by Grace Luo, Jiahai Feng, Trevor ...

Mech Interp Reading Group - Attribution-based Parameter Decomposition

Mech Interp Reading Group - Attribution-based Parameter Decomposition

Join us in this session as we dive into "Attribution-based Parameter Decomposition" by Dan Braun, Lucius Bushnaq, Stefan ...

Mech Interp Reading Group - Tracing Attention Computation Through Feature Interactions

Mech Interp Reading Group - Tracing Attention Computation Through Feature Interactions

Join us in this session as we dive into "Tracing Attention Computation Through Feature Interactions" by Harish Kamath et al.

Mech Interp Reading Group - There Will Be a Scientific Theory of Deep Learning

Mech Interp Reading Group - There Will Be a Scientific Theory of Deep Learning

Join us in this session as we dive into "There Will Be a Scientific Theory of Deep Learning" by Jamie Simon, Daniel Kunin, ...

Mech Interp Reading Group - Symmetry in language statistics shapes geometry of model representations

Mech Interp Reading Group - Symmetry in language statistics shapes geometry of model representations

Join us in this session as we dive into "Symmetry in language statistics shapes the geometry of model representations" by Dhruva ...

Mech Interp Reading Group - Automated Weak-to-Strong Researcher

Mech Interp Reading Group - Automated Weak-to-Strong Researcher

Join us in this session as we dive into "Automated Weak-to-Strong Researcher" by Jiaxin Wen, Liang Qiu, Joe Benton, Jan ...