Mech Interp Reading Group The

Media Summary: Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ... Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ...

Mech Interp Reading Group The - Detailed Analysis & Overview

Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ... Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ... Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ... Join us in this session as we dive into "What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free ... Join us in this session as we dive into "Learning a Generative Meta-Model of LLM Activations" by Grace Luo, Jiahai Feng, Trevor ... Join us in this session as we dive into "Tracing the thoughts of a large language model" by Anthropic!

Join us in this session as we dive into "The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind" by ... Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ... Join us in this session as we dive into "Open Problems in Mechanistic interpretability" by Lee Sharkey et al.! Join us in this session as we dive into "The Circuits Research Landscape: Results and Perspectives" by Anthropic, EleutherAI, ... Join us in this session as we dive into "Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?" by Maxime ... Join us in this session as we dive into "Subliminal Learning Is Steering Vector Distillation" by Camila Blank, Agam Bhatia, ...

Join us in this session as we dive into "Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought" by ...

Photo Gallery

Mech Interp Reading Group - Formal Mech Interp: Automated Circuit Discovery with Provable Guarantees

Mech Interp Reading Group - Eliciting Secret Knowledge from Language Models

Mech Interp Reading Group - ITDA: A Scalable Approach to Interpreting Large Language Models

Mech Interp Reading Group - What Do VLMs NOTICE?

Mech Interp Reading Group - Learning a Generative Meta-Model of LLM Activations

Mech Interp Reading Group - Tracing the thoughts of a large language model

Mech Interp Reading Group - The Secret Agenda: LLMs Strategically Lie, Our Safety Tools Are Blind

Mech Interp Reading Group - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Mech Interp Reading Group - Open Problems in Mechanistic interpretability

Mech Interp Reading Group - The Circuits Research Landscape: Results and Perspectives

Mech Interp Reading Group - Everything, Everywhere, All at Once: Is Mechanistic Interp Identifiable?

Mech Interp Reading Group - Subliminal Learning Is Steering Vector Distillation

View Detailed Profile

Mech Interp Reading Group - Formal Mech Interp: Automated Circuit Discovery with Provable Guarantees

Mech Interp Reading Group - Formal Mech Interp: Automated Circuit Discovery with Provable Guarantees

Join us in this session as we dive into "Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable ...

Mech Interp Reading Group - Eliciting Secret Knowledge from Language Models

Mech Interp Reading Group - Eliciting Secret Knowledge from Language Models

Join us in this session as we dive into "Eliciting Secret Knowledge from Language Models" by Bartosz Cywiński, Emil Ryd, Rowan ...

Mech Interp Reading Group - ITDA: A Scalable Approach to Interpreting Large Language Models

Mech Interp Reading Group - ITDA: A Scalable Approach to Interpreting Large Language Models

Join us in this session as we dive into "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting ...

Mech Interp Reading Group - What Do VLMs NOTICE?

Mech Interp Reading Group - What Do VLMs NOTICE?

Join us in this session as we dive into "What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free ...

Mech Interp Reading Group - Learning a Generative Meta-Model of LLM Activations

Mech Interp Reading Group - Learning a Generative Meta-Model of LLM Activations

Join us in this session as we dive into "Learning a Generative Meta-Model of LLM Activations" by Grace Luo, Jiahai Feng, Trevor ...

Mech Interp Reading Group - Tracing the thoughts of a large language model

Mech Interp Reading Group - Tracing the thoughts of a large language model

Join us in this session as we dive into "Tracing the thoughts of a large language model" by Anthropic!

Mech Interp Reading Group - The Secret Agenda: LLMs Strategically Lie, Our Safety Tools Are Blind

Mech Interp Reading Group - The Secret Agenda: LLMs Strategically Lie, Our Safety Tools Are Blind

Join us in this session as we dive into "The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind" by ...

Mech Interp Reading Group - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Mech Interp Reading Group - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Join us in this session as we dive into "Beyond Linear Probes: Dynamic Safety Monitoring for Language Models" by James ...

Mech Interp Reading Group - Open Problems in Mechanistic interpretability

Mech Interp Reading Group - Open Problems in Mechanistic interpretability

Join us in this session as we dive into "Open Problems in Mechanistic interpretability" by Lee Sharkey et al.!

Mech Interp Reading Group - The Circuits Research Landscape: Results and Perspectives

Mech Interp Reading Group - The Circuits Research Landscape: Results and Perspectives

Join us in this session as we dive into "The Circuits Research Landscape: Results and Perspectives" by Anthropic, EleutherAI, ...

Mech Interp Reading Group - Everything, Everywhere, All at Once: Is Mechanistic Interp Identifiable?

Mech Interp Reading Group - Everything, Everywhere, All at Once: Is Mechanistic Interp Identifiable?

Join us in this session as we dive into "Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?" by Maxime ...

Mech Interp Reading Group - Subliminal Learning Is Steering Vector Distillation

Mech Interp Reading Group - Subliminal Learning Is Steering Vector Distillation

Join us in this session as we dive into "Subliminal Learning Is Steering Vector Distillation" by Camila Blank, Agam Bhatia, ...

Mech Interp Reading Group - Global CoT Analysis: Attempts to uncover patterns across many CoT

Mech Interp Reading Group - Global CoT Analysis: Attempts to uncover patterns across many CoT

Join us in this session as we dive into "Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought" by ...