Steering Language Model Refusal With

Media Summary: The paper explores using sparse autoencoders to steer Modify the behavior or the personality of a In this AI Research Roundup episode, Alex discusses the paper: 'What Drives Representation

Steering Language Model Refusal With - Detailed Analysis & Overview

The paper explores using sparse autoencoders to steer Modify the behavior or the personality of a In this AI Research Roundup episode, Alex discusses the paper: 'What Drives Representation The paper introduces Conditional Activation Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... Alessandro Stolfo, PhD Candidate at ETH Zürich and Doctoral Fellow at the Swiss Cyber-Defence (CYD) Campus Abstract: ...

Photo Gallery

Steering Language Model Refusal with Sparse Autoencoders

[QA] Steering Language Model Refusal with Sparse Autoencoders

Steering vectors: tailor LLMs without training. Part I: Theory (Interpretability Series)

Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series)

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Inference-Time Steering Is Riskier Than You Think (EACL 2026)

Steering LLM Behavior Without Fine-Tuning

Mechanistic Analysis of LLM Steering Vectors

[QA] Programming Refusal with Conditional Activation Steering

Alignment faking in large language models

Refusal in Language Models Is Mediated by a Single Direction

NEC Talks: Improving Instruction Following in Language Models via Activation Steering – A. Stolfo

View Detailed Profile

Steering Language Model Refusal with Sparse Autoencoders

Steering Language Model Refusal with Sparse Autoencoders

The paper explores using sparse autoencoders to steer

[QA] Steering Language Model Refusal with Sparse Autoencoders

[QA] Steering Language Model Refusal with Sparse Autoencoders

The paper explores using sparse autoencoders to steer

Steering vectors: tailor LLMs without training. Part I: Theory (Interpretability Series)

Steering vectors: tailor LLMs without training. Part I: Theory (Interpretability Series)

State-of-the-art foundation

Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series)

Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series)

See Part I for an intro into

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Paper: What Drives Representation

Inference-Time Steering Is Riskier Than You Think (EACL 2026)

Inference-Time Steering Is Riskier Than You Think (EACL 2026)

Are the techniques we use to control

Steering LLM Behavior Without Fine-Tuning

Steering LLM Behavior Without Fine-Tuning

Modify the behavior or the personality of a

Mechanistic Analysis of LLM Steering Vectors

Mechanistic Analysis of LLM Steering Vectors

In this AI Research Roundup episode, Alex discusses the paper: 'What Drives Representation

[QA] Programming Refusal with Conditional Activation Steering

[QA] Programming Refusal with Conditional Activation Steering

The paper introduces Conditional Activation

Alignment faking in large language models

Alignment faking in large language models

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ...

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction

Study explores

NEC Talks: Improving Instruction Following in Language Models via Activation Steering – A. Stolfo

NEC Talks: Improving Instruction Following in Language Models via Activation Steering – A. Stolfo

Alessandro Stolfo, PhD Candidate at ETH Zürich and Doctoral Fellow at the Swiss Cyber-Defence (CYD) Campus Abstract: ...

Machine Learning Security Seminar Series - Andy Arditi (Northeastern)

Machine Learning Security Seminar Series - Andy Arditi (Northeastern)

Title: A mechanistic view of