Media Summary: The paper explores using sparse autoencoders to steer The paper introduces affine concept editing (ACE) for controlling The paper introduces Conditional Activation Steering (CAST), a method for selectively controlling LLM responses based on input ...

Qa Refusal In Language Models - Detailed Analysis & Overview

The paper explores using sparse autoencoders to steer The paper introduces affine concept editing (ACE) for controlling The paper introduces Conditional Activation Steering (CAST), a method for selectively controlling LLM responses based on input ... A Google TechTalk, 2025-06-11, presented by Ashwinee Panda Privacy in ML Seminar. ABSTRACT: It is widely believed that ... Demonstration ITerated Task Optimization (DITTO) aligns

Photo Gallery

[QA] Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
[QA] Does Refusal Training in LLMs Generalize to the Past Tense?
[QA] Steering Language Model Refusal with Sparse Autoencoders
I Removed an AI's Ability to Say No (Refusal Ablation Explained)
[QA] Refusal in LLMs is an Affine Function
Towards Monosemanticity: Decomposing Language Models Into Understandable Components
[QA] Programming Refusal with Conditional Activation Steering
Worst-Case Membership Inference of Language Models
Why Can’t Language Models Learn to Speak Backwards?
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
🤗 Tasks: Causal Language Modeling
View Detailed Profile
[QA] Refusal in Language Models Is Mediated by a Single Direction

[QA] Refusal in Language Models Is Mediated by a Single Direction

Study explores

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction

Study explores

[QA] Does Refusal Training in LLMs Generalize to the Past Tense?

[QA] Does Refusal Training in LLMs Generalize to the Past Tense?

Refusal

[QA] Steering Language Model Refusal with Sparse Autoencoders

[QA] Steering Language Model Refusal with Sparse Autoencoders

The paper explores using sparse autoencoders to steer

I Removed an AI's Ability to Say No (Refusal Ablation Explained)

I Removed an AI's Ability to Say No (Refusal Ablation Explained)

... flagship models: https://github.com/elder-plinius/L1B3RT4S/blob/main/OPENAI.mkd Research: "

[QA] Refusal in LLMs is an Affine Function

[QA] Refusal in LLMs is an Affine Function

The paper introduces affine concept editing (ACE) for controlling

Towards Monosemanticity: Decomposing Language Models Into Understandable Components

Towards Monosemanticity: Decomposing Language Models Into Understandable Components

This week, we're discussing "Decomposing

[QA] Programming Refusal with Conditional Activation Steering

[QA] Programming Refusal with Conditional Activation Steering

The paper introduces Conditional Activation Steering (CAST), a method for selectively controlling LLM responses based on input ...

Worst-Case Membership Inference of Language Models

Worst-Case Membership Inference of Language Models

A Google TechTalk, 2025-06-11, presented by Ashwinee Panda Privacy in ML Seminar. ABSTRACT: It is widely believed that ...

Why Can’t Language Models Learn to Speak Backwards?

Why Can’t Language Models Learn to Speak Backwards?

Language models

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Demonstration ITerated Task Optimization (DITTO) aligns

🤗 Tasks: Causal Language Modeling

🤗 Tasks: Causal Language Modeling

An overview of the Causal

Testing LLM Integrations: Why Traditional QA Breaks

Testing LLM Integrations: Why Traditional QA Breaks

Testing