Anthropic recently unveiled a significant study titled “Evaluating Feature Steering: A Case Study in Mitigating Social Biases“. The study explores the nuanced use of feature steering in Claude 3 Sonnet, the company’s latest language model, aiming to understand whether this technique can effectively mitigate social biases without compromising the model’s overall capabilities.
The research builds on Anthropic’s previous interpretability work, demonstrating their ability to identify and manipulate specific interpretable features of the model. The new experiments examine whether feature steering—a method that adjusts the influence of individual model features by modifying its internal state—can reliably mitigate social biases without affecting the model’s other capabilities.
The Steering “Sweet Spot”
The study examined 29 different features, each linked to social biases, to measure the effect of feature steering on model performance. In other words, researchers aimed to determine if tweaking these features could reduce biases without harming the model’s effectiveness.
In their experiments, Anthropic found promising, albeit mixed, results. The research demonstrated that while feature steering could reduce biases, such as gender or disability bias, it also produced some unintended off-target effects. For example, attempts to reduce gender bias inadvertently increased age bias, highlighting the complex and interconnected nature of these features.
Although, they did discover that within a defined steering range (dubbed the “sweet spot”), features could be adjusted without significantly impairing the model’s capabilities.
Mixed Results and Limitations
One highlight of the study is the identification of a “neutrality” feature, which consistently reduced social biases across nine dimensions without heavily compromising the model’s capabilities. However, the researchers acknowledged that steering was not always predictable.
Steering a feature sometimes led to significant changes in unrelated areas—an issue they refer to as off-target effects.
In virtue of transparency and ‘Open AI’, Anthropic shared both the promising results and the limitations, providing insights into the potential, as well as the risks, of this method.
Implications for AI Bias Monitoring
The research sheds light on a broader movement towards increased transparency and control in AI development. Platforms like Tracking AI have been crucial in monitoring biases in large language models, highlighting political biases and trends over time. This alignment suggests that ongoing evaluation is crucial in ensuring that emerging tools like feature steering are used responsibly and effectively to mitigate biases without unintended consequences.
The path forward, as envisioned by the startup, involves refining feature steering to enhance its precision while minimizing unexpected outcomes. They suggest potential approaches, such as examining circuits (interconnected groups of neurons within the AI model that work together to perform specific functions) within models or exploring alternative steering methods like multiplicative or conditional steering, to achieve more reliable control.