Interpretable Features

A team at 𝐀𝐧𝐭𝐡𝐫𝐨𝐩𝐢𝐜, creator of the Claude models, published a paper about extracting 𝐢𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐥𝐞 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 from Claude 3 Sonnet. This is achieved by placing a sparse autoencoder halfway through the model and then training it. An autoencoder is a neural network that learns to encode input data, here a middle layer of Claude, into a compressed vector representation and then decode it back to the original input. In a sparse autoencoder, a sparsity penalty is added to the loss function, encouraging most units in the representation to remain inactive, which helps in capturing essential features efficiently.

It turns out that these features range from very 𝐜𝐨𝐧𝐜𝐫𝐞𝐭𝐞, e.g., ‘Golden Gate Bridge,’ to highly 𝐚𝐛𝐬𝐭𝐫𝐚𝐜𝐭 𝐚𝐧𝐝 𝐜𝐨𝐧𝐜𝐞𝐩𝐭𝐮𝐚𝐥, such as ‘code error’ or ‘inner conflict.’ To get a feeling for the quality of these features, it is illuminating to look at their nearest neighbors (similar representation vector) as shown in the Figure 1. The paper contains a link to an interactive tool for more such examples.

Figure 1: Nearest neighbors to ‘inner conflict’ feature. Image from the paper.

Moreover, the features are 𝐦𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥: the ‘Golden Gate Bridge’ feature will get activated regardless of whether the input is an image or a text. The features also carry across 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞𝐬: the ‘tourist attraction’ feature (see Figure 2) will get activated when the model sees either ‘tour Eiffel’ in French or 金门大桥 (Golden Gate Bridge in Mandarin).

Figure 2: inputs activating the ‘tourist attraction’ feature. Image from the paper.

The experiments on influencing the model’s behavior, called 𝐟𝐞𝐚𝐭𝐮𝐫𝐞 𝐬𝐭𝐞𝐞𝐫𝐢𝐧𝐠, are fascinating reading. When clamping (i.e. manually setting in the model) the ‘transit infrastructure’ feature to five times its max value, the model will send you across a bridge when asking for directions, where otherwise it wouldn’t have.

At this point, you might be thinking about whether these insights could be applied to increase models’ 𝐬𝐚𝐟𝐞𝐭𝐲. Indeed, the paper reports detecting, e.g., unsafe code, bias, sycophancy (I had to look this one up: behavior of flattering or excessively praising someone to gain favor or advantage), deception and power-seeking, and dangerous or criminal information. Could feature steering help steer models’ answers in favorable ways? The authors caution against high expectations, but I believe this research direction has sufficient potential to warrant further exploration.

Interpretable Features

Comments

Leave a Reply Cancel reply