Interpretable Features

A team at ๐€๐ง๐ญ๐ก๐ซ๐จ๐ฉ๐ข๐œ, creator of the Claude models, published a paper about extracting ๐ข๐ง๐ญ๐ž๐ซ๐ฉ๐ซ๐ž๐ญ๐š๐›๐ฅ๐ž ๐Ÿ๐ž๐š๐ญ๐ฎ๐ซ๐ž๐ฌ from Claude 3 Sonnet. This is achieved by placing a sparse autoencoder halfway through the model and then training it. An autoencoder is a neural network that learns to encode input data, here a middle layer of Claude, into a compressed vector representation and then decode it back to the original input. In a sparse autoencoder, a sparsity penalty is added to the loss function, encouraging most units in the representation to remain inactive, which helps in capturing essential features efficiently.

It turns out that these features range from very ๐œ๐จ๐ง๐œ๐ซ๐ž๐ญ๐ž, e.g., โ€˜Golden Gate Bridge,โ€™ to highly ๐š๐›๐ฌ๐ญ๐ซ๐š๐œ๐ญ ๐š๐ง๐ ๐œ๐จ๐ง๐œ๐ž๐ฉ๐ญ๐ฎ๐š๐ฅ, such as โ€˜code errorโ€™ or โ€˜inner conflict.โ€™ To get a feeling for the quality of these features, it is illuminating to look at their nearest neighbors (similar representation vector) as shown in the Figure 1. The paper contains a link to an interactive tool for more such examples.

Figure 1: Nearest neighbors to ‘inner conflict’ feature. Image from the paper.

Moreover, the features are ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฆ๐จ๐๐š๐ฅ: the โ€˜Golden Gate Bridgeโ€™ feature will get activated regardless of whether the input is an image or a text. The features also carry across ๐ฅ๐š๐ง๐ ๐ฎ๐š๐ ๐ž๐ฌ: the โ€˜tourist attractionโ€™ feature (see Figure 2) will get activated when the model sees either โ€˜tour Eiffelโ€™ in French or ้‡‘้—จๅคงๆกฅ (Golden Gate Bridge in Mandarin).

Figure 2: inputs activating the ‘tourist attraction’ feature. Image from the paper.

The experiments on influencing the modelโ€™s behavior, called ๐Ÿ๐ž๐š๐ญ๐ฎ๐ซ๐ž ๐ฌ๐ญ๐ž๐ž๐ซ๐ข๐ง๐ , are fascinating reading. When clamping (i.e. manually setting in the model) the โ€˜transit infrastructureโ€™ feature to five times its max value, the model will send you across a bridge when asking for directions, where otherwise it wouldnโ€™t have.

At this point, you might be thinking about whether these insights could be applied to increase modelsโ€™ ๐ฌ๐š๐Ÿ๐ž๐ญ๐ฒ. Indeed, the paper reports detecting, e.g., unsafe code, bias, sycophancy (I had to look this one up: behavior of flattering or excessively praising someone to gain favor or advantage), deception and power-seeking, and dangerous or criminal information. Could feature steering help steer modelsโ€™ answers in favorable ways? The authors caution against high expectations, but I believe this research direction has sufficient potential to warrant further exploration.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *