What are polysemantic neurons?

8 min read

Suggest changes in Google Docs

For a “monosemantic” neuron, there’s a single feature that determines whether it activates strongly. If a neuron responds only to grandmothers, we might call it a grandmother neuron. For a “polysemantic” neuron, in contrast, there are multiple features that can cause it to activate strongly.

As an example, this image shows feature visualizations of a neuron that activates when it sees any of a cat face, cat legs, or the front of a car. As far as anyone can tell, the reason why this neuron responds to both cats and cars is not that cars and cats share some underlying feature. Rather, the neuron just happened to get two unrelated features attached to it.

Why do we think that the neurons are not encoding a shared similarity?

Suppose a neuron is picking out some feature shared by cars and cats. Say, the neuron is representing “sleekness”. Then we’d expect images of other “sleek” things, like a snake or a ferret, to activate the neuron. So if we generate lots of different images which highly activate our neuron, and find that they do contain snakes and ferrets, that’s evidence for the neuron picking up on a unified concept of sleekness. Researchers have run experiments like this on neurons like this one and found that, no, they only activate on cats and cars — just as the “polysemantic” hypothesis would lead us to expect.

Why do polysemantic neurons form?

Polysemantic neurons seem to result from a phenomenon known as “superposition”. "Superposition", in this case, means that the neural net takes a larger number of numerical "features" of its input that might each otherwise have had their own neuron (for example, one neuron for cat-ness and one neuron for car-ness), and instead spreads those features out over a smaller number of neurons, with the features mixed together ("loaded") in different ways for each neuron.

In fact, if we only care about packing as many features into n neurons as we can, then using polysemantic neurons lets us pack roughly as many as exp(C * n) features, where C is a constant depending on how much overlap between concepts we allow.¹ In contrast, using monosemantic neurons would only let us pack in n features.

Apparently neural network training processes consistently find ways to superpose features together like this because that means they can pack more features into the limited number of neurons they have available, or use fewer neurons for those features, conserving them for more important computations.

This would be possible most of the time because most sets of features would have a property called "sparsity". "Sparsity" in this case means that, even if each feature in the set has some plausibly occurring network input for which the feature would be a number far from zero, still for most inputs most features are very small or zero.

Essentially, the network is using a smaller number of "real features" (the neuron activations) to do computations on a larger number of "virtual features", but with some risk of getting "virtual wires crossed" (in a manner strikingly similar to synesthesia!), or of dropping out virtual feature numbers when they're too small to stand out from zero in all the mixed-together confusion.

What are the consequences of polysemantic neurons arising in networks?

Polysemantic neurons are a major challenge for the “circuits” research agenda for neural network interpretability, because they limit our ability to reason about neural networks. It’s harder to interpret what computation is being done by a circuit made out of neurons if those neurons' activations have multiple meanings. As an example: in a circuit where we only have two polysemantic neurons, which encode five different features each, and one weight governing a connection between them, then we have effectively 25 different connections between features that are all governed by that single weight. Which, effectively, means we have 25 different possible computations which could occur using only two neurons. In turn, that makes it very hard to figure out which computations are occurring in those neurons at any given time.

Progress in interpreting polysemantic neurons

There has been some substantial progress. In 2023, Anthropic claimed to achieve a breakthrough on this problem in their paper “Towards Monosemanticity”. Anthropic trained large "sparse autoencoder" networks on the non-sparse ("dense") activations in other, more polysemantic neural networks, to decompose those activations in the form of sparse activations from among a larger number of neurons. These sparse activations were (reported to be) more monosemantic, corresponding to more interpretable features.

As a result of such progress, Christopher Olah of Anthropic stated he is “now very optimistic [about superposition]”, and would “go as far as saying it’s now primarily an engineering problem — hard, but less fundamental risk.” Why did we caveat Anthropic’s claims? Because some researchers, like Ryan Greenblatt, are more skeptical about the utility of sparse autoencoders as a solution to polysemanticity.

And while we've made some progress in splitting polysemantic neurons into monosemantic ones, that still leaves the issue of figuring out how those polysemantic neurons are used by the network to solve problems. That is, how do we find, and interpret, polysemantic computations in a neural network? This is quite hard, as we've noted above, because a collection of polysemantic neurons can represent many different computations using the features in superposition.

In 2024, a preprint reported some early successes in this problem with an automatic technique for "small" language models (i.e., < 100 million parameters). The technique, named SHIFT², built upon sparse autoencoders to find the many different computations that could be occuring within a collection of neurons, polysemantic or not. Each computation can then be split into its own network consisting of mono-semantic neurons. That means no superposition of computations, which in turn means a much more interpretable network.

One of the authors of the preprint, Sam Marks, wrote that this was "the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction", because of how it fits into research towards the aim of "evaluat[ing] models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior".

This is a consequence of the Johnson-Lindenstrauss lemma. As this estimate doesn’t account for using the exponential number of features for useful computations, it is unclear if neural networks actually achieve this upper bound in practice. (The use of polysemanticity in computations is an active research area. For a model of how polysemanticity aids computations, see “Towards a Mathematical Framework for Computation in Superposition”.) What about lower bounds on how many concepts are packed into polysemantic neurons in real networks? Well, these estimates depend on assumptions about the training data, initialization of weights, etc. So it is hard to give a good lower bound in general. But for some cases, we do have estimates: e.g., “Incidental polysemanticity” notes that, depending on the ratio of concepts to neurons, the initialization process can lead to a constant fraction of polysemantic neurons. ↩︎
Or, Sparse Human-Interpretable Feature Trimming ↩︎