Language models can explain neurons in language models


Although the vast majority of our explanations score poorly, we believe we can now use ML techniques to further improve our ability to provide explanations. For example, we found that we were able to improve results by:

  • Repeating the explanation. We can augment the results by asking GPT-4 to come up with possible counterexamples and then revise the explanations in light of their activations.
  • Using larger models to provide explanations. The average score increases as the explanatory power of the model increases. However, even GPT-4 provides worse explanations than humans, indicating room for improvement.
  • Changing the architecture of the explained model. Training models with different activation functions improved the explanatory results.

We are open to our datasets and visualization tools for GPT-4 written explanations of all 307,200 neurons in GPT-2, as well as explanation and scoring code using publicly available models on the OpenAI API. We hope that the research community will develop new techniques for generating higher-scoring explanations and better tools for investigating GPT-2 using explanations.

We found more than 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they accounted for most of the behavior of the highest firing neurons. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that were not understood by GPT-4. We hope that by improving the explanation, we can quickly reveal an interesting qualitative understanding of the model's calculations.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *