Mon. Dec 23rd, 2024
Language Models Can Explain Neurons Using Language Models.

Although most of our explanations have low scores, we believe that we can further improve our ability to create explanations using ML techniques. For example, I found that you can improve your score by:

  • Repeat the explanation. You can increase your score by asking GPT-4 to come up with possible counterexamples and modifying your explanation to account for their activation.
  • I will explain using a large model. The average score increases as the features of the explainer model increase. However, even GPT-4 has poorer explanations than humans, suggesting that there is room for improvement.
  • Modify the architecture of the described model. Training the model with different activation functions improved the explanation score.

We are open sourcing the descriptive dataset and visualization tools written in GPT-4 for all 307,200 neurons in GPT-2, as well as the descriptive and scoring code. Use published models About OpenAI API. We hope the research community will develop new techniques to generate higher-scoring explanations and better tools to use explanations to investigate his GPT-2.

We found more than 1,000 neurons with descriptions with a score of at least 0.8. This means that, according to GPT-4, they account for the majority of the top activation behavior of neurons. Most of these well-described neurons are not very interesting. But we also found a number of interesting neurons that GPT-4 doesn’t understand. We hope that as our explanations improve, we may quickly uncover interesting qualitative understandings of model calculations.