Santhisenan

Global explanations in LLMs

Global explanations aim to offer insights into the inner workings of an LLM by understanding what individual components have encoded. Here, individual components could be neurons, hidden layers or even larger modules. In this post, we will look at four main methods – probing, neuron activation analysis, concept-based methods and mechanistic interpretation.

Probing-based methods

During self-supervised pre-training, LLMs acquire broad linguistic knowledge from training data. Using probing techniques, we can understand the knowledge that the LLMs have captured. There are two kinds of probing.

The first kind, called classifier-based probing, fits a shallow classifier on top of pre-trained or fine-tuned models.

The second type of probing, are data-centric methods called parameter-free probing does not require probing classifiers.

Data-driven prompt search is another technique where certain knowledge is examined via language model’s text generation or completion capabilities.

Neuron Activation Explanation

This technique tries to find methods to understand the activation patterns in a model. To identify the roles of specific neurons in a specific neurons in a model, particularly in relation to learning and processing linguistic properties, this simple method can be used.

Concept-Based Explanation

Concept-based interpretability algorithms like Testing with Concept Activation Vectors (TCAVs) map the input data to a higher level, human-understandable concepts and explain the predictions of a model using the predefined concepts.

The difficulty in defining useful concepts and the need to collect examples for each concept is a challenge in using concept-based explanations.

Mechanistic Interpretability

Mechanistic interpretability focusses on dissecting and understanding the complex network of neurons and their interconnections. This helps to understand how specific components of the model contribute causally to it’s behaviour and outputs.

In this post, we looked at a few techniques for generating global explanations for LLMs. These are my notes from the following paper:

H. Zhao et al., “Explainability for Large Language Models: A Survey,” ACM Trans. Intell. Syst. Technol., vol. 15, no. 2, p. 20:1-20:38, Feb. 2024, doi: 10.1145/3639372.

#Deep-Learning