Finding rare features are cool because it allows us to know when they appear and how often

  1. Finding Rare Features: is_rare = freqs < 1e-4

    Identifies the features that are rare based on a frequency threshold of 1×10−41×10−4.

  2. Slicing Encoder Weights: rare_enc = encoder.W_enc[:, is_rare]

    Takes the columns of encoder.W_enc that correspond to these rare features.

  3. Mean of Rare Features: rare_mean = rare_enc.mean(-1)

    Computes the mean of these rare feature weights across the feature dimension, resulting in a vector in the hidden unit space.

  4. Cosine Similarity Calculation: rare_mean @ encoder.W_enc / rare_mean.norm() / encoder.W_enc.norm(dim=0)

  5. Histogram: px.histogram(...)

Interpretation

The code seems to be exploring the relationship between 'rare' features and all features in terms of their alignment in the hidden space. It uses cosine similarity as a measure to plot this relationship, providing insights into how these rare features might influence or be influenced by other features.

Given your research interests in AI interpretability, understanding how 'rare' features align with the overall feature set could provide valuable insights into the workings of the model and what it finds important or unimportant.

Would you like to delve into any specific aspect of this code or its implications further?