Notion | The connected workspace with site publishing

Finding rare features are cool because it allows us to know when they appear and how often

finding mean of rate features is good because we know typical features used for a token of rareness
cosine similar: align avg rare feature with each feature in encoder. We are comparing how similar rare features and regular features are

Finding Rare Features: is_rare = freqs < 1e-4

Identifies the features that are rare based on a frequency threshold of 1×10−41×10−4.
Slicing Encoder Weights: rare_enc = encoder.W_enc[:, is_rare]

Takes the columns of encoder.W_enc that correspond to these rare features.
Mean of Rare Features: rare_mean = rare_enc.mean(-1)

Computes the mean of these rare feature weights across the feature dimension, resulting in a vector in the hidden unit space.
Cosine Similarity Calculation: rare_mean @ encoder.W_enc / rare_mean.norm() / encoder.W_enc.norm(dim=0)
- rare_mean @ encoder.W_enc: Dot product between the mean of rare features and all the encoder weights. This is essentially measuring how aligned the average rare feature is with each feature in the encoder.
- The division by norms converts it to cosine similarity.
Histogram: px.histogram(...)
- Plots a histogram of the calculated cosine similarities.
- The color=utils.to_numpy(is_rare) part is likely coloring each bar based on whether the feature is rare or not.
- marginal='box' possibly adds a boxplot for additional statistics.
- histnorm="percent", barmode='overlay' specifies that the histogram should be normalized by percentage and that bars should be overlaid.

Interpretation

The code seems to be exploring the relationship between 'rare' features and all features in terms of their alignment in the hidden space. It uses cosine similarity as a measure to plot this relationship, providing insights into how these rare features might influence or be influenced by other features.

Given your research interests in AI interpretability, understanding how 'rare' features align with the overall feature set could provide valuable insights into the workings of the model and what it finds important or unimportant.

Would you like to delve into any specific aspect of this code or its implications further?