Finding rare features are cool because it allows us to know when they appear and how often
Finding Rare Features: is_rare = freqs < 1e-4
Identifies the features that are rare based on a frequency threshold of 1×10−41×10−4.
Slicing Encoder Weights: rare_enc = encoder.W_enc[:, is_rare]
Takes the columns of encoder.W_enc
that correspond to these rare features.
Mean of Rare Features: rare_mean = rare_enc.mean(-1)
Computes the mean of these rare feature weights across the feature dimension, resulting in a vector in the hidden unit space.
Cosine Similarity Calculation: rare_mean @ encoder.W_enc / rare_mean.norm() / encoder.W_enc.norm(dim=0)
rare_mean @ encoder.W_enc
: Dot product between the mean of rare features and all the encoder weights. This is essentially measuring how aligned the average rare feature is with each feature in the encoder.Histogram: px.histogram(...)
color=utils.to_numpy(is_rare)
part is likely coloring each bar based on whether the feature is rare or not.marginal='box'
possibly adds a boxplot for additional statistics.histnorm="percent", barmode='overlay'
specifies that the histogram should be normalized by percentage and that bars should be overlaid.The code seems to be exploring the relationship between 'rare' features and all features in terms of their alignment in the hidden space. It uses cosine similarity as a measure to plot this relationship, providing insights into how these rare features might influence or be influenced by other features.
Given your research interests in AI interpretability, understanding how 'rare' features align with the overall feature set could provide valuable insights into the workings of the model and what it finds important or unimportant.
Would you like to delve into any specific aspect of this code or its implications further?