Blog

Blog here.

TL;DR;

  • Current mechanistic interpretability focuses on understanding models post-hoc (e.g., sparse autoencoders). We investigated whether restricting activations to the simplex could yield models that are both practical and more interpretable by design.
  • Points on the simplex have a privileged basis and a natural interpretation as probability distributions, so the hope was that vertices would correspond to relevant features and activations could be read as feature probabilities.
  • We explored several MLP variants constrained to the simplex: stochastic weight matrices, rescaled ReLU (normalize after ReLU), dimension-rescaled ReLU (scale by layer dimension), and a decaying variant that exponentially anneals the scale factor during training.
  • We analyzed training obstacles and interpretability metrics for each variant. This project was done as part of SPAR (Feb-May 2025) and was later extended into work on simplex-constrained transformers and manifold neural blocks.