A new adventure: mechanistic interpretability

What if we could compose LLMs from reusable circuits?

It’s always bugged me that we can’t explain how large language models do what they do. It makes the models difficult to trust, possibly unsafe to operate, and very difficult to train. If we could identify cognitive structures within a model, perhaps we could compose other models from them, avoiding much of the expense and haphazard nature of our current practice. This is the domain of mechanistic interpretability (MI).

I’ve started a new project to explore this, my attempt to discover circuits and other reusable cognitive structures in LLMs, using interactive tools that visualize what goes on in an LLM as it runs.

interactive inspection inside the minds of models

The project (as yet unnamed) is a livecoded browser-based MI framework, with which I hope to make the field more accessible to everyone. Inspired by TransformerLens and the work of Anthropic, I want to make model inspection easier to start, and more interactive. I also want to help build the MI community, specifically through sharing of results and standardization of circuit expressions. I think we can bring about a new era in transformer-based AI, through the composition of models from reusable circuits, rather than brute-force training from ever-larger sets of questionable training data. We can also pursue increased symbiosis between MI and knowledge representation research.

With current tools, MI remains laborious. Setting up frameworks like TransformerLens is difficult, often requiring complex Python environments with extensive configuration. Access to powerful local GPUs is an absolute requirement. Sharing results with other researchers requires everyone to replicate a complex environment. Worst of all, experiments must be run as large batches of computation, and visualizations are static, making it difficult to develop intuition about model behavior.

While there is a small server setup, the app runs in a web browser; there is no front-end setup. The server and its GPUs need not be local. The architecture lends itself to operation as a service for many researchers at once. Once started, the UI provides access to everything as first-class objects that can be inspected, modified, and composed interactively. Internal model structures become tangible. With the addition of web synchronization frameworks like Multisynq, multiple researchers can explore and share insights about the same live model.

Let’s Collaborate

If you’re a mechanistic interpretability researcher, or are just interested in the topic, please contact me. I’d love to discuss how we might collaborate on this!

Leave a comment