Full Summary
Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.
I had two main aims while completing this project:
Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.
Key Outcome:
The tutorial is available as part of the SAE Lens tutorial list.
Project Overview
Mechanistic Interpretability is a field of AI Alignment research. It aims to understand the inner workings of neural networks, which are often seen as mysterious 'black boxes’. In the last year, there has been a lot of new research focused on Sparse Auto-encoders (SAE) - a tool that can help us identify the features of a model. For more detail on SAE’s, I recommend this paper by Anthropic.
Here, we’re using SAE’s to target a feature of a model. There are many definitions of what a feature is, but I’m going to define it as a fundamental unit of a neural network that can be interpreted and decomposed. For us, this means a “Jedi” feature, that has been interpreted as “phrases related to the Jedi Master from the Star Wars universe”.
Using a pre-trained version of the GPT2-Small model and an SAE set for this model (Res-JB), I took a simple prompt - the word “Jedi” - and used it to construct a steering vector. A steering vector acts as a modifier tool that allows us to influence the responses of our model. Here I have made the model talk about Jedi’s when asked “What do we find in space?”.
Process:
These are the main steps I took for this project.
Identify Features: First I retrieved the feature activations from the SAE for this prompt. By looking at the top feature activations for the tokens used for the simple prompt of “Jedi”, I was then able to do a lookup of the index for each feature using a tool called Neuronpedia - an open platform for interpretability research. This allowed me to find my feature of interest at layer 2 of the residual stream.
Create a Steering Vector: I then created a steering vector. I used the SAE’s model decoder weights at the index for this feature (7650), and then used the steering vector in a hook on the model for future text generation by the model.
Affect the model’s output: I then affected the model’s output by applying the steering vector. I influenced responses to general prompts to focus on discussion about Jedi’s and Star Wars (see image below). The steering vector also uses a coefficient to scale the influence of the vector on the model. An extremely high coefficient (~ 1000) allowed me to make the model repeat words related to Jedi’s and Star Wars over and over.
For a detailed walkthrough, view the SAE Lens tutorial or my colab notebook.
How is this project useful?
My main goal is that this blog post and tutorial will help inspire other people to start engaging with project work in AI Safety - just as it has done for me. The work here is not novel - however, I believe it does provide a good entry point for newcomers. It provides a clear way to engage with the concepts of steering vectors, LLMs, and SAEs through using the tutorial, and also engage with code packages used for interpretability research. I think it’s a great idea to consider creating your own tutorial as a way to replicate some research or concepts and teach yourself and others.
For me, completing this project has allowed me to dive into a beginner piece of investigative work within AI Alignment, making use of the skills that I have been learning while self-studying the ARENA curriculum and taking part in the AI Safety Fundamentals course. It required me to work through a considerable amount of the ARENA curriculum that I had not yet completed. The project has also now given me a proper insight into what this kind of work could entail (albeit on a simpler level than day to day research). Now I am considering replicating more complex research or a novel research problem.
Project work in AI Safety does require a non-trivial amount of previous knowledge. ARENA (specifically Chapter 1 with the intro parts and then a focus on the SAE part) is the main resource I used to advance my skill set. I suggest working your way through this course material before you attempt creating your own tutorial.
For more information on SAE’s and steering vectors
Mechanistic interpretability? SAE’s? Steering Vectors? Some of you may be wondering what I’m writing about here. These are definitely complex areas that are too in depth for me to cover here and are better explained by experts in the field - instead, I recommend starting at this explainer on Mechanistic interpretability and this paper on SAE’s for more context.
Recent research that influenced this project, and which you might find interesting, includes the latest update from Google DeepMind’s Mechanistic Interpretability team, Joseph Bloom’s open-source research on Sparse Autoencoders, and Anthropic’s extensive investigation into the use of SAEs with Claude 3 Sonnet.
Acknowledgements
I’d like to thank Joseph Bloom, owner of the SAE Lens package, who provided me with guidance and feedback around the project idea and adding it to SAE Lens as a tutorial. If you’re interested in the package, view it here: link
I’d also like to thank my mentors and cohort members from the AI Safety Fundamentals coursework, who provided a great space for open discussion on AI topics and the project work.