What is the MIT Algorithmic Alignment Group's research agenda?

The Algorithmic Alignment Group is a research group at MIT which aims to "better align the development of AI with human interests and values." It is part of MIT's Computer Science & Artificial Intelligence Lab (CSAIL) Embodied Intelligence Group.

The group's alignment research focuses on prosaic alignment of existing AI systems and on interpretability. As of August 2023, their recent papers include:

  • Benchmarking Interpretability Tools for Deep Neural Networks, which proposes "trojan rediscovery" as a task by which to judge the usefulness of interpretability tools, develops two benchmarking approaches based on this task, and evaluates a number of existing interpretability methods using these benchmarks. (A "trojan", in this context, is a behavior intentionally implanted into a neural network which causes it to produce aberrant outputs in response to some specific input feature while otherwise maintaining normal performance.)

  • "Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks", which introduces an automated method for finding "copy/paste" vulnerabilities (i.e., attacks in which copying one natural image onto another results in misclassification by the system) in image classifiers. The authors use this method to identify hundreds of vulnerabilities without human oversight. This is presented as a step towards "scalable oversight over deep neural networks".

  • "Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks", which surveys over 300 publications about interpretability methods, proposes a taxonomy for classifying these methods, and explores connections between interpretability research and work in adversarial robustness, continual learning, and other areas.

Their work on AI policy and regulation includes: