What is Jacob Steinhardt's research agenda?

Jacob Steinhardt is an Assistant Professor in the Department of Statistics at the University of California (UC), Berkeley. His research focuses, for machine learning systems, on reliability and alignment to human values, in directions of robustness, reward specification and reward hacking, and scalable alignment.

His current approach to robustness is based on building models robust to distributional shift, to adversaries, balancing performance on different safety axes, and defenses against data poisoning.

His work on reward specification is based on the premise that complex value functions in alignment with human values need to be inferred from data without supervision and not specified directly. His paper on discovering latent knowledge in language models may pave the way to prevent reward-hacking by exploiting differences between the inferred and true rewards.

Introducing meaningful human oversight during training and deployment of large models needs these models to be mechanistically interpretable in the first place. Reverse-engineering GPT-2 small through causal interventions might help us design and monitor such models based on underlying interpretable abstractions.

Sources

Jacob Steinhardt, UC Berkeley