Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

January 8, 2026

Fazl Barez

View Journal Article / Working Paper >

Current interpretability methods have made substantial progress in explaining model internals, but they rarely connect understanding to action. We propose a research agenda for automated interpretability-driven model auditing and control: a system where domain experts can query a model’s behavior, receive explanations grounded in their expertise, and instruct targeted corrections—all without needing to understand how AI systems work internally. Our agenda comprises eight interrelated research questions forming a complete pipeline: from translating queries into testable hypotheses about model internals, to localizing capabilities in specific components, generating human-readable explanations, and performing surgical edits with verified outcomes. Our approach is distinguished by explicit hypothesis generation and testing rather than purely learned mappings from latent space to language, compositional concept graphs that capture how capabilities combine and interact, domain-grounded explanations that enable expert oversight, and human-in-theloop intervention with predicted side effects. We evaluate the framework on three safety applications: chain-of-thought faithfulness verification, emergent capability prediction during training, and capability composition mapping. This paper outlines the full agenda, proposed experiments, baselines, and evaluation strategies, providing a roadmap for building interpretability tools that are not only scientifically interesting but practically useful for making AI systems safer and empowering humans to control them.

Image for Europe and the geopolitics of AGI: The need for a preparedness plan

Europe and the geopolitics of AGI: The need for a preparedness plan

December 18, 2025
Image for AI Benefit-Sharing Framework: Balancing Access and Safety

AI Benefit-Sharing Framework: Balancing Access and Safety

December 1, 2025