Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

January 8, 2026

Fazl Barez

Current interpretability methods have made substantial progress in explaining model internals, but they rarely connect understanding to action. We propose a research agenda for automated interpretability-driven model auditing and control: a system where domain experts can query a model’s behavior, receive explanations grounded in their expertise, and instruct targeted corrections—all without needing to understand how AI systems work internally. Our agenda comprises eight interrelated research questions forming a complete pipeline: from translating queries into testable hypotheses about model internals, to localizing capabilities in specific components, generating human-readable explanations, and performing surgical edits with verified outcomes. Our approach is distinguished by explicit hypothesis generation and testing rather than purely learned mappings from latent space to language, compositional concept graphs that capture how capabilities combine and interact, domain-grounded explanations that enable expert oversight, and human-in-theloop intervention with predicted side effects. We evaluate the framework on three safety applications: chain-of-thought faithfulness verification, emergent capability prediction during training, and capability composition mapping. This paper outlines the full agenda, proposed experiments, baselines, and evaluation strategies, providing a roadmap for building interpretability tools that are not only scientifically interesting but practically useful for making AI systems safer and empowering humans to control them.

Open Problems in Frontier AI Risk Management

February 24, 2026

Financing the AI Triad: Compute, Data and Algorithms A framework to build local ecosystems

February 18, 2026