Uncovering and Inducing Causal Structure in Deep Learning Models
Margaret Jacks Hall, Greenberg Room (Room 126)
Abstract: A faithful and interpretable explanation of an AI model’s behavior and internal structure is a high-level explanation that is human-intelligible but also consistent with the known, but often opaque low-level causal details of the model. We argue that the theory of causal abstraction provides the mathematical foundations for the desired kinds of model explanations. In the analysis mode, we uncover causal structure using interventions on model-internal states to assess whether an interpretable high-level causal model is a faithful description of a deep learning model. In the training mode, we induce interpretable causal structure using interventions during model training to simulate counterfactuals in the deep learning model's activation space. We show how to uncover and induce causal structures in a variety of case studies on deep learning models that reason over language and/or images.