Regularizing Black-box Models for Improved Interpretability


Most of the work on interpretable machine learning has focused on designing either inherently interpretable models, which typically trade-off accuracy for interpretability, or post-hoc explanation systems, which tend to lack guarantees about the quality of their explanations. We explore a hybridization of these approaches by directly regularizing a black-box model for interpretability at training time - a method we call ExpO. We find that post-hoc explanations of an ExpO-regularized model are consistently more stable and of higher fidelity, which we show theoretically and support empirically. Critically, we also find ExpO leads to explanations that are more actionable, significantly more useful, and more intuitive as supported by a user study.


Regularizing Black-box Models for Improved Interpretability

Gregory Plumb, Maruan Al-Shedivat, Ángel Alexander Cabrera, Adam Perer, Eric Xing, Ameet Talwalkar
Conference on Neural Information Processing Systems (NeurIPS). Vancouver, 2020.


@inproceedings{plumb2020expo,author = {Plumb, Gregory and Al-Shedivat, Maruan and Cabrera, '{A}ngel Alexander and Perer, Adam and Xing, Eric and Talwalkar, Ameet},booktitle = {Advances in Neural Information Processing Systems},pages = {10526--10536},publisher = {Curran Associates, Inc.},title = {Regularizing Black-box Models for Improved Interpretability},url = {},volume = {33},year = {2020}}