Xu et al. [19] introduced an attention‐based model that automatically learns to describe the content of images. They showed through visualization how the model is able to interpret the results. Ustun and Rudin [20] presented a sparse linear model for creating a data‐driven scoring system called SLIM. The results of this work highlight the interpretability capability of the proposed system in providing users with qualitative understanding due to their high level of sparsity and small integer coefficients. A common challenge, which hinders the usability of this class of methods, is the trade‐off between interpretability and accuracy [21]. As noted by Breiman [22], “accuracy generally requires more complex prediction methods … [and] simple and interpretable functions do not make the most accurate predictors.” In a sense, intrinsic interpretable models come at the cost of accuracy.
An alternative approach to interpretability in ML is to construct a highly complex uninterpretable black‐box model with high accuracy and subsequently use a separate set of techniques to perform what we could define as a reverse engineering process to provide the needed explanations without altering or even knowing the inner works of the original model. This class of methods offers, then, a post‐hoc explanation [23]. Though it could be significantly complex and costly, most recent work done in the XAI field belongs to the post‐hoc class and includes natural language explanations [24], visualizations of learned models [25], and explanations by example [26].
So, we can see that interpretability depends on the nature of the prediction task. As long as the model is accurate for the task, and uses a reasonably restricted number of internal components, intrinsic interpretable models are sufficient. If, however, the prediction target involves complex and highly accurate models, then considering post‐hoc interpretation models is necessary. It should also be noted that in the literature there is a group of intrinsic methods for complex uninterpretable models. These methods aim to modify the internal structure of a complex black‐box model that are not primarily interpretable (which typically applies to a DNN that we are interested in) to mitigate their opacity and thus improve their interpretability [27]. The used methods may either be components that add additional capabilities, components that belong to the model architecture [28, 29], for example, as part of the loss function [30], or as part of the architecture structure, in terms of operations between layers [31, 32].
4.1.2 Global Versus Local Interpretability
Global interpretability facilitates the understanding of the whole logic of a model and follows the entire reasoning leading to all the different possible outcomes. This class of methods is helpful when ML models are crucial to inform population‐level decisions, such as drugs consumption trends or climate change [33]. In such cases, a global effect estimate would be more helpful than many explanations for all the possible idiosyncrasies. Works that propose globally interpretable models include the aforementioned additive models for predicting pneumonia risk [1] and rule sets generated from sparse Bayesian generative models [18]. However, these models are usually specifically structured and thus limited in predictability to preserve uninterpretability. Yang et al. [33] proposed a Global model Interpretation via Recursive Partitioning called GIRP to build a global interpretation tree for a wide range of ML models based on their local explanations. In their experiments, the authors highlighted that their method can discover whether a particular ML model is behaving in a reasonable way or is overfit to some unreasonable pattern. Valenzuela‐Escárcega et al. [34] proposed a supervised approach for information extraction that provides a global, deterministic interpretation. This work supports the idea that representation learning can be successfully combined with traditional, pattern‐based bootstrapping yielding models that are interpretable. Nguyen et al. [35] proposed an approach based on activation maximization – synthesizing the preferred inputs for neurons in neural networks – via a learned prior in the form of a deep generator network to produce a global interpretable model for image recognition. The activation maximization technique was previously used by Erhan et al. [36]. Although a multitude of techniques is used in the literature to enable global interpretability, global model interpretability is difficult to achieve in practice, especially for models that exceed a handful of parameters. In analogy with humans, who focus their effort on only part of the model in order to comprehend the whole of it, local interpretability can be more readily applied.
Explaining the reasons for a specific decision or single prediction means that interpretability is occurring locally. Ribeiro et al. [37] proposed LIME for Local Interpretable Model‐Agnostic Explanation. This model can approximate a black‐box model locally in the neighborhood of any prediction of interest. Work in [38], extends LIME using decision rules. Leave‐one covariate‐out (LOCO) [39] is another popular technique for generating local explanation models that offer local variable importance measures. In [40], the authors present a method capable of explaining the local decision taken by arbitrary nonlinear classification algorithms, using the local gradients that characterize how a data point has to be moved to change its predicted label. A set of works using similar methods for image classification models was presented in [41–44]. It is a common approach to understanding the decisions of image classification systems by finding regions of an image that are particularly influential for the final classification. Also called sensitivity maps, saliency maps, or pixel attribution maps [45], these approaches use occlusion techniques or calculations with gradients to assign an “importance” value to individual pixels that are meant to reflect their influence on the final classification. On the basis of the decomposition of a model’s predictions on the individual contributions of each feature, Robnik‐
4.1.3 Model Extraction
Model‐specific interpretability methods are limited to specific model classes. Here, when we require a particular type of interpretation, we are limited in terms of choice to models that provide it, potentially at the expense of using a more predictive and representative model. For that reason, there has been a recent surge of interest in model‐agnostic interpretability methods as they are model‐free and not tied to a particular type of ML model. This class of methods separates prediction from explanation. Model‐agnostic interpretations are usually post‐hoc; they are generally used to interpret ANN and could be local or global interpretable models. In the interest of improving interpretability AI models, a large number of model‐agnostic methods have been developed recently using a range of techniques from statistics, ML, and data science. Here, we group them into four technique types: visualization, knowledge extraction, influence methods, and example‐based explanation ().
1 Visualization: A natural way to understand an ML model, especially DNN, is to visualize its representations to explore the pattern hidden inside a neural unit. The largest body of research employs this approach with the help of different visualization techniques in order to see inside these black boxes. Visualization techniques are essentially applied to supervised learning models. The popular visualization techniques are surrogate models, partial dependence plot (PDP), and individual conditional expectation (ICE).A surrogate model is an interpretable model (like a linear model or decision tree) that is trained on the predictions of the original black‐box model in order to interpret the latter. However, there are almost no theoretical guarantees that the simple surrogate model is highly representative of the more complex model. The aforementioned [37] approach is a