This paper is about finding a more rigorous process of defining interpretability.
- current approach is soft. you will know it when you see it
- How to compare based on degree of model classes
- e.g linear regression models sparse in features AND prototypes
- Do all models and applications have same interpretability needs?
Why Interpretability
- All ML Models dont need it because they dont require human intervention that requires explanation to justify their decision.
- Comes from incompleteness in the problem formalization
A Taxonomy of Interpretability Evaluation
Application grounded evaluation: real humans, tasks
- Applications that require real humans to guide how to complete the task. Evaluation is actually doing the task
- Rather than assessing the model in isolation or through theoretical tests, it's evaluated by seeing how well it performs in a practical, real-world scenario, end to end.
- For instance, if a researcher has developed a model to assist doctors in diagnosing a particular disease, the best way to test its effectiveness isn't in a lab or a simulated environment, but directly in a clinical setting, by seeing how well it aids doctors in actual diagnoses.
- Since its supervised, how is it trained? When its also in a real setting? So the training data is basically produced and consumed in real time? Wouldn’t it produce terrible results that can be dangerous for patients?
- Importantly, this training process typically occurs offline, before the model is used in a live, real-world setting. The model learns from this pre-existing data, adjusting its parameters to minimize the difference between its predictions and the actual outcomes.
- Once the model has been trained and validated (typically on a separate set of labeled data to avoid overfitting), it can be deployed in a real-world setting. But even then, it's generally not used in isolation. In a medical context, for example, the model's predictions might be used to assist doctors in making diagnoses, rather than making the final decisions on its own. This provides a safety net and allows for human oversight.
- It directly tests the formal objective of the model to the real objective needed by humans. It gives strong indicators of success if it works.
- The hardest evaluation metric
Human-grounded Metrics: Real humans, simpli ed tasks
- Conducting human subject experiments in controlled environment, not application or real world env
- test more general aspects of the quality of a model's output, like the understandability or clarity of an explanation it provides. For example, if you want to understand what kind of explanations people can understand under tight time constraints, you might design a simplified, abstract task where other factors like the complexity of the task can be controlled. This allows researchers to isolate the aspect they're interested in testing.
- We dont care about outcome,
- binary: humans choose one of the two options
- forward simulation: humans given inputs and explanation, and must self generate the output.
- counterfactual simulation: given input, explanation, and output, and human asked what to change to get the correct output
Functionally-grounded Evaluation: No humans, proxy tasks