This is first guest post by Deb. More to come!
In “The clinician and dataset shift in artificial intelligence,” published in the New England Journal of Medicine, a set of physician-scientists describe how a popular sepsis-prediction system developed by the company Epic needed to be deactivated. “Changes in patients’ demographic characteristics associated with the coronavirus disease 2019 pandemic” supposedly caused spurious alerting arising from the system, rendering it of little value to clinicians. For the authors, this is a clear illustration of distribution shift, a change in training and test data that, in this case, made it difficult to distinguish between fevers and bacterial sepsis. They go into detail about what this means: distribution shift is a fundamental challenge in machine learning, and whenever we attempt to deploy machine learning in the real world without considering the way in which that real world environment can change (whether its changes in technology (e.g., software vendors), changes in population and setting (e.g., new demographics), or changes in behavior (e.g., new reimbursement incentives), then we fail to properly consider the ways in which the data can change or shift between train and test environments. If not considered, the model will inevitably fail.
And why not? If the underlying test data diverges from the data used in the development of the model, we should expect disappointing results. But the distribution shift problem is so common that ML researchers and practitioners have started seeing it everywhere they look. In fact, in many cases, they will inappropriately characterize any failure of deployed ML models as a distribution shift. This both muddles our understanding of what exactly distribution shift means, and limits our vocabulary for the range of failures that can show up in deployment. In this blog post, I’ll use Epic’s sepsis-detector to illustrate some of the current confusion about distribution shift, and why the notion of “external validity”, a description of generalization problems used widely in other fields, is perhaps more relevant.
The terminology of distribution shift is both too specific and not specific enough. A “change in distribution” could be characterized as anything from a variation in source to a re-sampling. These changes could involve changes to the input features (ie. covariate shift), changes to the labels (ie. prior probability shift) or both (ie. concept drift).
The notion of “data distributions” themselves assumes data comes from an imagined data generating function. In that world of infinitely abundant independent data points samples from a bespoke probability distribution (ie. the “independent and identically distributed” i.i.d. assumption), describing data in terms of how it’s distributed makes a lot of sense. But as Breiman describes in “Statistical Modeling: The Two Cultures”, that assumption doesn’t often hold for real world data. Very rarely does one actually know the data generating function, or even a reasonable proxy - real world data is disorganized, inconsistent, and unpredictable. As a result, the term “distribution” is vague enough to not address the additional specificity necessary to direct actions and interventions. When we talk about a hypothetical distribution shift, we talk about data changes but are not specific about which data changes happen and why they happen. We’re also constraining our discourse by just looking at changes in the data in the first place, when in fact, many other changes occur between development and deployment (such as changes in interactions with the model, changes in the interpretation of model results, etc.). Specifying the type of distribution shift is one solution, but more importantly, we need to understand specific distribution shifts as part of a broader phenomenon of external validity that we need to begin to articulate as a field.
The most significant consequence of the myopic obsession with distributions is how it constrains ML evaluations. The benchmarking paradigm that dominates ML at the moment is a by-product of its obsession with detecting shifts in data - the evaluation of models on static data test sets are tied to assumptions about failures being due to shifts in data distribution and not much else. A myopic view on distribution shift confuses the discourse on how to evaluate models for deployment. Several studies on regulatory approvals of ML-based tools in healthcare already demonstrate how over-emphasis on data distribution shift failures has led ML practitioners and even regulators within the healthcare space to inappropriately prioritize the use of retrospective studies (ie. evaluations on static collections of past data) rather than prospective studies (ie. examinations of the system within its context of use). Things like multi-site assessment, median evaluation sample size, demographic subgroup performance and “side-by-side comparison of clinicians’ performances with and without AI” are also exceedingly rare in the evaluation of ML-based healthcare tools, as they don’t fit our current narrow perception of what can go wrong when you throw an ML model into the real world. Of course distribution shift matters but the nature in which we focus on it to the exclusion of everything else is regrettable. For better regulation and evaluation methodology for machine learning deployments, we need to expand our thinking and align ourselves with the other fields attempting to understand performance gaps between the theory and practice of interventions.
This broader notion of validity characterizes the accuracy of the claims being made in a specific context. The related notion of reliability has to do with reproducibility and the consistency of results (think of measurement precision), but validity is concerned with some notion of truthfulness and how close claims get to the target of describing the real relationship between inputs and outputs. There are various notions of validity discussed in measurement theory, evaluation science, program evaluation and experiment design literature, but there are common core concepts. For example, internal validity is about assessing a consistent causal relationship between the inputs and outputs within the experiment and construct validity is related to the evaluation of how well experimental variables represent the real world phenomena being observed. When discussing generalization issues, we are most interested in external validity, which analyzes if the causal relationship between inputs and outputs observed in experiments holds for inputs and outputs outside the experimental setting.
To understand how external validity differs from the current discourse on distribution shift, let’s go back to the sepsis monitoring example. “External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients,”, published in JAMA, describes a retrospective study on the use of the sepsis tool between December 2018 and October 2019 (notably well before the pandemic began). They examined 27,697 patients undergoing 38,455 hospitalizations and found that the Epic Sepsis Model predicted the onset of sepsis with an area under the curve of 0.63, “which is substantially worse than the performance reported by its developer”. Furthermore, the tool “did not identify 1,709 patients with sepsis (67%) despite generating alerts… for 6,971 of all 38 455 hospitalized patients (18%), thus creating a large burden of alert fatigue.” These researchers rightfully describe these issues as “external validity” issues, and go into detail examining a range of problems far beyond the data-related “shifts” described in the “Clinician and Dataset Shift” oped. They don’t pretend that this doesn’t have to do with changes in the data - of course it does. Epic’s system evaluation was on data from 3 US health systems from 2013 to 2015, and that’s certainly a different dataset than University of Michigan’s 2018-2019 patient records. But they also comment on changes to the interactions doctors had with the model and how that modified outcomes, as well as other external validity factors that had very little to do with data at all, much less “data distribution shift.” Even when discussing substantive data changes, they are specific in characterizing what it is and breaking down the differences that occurred upon deployment at their hospital.
As this study shows, machine learning needs some clean guidelines for evaluating external validity. To begin scaffolding such frameworks, we can learn from the social sciences. For example, Erin Hartman, a UC Berkeley colleague in political science, and her co-author Naoki Egami propose a taxonomy that provides an interesting start to this discussion. Their interest is in assessing external validity in scenarios where a population is given a policy treatment (eg. sending out voting reminders, updating the tax code, giving out free vaccines, etc.) and the impact of this treatment as measured within the experiment and also once implemented in the real world. If we consider the treatment to be an ML model’s integration into a broader system, we can begin to articulate what external validity could mean in the algorithmic context. In my next blog post, I’ll try to work through Hartman and Egami’s framework and other specific proposals from other fields on how we could begin to taxonomize external validity issues, and see which of the external validity problems they describe are actually quite relevant to machine learning.