When are predictions policies?

Whether you are speaking to corporate managers, Silicon Valley script kiddies, or seasoned academics pitching commercial applications of their research, you’re likely to hear a lot of claims about what AI is going to do.

Hysterical discussions about AI machine learning’s applicability begin with a breathless recap of breakthroughs in predictive modeling (9X.XX% accuracy on ImageNet!, 5.XX% word error rate on speech recognition!) and then abruptly leap to prophesies of miraculous technologies that AI will drive in the near future: automated surgeons, human-level virtual assistants, robo-software development, AI-based legal services.

This sleight of hand elides a key question—when are accurate predictions sufficient for guiding actions?

At risk of painting with an overly broad brush, we have three canonical kinds of machine learning problems:

  1. Supervised learning (SL)—producing predictive models by mining patterns from collections of labeled examples.
  2. Unsupervised learning (UL)—any of a number of tasks that do not require annotations).
  3. Reinforcement learning (RL) —learning from a possibly sparse reward  signal over the course of interactions with an environment.

The first (supervised) accounts for nearly all current commercially-viable applications of machine learning but only the last (RL) is concerned in any way with models that actually do anything. Unfortunately, modern is RL is slow to learn, unstable, and brittle to subtle changes in the environment, obstacles that at present preclude its use in most real-world settings.

Are Predictions Actions?

To see when predictions may or may not be sufficient to guide actions, let’s consider two potential applications of machine learning: (i) Building a tool to help pathologists to recognize tuberculosis based on microscope images; (ii) Automating lending decisions.

The Robot PathologIST

In the first case, blood sample pours in to pathology departments from patients in the nearby geographic region. While there may be some natural variability among patient in what the blood looks like, (brighter when more oxygenated, darker when less), the distribution is stable over the scale of days, months and years. Moreover, what tuberculosis looks like doesn’t change appreciably over time and the diagnosis labeled by the pathologist, are unlikely to significantly impact the distribution of images that we’ll see in the future. So we don’t have to worry much about feedback loops coupling action and observation.

So long as the equipment (say, microscopes and cameras) also remain stable, this seems like an ideal problem for supervised learning. The standard confusion matrix calculations (accuracy, sensitivity, specificity) are well-aligned with our real-life objectives and there’s little reason to expect the system to break down. If our methods could significantly surpass human accuracy on important diagnostic microscopy problems, it might even be considered irresponsible not to deploy these systems in the wild.


Alternatively, consider applying supervised learning to the problem of allocating loans among applicants. Here, we might train our model on a dataset consisting of past applicants and their associated outcomes. Limited only by our imaginations and the all-seeing eye of Facebook, our inputs might consist of all available attributes of historical applicants and our labels could, in the simplest case, be the binary labels [0, 1] reflecting whether the loan defaulted. Alternatively, we could regress the fraction of the loan that was paid off, or some other measure of profitability.

We train our MLML system, and go into a trance as each successive iteration of training reduces the error of our classifier. After colleagues apply smelling salts to bring us back from the euphoria of watching the loss drop, we find that our model achieves a smashing result, surpassing the predictive accuracy of previous approaches. Some of our data scientists at MLML-corp might even start to make projections about how much money we could have saved (based on historical backtests) had we deployed the current MLML-system.

But our successful pattern-matching belies trouble ahead. First, we only trained the model based on those loans that were historically approved, raising the question of how we could apply the new model to accurately estimate the risk of default for new applicants that might never have been approved under the previous policy.

Moreover, say that when writing up the analysis we caught the NIPS 2017 test of time award, chastening us to work out what precisely worked and not just that our model worked. Running an ablation test to see which of our features added predictive value, we are surprised to discover that our model gets a significant boost in accuracy from the footwear choices of the applicants (remember, we used all available data). It turns out that with all other features in the mix, applicants who recently purchased Oxfords are more likely to repay their loans than those who did not.

Red flag. If we deploy this model to drive decisions, won’t people catch on and “game” the system? What’s to stop the devious applicants from altering their shoe-buying behavior to fool our model? Perhaps we should we keep the model secret to guard against such gaming? Some earnestly suggest this as a solution to this sort of problem but from a computer security perspective, little could be more idiotic. You can’t keep a system safe by hoping the hacker will never access the blueprints.

If we poke around further we could likely find that some other seemingly spurious correlations picked up by our model ran afoul of notions of fairness. How can we patch up the situation? Do we use interpretable models? Perhaps we can check if the model is learning “bad patterns” and then somehow “subtract out” the bad patterns from our model?

The literature is overflowing with naive solutions to make machine learning society-compatible. But they usually miss the key problem. The entire process of naively chucking predictive models at complex social problems is itself the problem. Often when we apply machine learning, we’re not just addressing a prediction problem. We are deploying a policy whose decisions form a system of incentives, potentially manipulating the behavior of those around us.


Practitioners increasingly force ML into situations where the links between prediction and action are at best tenuous. When we decide who to lend to, who to hire, or when we purport to guide bail decisions, we’re not just mining patterns in i.i.d. data. Instead, when we set policies for choosing actions, we participate in complex and dynamic systems of incentives. If loan applicants change their footwear to boost their scores, if SAT takers adopt strange patterns known to fool automatic graders, they’re not fooling the system. They are responding naturally to a system of incentives. And if we knew what we were doing, we’d build system that incentivized only beneficial behavior adjustments.

So where then can we go from here? Do we throw out big data predictions and craft simpler systems based on factors that we think we understand mechanistically? For some problems, maybe. Do we develop a generation of tools that understands the mechanisms with which they are interacting? That sounds great, but it will be a while.

While I’m optimistic that we can make progress on the core technical and philosophical problems, I suspect that long before solutions emerge, the short-term misapplication of machine learning will continue to grow more severe. While a core community of social scientists, philosophers, CS theorists, and machine learning researchers are hard at work, current research is only tickling the critical questions. Meanwhile, too many of the practitioners most enthusiastically rushing to expand the uses of machine learning are those least likely to consider its responsible application.

Related Posts






Author: Zachary C. Lipton

Zachary Chase Lipton is an assistant professor at Carnegie Mellon University. He is interested in both core machine learning methodology and applications to healthcare and dialogue systems. He is also a visiting scientist at Amazon AI, and has worked with Amazon Core Machine Learning, Microsoft Research Redmond, & Microsoft Research Bangalore.

4 thoughts on “When are predictions policies?”

    1. I’m not convinced that RL, even modern algorithms overcame sample complexity issues etc, addresses the core problems here. When you introduce a set of incentives, you alter the behavior of those around you. RL is just as liable as SL to rely on potentially spurious correlations to drive decisions. Now RL can learn dynamically as the environment adapts, but 1) that doesn’t stop the RL from setting a set of ridiculous incentives in the first place and 2) if there’s a long lag between actions and rewards (consider, e.g., lending decisions), the RL might not be able to update fast enough.

  1. To what extent do you think RL would deal with the problems you discuss here, if we could overcome the obstacles that currently prevent it from being easily applied in the real world?

    1. So the set of problems that causal inference, reinforcement learning, mechanism design, and game theory encounter are not disjoint. It’s just that different communities choose to make different simplifying assumptions and focus on different core questions. RL tends to assume away many of the problems that arise in real-world decision-making. Importantly we assume away confounding (often without stating so explicitly). We also tend to assume away the existence of other agents whose behavior may be subject to change, let alone change prompted by our policies.

Leave a Reply

Your email address will not be published. Required fields are marked *