Many times in my career, I’ve been told by respected statisticians that machine learning is nothing more than nonparametric statistics. The longer I work in this field, the more I think this view is both misleading and unhelpful. Not only can I never get a consistent definition of what “nonparametric” means, but the jump from statistics to machine learning is considerably larger than most expect. Statistics is an important tool for understanding machine learning and randomness is valuable for machine learning algorithm design, but there is considerably more to machine learning than what we learn in elementary statistics.
Machine learning at its core is the art and science of prediction. By prediction, I mean the general problem of leveraging regularity of natural processes to guess the outcome of yet unseen events. As before, we can formalize the prediction problem by assuming a population of $N$ individuals with a variety of attributes. Suppose each individual has an associated variable $X$ and $Y$. The goal of prediction is to guess the value of $Y$ from $X$ that minimizes some error metric.
A classic prediction problem aims to find a function that makes the fewest number of incorrect predictions across the population. Think of this function like a computer program that takes $X$ as input and outputs a prediction of $Y$. For a fixed prediction function, we can sum up all of the errors made on the population. If we divide by the size of the population, this is the mean error rate of the function.
A particularly important prediction problem is classification. In classification, the attribute $Y$ takes only two values: the input $X$ could be some demographic details about a person, and $Y$ would be whether or not that person was taller than 6 feet. The input $X$ could be an image, and $Y$ could be whether or not the image contains a cat. Or the input could be a set of laboratory results about a patient, and $Y$ could be whether or not the patient is afflicted by a disease. Classification is the simplest and most common prediction problem, one that forms the basis of most contemporary machine learning systems.
For classification problems, it is relatively straightforward to compute the best error rate achievable. First, for every possible value of the attribute $X$, collect the subgroup of individuals of the population with that value. Then, the best assignment for the prediction function is the one that correctly labels the majority of this subgroup. For example, in our height example, we could take all women, aged 30, born in the United States, and reside in California. Then the optimal label for this group would be decided based on whether there are more people in the group who are taller than 6 feet or not. (Answer: no).
This minimum error rule is intuitive and simple, but computing the rule exactly requires examining the entire population. What can we do if we work from a subsample? Just as was the case in experiment design, we’d like to be able to design good prediction functions from a small sample of the population so we don’t have to inspect all individuals. For a fixed function, we could use the same law-of-large-numbers approximations to estimate the best decision. That is, if we decide in advance upon a prediction function, we could estimate the percentage of mistakes on the population by gathering a random sample and computing the proportion of mistakes on this subset. Then we could apply a standard confidence interval analysis to extrapolate to the population.
However, what if we’d like to find a good predictor on the population using only a set of examples sampled from the population. We immediately run into an issue: to find the best prediction function, we needed to observe all possible values of $X$. What if we’d like to make predictions about an individual with a set of attributes that was not observed in our sample?
How can we build accurate population-level predictors from small subsamples? In order to solve this problem, we must make some assumptions about the relationship between predictions at related, but different values of $X$. We can restrict our attention to a set of functions that respect regularity properties that we think the predictions should have. Then, with a subsample from the population, we find the function that minimizes the error on the sample and obeys the prescribed regularity properties.
This optimization procedure is called ”empirical risk minimization” and is the core predictive algorithm of machine learning. Indeed, for all of the talk about neuromorphic deep networks with fancy widgets, most of what machine learning does is try to find computer programs that make good predictions on the data we have collected and that respect some sort of rudimentary knowledge that we have about the broader population.
The flexibility in defining what “knowledge” or “regularity” means complicates the solution of such empirical risk minimization problems. What does the right set of functions look like? There are three immediate concerns:
-
What is the right representation? The set needs to contain enough functions to well approximate the true population prediction function. There are a variety of ways to express complex functions, and each expression has its own benefits and drawbacks.
-
The set of functions needs to be simple to search over, so we don’t have to evaluate every function in our set as this would be too time consuming. Efficient search for high quality solutions is called optimization.
-
How will the predictor generalize to the broader population? The functions cannot be too complex or else they will fail to capture the regularity and smoothness of the prediction problem (estimating functions of too high complexity is colloquially called “overfitting”).
Balancing representation, optimization, and generalization gets complicated quickly, and this is why we have a gigantic academic and industrial field devoted to the problem.
I’m repeating myself at this point, but I again want to pound my fist on the table and reiterate that nothing in our development here requires that the relationship between the variables $X$ and $Y$ be probabilistic. Statistical models are often the starting point of discussion in machine learning, but such models are just a convenient way to describe populations and their proportions. Prediction can be analyzed in terms of a deterministic population, and, just as we discussed in the case of randomized experiments, randomness can be introduced as a means of sampling the population to determine trends. Even generalization, which is usually studied as a statistical phenomenon, can be analyzed in terms of the randomness of the sampling procedure with no probabilistic modeling of the population.
On the other hand, some sort of knowledge about the population is necessary. The more we know about how prediction varies based on changes in the covariates, the better a predictor we can build. Engineering such prior knowledge into appropriate function classes and optimization algorithms form the art and science of contemporary machine learning.
This discussion highlights that while we can view prediction through the lens of statistical sampling, pigeonholing it as simply “nonparametric statistics” does not do the subject justice. While the jump from mean estimation to causal RCTs is small, the jump from mean estimation to prediction is much less immediate. And in machine learning practice, the intuitions from statistics often don’t apply. For example, conventional wisdom from statistics tells us that evaluating multiple models on the same data set amounts to multiple hypothesis testing, and will lead to overfitting on the test set. However, there is more and more evidence that using a train-test split leads does not lead to overfitting. Instead, the phenomena we see is that dataset benchmarks can remain useful for decades. Another common refrain from statistics is that model complexity must be explicitly constrained in order to extrapolate to new data, but this also does not seem to apply at all to machine learning practice.
Prediction predates probability and statistics by centuries. As Moritz and I chronicle in the introduction to Patterns, Predictions, and Actions astronomers were using pattern matching to predict celestial motions, and the astronomer Edmund Halley realized that similar techniques could be used to predict life expectancy when pricing annuities. Moreover, even though modern machine learning embraced contemporary developments in statistics by Neyman, Pearson, and Wald, the tools quickly grew more sophisticated and separate from core statistical practice. In the next post, I’ll discuss an early example of this divergence between machine learning and statistics, describing some of the theoretical understanding of the Perceptron in the 1960s and how its analysis was decidedly different from the theory advanced by statisticians.