Where Predictive Modeling Goes Astray

I recently reread Yarkoni and Westfall’s in-progress paper, “Choosing prediction over explanation in psychology: Lessons from machine learning”. I like this paper even more than I liked their previous paper, but I think a few notes of caution should be raised about the ways in which a transition to predictive modeling could create problems that psychologists are seldom trained to deal with.

In particular, I think there are three main dangers one encounters when engaging in predictive modeling with the aim of gaining generalizable insights about the world.

Danger 1: What kinds of predictions are needed?

The first danger with advocating “prediction” is that a reasonable reader can be uncertain about what kinds of predictions are going to be generated by psychologists after the field’s direction changes. In the mathematical literature, the word “prediction” is vague because of polysemy; that is, the term can be used to refer to two different problems in mathematical modeling. The first problem in modeling uses “prediction” to refer to the task of making predictions from the position of an all-seeing but never-acting observer who will predict the future without ever attempting to change it. The second problem in modeling uses “prediction” to refer to the task of making predictions from the position of a decision-maker who will intervene directly in the world that they observe and who wishes to predict the outcomes of their own actions before they act. For those familiar with the literature on causality, this distinction reflects the gaps between probabilistic and causal modeling: the first task involves the use of Judea Pearl’s see() operator and the second task involves the use of his do() operator. The appropriate techniques for developing and evaluating models in these two settings are closely related, but they are also generally not identical. If the field is going to move towards predicting modeling, it is important to understand which kind of prediction is being advocated because the training of students will need to ensure that students are taught to distinguish these two kinds of predicting modeling so that they can apply the proper techniques. I think this distinction is particularly important to make when contrasting “prediction” and “explanation” because I suspect that much of what we call “explanation” is a hodge-podge of causal modeling and the pursuit of models that are easily interpretable by humans. But these two pursuits are not inevitably linked: we should not assume that interpretable models will be valid causal models nor should we assume that valid causal models will be interpretable. The focus on explanation, in my opinion, allows one to excuse oneself for failing to construct models that make falsifiable and approximately accurate predictions, because one can always assert that the models, despite their predictive failures, are insightful and educational.

Danger 2: Mis-specified models perform well locally, but do not generalize well.

The second danger with adopting “prediction” as an objective of science is that we should not assume that the models we use will faithfully represent the data generating processes for the datasets we wish to predict against. But if the models we fit to data are usually mis-specified, we should fear that they will not generalize well when applied to new datasets – because only the true data generating process will typically converge locally to a parameterization that generalizes well across heterogeneous datasets. (And even this claim requires caveats because of potential failures caused either by using small samples or by using datasets without sufficient variation in the range of inputs to identify the model being learned.)

To see why the use of mis-specified models could be a problem, consider the problem of fitting a linear model to quadratic data described in this pure theory post I wrote a while back. When one fits an approximate model, one needs to be very clear about the domain in which this approximation will be employed: the true model extrapolates flawlessly (for what are essentially tautological reasons), but approximate models often fail disastrously when asked to extrapolate from the range of inputs for which they were optimized.

Danger 3: Empirical model comparisons are often badly confounded.

The problem of extrapolation for mis-specified models hints at another broad problem with evaluating models empirically: the empirical performance of a model depends on many factors and it is prohibitively difficult (or even impossible) to perform all-else-equals comparisons between models using only empirical data. In the language of psychological statistics, empirical model comparisons exhibit substantial interactions and suffer from powerful moderators.

This occurs because, when we compare models using finite data sets, we do not generally compare the intrinsic properties of the models in the way that a pure mathematical analysis would allow us to do. Instead, we are forced to select specific algorithms that compute specific estimators and those estimators are used to compare learned parameterizations of models rather than comparing the parameter families that we typically have in mind when talking about mathematical models. (To clarify this distinction, note that the model \(f(x) = sin(a * x)\) is actually an infinite parametric family of parameterized models with a distinct element for every value of \(a\). Training a model family on any given dataset will lead one to select a single element from this set – but comparisons between single elements are not generally equivalent to comparisons between sets.)

To put it another way, measured predictive power is, at a minimum, a function of (1) the evaluation metric used, (2) the model being fit, (3) the estimator being used to do the fitting, (4) the algorithm being used to compute the estimator, and (5) the data set being used. When you change any of these inputs to the function that outputs measured predictive power, you generally get different results. This means that it can be very difficult to understand what exactly accounts for the differences in observed predictive power between two entities we nominally call “models”: is your logistic regression with a quadratic polynomial as input out-performing your probit regression with a cubic polynomial as input because of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true model – which is being rejected by your, as it were, “underpowered” dataset that cannot reliably lead to the conclusions you would see in the asymptotic limit as N goes to infinity?

This type of confounding is not such a large problem for machine learning, because the goal of many machine learning researchers isn’t to discover one single universally correct model parameterization, but rather to develop algorithmic tools that can be used to learn useful model parametrizations from arbitrary datasets. What sets machine learning and statistics apart from psychology and physics is that those fields are far more dataset agnostic than fields with specific empirical areas of focus.

Instead of looking to machine learning, one should keep classical physics in mind when one wants to see a successful example of partially generalizable (but still imperfect) model parameterizations being learned and generating falsifiable and approximately accurate predictions about new datasets. In physics, the equation predicting the position of a falling object affected by Earth’s gravitation field isn’t a family of arbitrary polynomials, but rather a specific parameterization of a quadratic equation in which parameters like the typical value of the coefficient on the quadratic term are taught to students in introductory courses.

More generally, this danger hints at the same concerns about objectives that the first danger addressed: is the goal of a new version of psychology the construction of generalizable predictive model parameterizations or the construction of general-purpose algorithms that can be applied to custom datasets, but which will generally disagree across datasets? If the latter, how will people know when they’re talking past one another?

Conclusion

Despite these three concerns, I believe that Yarkoni and Westfall’s preference for prediction over explanation is not only the right approach for psychology, but is the only long-term cure to psychology’s problems. More sophisticated statistical methods will not, in my mind, solve psychology’s biggest problems: the field may purge itself of its accumulated stock of false positive claims, but it will not create a cumulative science of true positive claims until it can show that it can generate model parameterizations that make falsifiable, unexpected and approximately accurate predictions on a wide range of new datasets. This is an extremely hard challenge. But I believe Yarkoni and Westfall’s suggestions represent the best chance psychology has to overcome this challenge.