In particular, you might know something about the mathematical rules of how observable data in the present influence the outcome you want to predict – for example, the movement of planets around the sun.
Or it may be that you have noticed, without understanding why, that your outcome always follows a distinctive pattern of circumstances in the present – for example, that when the sun sets, it gets dark. Both of these approaches are models. The first is a mechanistic model, the second is an empirical model. (Spoiler alert, data science models are almost always a mixture of both approaches.)
An example of a mechanistic model is firing an artillery shell to hit a target. The arc that a shell takes after firing follows a very well understood curve called a parabola that describes the trajectory of any non-self-propelled airborne object. If, for example, you turn an ordinary garden hose upwards into the air and turn the tap on you can see the shape of this curve very nicely in the stream of water.
So, assuming our artillery is pointing in the direction of the target, you can use the mathematical formula that describes a parabola as a model and carefully control the amount of propellant and the upwards angle of the barrel and, as a result, do a pretty good job of predicting where the shell is going to hit. As always, there are complications in the real world, but they can be accounted for with some refinements to the model.
An empirical model, sometimes called a statistical model, relies on observation rather than theory. The idea is that if you observe some particular outcome following some particular circumstance then you can reliably predict that outcome in the future. A trivial example might be noticing that every evening when the sun sinks low in the sky and then slips below the horizon it gets dark outside. Armed with repeated observations of that pattern, you don’t need to know anything about the laws of celestial orbits or about the equations for the propagation of light to predict when it will be dark tomorrow, you just need a timepiece.
That example is perhaps too simplistic to be very interesting and it’s easy to see how knowledge of celestial orbits and light propagation would in fact greatly improve the precision of your prediction. But the empirical model works well and needs no special equipment beyond your wristwatch (or cellphone if you’re a millennial). This is a common theme, especially when the prediction concerns a more complex, or less well understood, phenomenon.
For example, imagine that we need a model to predict which products might interest a shopper. There are not very many fundamental, natural principles that can help us here. All we will have to go on is observed data. Which products did the shopper purchase on their previous visits? What has the shopper already purchased on this visit? What products have similar shoppers purchased? Amazon and other online retailers have this down to a fine art. These are all correlations and if you know anything about correlations it’s that (all together now!) correlation does not imply causation. But they don’t have to imply causation to be useful for prediction, they just have to be good signals and not mere background noise.
In fact, this is a common aim of much contemporary medical research where biomarkers are sought to predict disease or other physiological conditions. With the development of ‘omics technologies (genomics, transcriptomics, proteomics, metabolomics, etc, etc), high-throughput screens make it feasible to turn up biomarkers that are correlated with the condition of interest. It’s possible, with these technologies, to screen tens or hundreds of thousands of candidate markers looking for any that correlate with the disease or physiological condition of interest. With all these different biomarkers in play, inevitably some of the correlations found will be spurious, arising mostly through chance and with poor reproducibility, and this is what is meant by signal and noise. But even if all of the discovered biomarkers are noisy signals individually, it can none-the-less be, when taken as a group, that they provide reliable predictive power.
This is what happens under the hood in models which use machine learning. The algorithm sifts and re-sifts all the available data, comparing the combined signals each time with the outcome of interest and eventually returning the best set of predictive signals to the user to validate and test on a new independent set of data.
My dog uses an empirical pattern-based model to predict when we’re going for a walk. If it’s the evening, shortly after dinner, and if I’m wearing my jacket and my boots then excitement reigns! But if any of those signals are missing, or if a negative signal is observed, such as my cycling helmet, he thinks poorly of spending the energy to even lift his head off the floor. His success rate at this prediction is uncannily good, so I’m sure he’s discovered other weak predictors which I’m not aware of that further refine his response. Though I’m not sure which particular machine learning algorithm that he used to create his model. I like to think it was random forests. (And that’s the topic of another blog.)
So correlation and, by extension, empirical models are not to be lightly dismissed. Often mechanistic information is not available, or requires lengthy, involved computation. Sometimes you don’t want to have to generate mechanistic information, because it will require conducting experiments and that can be costly, or time consuming. If your only concern is the reliability of the prediction, then a causative explanation of good signals is nice to have, but not necessary.
Mechanistic models, of course, have several advantages. Only a few input data points are required for a given prediction (amount of propellant, elevation and direction of aim in our artillery example above), whereas the number of observations needed for empirical models tends to grow exponentially with the number of variables included. Extrapolation is possible with mechanistic models. We can make good predictions outside the range of previously used input values. This is not the case with empirical models. If our shopper model above was developed for an electronics store, it won’t be useful for a sports store. Or if my dog is good at predicting when I will take him for a walk, he might be less skilled at the same prediction for my daughter (probably due to insufficient data…).
In truth, as referenced before, almost all models are a combination of mechanistic and empirical thinking. Mechanistic models must allow for some element of empiricism. If this weren’t so, then they would be capable of making predictions that were perfect to the Nth decimal place. The reverse is equally true. Empirical models include mechanistic elements, even if only in the selection of candidate predictor variables to investigate. The choice of one approach over the other is a false dichotomy and the utility of the model matters far more than the underlying approach.