Note: In what follows, I take some notes on some thoughts about how learning is possible. My notes try to connect Hume, Kant and Nietzsche with statistics (Bayesian inference and causal models) and computer science (machine learning results like the no-free-theorem and machine-learning models). Be advised that things are confused in my mind, I’m no expert in none of these fields and I’m writing in English.
Hume famously said that
“When we look about us towards external objects, and consider the operation of causes, we are never able, in a single instance, to discover any power or necessary connexion; any quality, which binds the effect to the cause, and renders the one an infallible consequence of the other. We only find, that the one does actually, in fact, follow the other. ” (add reference).
With this and other statements, Hume is arguing that causality – understood as a necessary connection between cause and effect – can never be found and be proved, since there is nothing in our sensible experience that show to us the inner connection between things. See, also, the following quote from Hume:
“When I see, for instance, a billiard-ball moving in a straight line towards another; even suppose motion in the second ball should by accident be suggested to me, as the result of their contact or impulse; may I not conceive, that a hundred different events might as well follow from the cause? … All these suppositions are consistent and conceivable.”
However, as a practical question, we do act as if learn is possible. Don’t we know that we can’t fly, or that when I write in my native language, Portuguese, people around mean understand me, or that when I cross a street if a car hits me, it will harm me (I may even be killed)? In fact, if we look at modern science, we see lots and lots of claims of causality being discovered, lots of cause and effect relationship being established that it is quite hard to reconcile Hume’s view and modern science view. In a sense, the “surprise” is even bigger when we consider several recent development on statistics about how to proper identify causal effects with data – I’m thinking here of the Neyman-Rubin causal model and Pearl causal models.
This surprise isn’t new, of course. At the time when Hume wrote these lines, Newton mechanics was gaining more and more ground as the paradigmatic science, which of course seems to contradict Hume’s view. Kant famously discussed this, and attempted to solve Hume’s conundrum – or so him said. Kant reframed Hume’s argument bout how is learning possible into “How are synthetic a priori propositions possible?”
Says Kant in the prolegomena ($ 30):
the pure concepts of the understanding have no meaning when they depart themselves from objects of experience and refer to things in themselves (noumena). do not refer to objects of experience. They serve, as it were, only to spell appearances, that we may be able to read them as experience (my translation, from the original, English and Portuguese translations).
By this, he means that the experience is only possible because we have these mental faculties that allow us to understand the data. In more modern terms and more suited to our discussion, it’s the transcendental that allows us to learn from experience.
If Kant solved or not Hume’s problem, is debatable. Nietzsche, for instance, criticized Kant’s solution by arguing that Kant didn’t really found the mental faculties – the transcendental -, but invented it. And maybe the habit (Hume own’s solution to his own problem) as a practical illusion is what allows us to learn (but if it is an illusion, do we really learn? And what we mean by learn, after all?).
My purpose on bringing these questions here is not to make an exegesis of these authors to try to understand what they really meant. Rather, I’d like to bring some developments on the statistics and my experience with programming – computer science – to try to look at these questions from a new point of view and, maybe, simultaneously illuminate both the philosophical problems and the scientific questions on causality and learning.
I don’t have the time right now to develop the statistics and programming side of the issue. So, for now We can consider that I only posed the problem. But to avoid only put the problem, I”ll make some appointments about what I’m thinking that connect both statistics and programming on these issues.
About the Nietzsche-Kant controversy (if I may call it so), I’d like to point (not in favor of Kant,neither against Nietzsche) the “grue ridle”, brought to me by the first time by Scott Aaronson. My discussion on this point follows him almost verbatim. Aaronson discusses how do we learn that a hypothesis is true – or at least, more likely than other hypothesis in a probabilistic sense – and then presents the following problem. Assume we’re asked to consider the hypothesis “All emeralds are green”. Then, Aaronson asks: why do we favor this hypothesis against, say, “All emeralds are green before 2030, and then blue afterwards”.
Of course the second hypothesis is more complicated. But think about it a little bit further. If, in our language, we didn’t have definitions for blue and green, but only a word grue, meaning “green before 2030, and then blue afterwards”, and a word bleen, meaning “blue before 2030, then green afterwards”, then the hypothesis “all emeralds are grue” is manifestly more simple than the alternative hypothesis “all emeralds are grue before 2030, then bleen afterwards”. How can we make sense of this riddle?
If we think in what Kant is saying, we can think of our language and categories, as that only practical way of understanding our experience. Or we can think, against Kant and for Nietzsche, that instead of asking why such and such judgments are possible, we should think of them as why they’re necessary. And they’re necessaries because that how we learn and use our learning in our day-to-day activities.
But this answer is unsatisfactory on light of the modern achievements of computer science, mostly on the field of machine learning – and event of science itself, with it’s claims to proven causality. Here, the Bayesian approach may help us a little bit.
Bayes theorem is claimed (add reference) to be the rational way of learning from experience (data). I’m not so sure things are that simple, but let me add this bit. There is one hypothesis that, as Aaronson says, is fundamental to Bayes theorem to work: The future must resemble the past in some way for some learning be possible. If the past says nothing about the future, then no learning is possible.
This assumption was perceived long ago by the first Bayesians like De Finetti, when the concept of exchangeability was introduced. To understand the importance of this concept, think of the lottery. What past numbers says about the probability of new numbers being draw? In general, nothing, and no learning is possible. The numbers of the lottery are considered to be truly independent and, in this case, no learning is possible. And that’s why De Finneti introduced the concept of exchangeability: we need a weaker (less restrictive) notion of independence that admit some form of dependence in order to learn.
However, I’m not sure that exchangeability (with the problems that it has) is enough to justify that learning is possible. Consider the so called no-free-lunch theorems in machine learning. Roughly speaking, these theorems proved the claim by Hume that given my experience, any number of hypothesis are equally plausible. Saying a little bit more formally, I mean that, given n data-points, and considering the mis-classification rate as the basis of my loss function, evaluating any algorithm by this loss function in out-of-sample data, there is no a priori reason to think that any algorithm will perform better than others. One interpretation is that there is no universal better algorithm. All algorithms are equal, in the sense that there is always possible to find a set of k data points where one algorithm performs better than another one.
My take on these results is that theory is necessary to reduce the hypothesis space and also to model unlikely or uninteresting sample space where an algorithm will perform better than our preferred one. So, connecting all the dots: I need theory, language and some assumptions (like the future resemble the past) for learning be possible at all. And this means that, strictly speaking, Hume was right. But, for practical purposes, we can consider that learning does happen and causality can be proved and founded, on grounds that how human organize our experience (And given the practical take on view of thing by Hume it’s not even clear that my answer isn’t Humean in spirit).
These are more or less what I intend to talk about. However, I still want to introduce the Neyman-Rubin causal model in this conversation (And maybe Pearl). I understand things are confused in my mind, and probably here.