Why does machine learning often persist the bias

The two cultures: statistics vs. machine learning?

Last year I read a blog post by Brendan O'Connor entitled "Statistics vs. Machine Learning, Fight!" that discussed some of the differences between the two fields. Andrew Gelman responded positively:

Simon Blomberg:

From R's lucky package: To put it provocatively: "Machine learning is statistics minus checking models and assumptions". - Brian D. Ripley (on the difference between machine learning and statistics) useR! 2004, Vienna (May 2004) :-) Christmas greetings!

Andrew Gelman:

In this case, perhaps we should give up checking models and assumptions more often. Then maybe we could solve some of the problems that machine learning people can solve, but we can't!

There was also the paper "Statistical Modeling: The Two Cultures" by Leo Breiman in 2001, who argued that statisticians are too reliant on data modeling and that machine learning techniques are advancing by relying on the Leave prediction accuracy abandoned by models.

Has the statistics field changed in the last ten years because of this criticism? Do the both cultures still or have the statistics been expanded to include machine learning techniques such as neural networks and support vector machines?


I think the answer to your first question is simply an affirmative. Take a copy of Statistical Science, JASA, Annals of Statistics for the past 10 years and you will find articles on boosting, SVM, and neural networks, although this area is less active now. Statisticians have taken over the work of Valiant and Vapnik, but on the other hand computer scientists have taken up the work of Donoho and Talagrand. I don't think there is much difference in scope or method anymore. I never bought Breiman's argument that CS people were only interested in minimizing losses by any means. This view was heavily influenced by his participation in neural network conferences and his advisory work. But PAC, SVMs, and boosting all have a solid foundation. And today, unlike 2001, Statistics is more concerned with properties of finite samples,

But I think there are three more important differences that won't go away anytime soon.

  1. The publications on methodological statistics are still predominantly formal and deductive, whereas machine learning researchers tend to tolerate new approaches even if they are not accompanied by evidence.
  2. The ML community primarily shares new findings and publications in conferences and related practices, while statisticians use journal articles. This slows down progress in statistics and star researcher identification. John Langford made a lovely post on this topic.
  3. The statistics still cover areas that are (for the time being) of little importance for ML, e.g. B. Survey Design, Sample Survey, Industry Statistics, etc.

The biggest difference I see between communities is that statistics emphasize inference while machine learning emphasizes prediction. If you have statistics create , Do you want to the process derive with which you generated data. If you like machine learning operate , you want to know how you predict can what future data will look like with one variable.

Of course, the two overlap. Knowing how the data was generated will give you some pointers on what a good predictor might look like. However, an example of the difference is that machine learning has been dealing with the problem p >> n (more functions / variables than training examples) since its inception, while statistics are just beginning to take this problem seriously. Why? Because at p >> n you can still make good predictions, but you can't draw good conclusions about which variables are actually important and why.

Bayesian: "Hello machine students!"

Frequentist: "Hello, machine students!"

Machine Learning: "I heard you're good at things. Here's some data."

Q: "Yes, let's write a model and then calculate the MLE."

B: "Hey, F, you didn't tell me that yesterday! I had some univariate data and wanted to estimate the variance, and I calculated the MLE. Then you pounced on me and told me to divide by instead fromnn − 1n. "

Q: "Ah yes, thank you for reminding me. I often think I should use the MLE for everything, but I'm interested in unbiased estimators and so on."

ML: "Um, what is this philosophy about? Does it help me?"

Q: "OK, a Appraiser is a black box, you enter data and numbers are output. Many of us do not care how the box was constructed or the principles according to which it was designed. For example I don't know how to derive ÷ (n − 1) rule. "

ML: So what do you care?

Q: Evaluation

ML: "I like the sound of it."

Q: "A black box is a black box. If someone claims that a particular estimator is an unbiased estimator for, then we try many values ​​of, generate many samples from each based on an assumed model, and push them through the estimator and the average estimated . If we can prove that the expected estimate equals the true value for all values, then we say it is impartial. ”Θ θθθθ

ML: "Sounds great! It sounds like frequentists are pragmatic people. They judge every black box based on its results. The evaluation is the key."

Q: "Indeed! I understand that you guys have a similar approach. Cross-validation or something? But that sounds messy to me."

ML: "Chaotic?"

Q: "The idea of ​​testing your estimator on real data seems dangerous to me. The empirical data you use can create all sorts of problems and may not behave according to the model we have agreed upon for the evaluation."

ML: “What I thought you were saying, you would prove some results that your appraiser would always be unbiased, for everyone? . “Θ

Q: Yes. Although your method may have worked with a dataset (the train and test dataset) that you used in your assessment, I can show that mine always works. "

ML: "For all records?"

Q: No.

ML: "My method was cross-validated for a data set. You have not yet tested your method on a real data set?"

Q: "That's right."

ML: "Then I'll be at the top! My method is better than yours. It predicts cancer 90% of the time. Your 'proof' is only valid if the entire data set behaves according to the model you have adopted."

Q: "Emm, yes I suppose."

ML: "And this interval has one Coverage of 95%. But I shouldn't be surprised if it only contains the correct value 20% of the time? "Θ

Q: "That's right. If the data is not really normal (or whatever), my evidence is useless."

ML: So my assessment is more trustworthy and comprehensive? It only works with the records I've tried so far, but at least it works with real records, warts and all, "and" thorough "and that you were interested in model checks and all."

B: (Throws in) "Hey guys, sorry to interrupt. I would like to step in and balance things out to maybe demonstrate a few other issues, but I really love watching my regular mate squirm. "

Q: "Woah!"

ML: Okay, kids. It was all about evaluation. An appraiser is a black box. Data comes in, data comes out. We approve or disapprove of an appraiser based on how well it performs on the evaluation. We don't care about the 'recipe' or the 'design principles' that are used. "

Q: "Yes. But we have very different ideas about which assessments are important. ML will train and test real data. I will do a more general assessment (since it is generally applicable evidence) and also more limited (because I don't knows if your dataset is actually from the model assumptions I use in designing my assessment.) "

ML: "What rating are you using, B?"

Q: (Throws in) "Hey. Don't make me laugh. He doesn't judge anything. He just uses his subjective beliefs and runs with it. Or something."

B: "That is the usual interpretation. But it is also possible to define Bayesianism by the preferred ratings. Then we can use the idea that nobody cares what is in the black box, we just care about different ones Types of evaluation. "

B continues: "Classic example: medical test. The result of the blood test is either positive or negative. A frequentist will be interested in the healthy people, which proportion achieves a negative result. And also what proportion of the sick. The frequentist calculates these values for each blood test method considered and then recommends using the test that gives the best result. "

Q: "Exactly. What more do you want?"

B: What about the people who test positive? You will want to know how many of those who get a positive result get sick. and 'of those who get a negative result, how many are healthy?' "

ML: "Ah yes, that seems like a better pair of questions."

Q: "HERE!"

B: "Now we're starting again. He doesn't like where this is going."

ML: "It's about 'superiors', isn't it?"

Q: "EVIL".

B: Anyway, yes, you are right ML. To calculate the percentage of sick people with a positive result, you need to do one of two things. One option is to do the tests on many people and watch just that relevant proportions. How many of these people die from the disease, for example. "

ML: "That sounds like what I'm doing. Use training and testing."

B: "But you can calculate these numbers in advance if you are ready to make an assumption about the disease rate in the population. The frequentist also does his calculations in advance, but without using this disease rate at the population level."


B: Oh, shut up. You used to be found out. ML found that you like unsubstantiated assumptions as much as anyone else. Your 'proven' probabilities won't stack up in the real world unless all of your assumptions are correct. Why is my previous assumption so different? You call me crazy but pretend your assumptions are the work of conservative, solid, assumption-free analysis. "

B (continues): "Anyway, ML, like I said. Bayesians like a different kind of evaluation. We are more interested in conditioning the observed data and calculating the accuracy of our estimator accordingly. We can do this evaluation do not perform without using perform a prior. But the interesting thing is that once we have decided on this form of evaluation and have chosen our prior, we have an automatic "recipe" for making a suitable estimator. The frequentist has no such recipe. If he has one wants "impartial estimator for a complex model, he has no automated possibility to create a suitable estimator."

ML: "And you? You can make an estimate automatically?"

B: "Yes. I don't have an automatic way of making an unbiased estimator because I think bias is a bad way to evaluate an estimator. Given the data-dependent estimate I like and the the previous one can connect the prior and the probability of giving me the appraiser. "

ML: Anyway, let's summarize. We all have different ways of evaluating our methods, and we will likely never agree on which methods are best.

B: Well that's not fair. We could mix and match them. If either of us has well-labeled training data, we should probably test it. And in general, we should all test as many assumptions as possible, "Evidence could be fun too, and predict performance under a presumed model of data generation."

Q: "Yes folks. Let's be pragmatic about the scoring. And in fact, I'm going to stop obsessing about traits with infinite samples. I asked the scientists to give me an infinite sample, but they still haven't. It's time for me to focus on finite rehearsals again. "

ML: "So, we only have one last question. We have argued a lot about how we can use our methods evaluate but how we use our methods create ."

B: Ah. As I said earlier, we Bayesians have the more powerful general method. It may be complicated, but we can always write an algorithm (perhaps a naive form of MCMC) that will be sampled by our posterior. "

Q (interjects): "But there might be bias."

B: Could your methods too. Do I need to remind you that the MLE is often biased? Sometimes you struggle to come up with unbiased estimators, even when you have a stupid estimator (for a really complex model) that says variance is negative. And you call that impartial. Open-minded, yes. But useful, no! "

ML: "Okay folks. You got upset again. Let me ask you a question, Q. Have you ever compared the deviation from your method to the deviation from B's method when you were both working on the same problem?"

Q: "Yes. In fact, I hate to admit it, but B's approach sometimes has less bias and MSE than my estimator!"

ML: "The lesson here is that none of us have the monopoly on making an appraiser that has properties we want, even though we disagree a little."

B: "Yes, we should read each other's work a little more. We can give each other inspiration for estimators. We might find that each other's estimators work great on our own problems."

Q: "And I should stop being obsessed with bias. An unbiased appraiser could have ridiculous variance. I suppose we all need to take responsibility for the choices we make in appraisal and the traits we use in our appraisers. We can't fall short of one philosophy. Try all the appraisals you can. And I'll keep looking at the Bayesian literature for new ideas for appraisers! "

B: "In fact, a lot of people don't really know what their own philosophy is. I'm not sure about myself. If I use a Bayesian recipe and then prove a nice theoretical result, doesn't it mean that I'm a frequentist? A frequentist." takes care of the above-mentioned performance records, he doesn't care about prescriptions. And if I take a few tests instead (or also), does that mean I'm a machine learner? "

ML: "Then we all seem pretty much alike."

In such a discussion, I always remember the famous Ken Thompson quote

When in doubt, use brute force.

In this case, machine learning comes to the rescue when the assumptions are difficult to pin down. or at least it's way better than guessing them wrong.

What enforces more separation than it should be is the lexicon of every discipline.

There are many instances where ML uses one term and statistics uses a different term - but both refer to the same thing - well, that would be expected, and it doesn't create any permanent confusion (e.g. traits / attributes versus expectation) Variables or neural network / MLP versus projection tracking).

What is much more annoying is that both disciplines use the same term to refer to completely different concepts.

Some examples:

Kernel function

In ML, kernel functions are used in classifiers (e.g. SVM) and of course in kernel machines. The term refers to a simple function ( Cosine, Sigmoidal, Rbf, Polynomial ) to the Mapping of non-linearly separable data to a new input space, so that the data can now be linearly separated in this new input space. (as opposed to using a nonlinear model).

In statistics, a kernel function is a weighting function that is used in density estimation to smooth the density curve.


In ML, prediction algorithms, or implementations of those algorithms, that return class names "classifiers" are (sometimes) called Machines - z. B. Support vector machine , Kernel machine . The counterpart to machines are Regressors , the one score return (continuous variable) - e.g. B. the Support of vector regression .

In rare cases, the algorithms have different names depending on the mode. For example, the term MLP is used regardless of whether a class label or a continuous variable is returned.

If you are into statistics, Regression trying to build a model based on empirical data to predict a response based on one or more explanatory variables or variables, do a Regression analysis by. It doesn't matter whether the output is a continuous variable or a class label (e.g. logistic regression). For example, least squares regression refers to a model that returns a continuous value. On the other hand, logistic regression returns a probability estimate, which is then discretized in class labels.


In ML is that Bias Term in the algorithm conceptually identical to the Intercept Term statisticians use when modeling regression.

In statistics, the bias is a non-random error, meaning that some phenomenon has affected the entire data set in the same direction, which in turn means that this type of error cannot be removed by resampling or increasing the sample size.

Machine learning seems to have a pragmatic foundation - a practical observation or simulation of reality. Even within statistics, pointless "checking of models and assumptions" can lead to useful methods being discarded.

For example, years ago the very first commercially available (and working) insolvency model implemented by the credit bureaus was created by a simple old linear regression model with the goal of a 0-1 result. Technically it's a bad approach, but in practice it worked.

The biggest differences I've noticed over the past year are:

  • Machine learning experts don't spend enough time on the basics, and many of them don't understand the rules of making optimal decisions and correctly assessing accuracy. They don't understand that predictive methods that don't make assumptions require larger samples than those that do.
  • We statisticians spend too little time learning good programming practices and new computer languages. We are too slow to calculate and apply new methods from the statistical literature.

I disagree with this question as it suggests that machine learning and statistics are different or contradicting sciences ... if the opposite is the case!

Machine learning uses statistics extensively ... A brief overview of software packages for machine learning or data mining shows clustering techniques like k-means, which can also be found in statistics, also a statistical technique ... even logistic regression yet another other.

In my view, the main difference is that statistics have traditionally been used to prove a pre-made theory and that analysis has usually been based on that main theory. While the opposite is usually the case with data mining or machine learning, we just want to find a way to predict it instead of asking the question or building the theory, this is the result!

I talked about it on another forum run by the ASA Statistical Consulting eGroup. My answer was about data mining in particular, but the two go hand in hand. We statisticians have the noses of data miners, computer scientists and engineers. It is wrong. I think one reason for this is that some people in these areas ignore the stochastic nature of their problem. Some statisticians call data mining data snooping or data fishing. Some people abuse and abuse the methods, but statisticians have fallen behind in data mining and machine learning because we draw them with a broad brush. Some of the great statistical results come from outside the statistics area. Boosting is a major example. But statisticians like Brieman, Friedman, Hastie, Tibshirani, Efron, Gelman, and others got it, and their leadership has involved statisticians in analyzing microarrays and other major inference problems. While cultures may never mesh, there is now more cooperation and collaboration between computer scientists, engineers and statisticians.

The real problem is that this question is wrong. It's not machine learning versus statistics, it's machine learning versus real scientific advancement. If a machine learning device makes the right predictions 90% of the time, but I don't understand the "why", how does machine learning contribute to science as a whole? Imagine using machine learning to predict the positions of planets: Many complacent people believe that their SVMs can accurately predict a number of things, but what would they really know about the problem they are in their hands to have? ? Obviously, science doesn't really advance through numerical prediction, but through models (mental, mathematical) that allow us to see far beyond numbers.