Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Statistics: Losing Ground to CS, Losing Image Among Students (revolutionanalytics.com)
104 points by sampo on Aug 26, 2014 | hide | past | favorite | 70 comments


I disagree. Statistics is not "losing image" among students. It's the other way around: more and more students and software developers have become interested in statistics, inspired by the successes achieved by machine learning and AI researchers. Because machine leaning is statistical learning. It's almost impossible to build a machine learning system without learning at least some statistics!

As evidence of the growing popularity of statistics, consider that this submission made the front page on HN.

The "losing image" meme applies, not to the discipline of statistics, but to staid academic departments who have lost power and prestige. They don't like that their subject has become popular in a different department (CS), outside their purview.

PS. I would add that many (but not all) statistics departments have brought this upon themselves. For a long time, their approach to real-world, practical applications has been to apply existing models in ways that can be readily interpreted by human beings, e.g., with "explanatory variables." Meanwhile, machine learning and AI researchers have been experimenting with, testing, and deploying new models and new ways of 'scaling up' old models, focused on making accurate predictions on previously-unseen data. Of course students find the latter more exciting.


Talk about head in the sand. CS is subsuming stats (not really) because computers are a, if not the, major driving force begind statistical advances. L1 regulation became a fad in stats because computational optimization could efficiently solve this non-smooth optimization problem. Engineers are now advancing algorithms and moving it to real world use, while statisticians are still re-searching their field and finding applications using existing algorithms. There is nothing wrong with application, but this means that they are not at the forefront of research and more importantly they have to wait for other people to solve their problems because they are no capable of inventing novel algorithms. When the world at the large relies on computers for every little thing and on computer scientists for a lot of scientific advances, is it really surprising that computer scientists are more prominant?

And computer scientists have plenty of rigor. NIPs conferrence is an AI/neuroscience conference that is now prestigious in stats as well. Some machine learning experts lack mathematical foundations but statisticians are not really good mathematicians either. Being able to regurgitate stat theorems and prove convergence in expection under two dozen conditions (no kidding; read average JASA papers) are not the hallmark of analytic rigor. Average stats Ph.D., even from top programs, lack thorough understanding of measure theory. Just look at standard stat textbooks; examples are aplenty (using statistical software), but proofs are nowher to be found. It is amazing how many statisticians are still using pseudo-inverses in numeric computation.

The statistical education is a problem. The biggest factor being the very software this article is promoting: R. It is an atrocious monstrocity. Perhaps statisticians can live with it. Fine, but don't expect good programmers coming out of using R for five years. Python would be far better. In the end, statisticians are not professional programmers, the same way accountants are not. I don't see too many accountants fretting CS is subsuming their field. Let us not forget, most statistical methods have their roots in other science and engineering fields, and they are not particularly bothered.


Here is a math major's perspective (I also have a C.S. master's from Stanford, a decent school for computer science).

The real reason statistics is losing to machine learning is M&M: money and marketing.

Money is a huge factor when kids choose majors in college. Many of them have student loans to pay off, and for many of them, getting a high-paying job post college is a serious consideration. In this regard, statistics is a great major, but computer science is flat-out ridiculous. When big tech companies are offering a Stanford graduate with only a few summers of programming internships under the belt for 150k/year, naturally a lot of kids are lured into computer science.

Then, once they start studying computer science, they discover this thing called machine learning. While I do think there is a difference in emphasis between stats and machine learning, the fundamentals are same, except the nomenclature sounds much cooler in machine learning. Nonparametric inference sounds esoteric and cryptic, but unsupervised learning sounds futuristic and cool.

The biggest problem with both statistics and machine learning education, I would say, is their lack of emphasis on mathematical foundations. When I was in college, CS229 (Introduction to Machine Learning) was touted to be the hardest class at Stanford. Having helped my friends wade through CS229 problems (a lot of which comes down to wading through linear algebra and multi-variable differential calculus), I do not think this is remotely true: CS229 is hard because it attracts students who do not have the requisite mathematical maturity to learn statistics/machine learning in a serious way (I am not even talking about the real-analytic foundation of probability and calculus but rudimentary linear algebra and chain rules).

Also, I do think CS229@Stanford is a great class that brings theory and practice together =)

As for R/Python/Matlab/etc., I will let the zealots argue what's best. To me, they are like statistics and machine learning: similar with different emphasis.


I agree with your points so much.

I am a applied Maths student (currently) and when looking for internships I noticed 80% of them are for CS related stuff. Imagine being me ( 19 years old ), offered a 15K/year job in my first year in university. My best job up until then was cleaning the floor at a mall. I have studied insane amount of mathematics all my life ( proofs after proofs ), how much longer do I need to wait until I am able to something of value in society ?

Compared that to the amazing 14 weeks internship I did at a startup churning out frontend javascript. Those 14 weeks were the best weeks of my life, I never felt more confident and of value. My depression vanished overnight, I found a new aim in life, I treated coding the same way nations treat nuclear weapons.

I learnt more things of value in those 14 weeks then I did in the previous 19 years, I didn't need to set up an alarm clock to wake up in the morning. There was no hand holding, because I was getting paid. I spent the rest of my waking hour churing code with no need for external motivation.

I got really angry at society for misleading me for such a long time because of how easy it was for me to do something useful, finally. I see my friends struggling with the same problems and feel so much of young ppl's time is wasted, In my mind its a crime.

I think as always mathematicians fail to grasp the human side of the argument, why are they complaining about lack of interest in statistics ? its their own fault this problem exists. There is no Neil Degrasse Tyson for statistics, there is no mark zuckerberg for stats, you don't need to hire a PR department. Just show us the wonderful magic that you speak of and we are smart enough to figure out why we should dedicate our lives to your cause.

This is why if some statistical axiom/theory doesn't have a CS applied part I became really skeptical of doing that course.


>I think as always mathematicians fail to grasp the human side of the argument, why are they complaining about lack of interest in statistics ? its their own fault this problem exists. There is no Neil Degrasse Tyson for statistics, there is no mark zuckerberg for stats, you don't need to hire a PR department. Just show us the wonderful magic that you speak of and we are smart enough to figure out why we should dedicate our lives to your cause.

It doesn't help that textbooks and course materials in mathematics sometimes appear to be designed for working mathematicians more than for students, with a lot of outright hostility towards the very concept of students learning maths.

(I acknowledge that this comes from Math professors hating the time they spend teaching STEM weed-out courses, but I still think that overall, Mathematics is much worse than even theoretical computer science at avoiding things like Notation Proliferation Syndrome and Greek Letter Syndrome that make it artificially hard to learn.

Explain the rigor, yes, but also explain the meaning and the concepts such that a decent learner should understand them well enough to reconstruct the rigorous parts on-the-fly as he or she needs.)


Yeah, I use this book. Its the most clear book I have ever read for 1-2 year applied Mathematics.

http://www.amazon.co.uk/Engineering-Mathematics-K-Stroud/dp/...

Theoretical Computer Science is great in my opinion.

Many mathematics ppl complain about external factors, I agree there may be many external factors but there are also a bucketload of problems with mathematics itself that the are in a position to solve.

I am convinced that you cannot do applied mathematics without computers. An anecdotal example: I have been solving differential equations since I was 15 and only when I tried creating mathematical models for self-driving cars for my CS assignments did I realize that I have no clue what I understood. I spent 5 years just manipulating algebra which python can do with one line of code in Sympy.


While it must be a factor, I don't think weariness from teaching intro courses is the main reason we don't see better math books.

I think it is genuinely difficult to write engaging texts that are sufficiently rigorous and retrain teachers in the methods that go along with them and drum up support among administrative types.

Not that it's impossible, but it is very challenging.


Agreed but also easier said than done


I suspect money is a big motivation behind all the hand-wringing by statisticians. It is about jobs, grants, prestige. I don't blame them; people want these things. I defintively see a competition between computer scientists and statisticians for the mantle of date scientists. When the current bubble bursts and the hype dies down, one side could have won and the world could be very different, or not. What do I know?

I am a little torn between realistic business demand of paying least money to get jobs done and thorough mastery and understanding of theories. Look at biostatistics; there master graduates become data analysts and process data. They know the nomenclature and can understand instructions from more advanced statisticians. The majority of practitioners don't know proofs and they carry off their tasks reasonably well. I would call all but very select few statisticians applied statitisticans and the world is fine with that. I am not sure machine learning industry will not turn out like this, although I am not looking forward to that day.

I would just note that numerical optimization is a big deal in machine learning while I know of no statistical department strong in it. The popularity of SVM owes a load of gratitude to pracitical breakthrough in dynamic optimization. But if one elevates to the level of functional approximation then statistics is similar to almost all engineering fields. In fact, I suspect statisticians have more common with engineers than scientists. I bet Eric Cantor didn't care that polling results had margins of errors but that he lost his House seat due in part to overconfidence coming from unreliable statistics. Real world applicability, not theoretical nicety.


> I am not sure machine learning industry will not turn out like this

I am certain it will. The machine learning field is already rife with best practices, general-use algorithms, software solutions and tutorials.


>As for R/Python/Matlab/etc., I will let the zealots argue what's best. To me, they are like statistics and machine learning: similar with different emphasis.

Having tried working with Matlab and having extensive experience with Python, my verdict is: Haskell or Ocaml.

Ok, but seriously, once I understand just fine what algorithm I'm implementing, doing numerical computation in these all-dynamic languages is atrocious. I could have saved hours of runtime if I'd been allowed and able to just write the damned algorithms in a statically-typed language that compiles to machine code, on top of saving myself debugging time.


>When big tech companies are offering a Stanford graduate with only a few summers of programming internships under the belt for 150k/year

Aren't most of these offers in a part of CA where the cost of living is just as ridiculous as the salary? Therefore making the salary not ridiculous at all.

I work in the DC area, and veteran programmers normally can't get anywhere close to that unless its a contract job with no benefits and shitty working conditions. Even still, it will likely be about 30k less than that.


That is precisely a significant reason why I left DC - the pay is tied heavily to years of experience, and getting something like $150k is difficult, albeit not impossible, to do there.

I'm not sure if I buy that a Stanford graduate would necessarily net $150k starting at most companies (certainly possible at big companies, but I am skeptical - one software engineer told me that at one tech giant, he was offered $160k as a senior engineer, and was told that was about the max they could offer any software engineer then. He was a lead developer for a major group for this company, and this was with proven productivity) - I certainly wouldn't hire one at that price currently unless they are absolutely stellar, and I work in downtown Palo Alto for a pretty healthy startup. At that base salary, they'd be making around my salary, with far less productivity and likely even less potential.

$120k is more in the ballpark one might make typically , and even then I'd want a strong mind that is open to teaching/mentoring & investing in developing himself/herself.


All-in compensation for a top student out of a top college (think Stanford, MIT, Berkeley) in SV can be $200k.

But keep in mind that the high cost of living really only applies to certain areas (albeit desirable ones like SF, Palo Alto, even Mountain View). There are tons of affordable locations if you are willing to have a 30 minute commute and are willing to accept sketchier environments (Berkeley, Oakland, other East Bay locations, East Palo Alto, etc.).


"top student out of a top college" doesn't exactly describe a lot of individuals (10? 15?), which makes that compensation value an extreme outlier.


Allow me to revise that to "top student at a decent school." (School is actually less important). I would say it describes >200 people every year (take the top 5-10% of Stanford, Harvard, Berkeley, MIT, CMU, Princeton, UPenn, UT Austin, Caltech, a couple others, and then maybe the top 2-5% at UCLA, UCSD, CU-Boulder, dozens more schools.


Berkeley is sketchy now?


The cheaper parts of Berkeley are sketchy, yes.


> The biggest problem with both statistics and machine learning education, I would say, is their lack of emphasis on mathematical foundations.

Any recommendations for someone who'd like to pick up sound (if modest) foundations?


Yes. For example, to follow CS229's lecture notes, the following material should suffice.

- Linear Algebra Done Right by Sheldon Axler

- Linear Algebra by Steven Levandosky

- Linear Algebra and Multivariable Calculus, and Manifolds by Ted Shifrin (you probably do not need the manifolds part)

Also, as paradoxical as it may sound, CS229 lecture notes themselves. The key here is to verify the details painstakingly. In math, we joke that whenever a book says something is "trivial" or "easy to verify" or "clear to show", it is a substantial exercise in disguise.


Yes, the whole "trivial exercise" thing is in itself a big problem with how mathematics is taught. Someone who's struggling just to understand the text could probably really use a worked-out example.


That's the wrong interpretation. In math, a 'trivial' or 'clearly' is supposed to indicate that a certain detail is important to understanding the subject. And if it isn't clear or trivial then the reader is expected to delve deeper before moving on.

The "big problem" is that math is taught in a restricted time scale - a semester, just like all other standard courses. In reality it might take someone many weeks or months to pick up a certain piece of measure theory or hilbert spaces. Math builds upon itself and when it is rushed there is a lack of connection and then ultimately failure to understand.

This is why MOOCs are a great boon for many, as they allow more freedom to self-dictate the pace.


If that's what it really means then a perfectly good way to say it is: "To test your understanding, make sure you can do this exercise before continuing."

I suspect many mathematicians like being intimidating.


"Left as an exercise to the reader," the bane of my existence as a college student. I especially remember the Sipser book doing this. He'd show a super-simple example, like proving 2+2=4, and then the next thing would be actually interesting and would be left as an exercise. Awesome.


I am a big fan of http://www.amazon.com/Probability-Computing-Randomized-Algor... . Every time it teaches a new chunk of theory it motivates and grounds it in some algorithmic application.


A college-level textbook in linear algebra, another in calculus (must include vector calculus), and one in probability?

For linear algebra and calculus, at least, Coursera also offers courses.


Hogg & Craig used to be one of the standard texts.


I am just a undergrad maths student but here is my experience:

I learnt more about bayesian statistics in my Machine Learning course than I did from my statistics module, reason ?

The stat book/course is the most dull nonsense I ever came across even number theory made more sense to me than statistics, the whole point of the book/course was to proof some really abstract axiom which no one understood why they were doing it.

Compare that to the process of writing a spam filter from scratch in MATLAB. A mathematical axiomatic system is only useful if you are solving some real problem using it, otherwise good luck living in the matrix.

I have learnt that is far easier for mathematics to create garbage than to create useful ideas.

The reason why I think stats sucks so much is because its a cultural problem in the community, If you google "how to write spam filter" I am sure someone made a wikihow with pictures to explain bayesian stats even to a snail.

The reason is the computer science community constantly accepts their ignorance, even if a proof/program is trivial they will make sure to explain it. In the mathematics world no such thing exists. The big assumptions are "we leave it to the reader to proof this trivial fact" No Mr Smarty Pants, pls assume the reader is a 17 year girl who cannot send an email ( if you want to attract young people please understand their mindset first ).

/end rant


>The reason is the computer science community constantly accepts their ignorance, even if a proof/program is trivial they will make sure to explain it. In the mathematics world no such thing exists. The big assumptions are "we leave it to the reader to proof this trivial fact" No Mr Smarty Pants, pls assume the reader is a 17 year girl who cannot send an email ( if you want to attract young people please understand their mindset first ).

Tell it preacher!

Also, if you still want a Bayesian stats book (which covers plenty of the material from a Frequentist intro stats course, too), William Bolstad's Introduction to Bayesian Statistics is as clear as anything I've ever read.


Thanks ! I think every book that worked for the lowest common denominator like myself should be pushed in school and highlighted. The book that was pushed in class for us was the most pathetic excuse for a book. http://www.amazon.co.uk/Modern-Engineering-Mathematics-Prof-... I really hate this book. Its the bible for "wall of text" syndrome. The thing that really made me angry is one of the author of this book was my prof ! I think such conflict of interest should be made a crime in classroom. This book is much better if you are interested in everything applied mathematics related (besides stats ) the author is dead but damn is this book timeless. Everything is explained in a twitter like way. every page is made up of 4 slides. http://www.amazon.co.uk/Engineering-Mathematics-K-Stroud/dp/...


Which stats book did you use for which course?

Intro stats is typically taught abysmally, because it's been dragged down by eighty years of pre-computing historical stats.


I was lucky to have studied in a british curriculum high school The books I used were these: http://www.amazon.co.uk/Edexcel-Level-Modular-Mathematics-St... These books were amazing. they were about 100 pages with lots of color and worked example. They go from S1-S6. What I find amazing is I have seen some insane greek syndrome books explaining the same topics as these ""kiddee" colored books. However My statistics education stopped after college ( exception being khan academy videos ), I have replaced all my stats modules with CS modules in university, I have hacked the stats modules I was forced to do by mugging up formulas and memorizing exam techniques :( so you can see why this stats prof complaining ticked me off as he thinks students haven't tried and are acting in ignorance.


Don't advertise your ignorance.


I'd be interested in more about your beefs on R.

Note that its origins are the earlier language 'S', part of the original AT&T UNIX suite, and seen as an adjunct to the data knife of awk. Later you'd likely have paired Perl, Python, or Ruby, and likely a pseudo-database (sqlight, say), along with some stats and graphing libraries or tools.

My own stats background includes some very basic SPSS, as well as SAS and R, and graphics under gnuplot. I've always found R tremendously difficult to grok, though I've known a number of statisticians I respected (at least as statisticians, if not as programmers) who found R exceptionally useful and powerful. My understanding is that its real value is as a matrix manipulation language, which may account for both its obscure syntax and appeal among hard-core stats geeks.

But still: your criticisms?


OK, to be fair, R is a reasonable language for statisticians and others. I guess R and S are different enough from usual programming languages that I may be biased against them. But I did spent five years working on bioinformatics and statistical genetics using it, and I couldn't wait to get away from it when I did. I will list my personal peeves.

1. R's syntax has problems. This is well documented on the web. I recall : has wrong precedence.

2. My biggest problem is R's lack of basic abstraction facility. Beyond functions there is nothing good. There are three object-oriented programming systems, none of which are simple or straightforward. It has first-class functions and lexical closure so at least basic functional programming should have been possible and fun. But R functions' have a weird parameter passing mechanism which scared me away from investigating deep. In the end I used RCPP to write large codebase and expose an R interface. And I like to delve deep into things. I was learning template metaprogramming when I was working on bioinformatics.

3. CRAN has way too many low quality packages, but far more importantly almost all packages' documentation consist of use-cases plus reference. To this day I don't have a good mental picture of genomicRange (a foundational bioinformatics package in Bioconductor for NGS data analysis) data structure and how to manipulate it (one main data structure, I believe). This is annoying at best and infuriating at best.

4. R, more than any other language I know, is an implementation-defined language. The language reference manual frankly admits fuzzy rules for some of the more advanced features. Python, which has been recently criticised for similar problems, is clear at the basic language features level.

I don't want to this discussion to degenerate into a flamewar. Programming language preference is a highly personal thing and I respect people who use R, even admire programmers who create awsome packages. ggplot2 and Plyr are some of the best packages I have ever used, and their documentation is superb. I just don't believe R is a good language for statistical students if they hope to move on to professional programming in other languages.


> Beyond functions there is nothing good. There are three object-oriented programming systems, none of which are simple or straightforward.

I don't think this is quite fair. R was greatly inspired from Scheme and stole a lot of good ideas. S4 is very similar to CLOS, and does a fairly good job. If you come from Python or Java, this system looks crazy, but being unfamiliar with a language style does not make the language bad.

> To this day I don't have a good mental picture of genomicRange (a foundational bioinformatics package in Bioconductor for NGS data analysis) data structure and how to manipulate it (one main data structure, I believe).

It's a dataframe attached to ranges. This design extends lower classes like IRanges, and all the accessors, setters, etc are consistent. You can manipulate the range part with integer range operations and manipulate the dataframe part as a dataframe. I would check out the IRanges vignette — this is very helpful.

> R, more than any other language I know, is an implementation-defined language.

This is the most severe problem for sure. I'm cross my fingers someone forks R's implementation and does it right (which should break a lot of packages and code).


>It's a dataframe attached to ranges. This design extends lower classes like IRanges, and all the accessors, setters, etc are consistent. You can manipulate the range part with integer range operations and manipulate the dataframe part as a dataframe. I would check out the IRanges vignette — this is very helpful.

I agree that the accessors, setters of GRange are consistent. I was having trouble picturing how manipulating the range part impacts the dataframe. I think they are like a database table joint but that is not explicitly spelled out. It becomes even less clear what happens to the attached dataframes when merging two GRange intervals. I once had to a symmetric difference of two GRange and it took a few hours of staring at the manual and trying out things at the command prompt.


S4 was heavily influenced by Dylan, FWIW


Not the OP, and I don't have a beef with R, but I prefer python.

A lot of it is just my background. I was a math major, but I didn't use Matlab (much) - my professors just happened to be the kind who prefer C (numerical recipes types). So my intro to math was through programming languages rather than through math environments that can be programmed. I'd say that's probably the big difference between Python and R. I'm sure there's absolutely nothing you can't do with R, but python is a general purpose programming language with great libraries for ML, R is a stats environment with general purpose programming capabilities. Having also been a programmer for 15 years since my school days, I'm more productive in the former (general programming language with great libraries).

There's one other reason. People often speak derisively of "cramming things into a random forest." In my experience, it takes a remarkable amount of intricate programming to cram things into a random forest! Even a kaggle assignment, the sort of thing where you have the question and reasonably cleaned data set available, even then you have to poke and prod and threaten and beg the data into different formats.

And in the wild? Sheesh. Python seems to be the heir to the perl crowd, the "pipe it here and twist it there and cram it" sort of thing. I'd rather be doing that with Python than R.


I agree. It is VERY strange to complain about R in the context of IT/Developers, considering R is widely complained about by statisticians (since .edus have incentive to encourage the purchase of SPSS, SAS, and Mathematica).


> It is amazing how many statisticians are still using pseudo-inverses in numeric computation

I'm not entirely sure what you mean - pseudoinverse/SVD methods definitely have their place.


I fail to believe that computer science graduates also know measure theory.


Except that computer science graduates are not supposed to know measure theory, whereas statisticians ought to know how measure theory underlies probability theory underlies statistics.


It is amazing how many statisticians are still using pseudo-inverses in numeric computation.

What is specifically wrong with the pseudo-inverse? Is it that eigenvalues close to zero (that might actually be zero, but are not due to FP error) get magnified immensely? Or is it the general principle that inverting a matrix is usually a bad idea, for reasons that John D. Cook elaborates quite well?

R. It is an atrocious monstrocity. Perhaps statisticians can live with it. Fine, but don't expect good programmers coming out of using R for five years. Python would be far better.

Python has great libraries but I could see Clojure overtaking it as the data science language in the next 10-15 years, and Haskell has some obvious strengths for production systems (strong, static typing).


The following quote is from "Matrix analysis and applied linear algebra" by Carl D Meyer, p 423-424.

"Generalized inverses are useful in formulating theoretical statements as those above, ut, just as in the case of the ordinary inverse, generalized inverses are not practical computational tools. In addition to being computationally inefficient, serious numerical problems result from the fact that A+ needs not be a continuous function of the entries of A. ... Not only is A+(x) discontinuous in the sense that lim_{x->0}A+(x)!=A+(0), but it is discontinuous in the worst way because as A(x) comes closer to A(0) the matrix A+(x) moves farther away from A+(0)."

Here A+ denotes Moore-Penrose pseudo-inverse and A(x) = [1, 0; 0, x] in Matlab notation (or Numpy? I am very rusty on Matlab). And in case you are wondering why the generalized inverse is plural, there are different kinds of generalized inverses, Drazin inverse being a well defined alternative.


Pseudo-inverse (and other direct methods) do have a lot going against them, especially as problems scale. In practice, iterative methods introduce their own challenges as well though. Optimal algorithm selection is highly problem dependant, and often hinges on knowing how to generate an appropriate preconditioner.

GMRES is still pretty much my favorite general method for inverse problems, but it comes with plenty of nuances that you really don't see when using a direct method on a tiny problem that you can bank on being well-conditioned. It's hard to say how often those kinds of problems exist in the real world (in my case: exceedingly rare), but they are still out there.


I would actually bet on Julia before Clojure as the next big data science language.


If I were Kramer, I would dub thee Anti-Statite.


I'm surprised that they didn't mention Leo Breiman's famous paper "Statistical Modelling: The Two Cultures"[0]. A worthwhile read for anyone interested in the topic. I could sum it up, but the abstract does a better job:

----------

Abstract. There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current prob- lems. Algorithmic modeling, both in theory and practice, has developed rapidlyin fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools.

[0].http://tuvalu.santafe.edu/~aaronc/courses/5352/readings/Brei...


Fantastic paper. I wonder what "irrelevant theory, questionable conclusions" he had in mind.


In my opinion, one big reason in the fall of statistics is the asymmetry in curriculum.

If you are in CS/Machine Learning, you will be forced/suggested to take a few statistics/probability courses, typically up to a regression course. These are good enough to get you started and familiar with statistical concepts and methodology.

However, if you come from a typical statistics program, programming/computing courses are almost always never required. Professors in probability theory/statistics often do not have great computing skill, they might know enough to get around in R/Mathlab, and teach you the basics in these tools, but a lot of times you are left in the dark in terms learning how to code.

If your career goal is an analyst, whose job is report oriented, you can get by by knowing just enough R (SAS, Matlab etc.) scripting. As your deliverables are generally fancy LaTex pdf (or presentations). But current job market pays and values more of a "data scientist", whatever that terms mean. This type of roles typically requires solid software engineering background, and requires you to be able to work with software developers and commit to the same codebase on a regular basis. The lack of programming/computing skills of a stats majored students will be a huge minus for this type of responsibilities.


Granted, I started college 30+ years ago, so my perspective may be outdated, but I fear that it might not be. What you say about "asymmetry" struck me because it seems like a more widespread issue. Some anecdotes from my college years:

* The CS students did all of their work on the college mainframe, and there was a pervasive attitude that personal computers were toys.

* The earliest adopters of PC's, and of using computers to solve problems rather than as objects of study, were in the physics department.

* Math required students to learn programming, but only one course used it -- numerical analysis.

I wonder if doing symbolic math by hand still dominates math education. Of course I enjoy doing it as a form of recreation, but I think that my use of math in my subsequent career would be stunted, had I not taken an interest in programming outside of the math department.

I also wonder if "the hot stuff in discipline A is being taught in department B" is just a feature of the chaotic nature of scholarship and research, meaning that a student needs to explore more than one field in order to really be top notch in their main chosen field.


> Professors in probability theory/statistics often do not have great computing skill, they might know enough to get around in R/Mathlab, and teach you the basics in these tools, but a lot of times you are left in the dark in terms learning how to code.

Hah, as a random anecdote, I had a graduate random processes class (from the Math department) last quarter where for one of the homework problems in his solution the professor basically used an Excel spreadsheet as a for loop. I was completely dumbfounded.


Here's another relevant HN discussion about the relationship of CS and Statistics:

https://news.ycombinator.com/item?id=6630491

The boosting, random forest, support vector machine, and deep learning ideas should have been developed in Statistics. Why weren't they?

It's much deeper than TI-83 calculators and cookbook AP courses!


Boosting was, I suppose, invented by Schapire and Freund, both CS guys, but its most successful form, gradient boosting was invented by Friedman, a stat professor. Random forests were developed by Breiman and Cutler, both stats professors. SVM's probably weren't developed in statistics because they don't have a probability model.


It sounds like you're challenging my central claim: conventional statistics has largely ignored the impact of computational technologies on inference, and dismissed the resulting failure as a mere marketing problem.

You are wrong.

Boosting is a generic technique, and reducing it to gradient boosting and then claiming it as "mainstream statistics" is silly. Boosting is not, even today, part of the standard statistics curriculum, despite its wide utility. But the bootstrap is. I wonder why?

Describing Leo Breiman as a stats professor is similarly twisted. He was always an outlier in traditional statistics; in fact, he left a position in the math department at UCLA because he didn't fit in, and began independent consulting, during which time he developed CART. And during his time, years later, as a bona fide stats professor at UC Berkeley, he spent considerable effort overcoming a tendency of mainstream stats to ignore the implications of computing. For instance, he was one of a small number of stats professors who was active in the NIPS community. (His bio at http://projecteuclid.org/download/pdf_1/euclid.ss/1009213290 is absolutely fascinating, see pages 195-6 for his feelings about conventional statistics.)

And don't forget that a lot of the popularity of CART was due to the C4.5 implementation, due to Ross Quinlan, a CS professor with a physics background, which allowed easy building of classifiers on top of trees. Which points out another thing, that these advances have in many cases not come from, strictly speaking, computer science; physicists and engineers have been very influential.

SVM's were developed by Vladimir Vapnik and collaborators, who came out of a Russian math/physics background. Who cares whether SVMs have a model or not? This is an example of the myopia that afflicts old-fashioned statistics. SVMs are highly effective predictors with deep mathematical properties and admit a statistical analysis of their performance and properties. (You don't think every statistician who uses logistic regression really believes this is a model for the data generation process! Yet logistic regression is part of the stats canon.) The SVM analysis ("large margin classifiers") largely took place, again, outside mainstream statistics.

We could say the same thing about neural networks.

And we could say the same thing about deep learning.

NNs: 1980s-1990s. SVMs: 1990s-2000s. Deep learning: 2000s-2010s. In ten years, we'll be adding another example to the list, because this myopia is not just poor marketing skills.


My first stat class was an introductory statistics class that was geared towards biology students. It was pretty terrible.

When I got into CS grad school I took more rigorous statistics courses and fell in love. I mean, head over heels in love. I tried to transfer out of the CS graduate school and into to the statistics graduate school. Because I would lose funding and would be starting over I instead decided just to quit when I was awarded my masters instead. (Now I just spend my days code JavaScript and CSS...what happened..)

If I would have been exposed to statistics earlier in life I'm absolutely certain I would have majored in it in addition to computer science.


I did three terms of statistics combined at two different universities and then took the EdX statistics course a few years later. I was not impressed with the way it's taught at any of those places - if that's the usual way to tech statistics, I'm not surprised it's not very popular.

They say "This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions." - but CS should be easy to verify. Everything I ever did for the CS courses could be looked at and explained exactly why it's right in simpler terms than what I did in the first place. With statistics I get a feeling it's completely the other way around - you can get some data, answer a question and it takes really complicated explanation to prove what you did is right. If you do simple mistake like taking one tailed instead of two tailed test, you still get a result - but how can you actually test it's not the right one? I was never taught or could find any way to do that.

In geometry you can fairly easily throw some more equations at the problem and see your solution matches them too. In CS you can prove some thing in multiple ways, or throw enough data at some algorithm that you can see that both results and complexity bounds are what you expect them to be. The statistics courses I've seen resulted in many people who can go through the usual assignments, but couldn't tell if what they got back was right or wrong.


As an academic statistician running the largest data science program in the world (10,000+ have completed a class) I think that CS/Data Science only pose a threat to statistics if we don't adapt. I wrote about it here. http://simplystatistics.org/2013/04/15/data-science-only-pos...


I'm not sure what the difference between statistics and machine-learning is supposed to be, seriously. The theoretical foundation of any machine learning algorithm is probability theory and statistics.

Is machine learning implied to be a more "hands on" approach, to focus more on learning a set of techniques than on learning the theoretical underpinning of such techniques? Is it basically "statistical engineering"?

The least charitable interpretation would be that it's just a hipper term for it.

This article pretty much sums up my impression of the divide: http://brenocon.com/blog/2008/12/statistics-vs-machine-learn...


After reading the title I spent some time thinking about what the causes might be before reading the article. To my great amusement the reasons I consider machine learning to be a more vital field than statistics were all given as "structural problems with the CS research “business model”". :)

Take the great recent success of deep learning. This is a method that could have never arisen in statistics, because at this point in time it has no (well, very little) supporting theory. But it works! Given a decade or so we'll likely have a good theory but we shouldn't let lack theory hinder development of applications, and in machine learning we generally don't.

The classic Brieman "Two Cultures" paper is very relevant here.


The irony is conventional scientists forgot was it was to be scientists ! the best science was created without knowing what you were creating.


His attitude seems to be more "slow CS down" than "speed Stats up". He talks about "let's consider the problems in CS first", but then never gets around to seriously discussing the problems of the Stats profession.

The most he seems to admit to is "Stats hasn't been awake, hasn't been present", like it's some kind of defendable excuse. It's a major problem! If you don't show up to the game, you can't say anything about the "competition" playing unfairly or not playing by the right rules or what have you.

We've seen this before in all kinds of discussions. "Them damn kids and their Rubython on Djrails think they invented everything." It paints a picture of some new, young beast hunting down and devouring the old guard bit by bit in some kind of cruel game designed to torture body and mind. Which is completely ridiculous. The new gang never "attacks" the old. They just walk away.

For all sets of tools, the reason people prefer one over another is because they are finding answers quicker and more easily with A rather than B. Nobody cares about the purity of form (or in other words, the ones who do care are statistically insignificant). They care about getting work done.

None of this is to say that working on theory is inferior or superior to working on applied use and engineering. They are all equally valid pursuits. But we all need to be able to admit when people don't share our passions, and not blame them for having their own.

Show up to the game. The game is industry. It's work. Stop focusing on academia as the only legitimate outlet for graduating mathematicians and statisticians. Talking about "adult supervision" is really just armchair quarterbacking. Sit in a meeting with a person with a 30-year-old, unused history degree with absolutely zero understanding of statistics--thinks ALL math is ivory tower bullshit not worth the paper the textbooks are printed on--and explain to him why his desired results can't be obtained from his current data. See how much your statistical models stand up to "let's turn that can't into a can" or "can't is unacceptable, we're not going to leave here until we figure it out".


Statistics is taught wrong from the get go. Who cares about z-tables and chi values? It distracts from the important underlying laws of probability. Statistics has never been more important, and yet the vocabulary is badly outdated leaving students with a huge mountain of knowledge to climb before reaching anything of real utility.


I don't quite agree, as the standard normal and Chi squared distribution are obviously important to understand. I think what you mean is that teaching stats based on hypothesis testing on simple experiments with small samples is not the best way to introduce students to this world. But the history of statistics was heavily influenced by designed experiments, while most of the data that is out there is observational. That is hard for older generations of statisticians to swallow.


> Who cares about z-tables and chi values?

I think the problem is how these concepts are taught, not the concepts in and of themselves. Too many people leave their graduate-level stats courses (that they take for X scientific/engineering/sociological discipline) believing that 0.90, or 0.05, or 1 SD are somehow magical values. Too few students understand the concepts behind them (I had to work hard on my own to gain a glimpse into them).


I was expecting a statement like:

"Based on a random sample of institutions surveyed, Statistics is losing ground to CS at a relative mean rate of 5.7% per year (by junior and senior year undergraduate enrollment headcounts, taken over the past five years from the institutions sampled), with a 95% confidence interval of +/- 0.3%."

Ironically, no.


Tisk, tisk Norman. A good statistician shouldn't use anecdotal evidence with a sample size of one to support his/her points. If I picked up a random stat pub, I bet I could find some missing citations as well.


I wonder if CS should be divided into heavy research and then a serious minor to be applied to other majors.

CS of applications should augment other fields of study. CS of research should be the heavy duty stuff.


im currently reading "discovering statistics using R" ..even though we're using python and ipython at work - ipython is simply fantastic! -- hope to integrate the R support magic soon too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: