What we should teach and you should learn.

by Andreas Baumann

Normally, the stats curriculum taught to social scientists tend to emphasize inferential techniques such as analysis of variance and regression over descriptive and dimensionality-reducing techniques such as factor analysis or cluster analysis.

Sooner or later, we’re going to have to change this.

The future of data analysis – whether in the natural sciences, computer science or social science – is dealing with “Big Data” (or – as I’ve heard it called – organic data, as opposed to the designed data of a RCT). Normally, when we deal with these complex sets, we have a very large (both in terms of data points and dimensionality) set of informations, and most of the measures contained in this set are very bad measures of the latent traits we’re really interested in. For this reason exactly, it’s going to be more and more important for us to be able to extract the latent traits from the data.

For this reason, I think that majors in sociology, political science, etc. should move towards emphasizing

  • Data harvesting in real life: how to extract the information you’re interested in from a social media or a database,
  • Data processing: How to adapt the data set for use in an investigation.
  • Dimensionality reduction: How to get from the manifest variables in the data set to the latent variables you’re interested in.

Most of this should come as an “add-on” for the regression courses currently offered, not as a substitution. Indeed, maybe a shared foundation in introductory linear algebra is the way to tie these two sets of statistics courses together.