I understand what you think you are predicting, I just can’t work out why you think you predicted it…

predicting

Going to any of the web related conferences it is impossible to not notice how much of these end up being about predictive modelling. Anyone familiar with this modelling will know the literature is dominated by various forms of linear statistics. Here, you take two or more variables and map them onto each other using some probability distribution and a function representative of a straight line. There are then a range of hypothesis testing things you can do like calculating p-values and the rest.

This has led to many people you talk to seemingly believing that a statistical relationship between two variables exists if and only if there is a p-value of less than some arbitrarily chosen level – say 0.1. This is a very odd conclusion to make for the following reasons;

  1. Mapping two variables together with a function of a straight line is a rather stringent condition. There are an almost infinite variety of other functions which may be more appropriate
  2. The 0.1 only makes sense if you can also assume that the mean and standard deviation of either variable are infinitely stable – when considering most human behaviours this seems rather unlikely
  3. There are a whole range of probability distributions that exist. Why choose one over the other?
  4. Why 0.1, what if I only expected my mapping to be accurate 50% of the time?

I think misunderstandings or not discussing the above points causes a lot of bother. Either, people find it easy to dismiss a result based on the fact that they don’t believe such a model can capture the features of the system or, people feel like it is enough to run and off the shelf statistics package and report findings as if the statistic is explanation enough for their conclusions.

Comments are closed.