Original Post Date: Friday, March 4, 2011

I consistently run into this idea of data driven estimating.  Yet, there is no clear explanation of this concept.  I am not trying to provide one here, however, I am interested in is what is at the root of this growing movement.  My take is that it is an attempt to scratch an itch.  But what’s the itch?

I believe it is related to my early post (Accuracy is Risky Business).  In the struggle to answer the accuracy question people have decided that understanding the data used in the estimating process is key to understanding its accuracy.  To a certain degree this makes sense.  It is useful to gain insight into the information upon which a cost estimation model was based.

  • How much data was available?
  • How current was the data?
  • Was it relevant to the project being estimated?
  • What statistical techniques were employed to analyze the data?

In terms of data, there are two general areas considered relevant: data quality and fit. Understanding how much data was used; how old it was; where the data came from sheds additional light on the results. Once the amount, age and source of the information is established, the next concern is how the information was analyzed. 

Descriptive statistics are often used to describe a collection of data in quantitative terms. They aim to quantitatively summarize a data set. Descriptive statistics include many measures including central tendency (mean, median, mode), dispersion (standard deviation, range, quartile) and association (correlation). Additionally, there is statistical inference, which makes propositions about populations, using data drawn from the population of interest via some form of random sampling. 

The industry is fraught with estimates backed by reams of data surrounded by all the necessary statistics. But does understanding descriptive statistics and statistical inference alone improve our understanding and confidence of an estimate?  If so, why then is expert opinion one of the most widely used forms of estimating?  Why do programs based on these data driven estimates continue to perform poorly?  Is data driven the answer?  What do you think?