When it comes to cost estimating, “Big Data” feels like a blessing. It’s great to have thousands of datapoints to analyze, especially in our data-driven profession. However, if you are the one trying to wrangle messy data into a usable format, it can sometimes feel like a curse!
Lately, my main task has been working with historical data that PRICE® Research has gathered over the years and amalgamating it into a larger, workable database. While I’m excited to see how we can harness old data to help guide future decisions, it has been an arduous task. The data I’m currently working with is structured in flat files (Excel worksheets) with over 100 columns each. The data needs to be checked for quality, normalized, and then placed into a standard format. My biggest challenge is developing a procedure to handle these files to create a homogenous database that is transparent, easy to navigate, and accessible for all customers, while also preparing for ingesting future data.
Part of creating the correct methodology is finding the right tools to do the work. The types of applications I’m looking at perform what it is normally called Extract Transform Load (ETL). I have compared about a dozen tools that perform ETL. Many, such as Stitch, were meant for performing simple ETL on data stored in the cloud or cloud-based applications. Others such as RapidMiner, Alteryx, and Informatica, are more complex and can perform many pre-established operations.
However, these programs are most useful when the data sources come in a fixed format and are refreshed on a consistent schedule, such as daily or monthly. This is not my goal; most data sources will be updated sparingly (perhaps once a year) if at all. Each data source will come in a different format and structure, so a unique ETL process must be created. In addition, we will most likely encounter a situation where we need custom scripts written in a programming language to do some of the work for us.
Therefore, these paid software applications do not add much more value than software that we already have access to such as Microsoft Power Query and Python. So far, I have decided to work with Power Query and Python, but as I encounter new challenges on this project, I may change my mind.