As we mentioned in the first blog post about my ETL project, I’m working with large historical datasets. This includes data on hardware components. During the past several months it has been clear to me that categorization is a challenge for anyone normalizing a lot of data within the aerospace and defense industry.
Hardware can be categorized in many ways: by functionality, physical characteristics, where it’s located on a system, and what kind of system it goes on. There is no standard taxonomy that is universally accepted. I suspect that this is because cost estimators do estimates at different levels. The Department of Defense uses MIL-STD-881E to organize Work Breakdown Structures (WBS) for common systems. But sometimes the lowest level still isn’t low enough for component-level estimates. For example, WBS level A.184.108.40.206 Fire Control can be made up of sensors, antennas, displays, and many more components. Government contractors such as Boeing have their own nomenclature, which might not align to MIL-STD-881E. If people working in the field for decades haven’t agreed on a common terminology for hardware, I doubt that somebody at my level of experience can do it!
While these issues may make my job harder for collecting new data, the data that I currently have has been partially normalized. This is both a good and bad thing. It’s great that somebody much more knowledgeable than I have already organized data within the datasets I’ve been using. However, these terms are not clearly defined in a data dictionary. So if I’m confused by the data, there’s no way for me to understand what the previous data labeler meant. In addition, the text often needs to be cleaned.
For example, a datapoint in my current dataset might be labeled “Blade Antenna, Steel Edge”. I would group that into an “antenna” category of components, but there’s other relevant information I’d like to pull from it; it’s a blade antenna and its edge is composed of steel. However, I am still developing standards for how we would organize this information. What do I do if a component is composed of multiple materials? Or there are multiple ways it could be categorized? I want the structure of the data to be clean, but I don’t want to throw out information that might be a relevant cost driver.
In the next ETL blog, we will further elaborate on cleaning text data and the potential of using Natural Language Processing to normalize data.