In this blog I will talk a bit about Natural Language Processing (NLP) and how it contributes to our work on the Data Set Initiative, along with the ETL of data in general. NLP is the intersection of linguistics, computer science, and Artificial Intelligence (AI). As you can see in the picture below, AI is also made up of Machine Learning (ML) and Deep Learning (DL).
Source: Data Science Foundation
We are focusing on the green area.
I recently studied Natural Language Processing to try to solve some of the problems I described in my last blog post. As I stated before, we needed to come up with a classification system for the data that we have.
Categorizing old data is one of the challenges we’ve encountered during our work on the Data Set Initiative. The hardware component categories are not formally documented in the data. Unfortunately, there is no standard dictionary or taxonomy of hardware component terms used throughout the aerospace and defense cost estimating industry. Since NLP depends on the machine “understanding” the meaning of words (usually by being trained on other text documents) this made implementing most NLP algorithms implausible.
Another big issue is that the data set that we are working with is very imbalanced. Most hardware component categories were only represented once in the dataset. This meant that implementing a Machine Learning technique for classification purposes would be extremely difficult.
For now, we have written an algorithm that captures certain keywords for hardware component types. We have compiled a list of these keywords based on my experience. Some keywords captured component types (“Primary”), while others were more like subcategories (“Secondary”) that might be good for additional analysis of the data. Below is a small subset of the beginning of the list:
For example, in the previous blog, for the datapoint “Blade Antenna, Steel Edge”, we would want to extract the categories “Antenna” and “Blade Antenna” as primary categories. In addition, we would extract the secondary term of “Steel” as that might be an important characteristic for anyone doing analysis on the dataset.
This algorithm is not cutting-edge NLP technology, but it is useful for pre-processing text data. However, it still doesn’t solve all of our problems. A user would still have to check the data for errors (or for new terms that are not currently a part of the algorithm). They might have to normalize their data in certain ways, such as checking for spelling mistakes. But it could still save some time when it comes to labeling large amounts of data.
Do you want to learn more about the challenges and successes of big data in the cost estimation industry? Join us in a webinar I will be cohosting on May 26th with Arlene Minkiewicz!