Search

Data for Industrial AI: a strategy to settle the chicken-and-egg conundrum!

By Mike Mohseni


There are fundamental differences between machine learning (ML) solution development for applications such as movie recommendations for Netflix, and classification or prediction modules for manufacturing applications, especially those involved with complex materials processing operations. Consumer websites like Google or Netflix can generate an enormous amount of data every minute. This is in contrast with industrial applications wherein generating one data point can cost substantial resources. This leads to a common dilemma in industrial applications that is whether ML software development should follow extensive data collections, or the final ML model should guide data collections.


The short answer is that data collection and ML developments should progress in parallel. Focus on either alone raises the risk of collecting the wrong data for an extended time or spending resources on establishing complex ML model architects that manufacturing lines cannot possibly provide data for their training and deployment.


Before discussing a practical development strategy, we should understand the root of data scarcity in industrial applications. We define data point as the smallest unit to be incorporated in (ML) software development. For example, in the case of signal anomaly detection, every temperature reading from a thermocouple can be considered a data point. This is different if the thermocouple data is used to develop software that predicts the quality of a specific product. In such a case, the complete set of thermocouple data and properties of the output product combined make up a single data point.


Considering what constructs a data point, the size of data available for ML development for many industrial applications is considerably limited. We discuss the strategies to address AI and machine learning development with limited industrial data in a later article.


Primarily, the development of ML software should be directed by a well-defined business case. The development framework and the type of data needed highly depend on the business case. When this is clear, the development process should be gradual in contrast to turn-key solution developments common in the manufacturing sector. This strategy is especially useful in industrial applications requiring substantial time and financial resources for data collection. Since it ensures stakeholders realize the benefits of new technologies in the production line while ML engineers have access to real-world data sources for continuous developments toward the target product.


From the user's, or manufacturer's, point of view, the primary software delivers limited but practical functionalities relative to the target product. As the early versions of the product are used in the production line and accumulate data, ML engineers will have access to meaningful data to be incorporated in further improvements of the early developments and completing additional functionalities.


In conclusion, an efficient strategy towards ML-powered product developments is a gradual and collaborative approach. ML technology providers or product developers can employ different marketing and sales strategies to incentivize this seemingly prolonged development cycle for manufacturers. We will review some of these strategies in yet another future article.