Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Automatic Retraining of AI Systems for Development of New Molecules and Materials

By Sunil Gupta posted Thu February 24, 2022 02:35 AM

  

NOTE: This is the open preprint of the article.

1. Introduction

The problem of “unknown” exists almost for each AI system. Speaking about AI, the word “unknown” has several meanings: unknown data types, wrong data formats, unknown classes of data, data insufficiency, etc. All these meanings can be replaced by “unknown data” in the borders of this article.

Unknown data leads to significant quality and accuracy losses for 4 of 5 main machine learning (ML) tasks: classification, clusterization, regression and dimension reduction. The problem of unknown data is extremely important for applied AI in the development of new molecules and materials. The cost of mistakes here is billions of dollars and lack of new materials and substances, for example for healthcare. It is extremely important in COVID times.

One of the specificities of AI projects in the development of new materials is a lack of data. Thus the direct solving task with artificial neural networks (ANN) and deep machine learning are not always possible. Such systems should have complicated preprocessing modules that cannot be simply fitted for unknown data types.

The problem of unknown data was mirrored in such AI/ML directions as Lifelong Learning (LL), Generative adversarial network (GAN), unsupervised learning. But these directions have several general cons:

  1. Need significant volumes of data and computation power.
  2. Need significant time frames for training and scaling new data types and classes.
  3. Inability to handle new unknown data types/classes without updating.

These cons make the mentioned solutions insufficient for use in a major part of custom AI projects for new material development NMD.

2. Problem

APRO Software faced an "unknown data" problem during the development of an AI solution for NMD for healthcare needs. The additional challenges in this project were the requirement of an automatic system fitting with the new data, which can be considered as “unknown” in many cases. The challenge had the following manifestations:

  1. Insufficiency of data.
  2. Variations in the methods of obtaining data.
  3. A wide range of materials descriptions and characteristics.
  4. An ambiguity of criteria of assessing required characteristics.
  5. The proximity of signs of irrelevant materials.

The mentioned aspects and automatic system fitting requirements made the project quite complicated in implementation, but the solution was found.

3. Solution

The solution is based on the conception of lifelong learning in combination with supervised learning and classical statistical methods. It is represented as a “re-training AI Core layer” on the system (software) level.

The solution is AI-based (ANNs) with partial human-driving control of the fitting process of the most important parts of the AI system. It has the following features for “unknown data” problem control:

  1. AI-based quality control sub-system;
  2. AI-based checks for unknown data cases (for example, if data contain information about the wrong subject areas);
  3. AI-based recognition of unknown features and fitting processing according to this information (for example, it can detect properties of unknown types and fit the meta-data generation and a material assessment process based on this event);
  4. AI-based usage of new data fitting (re-training) and updating the system to ensure better quality and accuracy of predictions (NMD).

These features fully cover the requirements. The descriptions of the used approach are presented below. The re-training AI core layer is a wrapper around the AI core of the system. It automatically controls processes of fitting and unknown data cases (Image 1).

Image 1. -  Re-training AI core layer

3.1 AI quality control sub-system

The suggested method includes an AI-based quality control sub-system, which is based on ANN for images processing and classical computer vision (CV) normalization methods. The quality sub-system is a part of the Quality validation controller. It controls all the steps of the processing inside the AI Core and assesses the quality of each sub-module by specially developed metrics. Also, it calculates the quality and accuracy of the entire system.

The controller operates a set of flexible rules that decide whether to continue the processing of input data or automatically adjust the system and repeat the processing of the data. Decisions for each AI Core module are made independently.

The controller works on the principle of a negative feedback loop (NFL). It is based on convolutional neural networks (CNN). If the controller decides adjustments, then it will perform an impact on the appropriate module.

The novelty of this approach is that the adjustment process occurs recursively and includes not only classical ML algorithms but also the impact on ANNs.

3.2 AI checks for unknown data cases

AI-based data checks for unknown data cases are part of the Preprocessing module (see Image 1.). This module solves 2 main cases in the borders of the “unknown data” problem:

  1. Intellectual checking and rejection of wrong data formats and data samples.
  2. Intellectual fitting insufficient data samples to an acceptable state.

Examples of totally wrong data formats: not allowed data formats (including wrong files types) or data samples that represent a wrong subject area. Such images of the 1st cases should be rejected from the processing and re-training process.

Also, examples of insufficient data samples can be data is provided from hardware modules that do not have 100% support. Such data should be transformed up to the state that can be used for processing.

Cases 1 and 2 are interrelated, but in the article, they are presented as independent cases.

3.2.1 Intellectual checking and rejection

The module uses approaches with simple algorithmic checks (for such cases as wrong data formats) and ML ANN-based approaches for checking wrong subject areas. A simplified structure of the module is presented in Image 2.


Image 2. -  Intellectual checking and rejection

ANN-based multiclass segmentation trying to make segmentation of data-sample on N classes of features. These “N classes” are also a subject of fetting (it will be shown next). Each class has its metrics and flags which characterize the degree of presence of a class of features in the sample. According to the degrees of features and their mutual relations, it is possible to judge which domain the sample data refers to and whether it can be accepted for further processing or should be rejected.

The originality of the approach is to use specific features of subject areas, rather than general macro criteria.

3.2.2 Intellectual fitting

Intellectual fitting is used for data samples normalization. The system does not know all possible tolerances. It means that the preprocessing cannot be implemented using classical methods only. Thus, it is based on 2-level intellectual preprocessing:

Level 1: Classical methods that fit samples to the correct format and partially fix artefacts from third-party processing (Image 3).


Image 3. -  Intellectual fitting

Level 2: Intellectual processing (based on ANNs) assesses distortions and calculates matrices of transformations (if it is possible) and corrections that will compensate them. This level makes the final corrections of third-party distortions.

The novelty of the approach is the usage of multi-level preprocessing with deep learning level (ANNs).

3.3 AI recognition of unknown features

The recognition of unknown features is based on the mix of deep machine learning and classical statistical methods. Each ready data sample (Image 3) is accumulated in the Data Accumulator (Image 1). The Data accumulator can be described as data storage, which collects data samples, segmented features and meta-data from the AI Core.

One of the Re-training controllers activates statistical analysis in the Data accumulator. This process is periodic. The period duration is based on the speed of volume data increasing in the accumulator. If statistical characteristics demonstrate that a lot of data goes to special “unknown data” classes, the controller decides the activation of the fitting process (the system tries to perform these operations during periods of inactivity).

After the activation, the re-training layer starts the recursive fitting process. All the data from the data accumulator sequentially passed through the AI core. It passes all the stages that are used for normal analysis. But the final result goes to the memory of a Re-training processing controller.

The controller accumulates results and makes classification (see Image 4) using predefined groups of features. Next, the controller makes clusterization and statistical assessment of results for the segmentation of new features.

The process is recursive as long as the statistical characteristics allow it. The process of recursive unknown features assessment is presented in Image 5.


Image 4. -  New features clusterization

The initial number of known classes elements (features) is N. Each iteration adds ni new classes. So after the M iterations, the system will be able to operate with , elements where i = 1, 2, 3, … (Image 5).

Image 5. -  Unknown elements recognition iteration

The application of this approach is in itself unusual and new. The work with elements has an acceptable risk due to the distribution of risk among the elements of the AI system. Sequential processing of different methods by deep learning methods does not allow a fuzzy growth of errors. Thus, this approach is safe for usage in critical NMD projects.

3.4 AI data fitting (re-training) and updating

The most important and complicated part of the project is the retraining of the entire AI Core. This sub-module has a lot in common with the process of the recognition of unknown data elements (features). The specificity of the NMD for the healthcare sector imposes additional requirements on the reliability of this process. Retraining encapsulates elements of recognition and updates all the N features and high-level ANNs for meta-data processing.

The process has the start trigger. After the Data accumulator obtains the volume of data, which is reasonable for the re-training process the updating process is initiated. AI Core makes detection of features and segmentation (the same with unknown data recognition). Features with the best recognition quality are transformed into marked training data sets that are used for system updating. The process also has a validation phase and can be classified as supervised learning. The process can have several updates (sub-iterations) in a row (see Image 6) for simple features.

Meta-data, which is mainly represented with complex features, should be validated by an expert (human-driven validation) (Image 1 - Expert Review). The specific metadata has much more “unknown” that can cover modern custom AI solutions (only general AI solutions, such as IBM Watson can cover this validation instead of a real expert). So, expert validation is the only possible option for the validation of complex abstract NMD data on each iteration.

The simplified process for 1 iteration and 2 sub-iterations of the update is shown on the image below:

Image 6. -  Simplified AI Core updating diagram


The novelty of this approach is that a person (an expert) is included in the process of the validation and training of the AI system. At the same time, human participation is required only for the validation of single cases, and the results are entered into the system using data augmentation. With such a training method a large number of iterations can be independently completed without human participation.

4. Conclusion

During the implementation of an AI-based project for NMD, the set of approaches to solving the problem of working with unknown data was developed and successfully applied. The solutions are based on the conception of lifelong learning in combination with supervised learning and classical statistical methods.

The solution includes 4 main directions:

Direction 1: Usage of active AI-based quality control, which can make adjustments in each part of the system. The novelty of the direction is that the system adjustment process occurs recursively and includes not only classical ML algorithms but also ANNs.

Direction 2: Usage of AI-based checks for segmenting of subject areas. The originality of the approach is to use specific features of subject areas, rather than general macro criteria.

Direction 3: AI-based recognition of unknown data features. The application of this approach is in itself unusual and new. The work with properties elements has an acceptable risk due to the distribution of risk among the elements of the AI system. Sequential processing of different methods by deep learning methods does not allow a fuzzy growth of errors. Thus, this approach is safe for use in NMD for healthcare projects.

Direction 4: AI re-training and validation process with independent human-based supervised learning. The novelty of this approach is that a person (an expert) is directly included in the process of validation and training of the AI system. At the same time, expert participation is required only for the validation of single cases, and the results are entered into the system using data augmentation.

The described set of approaches is applicable in the field of NMD projects, has high reliability. And it also has a good potential as a suitable solution for custom AI healthcare projects.

 




#GlobalAIandDataScience
#GlobalDataScience
0 comments
9 views

Permalink