Improving Augmented Reality Content Creation and Model Training Productivity
By Pradeep Naik and Dr. Gopi
Augmented reality (AR) technology has evolved rapidly in the last few years. As digital transformation adds more pressure to every organization and raises their expectations, AR has gained a lot of attention lately in enterprises across various industries sectors. The need to improve performance, reduce cycle times, and drive improved satisfaction — for employees and customers — is driving enterprises to rethink how work gets done.
AR technology provides tools to augment digital information onto real-world objects and offer immersive experiences to the user. Whether it is supporting the technician or exploring products, AR applications present a versatile way to increase revenue and profit margin.
Challenges in the Adoption of AR
We have worked with multiple customers to understand the use cases of AR-based applications. We have conducted several pilots on AR assistance applications, which involved the creation of AR content for maintenance and repair procedures involving complex equipment parts/subparts. The build process typically requires:
- Preparing workflows for each procedure, which requires domain knowledge
- Building object recognition models for the required equipment/parts to be used in the workflow
- Optimizing the model for better response and accuracy
- Understanding deep-learning knowledge for optimizing the model
For AI-based, guided assistance solutions, the applications should have the capability to recognize the object (equipment part/subpart) and the state in which it is currently. When building the deep-learning model, it is required to train the model with a substantial number of images for each part/subpart to understand the state of the machine in each step of the procedure. There is a huge manual effort involved in capturing the images accurately, annotating the labels on each image, and then initiating the model training.
Creating one AR procedure with 10–12 steps may take about two weeks of effort. This is one of the factors that we have seen for low adoption and a longer implementation cycle for any AR implementation. Our innovative solution is not only improving productivity but also reducing the impact of human error and improving accuracy, without requiring skills in AR/AI/Deep Learning.
In this blog post, we are pleased to share innovative content creation solutions for training models used in AR applications. This will help to mitigate some of the challenges mentioned above.
Solution 1: Augmentation of Images and Automation of Annotation
To build object detection models with high accuracy, we need to have more data for training the deep learning models. Image capture and annotation are one of the first steps in the preparation of AR content. The manual process of capturing images, and annotating each image is cumbersome, time-consuming, and error-prone. Our innovative solution uses augmentation techniques to generate more images with which to train the model and automate the annotation. Augmentation involves supplementing the base/reference image into multiple images using different orientations and features. Once the images are augmented, the automated annotation replicates the annotation on all the augmented images. The solution that we built has three major components — capture, augmentation, and automation of annotation.
Manual Data Collection
Data is critical for training the model; there should be a guided procedure to capture the device data with an optimal distance and cover all the angles of the equipment so that the AR application will detect equipment parts and subparts in real-time. The mobile application enables the AR camera to capture equipment data from all angles at an optimal distance, with guided procedures.
The captured content will be used in augmentation and annotation automation to train the deep learning model.
Augmentation
From the created data, the solution component uses different augmentation features such as zooming, resizing, flipping, distortion, rotation, etc. to create additional content. The augmented images generated will be used in the later steps for automating the annotations with the reference-labeled image data.
The images of equipment parts/subparts captured are manually annotated using Wipro iX Studio. The term “annotation” refers to the regions of interest (ROIs) of various equipment parts as XML files. These ROIs are used to train the model for object detection.
The image is cropped using the equipment parts ROIs from the reference labeled image. Then, each of these cropped, reference-labeled images are used for feature extraction modules and stored in a database.
Feature Descriptor Computation
From the cropped reference-labeled image data, we extract the different parts’ features and represent them as scale and location-oriented descriptors. The features can vary between colors, textures, features, etc. The feature descriptors are computed using an image processing algorithm, such as the Oriented Fast and Rotated BRIEF (ORB) or scale-invariant feature transform (SIFT) algorithms. These feature descriptors are compared with the features obtained from the cropped, reference-labeled datasets for automating the annotations on the augmented images.
Annotation Automation
Automating the annotations requires finding the ROIs of each object on augmented images and other non-labeled images. These ROIs are used to build a robust dataset for training the model. In this approach, although manual annotation is a challenging task, users can annotate manually with a limited number of images to create more automated annotation content, using the feature-based matching method.
In any equipment image, there will be multiple reference labels, each of which will utilize image cropping. These cropped images’ features will be compared with the augmented images’ features using a similarity metric algorithm, and a rectangular box will be drawn wherever features match. The annotation file (XML) will be created using the coordinates of the rectangular boxes for training the model. The rectangular box’s coordinates can vary by around 5% of the drawn box to accommodate the device part in the ROI.
The entire process is automated for all augmented images and other unlabeled images. The feature-based mapping of device parts and the annotation will lead to an improved training model for real-time detection.
Solution 2: Anomaly Detection in Annotation
Users manually annotate the data to train the models. While creating the manual content, the users may annotate incorrectly or utilize the wrong labels with respect to ROIs. These anomaly data will negatively affect the training model in predicting the right ROI of the label and the right class for the ROIs.
We have built a solution to help users to overcome the challenges in identifying the anomalies and automatically correcting them to provide the right data annotation dataset for training.
We extract the features of the reference data images and match the features of the non-reference data images with the help of computer vision algorithms. The feature distances are computed and, if the feature distances of the two dataset images are above the predefined threshold value, then there is no anomaly in the datasets. On the other hand, if the features’ distance values are below the threshold value, then there is an anomaly in the image. Lastly, the anomaly type, such as a label or ROI mismatch, is identified.
Feature Descriptor Computation
From the reference datasets, the image labels are cropped based on each of the classes and stored in a database. The feature descriptors are computed from the cropped images of each class label using the KAZE algorithm. These features may vary from color features to textual features. We obtain the feature descriptors of all the class labels from the reference image datasets and these feature descriptors will be used in the anomaly detection module to obtain the anomalies of the non-reference image datasets. We use the KAZE algorithm to compute the feature descriptors of nonreference image datasets. The feature descriptors are computed for each of the label classes of the ROI.
Anomaly Prediction and Auto-Correction
There are different types of anomalies from non-reference datasets, and we predict these anomalies with the help of reference datasets features:
- Mismatched labels by the user.
- Enlarged ROIs due to manual annotation.
- Diminished ROIs due to manual annotation.
We compute the distance measure of reference dataset feature descriptors and nonreference datasets feature descriptors using cosine similarity. If the distance measure is more than the predefined threshold (preferably 0.7), then there is no anomaly present in the class label annotation (or image). Otherwise, there is an anomaly in the class label.
Once an anomaly is identified, the ROI and mismatched label are corrected, using a similar method to that which was mentioned in solution 1.
We have conducted benchmarking exercises to calculate the productivity and accuracy improvements with the above solutions. The results of our benchmarking are provided in the section below.
Improvements in Productivity and Accuracy
Once the content is generated using the solutions described above, the model is trained using conventional deep learning techniques. To benchmark this solution, we used medical equipment with 60 parts. Initially, we annotated 1100 images manually on an average of 10 labels per image and trained the model with a total of 60 labels. Annotating each image with the manual annotation tool takes approximately 100 hours. The object detection accuracy of 78% is observed with 10 hours of GPU processing for training the model. With the augmentation and annotation automation solution approach, we were able to get similar accuracy with only annotating 50 images manually, which took approximately 5–6 hours. This has helped to reduce manual efforts by more than 60%. Anomaly detection and auto-correction helped us in correcting human errors while labeling the data, improving the accuracy of the model.
In a typical enterprise, there could be different product lines and each product line may have multiple models. For example, a medical equipment maker may have hundreds of products. Each product may have multiple models, say 5 models on average. Assuming there are 50 parts for each model, there could be approximate ~500000 images (1000 x 100 x 5) to be trained. Our solutions will help generate AR content with a saving of over 30,000 person-hours of effort.
This is a huge productivity improvement in content creation which can drive better adoption of AR implementation for various industry domains.
Next Steps
A user-friendly interface for authoring AR content with automation and better accuracy is key to the success of the adoption of AR in enterprises. The augmentation and automated annotation will help to expedite the AR content creation process with better accuracy. This will in turn help drive AR adoption in the enterprise.
We are currently researching dynamic AR content and procedure creation from the actual video capture, using the reinforcement learning method. This will be a major differentiator to our solution in driving the scale of implementing AR-based solutions in multiple domain areas.
By Pradeep Naik and Dr. Gopi