AI based Solution for Enhancing Low Light and Dark Images for Field Assistance in AR

By Dr. Gopichand Agnihotram, Vivek Kumar Varma Nadimpalli, Pandurang Naik and Surbhit Kumar

9 min readFeb 10, 2022

Introduction

Recently, there has been a lot of demand for artificial intelligence solutions to improve computer vision in various fields. Solutions like object detection, fault detection, environment description, and scene prediction are helping to solve many real-life problems. But these solutions are dependent on vision-based computations. In dark conditions, the photon count that a camera can capture decreases drastically, and the environment may not be visible. In this scenario, the system will fail to compute all the tasks that are dependent on the visibility of the environment, resulting in poor accuracy.

In artificial intelligence solutions based on vision data, it is important to process low light and dark images to draw intelligence from them. The information and subsequent intelligence available during low light scenarios is used wisely by our Deep Learning Architecture. The algorithm will process the raw data taken from the camera sensor and provide you the enhanced JPEG images. These enhanced images will be used to train the object detection model to detect objects in frames.

High-Level Solution Approach for Image Enhancement to solve the problem of detecting various parts of equipment in a low light environment, our approach uses an autoencoder decoder model for low light image enhancement from raw images. The camera sensor is exposed to the natural environment and the video stream will be passed to the low light recommender system which will predict the presence of low light conditions. This system recommends switching to low light mode if low light conditions are present. The recommender is a classifier model built on bright and low light images to predict the class of low light conditions. If low light is detected, the application will start collecting the raw images from the camera sensor in a video stream for image enhancement. The raw images are processed and passed to the encoder-decoder model which is trained on the dataset of short exposure raw images and corresponding long exposure images as the ground truth. The output of the model will be the smooth enhanced image that will be sent for further computation in object detection.

Once the low light image is enhanced, the enhanced image will be used to detect information from various parts of the equipment as objects. The collected data is trained with SSD models and ported to mobile devices using transfer learning methods. The object detection module is used to detect the object as required by the maintenance and repair procedure used by the field technician. ARKit libraries have been used for augmentation in iOS devices and ARCore libraries are used for augmentation in Android devices.

Image Enhancement on low light images

The output of the image enhancement model needs to be smooth with low noise for better results from the object detection module. We have used U-net architecture for image enhancement of low-light images. The autoencoder decoder models are used in the U-net architecture as input and output model is in a different format. With the given input data, the model is trained end to end to enhance the images from low light conditions. The model needs to learn the various boundaries in the images and the colors for each object in the images, similar to an image segmentation problem.

Initially, the U-net architecture was used in image segmentation problems for Biomedical images. Later the method extended to various major applications like pan-sharpening remote sensing images using pixel-wise regression. The regression relationship between multi-resolution image features and target image pixel values are obtained from the model that will volumetrically segment and learn from sparsely annotated data. One widely used application of U-net architecture was image segmentation as described in the article.

U-net architecture consists to use a shrinking path for encoder regions, typically a convolutional neural network with ReLU, and Max Pooling operations and symmetric expanding paths for decoder regions, typically consisting of transposed convolutional neural networks.

Datasets set up for training the U-net Architecture

The datasets for training deep learning models required an input type for the model, on which the trained model will predict and compute the output. In our approach, the input for the model is an image taken in low light conditions which will be sent into the encoder-decoder model, and output from the model will be an enhanced image that will have enough features to detect objects in the image. Therefore, we have collected the image data for training the model in dark conditions and corresponding ground truth images. The main challenge for this procedure was to collect the ground truth images. We reviewed existing literature and research work on obtaining the ground truth in low light images. Consider the SID (See In the Dark) dataset approach where the images are taken in the dark, using a short exposure time. Corresponding ground truth is obtained by taking the long exposure time. It can be utilized for the end-to-end autoencoder decoder model. Here, the collection of ground truth images is relatively simpler.

Figure 2: Short Exposure Images in seconds

Figure 3: Short exposure time in seconds

The SID dataset contains 424 distinct reference images. For each reference image, the ground truths are given based on different long exposure images of the same reference images. The dataset contains images under the moon and streetlights, and images in indoor conditions with lights off. To generalize for real scenarios, we added 40 distinct reference images for indoor conditions in low light by turning off the lights and allowing slight reflection of a combination of natural light and room lights from different locations on objects.

Figure 5: Ground Truth images (long exposure images) at different conditions

We added 15 distinct reference images that were taken in night conditions under the moonlight and slight streetlight reflection (refer to figure 6). The images were added to balance the dataset in the most possible low light conditions. The new images were captured in iPhone X using the Adobe Lightroom App in DNG format. The DNG images are further preprocessed and added to the existing dataset and pushed for training to the U-net Architecture.

Figure 6: Ground Truth images (Long exposure images) at different conditions

Training with U-net Architecture

The training model takes the input as the processed raw image and corresponding ground truth from the dataset. From the initial training model, the least absolute deviation loss is computed from random initialization and the corresponding ground truth for the given input. Further, the loss will be used to update the weights of the network using backpropagation. Here, the Adam optimizer is used to find the local minimum for loss efficiently. The system hardware consists of 64GB RAM and Nvidia GTX 1080 for faster computation. The model was trained for 4000 epochs. The final training model is used to obtain the enhanced JPEG images by processing the raw input data to the model. The enhanced images are then used for object detection.

Object Detection with the enhanced images

There are different models trained for object detection, and these models are compared with the enhanced images. The enhanced images are passed through a pre-trained deep learning model which will detect the object and predict the locations of different objects using TFLite model (refer to figure 7). Here, we have set up a training model with a large dataset consisting of images and labeled annotations that need to be predicted at detection time. Once the dataset is created, training the deep learning model will be a major challenge. Building the training model from scratch is a compute and time-intensive process. Therefore, we used a model pre-trained on a COCO dataset and performed a transfer learning on our labeled dataset. The processing time and accuracy vary depending upon the architecture of the data. Therefore, a few different models are trained, and the final model is chosen based on the tradeoff between accuracy and inference time. For example, in AR-based step-by-step guidance on a mobile device, we have converted a model to TFlite files for object detection. The inference time is calculated on the mobile device, and comparisons are made on various models. We used ARKit for tracking the objects using iOS mobile devices.

Object detection on local devices has given better response time and predictions. We trained different models on Arduino robot data and the performance metric of different models are captured below.

Table 1: Models comparisons MobileNetV2 Vs MobileNetV1 FPN Vs ResNet50

Table 2: Approach 2, Android Vs iOS devices detection time

In table 1, the mobile accessible deep learning models, the SSD MobileNetV2 model is performing with low inference speed on Arduino robot data, but the accuracy is significantly low when compared to other models. The SSD MobileNetV1 FPN has better accuracy than MobileNetV2, but the inference time is increased. The SSD ResNet was thought to be the best performing, as it produces better mAP values than MobilenetV1 FPN when trained on COCO dataset. But, on the Arduino robot datasets, the SSD ResNet produced mAP values slightly lower than the MobileNetV1 FPN and inference took more time. Another observation is that these models performed better on iOS devices than on Android devices (see table 2). For the Arduino robot data, MobileNetV1 FPN was chosen for the object detection required for AR based assistance. In general, these models are chosen based on accuracy and inference time between MobileNetV2 and MobileNetV1 FPN.

Conclusions and Next steps

The AR based assistance system must detect the object to trigger the next step of action. If there is not sufficient light, images captured with low light do not provide good accuracy or may be unable to detect the required object. The solution described in the architecture above uses a trained encoder-decoder model to enhance the image using raw images. The enhanced image is then sent to the object detection module, where various parts of the device are detected along with location and the help of the trained model. The performance metrics data gathered by comparing three different models help to choose the correct model based on the tradeoff between accuracy and response time.

In future projects, the end-to-end solution will be extended to Android devices. The action and scene prediction on images can be added to object detection intelligence which will point out the anomalies or errors. Sensor information along with the image information can be fused together for better action prediction. The complete solution can be ported on hands-free devices such as HMD (Head Mounted Devices) like Google Glass, HoloLens, etc. This solution can also be extended to other domains like remote training, where the user learns about the device from an expert. It can also be extended to consultancy in healthcare, retail, and e-commerce domains.

References

1. Thivakaran, T. K., and Chandrasekaran, R. M. “Nonlinear Filter Based Image Denoising Using AMF Approach”. International Journal of Computer Science and Information Security, Vol. 7, №2 (2010).

2. Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional Networks for Biomedical Image Segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, (2015).

3. Wei, Y., Zhigang, Z., Cheng, L., and Huiming, T.. “Pixel-wise regression using U-Net and its application on pansharpening”. Neurocomputing, Vol. 312, p. 364–371, (2018).

4. Iglovikov, V., and Shvets, A. “TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation.” arXiv:1801.05746 (2018).

5. Tzutalin. LabelImg. Git code. https://github.com/tzutalin/labelImg (2015).

6. Nadimpalli, V.K.V. and Agnihotram, G., Wipro Ltd, 2021. Method and system for rendering content in low light condition for field assistance. U.S. Patent Application 16/834,729.