EfficientDet for object detection on edge for AR Assistance

By Abhishek Kumar; Gopichand Agnihotram; Raja Sekhar Reddy S; Surbhit Kumar; Pandurang Naik

Object detection is a technique of training computers to detect objects from images or videos. The object detection based AR assistance is expected to help the users with all the activities associated with smart systems such as repair, maintenance, interactions, anomalies identification, training etc. Over the years, there have been many object detection architectures and algorithms created by multiple companies and researchers. Edge based algorithms are especially popular as loud-based detection takes a lot of computing resources and cycle time for object detection and has latency issues for AR assistance in real time.

In the race for creating the most accurate and efficient model, the Google Brain team has recently released the EfficientDet model, which has achieved the highest accuracy with fewest training epochs in object detection tasks. This architecture beats the YOLO [4] and region based [3] object detection family models with minimum computation power. This article discusses the comparison of different object detection models’ performance for custom datasets on edge devices and shares our findings.

Figure 1. Edge Computing in different industrial applications

For near real time object detection on the edge, fast inference speed is required along with the challenge of maintaining accuracy. EfficientDet architecture achieves this state-of-the-art performance as seen in Figure 2 [1] compared to other object detectors. It takes 13x — 42x fewer FLOPs and 4x — 9x lesser training parameters than earlier models. These improvements significantly enhance the training and computation time and resource utilization making it more suitable for edge side deployment.

Figure 2. EfiicientDet Comparison to Other Object detectors [1]

High level Solution approach for edge computing

Edge computing is a distributed computing framework that brings enterprise applications closer to data sources such as IoT devices, AR/VR headsets or local edge servers. This proximity to data at its source can deliver strong business benefits, including faster insights, improved response times and better bandwidth availability. The solution described here is efficient lite models-based object detection that is scalable to adapt to the tighter and potentially time-varying resource constraints of edge computing environments.

Figure 3. High Level Approach

The EfficientDet Lite model is first trained on custom dataset and is exported to TFLite model file; the exported model is then deployed on edge devices. This approach is very useful in applications which requires high accuracy with limited computational resources.

Model training with the EfficientDet Architecture

Three main components of an object detector model that are optimized to improve its training efficiency and inference capability are, a backbone to extract features from the input image, a feature network that takes multiple levels of features from the backbone as input and the final class/box network that uses the fused features to predict the class and location of each object.

· EfficientNet is used as backbone as it is more powerful than other currently available backbone architectures. It improves the accuracy significantly and reduce the computational resources required to train the model to a very large extent.

· The feature extractor used here is bi-directional Feature Pyramid Network (FPN), which uses multilevel feature fusion that enables information to flow in top-down and bottom-up directions while using efficient connections. Here weights are assigned to treat each input features of different resolutions in a specific way combined with depth wise separable convolutions which optimizes the computation cost significantly.

· The third step to improve model training and prediction accuracy is compound scaling that controls the depth, width and resolution of different architecture components that makes the model performance on edge devices very optimized as compared to other object detectors such as faster RCNN and SSD models.

· Both the Bi-FPN layers and class/box net layers are repeated multiple times based on the resource constraints.

Here are the methods and techniques used in the architecture that results in better performance in terms of training and edge device deployment.

Compound scaling: This method was introduced to scale up baseline EfficientDet models to improve their object detection capabilities. The compound scaling coefficient jointly scales up all the dimensions of backbone, Bi-FPN and class/box prediction based on heuristics-based rules as object detectors have lots of scaling dimensions.

Backbone network: EfficientNet is used as backbone network. Same scaling for width/depth is used to get the weights same as EfficientNet B0-B7 architecture. Compound scaling coefficient to scale the width, depth and resolution of the backbone network is same as EfficientNet Architecture [2].

Bi-FPN network: Input images features have different resolutions. For optimizing the feature aggregation, the solution uses BiFPN Feature network. The Bidirectional cross scale connections enable fusion of features by removing input nodes that has one input edge which contributes less to feature network. The weighted feature fusion technique is used to assign weights to input features based on the importance. As different resolution of input features contributes unequally to the output features, the weighted feature fusion helps to model to learn weights for each input features. This optimization helps the network to learn each input feature based on its importance.

Due to the above-mentioned architectural improvements, training the EfficientDet Lite models for our custom datasets becomes faster on GPU and CPU machines. Inference time on edge devices is substantially better than other object detector models tested on edge device. These models have higher FLOPS and better accuracy.

Figure 4. Detection Architecture

Training the EfficientDet lite model using the TensorFlow model maker library for our custom datasets reduces the training time and requires less data due to the use of transfer learning. It shows a significant improvement performance on edge devices while performing object detection using these trained models. Size of the trained and exported model is small and can be deployed easily on edge devices.

Dataset preparation for edge computing for field assistance use case

For training the EfficientDet model, we must have input images with more descriptive features. The input data required to train the model should have labels and images taken in proper scale, format with all meaningful features included. Data preparation is an iterative and explorative step which has huge impact on model performance.

While selecting the data we should try to include data with good distribution of each label, and instead of including large number of images the focus should be to include images with rich feature that helps to detect the class labels more accurately. There may be features that are more beneficial to training process if split into constituent parts. Combining some features while creating the dataset will improve the model’s performance. Model Performance largely depends on how the data is prepared considering the detection task of the models at hand. Depending on the applications each label should be annotated so that while model training it has good amount of variation in features for the model to learn.

The data required to train the EfficientDet lite model for edge devices should be in CSV format or in Pascal VOC format. Here we have used Pascal VOC format to train our model. The dataset is divided into 80% for training and 10% each for validation and testing. The training data should be balanced and contain the annotated images with all labels or classes in the right proportion. While preparing the data in pascal VOC format we should take care that the bounding box coordinates such as min values should not exceed the max values and all relevant tags should be present. Tools such as Labelling can be used to prepare the Pascal VOC Format data.

Pascal VOC: Pascal VOC is an XML file. For each image in the dataset, we create an annotation file in XML format which contains tags filename, folder, size, object, bounding box, label name. Training the model with this format of data improves model performance and predictions. The data provided to the model for training is converted to tensorflow records that makes the training fast and data efficient on the disk.

Figure 5. Sample Pascal VOC image

Object detection on edge devices and assisting the field engineers for AR steps

Object detection plays a key role in various applications such as device repair, AR steps and procedure creation, AR guidance etc by helping enhance the user experience. Various activity steps and procedures based on detection of parts enhances functionality like guidance, verification, inspection etc. With the detection of class, we get the object positions also which is quite helpful while notifying users of the steps in an AR environment.

Figure 6. Arduino robot kit components detection

In the above detection samples, we can see that the trained EfficientDet model deployed on edge devices helps detect parts at fast speed of approximately 50–100ms. Such response times are necessary for immersive experience in repair and maintenance AR assistance use cases. The detection in the example above is done with probability threshold of 0.5. We can vary the threshold based on number and kind of labels we want to detect for our field use cases.

Figure 7. Television remote and Small Keys detection

The detection of small and large components helps in scene tracking that enables the creation of continuous and persistent AR experiences for large- and small-scale objects.

Tracking and recognizing objects have various applications in the field of self-driving cars, medical science, agricultural, robotics and hardware domain. Here the probability threshold can be changed based on application while building the application file to be deployed on edge devices so that we will get the labels detected.

Performance metric comparison with other methods on edge devices

We analysed the performance of different models by porting these models in different edge devices. The performance metric computed on edge devices as given in Table 1 with continuous 10 hits to detect the labels in the images. We noted that detection is extremely fast (in milliseconds), and it can detect most labels with accurate bounding boxes. The confidence score of each label also indicates the model performance with 0.5 probability threshold. The models are trained on TV remote dataset, and Arduino datasets. The performance metrics are compared on different models such as SSD MobileNets and SSD ResNets on edge devices. The results are presented in the table below.

Table 1: Different Models performance Metrics

Table 1 shows that the EfficientDet Zero (EfficientDet0) model performs much better than other models tested with respect to accurate detection of objects. It also performs much better with respect to time taken for the detection. In Figure 8 and Figure 9 below, we can see that the EfficientDet model is quite smaller in size with good accuracy, less inference time, whilst also being best suited for Edge devices.

Figure 8. Edge Model Size vs % Accuracy on TV remote dataset
Figure 9. Edge Model Size vs Inference Time (in ms) on TV remote dataset

Conclusions and Next steps

Based on our analysis, we found that the EfficientDet model running on edge devices performs on par with the cloud-based models. The EfficientDet algorithm proves to be better in identifying the smaller objects in the frame as compared to the other cloud algorithms such as YOLO, and FRCNN.

When compared with the SSD models on edge devices, the EfficientDet model shows high inference speed (in milliseconds) necessary for providing immersive experiences in AR assistance applications. It performs much better with respect to accuracy in identifying the objects and detection time.

As for next steps, we are porting the EfficientDet model in other HMD devices such as RealWear, HoloLens etc., for identifying objects in the frame in real time video streaming data for AR assistance. These models will be analysed for accuracy and the hyper parameters will be tuned in such a way that we can achieve results for any kind of objects in video frame data.


1. Tan, Mingxing et al. “EfficientDet: Scalable and Efficient Object Detection.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020): 10778–10787.

2. Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks arxiv:1905.11946ddd.

3. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137–1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6. PMID: 27295650.

4. Handalage, Upulie & Kuganandamurthy, Lakshini. (2021). Real-Time Object Detection Using YOLO: A Review. 10.13140/RG.2.2.24367.66723.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store