Cognitive edge computing system design: Some tips and tricks!
By Dr. Manjunath Ramachandra | DMTS Senior Member, Wipro
and Vinutha B N. | Distinguished Member & General Manager, Wipro
With the proliferation of IoT devices, edge intelligence is expected to go into all application domains. Most of them involve real time interactions and call for extremely low latency in decision making through the AI algorithms. AI algorithms are used extensively to get the best out of the edge. In addition, the AI algorithms driving the applications over various edge devices need special considerations because of resource constraints. As a result, AI models are required to run over the edge, including devices close to where the data is generated. To provide seamless experience, edge intelligence requires continuous tracking of the user, prediction of next moves, and monitoring of the traffic pattern over the network. To make it happen, the models are required to be sliced and stored in the devices. Larger models must be compressed. Offloading the data or the process should happen dynamically over the right choice of the device or edge terminal. If the edge needs to support user mobility or mobile devices, the process migration with partial or completed results should follow the user with the same speed. To provide correct decisions, the success of edge computing depends up on the availability of the right choice of the algorithms at the right time. Below, we provide fine details of the relevant parameters influencing the choices in data acquisition and processing.
Edge intelligence
Edge AI means that AI software algorithms are processed locally on a hardware device. The algorithms are using data (sensor data, signals, etc.) that are created on the device. Edge computing is all about putting the data processing power back to the edge. AI on edge is quite an active area of research.
Edge intelligence architecture is comprised of data capture, model development, deployment, and load sharing. Respectively, they map to:
1. edge caching
2. edge training
3. edge inferencing
4. edge offloading
These components and the workflow sequence must be compatible and interoperable across the different edge devices. Training and inferencing can share similar needs such as optimization of resources and computation. Likewise, caching and offloading share the common concern of how much of data is to be processed locally and how much of it gets transferred outside. In turn, offloading and training share the concept of distributed collaborative computing and merge the results before consumption. Models must be downloaded in real time to cater to changes in the device environment. Performance measurements of the model must be carried out at regular intervals. One way to decide when to download an updated model is by basing the decision on how much your current model is overshooting tolerance limits. However, it can be a challenge to decide when to download the model.
John enters a mall with his head-mounted display enabled. He is looking for an LED bulb on the shelf. A camera starts capturing John to derive his location, gestures, and expressions. The location information gets inferred by the camera itself while the expressions and gestures get analyzed by the edge server. As he spends more time in front of the rack with Wipro LED bulbs, the inferencing model gets more fine-tuned with the brain wave (electroencephalograph, EEG) signals from the head mount device. Now, the generic model on the server can precisely infer his emotions and expressions. Accordingly, information on other products in the environment illustrated below, including the printer and furniture pop up as recommenders. John can now see how his office looks in the LED light. John’s movements are analyzed by the edge server and offloaded to stream the corresponding AR or VR content to the edge media server in response to John’s actions.
Edge caching
Edge caching involves data collection and archival for future use. While supporting the applications at the edge, data may be generated by the edge device, other devices attached to the same server, or internet-connected devices. Caching involves what data to cache, how to cache, and where to cache (i.e., in the device or edge server or other edge devices).
The type of data and the amount of data that gets cached on each device generating the data, other edge devices, and edge server depends upon the capabilities of the device. Devices often come from different vendors with various form factors. Device capabilities should be clearly advertised through a protocol such as UPnP+.
Numerous factors influence the amount of data cached by the device.
· Ability of the device to process the data. This dictates the amount of data transfer.
· Power available and power required to process and transmit the data. E.g., a video camera capturing hand gestures can cache frames and transfer motion vectors and reference frames. This reduces caching and transmission at the cost of increased computation in the camera itself.
· Proximity of the edge device to cache the data
· Mobility of the data source if the edge server or edge devices support mobility. This concerns the rate of movement with respect to edge devices.
· Rate of data acquisition. This is the quantity of data acquired at a single time if it is event driven or data per unit time if it is continuous.
· Intervals of acquisition
· Format of the acquired data.
Based on these parameters, an AI system should be able to decide what to cache, where to cache, when to cache (event driven or time driven based on the use case) and how much to cache. Some of these values may be derived based on the applications as detailed in 3GPP TS 22.104 V18.2.0 (2021–09). For example, the rate of data acquisition should be less than the maximum packet (bit) rate over the channel.
Data caching depends upon the amount of data acquired by the device. Today, cognitive devices ingest as much data as is required for processing. For example, a security camera acquires images at an exceptionally low frame rate when there is no event happening in front of it and increases the frame rate as someone approaches it. However, this advantage is not available in all edge devices. For example, if the computation complexity at the device is reduced to minimize battery usage, the amount of data to be transferred to the server increases, which in turn requires increased usage of the battery and the channel. When more devices are contending for resources such as channel or access to other edge infrastructure (devices, server), a rate control mechanism such as random early detection can be imposed on the acquisition of the data as indicated in the figure below.
Other challenges associated with data caching are the cache replacement and cache coherency policies. These policies are determined by the combination of different classes of edge devices and the edge server.
Edge training
When edge devices use AI models for inferencing, the models need to be updated to take care of the changing environment. I.e., incremental training is required based on the changes in the environment. For example, in AR and VR applications, some person-specific interaction styles are considered by the model. A generic model needs to learn these specific traits for a better experience. Edge training calls for distributed training, i.e., training over other edge devices. The device acquiring data may not be computationally equipped for incremental learning.
Learning can also happen offline with cached data if the model is not required to be adopted to changing users in real time. AI algorithms involve the selection of the transfer learning model and the decision on when to retrain. The factors influencing training depend upon the change in patterns for inferencing. Data changed due to a different user or environment may not produce the earlier patterns for inferencing. New patterns that are relevant for decision-making creep in. This calls for a retrain threshold for the appearance of the new patterns beyond which a retraining is triggered. In any case, the data available for retraining is small. Zero-shot learning over multiple related edge devices can reduce the need for training data. For example, In AR and VR, image and sound are often associated, touch and facial expressions are correlated. Standard protocols need to be in place for selection of the cached data (the right data) and the right model for training.
In the case of collaborative and federated training, constraints on computing power and delays (processing and transmission) of participating devices should be considered. Availability of the data on the device and transmission vs processing power (and computation) should be considered before including or excluding a device to be part of the training.
Edge inferencing
Here the AI models act upon the acquired data to produce the inference. As AI models are often difficult to fit onto a single edge device, several methods are being researched to eliminate the memory issue. In addition, the limited computing capabilities would have a say on frequency of model usage and processing time.
Some of the techniques include:
Model splitting
Splitting the model between the end device and network endpoints is a challenge because it depends on the logistic resources available on the device. The split can happen dynamically based on the available device resources. The model can be split to have a different number of layers at distinct locations. The edge server facilitates the sequential traversal of these layers and ensures the seamless integration of devices from different vendors. The edge server takes responsibility for which part of the model and associated criteria are to be dispatched to the device. The edge server makes its decision based on factors including proximity to the edge, local buffer size, and model size (layers that can be independently dispatched). The different split points and data rates to support required user experience depends on the specific application. For an example of image recognition, it is defined in 3GPP standard 3GPP TR 22.874 V2.0.0 (2021–06).
Model compression
Compressing the model involves developing the model to have a small footprint for faster training and inference. A typical AI model sized at hundreds of MBs, cannot sit in an edge device such as a camera. The model must be compressed. There are multiple ways of reducing the model size with minimal deterioration in the accuracy.
1. Weight quantization: Here the weights are made Boolean to reduce the model size. In general, a floating-point weight is costly in terms of memory.
2. Knowledge distillation: Here a fixed-size small network is retrained with the same data as the large original model, reusing the important patterns.
3. Architectural optimization: Here the neurons or processing units are reduced. For example, in a convolutional neural network, the number of filters is reduced, depending up on their significance.
Model inferencing
Inferencing the model involves generating a model that has a small footprint that can run on low-cost and low-power hardware devices.
Edge offloading
Edge offloading involves the migration of data, process, or applications from an edge device (or edge server) to another device to perform an indicated task. A device can offload the data and application to more than one device and combine the processed results before consumption. The following figure shows offloading in an AR VR application where the camera performs minimal processing and offloads the rest to the edge.
Offloading is required for two main reasons.
1. To make up for the resource crunch
2. To support user mobility
The reasons for offloading are listed below.
· Excessive computation
· Latency requirement
· Load balancing
· Power saving
· Data saving
· Resource constraints
Four steps are involved in offloading as indicated in the figure below. AI algorithms are used to traverse each of these steps. These algorithms run over the scheduler or agent over the edge server. Optimal offloading and migration are linked to the timing of offloading, transfer of partial results like how much to process on an edge device and whom to send these results play a role on usability of the results. To perform offloading, it required to estimate which data packets should be allocated to the edge device in real time to minimize delay, how many data packets to be uploaded and which edge device should be selected.
1. When to offload:
The offloading time is estimated through one of the following techniques:
· Multivariate linear regression
· Polynomial multivariate regression
· Random forest regression
· Ridge regression
The prediction is made based on patterns in the consumption of acquired data and the nature of processing required. Alternatively, user settings or interruptions can trigger offloading for the processing or consumption of data.
2. What to offload:
For offloading, the application and the data need to be segmented and dispatched at the right time. Partitioning the application and data is based on a range of factors as indicated in the figure below.
Some parameters are measured at the time of offloading, these include battery status on the device, round trip delays (i.e., topographic information), and available processing power (considering the background processes and housekeeping). These parameters dictate the quantity of data the device can process or transmit to another device and the applications (or parts of the applications) that it can process locally.
By default, the device does not have visibility into the logistics of the potential target device to offload the data or application. The scheduler running over the edge server estimates the load on different devices at a future point in time and communicates the same for devices prior to the initiation of offloading. Channel congestion status, resource contention, and others are determined in advance. Accordingly, the data rates are adjusted through a RED (Random Early Detection) algorithm that considers the relative priorities among the offloading devices as well as the applications they execute.
The more partitions there are, the more handshakes there will be. On the other hand, the device may not have adequate resources to run a larger chunk of the code. A cognitive scheduler is used to optimize partitioning and map the applications or task onto multiple devices.
3. Where to offload:
Computation on the local device is determined by the quantity of data acquired, the availability of computational power, and the connectivity needed to transfer the data or processed results further. In turn, these factors are constrained by or depend upon the available battery power and response time. The offloading needs to happen on the right edge device, and migration must happen at the right time so that the user gets the processed result in time. AI algorithms are required for load prediction and device or user movement prediction (if mobility is involved), subject to the above constraints. As detailed above, a cognitive scheduler on the edge server selects the right type of target device to offload based on the amount of data to churn, the application, the response generation time, capabilities of the device, and more.
A reinforcement learning model can determine the problems of data packet selection and edge device allocation in real time. Its objective is to select an action to minimize the latency. The reward is inversely proportional to latency. The edge server or an edge device acts as a controller and monitors the current state and action. It gets information on new states and rewards. The Q-function is updated after transition. Significant performance improvements can be achieved over the variation of service rate and traffic arrival rates.
4. How to offload:
Once the offload process starts, the program and data are containerized to run over the target device. The data is often encrypted to protect privacy and to prevent intrusion. Not all the segments of the data need equal protection. The segregation of data and program according to sensitivity is made before the initiation of transfer.
Future directions
The mechanism provided here works well for a moderate number of devices around the edge. However, the number of devices around an edge server is increasing rapidly. To support the increased demand for channels, resources, and complex applications, the underlying arbiter system must be more cognitive. Only the required data needs to be processed, determined by relevance. It involves turning big data into small data. This means developing algorithms which are then given a large volume of data generated by IoT devices. These algorithms need to be able to throw not-so-useful data points and retain only those which carry major weight for decision making.
The same holds for model weights. Only the weights contributing to decision making are handled. This significantly reduces the data and model traffic as well as the cache requirements. Also, prioritization of SLA (Service Level Agreement) services and the reduction of transmission rates based on availability of resources can provide a solution for streaming issues.