In this section, we introduce the details of our 3D object recognition method.
Overview
In this paper, we study the efficient 3D object recognition method in the mobile edge environment. The overall idea is that we combine edge computing and 3D object analysis to improve the user experience of mobile applications. By utilizing the computing resources of the edge cloud servers, we can realize high-through complex 3D object analysis tasks with low latency. The process of our method is shown in Fig. 1.
Our edge computing framework consists of many mobile devices. And each mobile device connects to a nearby edge cloud server with high-speed network connectivity based on the location information. We assume that the mobile device is equipped with a portable 3D scanning sensor, which can realize 3D reconstruction of the real object in the physical world. During the 3D reconstruction process, the 3D data is compressed by the distance-based compression method on the mobile device [57]. When users need to obtain the category of the captured 3D object by the mobile device, they can send the request to our edge computing framework. The edge cloud server and the mobile terminal can collaboratively accomplish the recognition task to respond to user requests. The whole process consists of the following four steps.
First, the captured 3D data is uploaded from the mobile device to the edge cloud server. To improve the speed of the data transmission, we discard the normal and color information of the captured 3D data, while transmitting only the compressed geometric data. Since the 3D data has been compressed during the 3D reconstruction process, the timing of data transmission is very short.
Second, the edge cloud server will arrange the recognition tasks for multiple requests. For each request, the edge cloud server decompresses the 3D data after receiving the data and then starts to render 2D images. Our method follows existing view-based methods [12] and renders 12 images according to a fixed viewpoints setting for 3D object recognition. Since the rendering processes of the 12 images are totally unrelated, the edge cloud server performs parallel off-screen rendering by the real-time ray tracing algorithm [58].
Third, the edge cloud server sends the 12 rendered images back to the mobile device via the mobile network. As the resolution of the image is only \(224\), the timing of data transmission can be negligible.
Fourth, the mobile device takes the 12 rendered images from the edge cloud server as the input of our specific lightweight 3D CNN network, and executes the recognition model to output the prediction result to users. During this step, 3D CNN network can be executed by three modes: CPU, GPU and DSP. The users can choose one of the three modes according to the mobile terminal used by compromising the performance and speed.
Among the four steps, the first and third steps are only responsible for data communication between the mobile device and the edge cloud server. In the following section, we detail the second and fourth steps, which are cloud-based rendering and terminal-based recognition.
Cloud-based rendering
The goal of cloud-based rendering is to render multiple 2D photorealistic images by the edge cloud server, which are the input of the recognition model. As shown by previous research [59], the recognition accuracy increases with the improvement of the quality of the rendered images. However, the 3D graphic ability of the mobile device is rather limited, and it requires dedicated hardware to generate most optical effects. Thus, the only choice is to perform the rendering task on the edge cloud server.
To realize the cloud-based rendering, we set up the viewpoint set in advance and save them in the edge cloud server. To ensure the features captured from the different views can be complementary, we put 12 cameras at different positions around the 3D object, which is the same as the pioneering work MVCNN [12]. To put the 3D object into the rendering scene, we use the upright orientation method [60] to adjust the pose of the 3D object and then put the center point at the origin of the 3D coordinate system. The 12 cameras are elevated 30 degrees from the ground plane, pointing toward the centroid of the 3D object, every 30 degrees around the object.
To make the cloud-based rendering result conspicuous, we add some lights into the 3D scene, which shows some optical effects, such as shadow, reflection, and refraction. To handle the illumination, we generate the rendering result by utilizing a real-time ray tracing algorithm [58], instead of the rasterization technology. The ray-tracing algorithm simulates the basic principle of vision, which is the process of shooting rays from the eye toward the pixels of the rendered image. Thus, the algorithm first checks the intersection between the ray and every triangle face of the 3D object and then determines the shading of the corresponding pixel in the rendered image. In practice, the decompressed 3D object usually has more than 100K triangles, thus the complexity is too overwhelming due to the inefficiency in handling irregular ray tracing.
As a result, the only viable solution is to resort to powerful dedicated ray-tracing hardware, such as NVIDIA graphics cards. To deal with high concurrency user requests, all the edge cloud serves are equipped with multiple dedicated graphics cards for a high-performance ray tracing process. Due to the separable feature of multi-view rendering, the edge cloud server can parallel run the off-screen rendering subtasks. This reduces the latency of mobile applications significantly.
One challenge of rendering implementation on the edge cloud server is that the edge cloud servers usually have heterogeneous graphics processing units and operating systems. To solve it, instead of using some specific shading languages, we interact with the graphics renderer by OpenGL program as a shared library, which provides unified APIs for 3D graphics rendering. The shared library is then deployed on all the edge cloud servers to realize the ray-tracing rendering process with a small amount of reprogramming. Finally, the 12 2D rendered images of size \(224\times 224\) are generated in the edge cloud server and sent back to the mobile device.
Terminal-based recognition
Previous multi-view 3D object recognition methods use a much more complex network to maximize the recognition performance. However, these networks are too large to be deployed on mobile devices. To improve the efficiency of our model in the mobile terminal, we design a lightweight multi-view CNN architecture based on ShuffleNet [24], which has fewer parameters than the other CNN architecture, such as VGG11 [61] and ResNet [62]. It is worth noting that there are other lightweight networks for 2D images, such as SqueezeNet [63], Xception [64], and MobileNet [65]. SqueezeNet uses a deeper network to reduce the number of parameters, which requires more inference time. Xception proposes depthwise separable convolution to improve the performance, but the amount of the parameters not reduces impressively. MobileNet constructs a lightweight network by combining depthwise convolution and pointwise convolution, which hinders the information transmission. Thus, we choose ShuffleNet for its high information interchange between different channels. The input of our CNN network is the 12 rendered images generated on the edge cloud server. The CNN network outputs the category of the 3D object as the response to the user request.
Network Architecture. Given each rendered image, we first use the ShuffleNet network as the basic CNN architecture to compute the image descriptors from each rendered image. The architecture of the ShuffleNet is shown in Fig. 2. The core of the ShuffleNet is the channel shuffle operation which helps the information in different groups flow to other groups randomly. Based on the channel shuffle operation, we define two types of units S1 and S2, as shown in Fig. 3. Unit S1 is the basic unit for feature encoding based on the channel split operation, which splits the input of feature channels into two branches. Meanwhile, unit S2 is used for spatial down sampling by removing the channel split operator. The final ShuffleNet architecture is composed of several layers of units S1 and units S2, which is highly efficient for mobile device.
To aggregate the image descriptors of every rendered image, our multi-view CNN model utilizes a view pooling layer for the fusion of multiple views of the 3D objects in no specific order. The view-pooling layer only considers the view with the maximal activation, thus we simply use a max pooling operation for information fusion. Finally, a SoftMax layer is added as the classification layer, which generates the category prediction results. The whole architecture of our multi-view CNN model is shown in Fig. 4. Our experiment shows that such a lightweight CNN architecture can be executed fast by mobile devices.
Model Training. To save the cost of implementing the edge computing framework, we design a two-stage training algorithm by semi-supervised learning algorithm FixMatch [26]. This algorithm reduces the amount of labeled data, which is beneficial for the upgrading of the edge computing framework. There are lots of semi-supervised methods, which can be divided into two categories: pseudo-labeling and consistency regularization. The pseudo-labeling methods first learn a deep model by the labeled examples and then use the learned model to predict the other examples, which are used to incrementally model training. In contrast, the consistency regularization methods utilize the prediction invariance after the random transformation of training data as the regularization term. To integrate the advantages of these two methods, we choose to use FixMatch for model training, which generates the labels of unlabeled samples by combining consistency regularization and pseudo-labeling. The input of our model training algorithm is a set of 3D objects \(D=L\bigcup U\), where the objects in L are labeled and the ones in U are unlabeled. For each labeled object \(s_{i} \in L\), we attach a category tag \(c_{i}\) as its label, and all the category tags form the label set C. Every object is expressed as 12 images \(\{m_{j}\}\), and its category tag is the same as the 3D object.
The existing multi-view methods usually use the network pre-trained on the ImageNet directly to initialize the weights of the whole network. Since the training samples are all rendered images, the initial weights do not reflect the feature of these samples due to the domain discrepancy. Thus, we finetune the network by our rendered images before the multi-view learning stage. Accordingly, there are two stages in our model training step: SVCNN and MVCNN. The goal of the SVCNN stage is to train the image representation network which is a part of our whole model, while the MVCNN stage is used to train our multi-view CNN model for 3D object recognition.
During the SVCNN stage, we use the FixMatch algorithm to train the image network ShuffleNet S, as shown in Fig. 5. The input of the image network is one rendered image, and the output is the category prediction of the image. To utilize the unlabeled rendered images, we define several image augmentation operations and generate their artificial labels by consistent regularization and pseudo-labeling generation. There are two types of image augmentation operations: weak augmentation and strong augmentation. We indicate \(\alpha (\cdot )\) and \(\beta (\cdot )\) as the weak augmentation and strong augmentation, respectively. The weak augmentation operation takes the standard flip or shift transformation strategy. The images are flipped horizontally with a probability of 50% and translated by up to 12.5%. By contrast, the strong augmentation operation produces a more distortion effect. We first perform no more than 4 augmentation operations from RandAugment [66] and CTAugment [67], and then randomly select a small square from the augmented image. During the strong augmentation process, the gray values of some pixels are changed to a certain value.
For the labeled rendered image \(m_L\) with the label c, we use the standard cross-entropy loss as the supervised loss \(L_s\) on weakly augmented labeled images:
$$\begin{aligned} L_s(m_L)=H(c,S(\alpha (m_L))) \end{aligned}$$
(1)
For the unlabeled rendered image \(m_U\), we perform the weak augmentation operation and predict the category of the weakly augmented image by the network S. If the network S can give a confident result, i.e. \(\max S(\alpha (m_U))\succeq \tau\), the label \(c'=\arg \max S(\alpha (m_U))\) is taken as the pseudo-label of the image \(\alpha (m_U)\). Accordingly, we use the cross-entropy loss as the unsupervised loss \(L_u\) on the strong augmented image
$$\begin{aligned} L_u(m_U)=H(c',S(\beta (m_U))) \end{aligned}$$
(2)
The final loss function of the SVCNN stage is defined by adding the supervised loss of all the labeled images and the unsupervised loss of all the unlabeled images:
$$\begin{aligned} L_1=\frac{1}{B}\sum \limits _{b = 1}^{B} L_s(m_{Lb})+ \frac{1}{\mu B}\sum \limits _{b = 1}^{\mu B} L_u(m_{Ub}) \end{aligned}$$
(3)
where B is the number of the labeled images, \(\mu\) is the ratio between the number of the labeled and unlabeled images, and the number of all the images is \((1+\mu )B\).
During the MVCNN stage, we combine the FixMatch and view consistency to train the whole MVCNN network M. The input of the network is 12 rendered images \(\{m_j\}\) which belong to the same 3D object, and the output of the network is the category prediction of the 3D object. The learned network S in the SVCNN stage is part of the MVCNN network, as shown in Fig. 4. The 12 Shufflenet networks share the same parameters and are used to extract the image descriptors. When combining the Shufflenet networks into the MVCNN network, we remove the SoftMax layer and perform the view pooling layer on the penultimate layer of the shufflenet network. As done in the SVCNN stage, we also perform the same image augmentation operations on the rendered images to define the loss function on the unlabeled 3D objects. For the labeled 3D object expressed by \(\{m^L_j\}\) with the label c, we perform the weakly augmentation operation and use the standard cross-entropy loss as the supervised loss \(l^M_s\)
$$\begin{aligned} L^M_s(\{m^L_j\})=H(c,M(\alpha (m^L_1),...,\alpha (m^L_{12}))) \end{aligned}$$
(4)
For the unlabeled 3D object expressed by \(\{m^U_j\}\), we perform the weak augmentation operation on all the rendered images and predict the category of the 3D object by the network M. If the network M can give a confident result, i.e. \(\max M(\alpha (m^U_1),...,\alpha (m^U_{12}))\succeq \tau\), the label \(c'=\arg \max M(\alpha (m^U_1),...,\alpha (m^U_{12}))\) is taken as the pseudo-label of the 3D object \(\{m^U_j\}\). Accordingly, we use the cross-entropy loss as the unsupervised loss \(l^M_u\) on the unlabeled 3D object
$$\begin{aligned} L^M_u(\{m^U_j\})=H(c',M(\beta (m^U_1),...,\beta (m^U_{12}))) \end{aligned}$$
(5)
Since the rendered images \(\{m^U_j\}\) belong to one 3D object, they should have the same category label. According to the observation, we add a view consistency term to boost performance. For a set of rendered images \(\{m^U_j\}\), we perform weak augmentation and strong augmentation, and then minimize the divergence between the prediction of these augmented images. To realize such minimization, we compute the standard deviations of the prediction of the weakly augmented and strongly augmented images by the shufflenet S respectively
$$\begin{aligned} {\begin{matrix} L^\alpha _{std}=Std(S(\alpha (m^U_1)),...,S(\alpha (m^U_{12}))) \\ L^\beta _{std}=Std(S(\beta (m^U_1)),...,S(\beta (m^U_{12}))) \end{matrix}} \end{aligned}$$
(6)
Based on these two standard deviations, we can measure the consistency degree between the predictions of different views of the 3D object. Thus, we define the view consistency term as follows:
$$\begin{aligned} L_v=L^\alpha _{std}+L^\beta _{std} \end{aligned}$$
(7)
Accordingly, the final loss function of the MVCNN stage is defined as:
$$\begin{aligned} L_2=L^M_s+L^M_u+L_v \end{aligned}$$
(8)
Given the loss function, we optimize the network through back-propagation with stochastic gradient descent [68] with decreasing learning rates.
Model Deployment. The learned deep model is finally deployed in the mobile terminal. After the terminal receives the 12 images rendered in the edge cloud server, it uses the deep model for 3D object recognition. We provide three running modes on the mobile terminal: CPU, GPU, and DSP. The speed of the CPU mode is the lowest, while the speed of the DSP mode is the fastest. For the recognition accuracy, the CPU and GPU modes are the same, while the DSP mode is the lowest. The reason is that the DSP mode only supports the quantized model, which has a quantization error. The user can choose one of the three modes according to mobile device, and whether to quantize the deep model.
The goal of the neural network quantization technology [69] is to decrease the computational time and energy consumption of the mobile device. After quantization, the weights and parameters are stored in lower bit precision and the computational cost for matrix multiplication reduces quadratically. By network quantization, the latency of our edge computing framework can be reduced impressively. However, quantization without any fine-tuning might degrade the recognition accuracy. To avoid this problem, we use the quantization-aware training method [70] to mitigate the quantization error.
To perform the quantization-aware training, we first introduce the quantization simulation block into every layer of our model. The quantization simulation block will turn the real-valued vector v into the integer vector \(v_{int}\) by the rounding and clamping operation. Specially, given a real-valued vector v, we first map it to the integer grid \(\{0,1,...,2^8-1\}\):
$$\begin{aligned} v_{int}=clamp(\left\lfloor \frac{v}{s} \right\rceil +z;0,2^8-1) \end{aligned}$$
(9)
where \(\left\lfloor \cdot \right\rceil\) is the round-to-nearest operator, s is the factor, z is the zero point. And, s and z are optimized during the quantization-aware training. The clamping is defined as:
$$\begin{aligned} clamp(x;a,c)= {\left\{ \begin{array}{ll} a, &{} x < a \\ x, &{} a\le x\le c\\ c, &{}x>c \end{array}\right. } \end{aligned}$$
(10)
To fine-tune such a network, we need to back-propagate through the quantization simulation block. However, the gradient of the round-to-nearest operation is not well defined. To measure the gradient, the straight-through estimator is utilized and the gradient of the round-to-nearest operator is equal to 1. According to this approximation, we can use the standard back-propagation algorithm to fine-tune our MVCNN network with the quantization simulation block. After neural network quantization, the quantitative MVCNN network can be deployed on the mobile device. To use DSP for network inference, we remove the data operations in the MVCNN network that exceed 4-dimension. This is done by converting the 5-dimensional operations involved in the network structure to 4-dimensional operations. For example, we convert a certain operator from the dimension of (3, 12, 1024, 7, 7) to (3, 12, 1024, 7*7). By the DSP environment, the network can be executed on the mobile device quickly, and the speed is close to that of running a normal model on the cloud server with powerful GPU.