A cloud-oriented siamese network object tracking algorithm with attention network and adaptive loss function

Aiming at solving the problems of low success rate and weak robustness of object tracking algorithms based on siamese network in complex scenes with occlusion, deformation, and rotation, a siamese network object tracking algorithm with attention network and adaptive loss function (SiamANAL) is proposed. Firstly, the multi-layer feature fusion module for template branch (MFFMT) and the multi-layer feature fusion module for search branch (MFFMS) are designed. The modified convolutional neural networks (CNN) are used for feature extraction through the fusion module to solve the problem of features loss caused by too deep network. Secondly, an attention network is introduced into the SiamANAL algorithm to calculate the attention of template map features and search map features, which enhances the features of object region, reduces the interference of background region, and improves the accuracy of the algorithm. Finally, an adaptive loss function combined with pairwise Gaussian loss function and cross entropy loss function is designed to increase inter-class separation and intra-class compactness of classification branches and improve the accuracy rate of classification and the success rate of regression. The effectiveness of the proposed algorithm is verified by comparing it with other popular algorithms on two popular benchmarks, the visual object tracking 2018 (VOT2018) and the object tracking benchmark 100 (OTB100). Extensive experiments demonstrate that the proposed tracker achieves competitive performance against state-of-the-art trackers. The success rate and precision rate of the proposed algorithm SiamANAL on OTB100 are 0.709 and 0.883, respectively. With the help of cloud computing services and data storage, the processing performance of the proposed algorithm can be further improved.


Introduction
The task of object tracking [1,2] is to stably locate the object to be tracked from subsequent frames when the size and location information of the object in the first frame of the video sequence is given. To make up the limited computing resources and storage resources of a single computer, video sequences can be deployed in the cloud, and cloud computing technology can be used to further improve the tracking performance. Object tracking is currently applied to various fields of artificial intelligence such as intelligent monitoring based on edgecloud computing [3], human-computer interaction based on vision [4,5], intelligent transportation, and autonomous driving [6].
Object tracking algorithms are mainly divided into two types, namely, generative model and discriminative model. Generative models, such as optical flow methods [7,8] and mean shift algorithm [9][10][11], are difficult to resist scale changes, deformation, and similar interference. The mainstream discriminative object tracking • A multi-layer feature fusion module using modified ResNet50 network is proposed, which fuses the hierarchical features in the last three layers of the ResNet50 network to avoid missing important features in the process of feature extraction. • Aiming at solving the limited accuracy of tracking algorithm, an attention network is introduced to encode self-attention and cross-attention of feature maps. The features of the elements with rich semantic information of the object are enlarged, while those of irrelevant elements are reduced, and the generalization ability of search map features is improved. • In order to improve the accuracy of object classification, the cross entropy loss function and pairwise Gaussian loss function are proposed in the classification branch to increase inter-class separation and intra-class compactness. • By comparing the proposed SiamANAL algorithm with other trackers on the existing mainstream object tracking datasets, it is verified that the accuracy and robustness of the proposed algorithm have been significantly improved.
The remainder of this study is organized as follows. Related work section summarizes and discusses existing methods of object tracking based on siamese network. The proposed SiamANAL algorithm section describes the overall framework of the proposed algorithm, and constructs feature extraction network, selfattention network and cross-attention network, as well as classification-regression subnetwork. Result analysis and discussion section verifies the tracking effect of the proposed algorithm in different datasets, and carries out quantitative and qualitative analysis and discussion with comparative algorithms. Finally, conclusion section summarizes the conclusions.

Related work
A typical siamese network consists of two branches with shared parameters, namely, a template branch representing the object features and a search branch representing the current search area. The template is usually obtained from the label box of the first frame in the video sequence, marked as Z , and the search area of each subsequent frame is marked as X . The siamese network takes two branches Z and X as the inputs, and uses an off-line trained backbone network ϕ with shared weights to take the characteristics of the two branches. The parameter of the backbone network is θ . By convolving the features of the template branch and the search branch, the tracking response map of the current frame can be obtained. The value on the response map represents the score of the object at each position. The response map is calculated as follows: where b represents the deviation term of simulated similarity deviation. In Eq. (1), the template Z performs exhaustive search on the image X to obtain the similarity score of each position.
Generally speaking, the siamese network trains (Z, X) and the corresponding real label y offline by collecting many images from the training video. The backbone network parameter θ is continuously optimized during the training process. To match the maximum value in response map f θ (Z, X) with the object position, the loss is usually minimized in the training set, that is: Based on the above mentioned theories, the tracking algorithms based on the siamese network have been modified to improve the tracking performance. The SiamFC (Full coherent siamese networks for object tracking, SiamFC) algorithm [20] firstly proposes the concept of siamese structure, which has two inputs: one is the benchmark template for manually labeling the object in the first frame, and the other is the search candidate area for all the other frames in the tracking process. The purpose of the siamese structure is to find the area that is most similar to the reference template of the first frame in each frame. The design and optimization of the loss function play a key role in the tracking effect. The SiamRPN (High performance visual tracking with Siamese region proposal network, SiamRPN) algorithm [21] introduces the region proposal network (RPN) based on SiamFC. The RPN sends the features extracted by the siamese neural network into the classification branch and regression branch, and uses the predefined anchor boxes as the reference for the regression value of the boundary box. The speed and accuracy of tracking algorithm are significantly improved. Guo et al. [22] propose a fast universal transformation learning model, which can effectively learn changes in the appearance of the object and suppress the background, but online learning has lost the real-time ability of the model. Wang et al. [23] explore the effects of different types of attention mechanisms on template map features in the SiamFC method, including general attention, residual attention, and channel attention. However, this algorithm does not explore the attention network on search map features. He et al. [24] propose double feature branches, namely, semantic branch and appearance branch, which effectively improve the generalization of the algorithm. However, these two branches are trained separately and only combined during reasoning, thus are lack of the coupling. The SiamRPN++ (SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks, SiamRPN++) algorithm [25] based on SiamRPN uses modified ResNet50 [19] network for feature extraction, and also achieves good results. Other classical algorithms based on siamese networks include the Siam R-CNN (Siam R-CNN: visual tracking by re-detection, Siam R-CNN) algorithm [26] and the SiamCAR (SiamCAR: siamese fully convolutional classification and regression for (2) argmin y, f θ (Z, X) .
visual tracking, SiamCAR) algorithm [27], and they both achieve significant tracking effect. Chen et al. [28] propose a siamese network tracking algorithm based on SiamRPN++ algorithm for online object classification to enhance the context information of the object and improve the robustness of the algorithm. Tan et al. [29] design a full convolution siamese tracker without anchor frame, which can directly classify and predict on pixels to improve the robustness of the tracker.
Although the tracking model based on the siamese network improves the tracking performance while ensuring the real-time tracking, the model based on offline training is difficult to effectively distinguish the tracking object from the background information under dim ambient light. How to reduce tracking drift or tracking failure and improve the tracking success rate and robustness is still the key research content when the object is occluded, deforms and faces other interference.

Proposed SiamANAL algorithm
The overall framework of the proposed SiamANAL algorithm is shown in Fig. 1, which is divided into four parts: the feature extraction of siamese network, the self-attention network, the cross-attention network, and the classification-regression subnetwork. The main processing flow is divided into the following four parts: map feature f * (Z) is used as the coding matrix, which are input into the cross-attention network to obtain the cross-attention output f * * (X). 4. Classification-regression subnetwork (see 'Classification and regression subnetwork' section for details).
As the inputs of the classification-regression subnetwork, f * (Z) and f * * (X) are performed the deep cross-correlation operation. The pairwise Gaussian loss function and the cross entropy loss function are designed to achieve classification and regression results of boundary boxes.

Proposed feature extraction module
Low-level features of CNN, such as edge, color and shape, provide rich position information, and can deal with the tracking problems in scenes such as illumination changes, but they are not robust to appearance deformation. High-level features are better to represent rich semantic features and have strong robustness to significant changes in the appearance of the object. However, the spatial resolution is too low to achieve accurate object localization. The object tracking effect can be improved by making full use of the different resolution of low-level features and high-level features. Many methods take advantage of fusing both low-level features and highlevel features to improve the tracking accuracy. Considering the above mentioned factors, the last three layers of convolutional network with both location features and semantic features are selected to represent the object. The input of the template branch is the template map Z withe size 27 × 127 × 3 and the input of the search branch is the search map X with size 255 × 255 × 3 .
The template map feature f (Z) and the search map feature f (X) are output by the modified weight sharing ResNet50 network respectively. As the deep learning network becomes deeper and deeper, the extracted features become more and more abstract. To avoid the loss of some useful features due to the deep network, a multi-layer feature fusion module is proposed, including MFFMT and MFFMS for the template branch and the search branch respectively.

MFFMT
As shown in Fig. 2, MFFMT represents multi-feature extraction of the template branch.
Step 1: Compress the hierarchical features in the last three layers of the template map Z to keep the number of channels consistent.
Step 2: Compress the hierarchical features in the last three layers by using a convolution kernel of size 1 × 1 to keep the number of channels consistent that is 256.
Step 3: To reduce the amount of calculation in the template branch, the hierarchical features in the last three layers are clipped in the center so that the size of the feature map is kept as 7 × 7 × 256.
Step 4: Concat these three feature maps together to obtain a feature map with 3 × 256 channel number and 7 × 7 size.
Step 5: The ConTranspose2d operation is used to obtain the feature map f (Z) with the size of 7 × 7 × 256 , which contains all useful information of the last three layers of the ResNet50 network structure.

MFFMS
As shown in Fig. 3, MFFMS represents multi-feature extraction of the search branch.
Step 1: Compress the hierarchical features in the last three layers of the search map X to keep the number of channels consistent.
Step 2: Compress the hierarchical features in the last three layers by using a convolution kernel of size 1 × 1 to keep the number of channels consistent that is 256.
Step 3: Concat these three feature maps together to obtain a feature map with 3 × 256 channel number and 31 × 31 size.
Step 4: The ConTranspose2d operation is used to obtain the feature map f (X) with the size of 31 × 31 × 256 , which contains all useful information of the last three layers of the ResNet50 network structure.

Proposed attention network model
The most effective way to improve the accuracy of the algorithm is to improve the expression ability of the feature matrix, and attention network can further improve the expression ability of backbone features. The model scheme of attention network in this paper mainly includes self-attention and cross-attention. Self-attention can encode the correlation between the feature elements and the channel, which can help better highlight the feature elements that are useful for tracking in the object tracking task. Cross-attention can encode the element correlation between two different features, which acts on both search map features and template map features in object tracking task. It is more beneficial to improve the accuracy of cross correlation results to let the feature elements of the search map execute a weight allocation in advance according to the influence of the feature elements of the template map.
Considering the impact on real-time performance, the lightweight attention network introduced in this paper is Non-local [30], which will have minimal impact on the number of parameters and floating point arithmetic, and can effectively improve the expression of backbone features. Non-local is a kind of non-local network operation, which is the opposite of local operations such as convolution and cyclic operation. The long-distance dependency of each element in the input features is captured, which is an extremely informative dependency. The structure diagram is shown in Fig. 4.
The inputs of the Non-local network are two matrices H i and W i represent the height and width of the matrix respectively, and D represents the number of channels of the matrix. After matrix A is input, the residual matrix is calculated with matrix B . The inputs of residual matrix operation are query , key , and value , where matrix query is assigned by matrix A , and matrices key and value are assigned by matrix B . We Perform the convolution kernel operation of 1 × 1 × D on the input matrix, and then perform the matrix dimension transformation. After two matrix multiplication operations and the final 1 × 1 × D convolution operation, the output is the residual matrix A * . The final output A is obtained by adding the residual matrix A * and the original matrix A . To simplify the expression, the convolution kernel encoding of 1 × 1 × D is represented by function Conv. The expressions of each operation step are as follows: " · " represents the matrix multiplication operation, " ⊕ " represents the addition of the matrix element by element, " T " represents the transpose operation of the matrix, and " (·) M " represents the first and second two-dimensional combinations of the matrix.
The input dimensions of matrix multiplication in Eq.
, which means that the elements in the space dimension of matrix A and the elements in the space dimension of matrix B carry out attention correlation operation one by one. The input dimensions of matrix multiplication in Eq. (8), which injects the attention influence coefficient of matrix B on matrix A in the dimensions H 1 , W 1 and D . By adding the attention influence coefficient matrix A * and A one by one through Eq. (9), the final output result of the Nonlocal attention network can be obtained.

Self-attention calculation
The self-attention-non-local (SANL) network designed in this paper takes the feature matrix itself as attention correlation, that is, uses search map features and template map features as the input of query , key and value matrices. For the template map attention network, the template map feature f (Z) is input into the Non-local attention network as the input matrix, that is, A = f (Z) and B = f (Z) , and they are substituted into the Nonlocal attention network model to obtain the self-attention output as follows: f * (Z) is the template map feature using SANL attention coding.
For the search map attention network, the search map feature f (X) is input into the Non-local attention network as the input matrix, that is, A = f (X) and B = f (X) , and they are substituted into the Non-local attention network model to obtain the self-attention output as follows: f * (X) is the search map feature using SANL attention coding.
After the feature matrices f (Z) and f (X) are encoded by SANL network, the correlation between each feature element of the matrix and the other elements is calculated to obtain f * (Z) and f * (X) . Compared with the feature matrix without coding, the elements with tracking semantic information in f * (Z) and f * (X) are enhanced, so as to obtain better scores in the classification branch. The background elements in f * (Z) and f * (X) are weakened, which will cause less interference to the score results of the classification branch. The feature values of the final object elements with rich semantic information are enlarged, while those of irrelevant elements are reduced.

Cross-attention calculation
The cross-attention-non-local (CANL) network designed in this paper takes the features of the search map as the input of query matrix, and the features of the template map as the input of key and value matrices, that is, A = f * (X) and B = f * (Z) . CANL network structure diagram is shown in Fig. 5.
The template map feature f * (Z) is used to encode the search map feature f * (X) , which is exerted the attention influence. The relevant elements in the template map will enhance the features of the core semantic elements in the search map, and the cross-attention output is as follows: Where f * * (X) is the search map feature using CANL attention coding, f * (X) is the search map feature using SANL attention coding, and f * (Z) is the template map feature using SANL attention coding.
After the feature matrix f * (X) is encoded by CANL network, each element of its own is calculated by correlation with each element of f * (Z) , and the output f * * (X) is obtained. Compared with f * (X) , the feature elements with semantic information of f * * (X) are enhanced by the influence of f * (Z) , while irrelevant background elements are weakened. Before the cross correlation of the classification and regression network, the search map features perceive the attributes of the template map features in advance, improving the generalization ability of the search map features. Thereafter, f * (Z) and f * * (X) will be sent to the classification branch and the regression branch respectively.

Classification and regression subnetwork
The algorithm designed in this paper uses RPN to achieve classification and regression of the object, in which the function of the classification branch is to distinguish the foreground from the background, the foreground refers to the location of the object, and the background refers to the location of non-object. The function of the regression branch is to determine the size of the object. If k anchor boxes with different scales are added to the object, the classification features will have 2k channels and the regression features will have 4k channels. The template features f (Z) and the search features f (X) are depth cross correlated to obtain the depth features Y 1 ∈ R H ×W ×D , and the template features f * (Z) after SANL network coding and the search features f * * (X) after CANL network coding are depth cross correlated to obtain the depth features Y 2 ∈ R H ×W ×D . Matrix Y 1 and matrix Y 2 are added together to get the final output where * denotes the channel-by-channel correlation operation.
Y ∈ R H ×W ×D is divided into two branches used for classification and regression. In the classification branch, the channels are compressed to 2k by using the convolution whose convolution kernel size is 1 × 1 to obtain Y cls ∈ R H ×W ×2k . In the regression branch, the channels are compressed to 4k by using the convolution whose convolution kernel size is 1 × 1 to obtain Y reg ∈ R H ×W ×4k .

Design of loss function for classification branch
It is assumed that there are N classification tasks and K samples, the feature vector of input samples is represented by x i , and the category label of input samples is y i ∈ [1, N ] . Then it is easy to obtain the cross entropy loss of this sample: P i,y i represents the probability that the sample belongs to class y i , W = [W 1 , W 2 , · · · , W N ] represents the parameter of the last fully connected layer of the network, and β j j ∈ [1, N ] represents the angle between W j and x i . Suppose N = 2 , the cross entropy loss of x i can be expressed as: It can be seen that when the cross entropy loss ℓ(x i ) is minimized in the process of network training, x i will gradually approach vector W 1 while moving away from W 2 . Similarly, if the class label of x i is y 2 , it will be closer to W 2 and farther away from W 1 . Therefore, the cross entropy loss function ignores the intra-class compactness while maximizing the inter-class separability. In the object tracking task, even the same object will be diversified due to different angles of view and illumination. Therefore, the pairwise Gaussian loss function [31] (PGL) is proposed to calculate the classification loss to improve the intra-class compactness of the object. The Eq. for calculating PGL is as follows: where η represents the simplified proportional parameter of the Gaussian function and d ij represents the Euclidean distance between two features. y ij indicates whether two features have the same class label. If two features are the same, y ij is 1; otherwise,y ij . is 0. For two features that belong to the same object, y ij . is 1. Then, PGL can be expressed as: It can be seen that if the Euclidean distance d ij is bigger, the loss ℓ PGL is greater, the penalty imposed is greater, and the intra-class compactness is higher. Therefore, the cross entropy loss and pairwise Gaussian loss are combined as the total loss function of the classification model. The intra-class compactness is improved through PGL, and the inter-class separability is improved through the cross entropy loss function.
The loss function ℓ cls of the classification branch can be expressed as the followings:

Design of loss function for regression branch
The smooth L 1 loss function is used to train regression. We let x, y and (w, h) represent the coordinate of the center point and the size of the anchor box, and x 0 , y 0 and (w 0 , h 0 ) represent the coordinate of the center point and the size of the real-time frame. After normalization of the regression distance, Eq. (21) is obtained.
where δ is a hyper-parameter of the Huber Loss. The loss of the regression branch is calculated as follows: The final loss of the classification and regression network is calculated as follows: where the constants 1 and 2 are hyper-parameters that balance the classification loss and the regression loss. During the model training, 1 and 2 are set at 1.8 and 2.5 respectively.

Algorithm flow
The classification and regression subnetwork is used to locate the object, in which the classification branch distinguishes foreground and background, and the regression branch determines the size of the object. The main working steps of the proposed algorithm are shown as follows:

Implementation details
In this paper, the algorithm is built based on the pytorch deep learning framework. The GPU is NVIDIA GeForce GTX 1080 and the processor is Intel Core i7-8550U at 2.0GHZ CPU. ResNet50 is initialized with the weights pre-trained by ImageNet [32], leaving the parameters of the first two layers unchanged. During the training phase, the stochastic gradient descent (SGD) is used to calculate loss functions of different layer features, and then calculate gradients to optimize the network parameters. The training data sets are ImageNetDET [32], COCO [33] and LaSOT [34]. Data sets used for testing include the visual object tracking 2018 (VOT2018) [35] and the object tracking benchmark 100 (OTB100) [36]. The learning rate decreases from 0.01 to 0.0005, the batch size is 64, and the training epoch is 30. In the first 15 epochs, the learning rate decreases from 0.01 to 0.005. In the last 15 epochs, the learning rate decreases from 0.005 to 0.0005. The number of anchor boxes used for the classification and regression subnetwork is set to k = 5.

Evaluation indicator Evaluation indicator for VOT
The evaluation indicators used in VOT dataset include accuracy, robustness, and expected average overlap (EAO). The accuracy rate is used to evaluate the accuracy of the tracker. As the accuracy increases, the success rate increases. In each frame, the tracking accuracy is represented by the intersection ratio (IoU), which is defined as: where B G represents the boundary box marked manually and B T represents the predicted boundary box.
Robustness is used to evaluate the stability of the tracker. The more times the tracker restarts, the greater the robustness value, indicating that the tracker is more unstable. EAO is an indicator derived from the comprehensive evaluation of the intersection ratio, restart interval, and restart times, which can reflect the comprehensive performance of the tracker.

Evaluation indicator for OTB
The evaluation indicators of OTB dataset are success rate and precision rate respectively. The success rate is the rate of tracking success across all video frames. A threshold is set and the cross merge ratio is used to determine whether it is successful. The precision rate pays attention to whether the object center position predicted by the algorithm is close to the marked center position. The precision rate represents the percentage of the center location errors between predicted position and groundtruth with different thresholds.

Ablation experiment
To verify the performance of the proposed SiamANAL algorithm, ablation experiment and analysis of each component are performed and verified on OTB100. The baseline algorithm used for comparison is SiamFC, and the independent role of each component of the algorithm is experimentally tested. The benchmark algorithm is represented by Baseline, the tracking result without using fusion module is represented by BaseLine_UN_M, the multi-feature fusion module is represented by BaseLine_M, the self-attention network is represented by BaseLine_M_SANL, the cross-attention network is represented by BaseLine_M_CANL, the self-attention and cross-attention network is represented by BaseLine_M_ SANL_CANL, and the fusion of all components is represented by BaseLine_SUM. The experimental results are shown in Table 1. It can be seen from the Table 1 that BaseLine_SUM adopted by the proposed SiamANAL algorithm achieves the best tracking result by using the multi-layer feature fusion module, the self-attention network, and the cross-attention network on OTB100. The across-attention network BaseLine_CANL has obvious advantages in improving the success rate and precision rate, and plays a greater role than the multi-feature fusion module BaseLine_M. By selecting the attention of the feature map, the background interference of the object region can be filtered out to enhance the expression ability of the object region and effectively improve the tracking performance.

Quantitative experiment and analysis
The proposed SiamANAL algorithm has achieved excellent results with mass testing on VOT2018 and OTB100 comparing with other competing tracking algorithms.

Result analysis on VOT2018
The VOT dataset is a classic object tracking test dataset, which is proposed by the VOT challenge in 2013, and its data content is updated every year. The VOT2018 dataset contains a total of 60 video sequences, all of which are marked by the following visual attributes: occlusion, illumination variation, motion variation, size variation, and camera motion. As shown in Table 2, the proposed SiamANAL algorithm is compared with KCF [15], Staple [37], SiamFC [20], SiamRPN [21], SiamRPN++ [25], and Siam R-CNN [26] tracking algorithms. SiamA-NAL algorithm obtains high accuracy and EAO values.
In conclusion, the SiamANAL algorithm shows good tracking performance on VOT2018. Compared with the SiamRPN algorithm, the accuracy rate is improved by 0.066, the robustness is improved by 0.072, and the EAO is improved by 0.020. The attention network structure enhanced the expression of core semantic elements in the template map features and search map features, thus improving the accuracy of tracking frame extraction in the tracking process.
To compare testing results more intuitively, the comparison results are displayed in the form of a histogram as shown in Fig. 6.
In scenes with background interference, there may be complex background and similar objects. The attention network can weaken background feature elements, thus reducing the influence of background on tracking effect.

Result analysis on OTB100
OTB100 benchmark divides object tracking scenes into 11 types of visual challenge attributes and labels the challenge attributes for each video sequence. Each video sequence has more than one attribute tag corresponding to it. In this way, the tracking ability of the algorithm in different challenge attributes can be analyzed. 11 visual challenge attributes are: scale variation (SV), illumination variation (IV), motion blur (MB), deformation (DEF), occlusion (OCC), out-of-plane rotation (OPR), fast motion (FM), background clutter (BC), out-of-view (OV), in-plane rotation  (IPR), and low resolution (LR). The algorithm starts tracking from the real position of the object in the first frame, and obtains tracking accuracy and success rate using one pass evaluation (OPE). As shown in Fig. 7, the proposed SiamANAL algorithm is compared with KCF [15], Staple [37], SiamFC [20], SiamRPN [21], SiamRPN++ [25], and Siam R-CNN [26] tracking algorithms. The success rate and precision rate of SiamANAL algorithm rank first. Compared with SiamRPN, the success rate and precision are improved by 4.7% and 10.6%, respectively. It is proved that the extracted features have strong discrimination ability, and the design of loss function in classification and regression subnetwork is effective. For video sequences with challenging attributes, the tracking results are compared as shown in Fig. 8. In the video sequences with BC attribute, the designed attention network can effectively filter out the background information and enhance the features of the object position, achieving high precision and success rate. This shows that the proposed algorithm performs well in the BC scenes. However, in the video sequence with OPR attribute, the precision score of the proposed SiamANAL algorithm is the second, and the object position located by the regression branch has a certain deviation. Table 3 further shows the precision indicator and center location error (CLE) indicator of each comparison algorithm on OTB100. Specifically, the SiamANAL algorithm exceeds the comparison algorithms in terms of precision and CLE on the OTB100 dataset. The CLE value obtained by SiamANAL is 9.6, which is significantly higher than that of SiamRPN++ (14.3FPS), which ranks second in the tracking effect. It is verified that the Nonlocal attention network performs the self-attention and cross-attention calculation on features, which enhances the expression ability of deep features, and reduces the parameter amount and calculation amount of CNN.

Performance and speed analysis
To verify the real-time tracking performance of the SiamANAL algorithm, the SiamANAL algorithm is compared with other comparison algorithms on OTB100 in terms of success rate and speed. Some high-performance algorithms are usually designed to achieve high tracking accuracy, but this will affect real-time tracking. Similarly, some simple algorithms have good real-time performance, but the tracking accuracy is difficult to meet. It can be seen from Fig. 9 that the SiamANAL algorithm has achieved a high success rate with a speed of 49 FPS (Frame Per Second), which is not the fastest, but can meet the basic real-time tracking requirements with  25 FPS. The speed achieved by SiamANAL algorithm is slightly lower than that of the SiamFC, but the tracking success rate achieved by SiamANAL algorithm is much higher than that of the SiamFC. This is because the hierarchical features in the last three layers of the ResNet50 network in this paper are optimized by the fusion module, which makes the extracted features more discriminative. The floating-point computation of SiamANAL algorithm is 5.821 GFLOPS (Giga Floating-point Operations Per Second), in which the introduced attention network requires 0.434 GFLOPS. Further more, the lightweight attention network takes up a very low amount of computation in the real-time object tracking task, so the tracking algorithm has a good real-time performance. The large distance displacement between frames caused by high-speed motion will rarely occur, which is also conducive to the tracking algorithm to track the object more accurately.

Qualitative experiment and analysis
To intuitively illustrate the accuracy of different algorithms, tracking results in the tracking sequence will be compared and analyzed. Figure 10 shows the results of visual comparison with six popular algorithms (KCF [15], Staple [37], SiamFC [20], SiamRPN [21], SiamRPN++ [25], and Siam R-CNN [26]) in four typical video sequences Dog, Tiger1, Matrix, and Lemming.
In the video sequence with LR and DEF attributes (Fig. 10a Dog), the object has lower pixels and insufficient detail features, but the attention network can enhance the semantic expression of object feature elements, so as to identify the tracking object more accurately.
In the video sequence with the IV attribute ( Fig. 10b  Tiger1), the proposed algorithm can effectively overcome the influence brought by illumination variation and achieve robust tracking effect. The tracking results show that the multi-feature fusion model enhances the features of the object region and reduces the background interference.
In the video sequence with the FM and BC attributes (Fig. 10c Matrix), the details of the object are weakened due to the motion blur of the object. The attention network can also enhance the semantic expression ability of the object, and the tracking object can also be accurately obtained under the fuzzy state.
In the video sequence with the OCC attribute (Fig. 10d Lemming), the proposed SiamANAL algorithm can accurately locate the object after the occluded object reappeared through the classification and regression subnetwork according to the template map.
Other comparison algorithms achieve good tracking effect in video sequences Dog and Tiger1, and reduce the influence of low resolution and illumination variation on the object location. However, in the scene where the object is occluded, the comparison algorithms appear different degrees of tracking drift and cannot relocate the object after the object reappears.

Conclusion
The object tracking algorithm SiamANAL based on the siamese network is designed by introducing attention network and adaptive loss function. The following conclusions can be drawn: (1) The multi-feature fusion module integrates hierarchical features in the last three layers of the ResNet50 network, which can solve the problem of partial feature loss caused by too deep network. (2) The self-attention and cross-attention modules in the attention network calculate the attention of the template feature map and the search feature map, so that the calculated features highlight the object area, making the tracking process pay more attention to the object. (3) Two loss functions, cross entropy loss and pairwise Gaussian loss, are designed to maximize intraclass compactness and inter-class separability, and improve the accuracy of object classification. (4) Through quantitative and qualitative analysis of the tracking results on VOT2018 and OTB100, the proposed SiamANAL algorithm performs well in performance and various challenging video sequences.
In this paper, the tracking algorithm uses a fixed template map, which is not updated during the tracking process, resulting in the tracking results concussion when the algorithm deals with the problem of long-time occlusion of the object. In the future study, an effective object tracking method based on dual template fusion will be designed and the algorithm will be deployed in the cloud to further improve the robustness of the tracking algorithm.