This work proposes a HATT-MSCNN-IBiLSTM architecture for predicting corporate sales from a digital management perspective. This method combines MSCNN and IBiLSRM to fuse the advantages of both networks for data processing. Then, hybrid attention mechanism is proposed to further process the features to promote prediction performance.
Multi-scale convolution neural network (CNN)
The CNN model is actually a feedforward network which is based on convolution calculation. Moreover, the CNN has the ability to express features, and can perform translation-invariant classification of input features according to each level of its network. CNN is actually a feedforward neural network with deep structure based on convolution calculation, and CNN has the ability of feature expression, which can perform translation-invariant classification of input features according to each level of its network.
The convolutional layer is the key to understanding the convolutional neural network. A large number of trainable parameters are concentrated in the convolution kernels, and the network is modified by the loss during the learning process. It uses a back-propagation algorithm along the gradient, and the parameters are continuously moved in the direction that minimizes the loss. By continuously inputting data and labels, the network is trained to obtain appropriate weight parameters, which is the process of allowing the network to acquire knowledge. The design of the convolution operation contains two core ideas, parameter sharing and local sparse computation. Both ideas contain the original intention of reducing memory space and speeding up computing. While achieving suitable accuracy improvement effect with less computation, it also reduces the problems caused by network overfitting. The convolution operation is given by the following Eq. (1):
$${x}_j^l=f\left({\sum}_{i,j}{x}_i^{l-1}\ast {k}_{ij}^l+{b}_j^l\right)$$
(1)
Where the variable x is used to characterize a feature, k is convolutional kernel, and b is the bias.
After the convolutional layer extracts the features of input, it is input to the next pooling layer. The main function of the pooling layer is feature selection and information filtering, so as to achieve the effect of reducing dimension, reducing the amount of network parameters, reducing the probability of overfitting and improving the speed of network calculation. The working principle of the pooling layer is to use a certain value to replace a certain area in the feature map, and the pooling layer has no weight coefficient. The pooling operation is given by Eq. (2):
$${x}_j^l=f\left( down\left({x}_j^{l-1}\right)\right)$$
(2)
Where x is used to characterize a feature.
Both convolutional layers and pooling layers are computationally processed for local features. The fully connected layer, on the other hand, recombines the previously calculated local features into complete information through the weight matrix. In order to minimize input data loss, the fully connected layer’s primary job is to re-fit local features in a nonlinear fashion. There can be one or more completely connected layers overlay prior to the network output layer in order to better tackle nonlinear issues.
The output layer is to directly input result of previous FC layer into softmax function in the output layer to classify the input image. The softmax function is often used in multi-classification tasks. The principle of this function is to calculates the probability that the output of the previous multiple neurons belongs to the class. The Softmax operation is mathematically illustrated as given in Eq. (3):
$${s}_i={e}^i/{\sum}_j{e}^j$$
(3)
Where ei is the ith element of the one-dimensional vector output.
From the perspective of learning feature information in the local receptive field, if the scale and span of the convolution kernel are too small, the data feature resolution will be good, but the low-frequency features existing in the signal cannot be learned well. Conversely, convolution kernels with larger scales use larger strides and can learn information over longer time horizons. That is, the low-frequency characteristics in the signal cannot reflect the high-frequency characteristics. To combine the advantages of different scales, this work proposes a multi-scale feature learning strategy. The one-dimensional multi-scale convolutional layer is demonstrated in Fig. 1.
The one-dimensional multi-scale convolution layer (1D-MSCL), in fact, uses one-dimensional convolution kernels of different scales to perform multi-scale feature extraction on the original features to improve the robustness and discrimination of features. Based on this multi-scale convolutional layer, multi-scale convolutional network designed in this work is demonstrated in Fig. 2.
Improved bi-directional long short-term memory
The CNNs have no memory and process each input individually without saving any state between the input and the input. This also means that they cannot be connected before and after input and output. For such a network, to process time series of data, the entire sequence needs to be presented to network. But the biggest problem with this is the huge amount of parameters and too much calculation. When mapping between input and output sequences, recurrent neural networks have the advantage of using contextual information retained by hidden states. Standard RNN architectures can only access a small number of possible contexts. There’s problem since the effect of a particular input on the hidden layer, and hence on the network output, declines exponentially over time. So that the information of the previous sequence obtained by the later time step is very limited, which will cause the model to fail to train.
The LSTM model is one of the variants of standard RNN that can pass information from the past across multiple time steps. It runs parallel to the multiprocessing time series and is designed to filter information with a select-through mechanism. The LSTM introduces gated units and cell states, as illustrated in Fig. 3, one of the time step units of LSTM model.
Three gates in the LSTM approach are: (i) forget gate, (ii) input gate, and (iii) output gate. The forget gate in LSTM is to decide which information needs to be deleted in the current state as Eq. (4). The input gate is to decide what information to add to the memory state as Eq. (5). The internal memory unit will choose to add a part of the candidate memory state information to obtain new memory state information as Eqs. (6) and (7). The output gate is based on input and memory cell to determine what to output as Eqs. (8) and (9).
$${f}_t=\upsigma \left({W}_f{x}_t+{U}_f{x}_{t-1}+{b}_f\right)$$
(4)
$${i}_t=\upsigma \left({W}_i{x}_t+{U}_i{x}_{t-1}+{b}_i\right)$$
(5)
$$\tilde{c}_t=\mathit{\tanh}\left({W}_c{x}_t+{U}_c{x}_{t-1}+{b}_c\right)$$
(6)
$${c}_t={f}_t\ast {c}_{t-1}+{i}_{t-1}\ast \tilde{c}_t$$
(7)
$${o}_t=\upsigma \left({W}_o{x}_t+{U}_o{x}_{t-1}+{b}_o\right)$$
(8)
$${h}_i={o}_t\ast \tanh \left({c}_t\right)$$
(9)
Where Wf, Wi, Wc, Wo are the weight matrices of forget gate, input gate, update unit state and output gate. bf, bi, bc, bo are the bias of the forget gate, input gate, update unit state and output gate.
Although, the LSTM model can perform forward data feature extraction on time series data. However, time-series data is not only related to previous data, but also related to subsequent data to some extent. As a deformation structure, BiLSTM includes the forward propagation layer as well as the backward propagation layer of LSTM. Figure 4 is a schematic diagram of the BiLSTM approach.
The problems of manual parameter adjustment and slow convergence of BiLSTM network are improved in this work. In this work, PSO is used to optimize its hyper-parameters, and the hyper-parameters to be optimized are mapped to particles. Each particle shares the individual extremum and compares the global extremum, and continuously updates the position and velocity for iterative optimization. The inertia weight has a positive correlation with the particle’s global search capacity and a negative correlation with the particle’s local search ability When using a typical PSO, the inertia weight has a predetermined value, which reduces the capacity of the particles to find the global or local optimum. Consequently, this study optimizes the inertia weight factor and employs a nonlinear inertia weight to enhance performance of the particle swarm optimization and build the IBiLSTM model.
$$w={\left({w}_{max}-{w}_{min}\right)}^{t/{t}_{max}}$$
(10)
Where wmax is the maximum inertia weight, and wmin is the minimum inertia weight.
The hybrid attention mechanism
The attention mechanism largely borrows from the human visual attention mechanism in principle. The core of the attention mechanism is weight distribution. In terms of mathematics and programming, the attention mechanism can be understood as the weighted summation of sequence information. Formally, the attention mechanism can be understood as a key-value query. In physical sense, the attention mechanism can be understood as similarity calculation. For neural network models, the importance of information is reflected by weights.
The self-attention mechanism is to obtain Query, Key and Value from the input sequence through three sets of linear changes. Then find the similarity of each Query to each Key as a score. It gets the weight of the input sequence after going through softmax. Finally, the attention weights are injected into Value, and weighted summation is performed to obtain the output sequence. The advantage for the self-attention mechanism is it can be processed in parallel and can directly see the information of the entire input sequence. The calculation process is:
$$Y= softmax\left(Q\ast {K}^T/\sqrt{d_K}\right)V$$
(14)
Where X is input, and WQ, WK, WV are used to characterize the parameter matrix.
A variant is multi-head self-attention mechanism, and advantage of multi-head self-attention is that different heads can focus on different information. The specific process is to obtain multiple sets of Query, Key and Value through multiple linear transformations of the input sequence. The intermediate process is the same as the self-attention mechanism. After obtaining multiple sets of output sequences, the output sequences at the same time point are spliced by channel. If the connected dimension does not match the input dimension of the next layer, another linear layer can be passed to transform the dimension to the input dimension required by the next layer. The middle groups indicate how many heads there are. The multi-head self-attention is demonstrated in Fig. 5.
The role of the channel attention is to focus on channel dimension of feature map to obtain a set of attention weight distributions about the channel. The feature map on each channel dimension is equivalent to obtaining a feature on the original feature map, and the channel attention can help network to extract meaningful features. The upper layer feature map is utilized as input feature map, and adaptive average pooling and adaptive max pooling are performed based on the spatial dimension to compress the spatial information and obtain two feature channels. Then, through two 1 × 1 convolutions of shared parameters, dimensionality reduction and dimensionality increase are performed. The purpose of dimensionality reduction is reducing parameters. Then, the outputs of the two branches are added element-wise and Sigmoid is performed to generate channel attention feature weight score. Finally, the channel attention weight score is injected into the feature map of the original input, that is, the feature map with channel attention is obtained. The channel attention module is demonstrated in Fig. 6.
The role of the spatial attention is to focus on the spatial dimension of feature map to obtain a set of spatial attention weight distributions. After the data is filtered through the convolutional layer, each pixel in the extracted feature map represents a certain feature of certain region in the upper-layer feature map. The spatial attention mechanism indicates that the neural network should pay attention to the features of a certain region of the data. The neural network is able to maintain the essential information when the spatial attention mechanism transforms the original input into another space. Channel attention module’s feature map is used as the input feature map for spatial attention module’s feature map. Global average pooling and global maximum pooling are used to compress channel information based on channel dimension. Splicing together the two feature maps is done depending on the channel dimension and then a convolution is used to reduce the channel dimension. It then generates its weight score by using the Sigmoid function. Finally, the feature map with spatial attention is obtained by injecting the weight score for spatial attention into the original input feature map. The spatial attention module is demonstrated in Fig. 7.
The hybrid attention mechanism proposed in this work is to fuse multi-head self-attention, channel attention and spatial attention to enrich the diversity of features. The hybrid attention mechanism is demonstrated in Fig. 8.
HATT-CNN-BiLSTM for enterprise sales forecast
Although, the CNN model has shown great advantages in extracting the characteristics of data, however it only considers the correlation of adjacent vectors of data, and does not consider the time series between continuous data. Although, the BiLSTM approach can handle time-series-related features, it is not as comprehensive as CNN for feature extraction. This work uses CNN to extract high-dimensional spatial features, reduces the feature dimension of data, and realizes feature extraction of early weak features with insignificant features. The BiLSTM model is selected to process the feature sequence output by the CNN part, and mine the time series information in the data. This work integrates CNN and BiLSTM to obtain a CNN-BiLSTM hybrid model, uses CNN to extract unique features, and uses BiLSTM to solve the problem of timing characteristics. Finally, this work introduces an attention to further process features to improve robustness and discrimination of features. The proposed HATT-MSCNN-IBiLSTM deep learning model for enterprise sales forecasting is demonstrated in the following Fig. 9.
HATT-MSCNN-IBiLSTM model first extracts the corporate sales features of historical information through CNN. A bidirectional LSTM network is then used to train and predict on the corporate sales dataset. Attention mechanism is introduced in process of network training, and different weights are assigned to each sales feature, so that the network can learn feature information more effectively.
The edge computing model
According to Fig. 10, the suggested model is composed of three distinct layers: the cloud layer, the edge layer, and the local layer. Through a variety of IoT devices and sensors, the local layer is in charge of gathering crucial data (connected to businesses). After data has been gathered, it is processed (and/or stored) at the edge layer using small-scale datacenters, or edge clouds. Aggregation that might be accomplished by deleting duplicate data could be a part of the edge level processing. The filtered data can then be sent to the distant cloud layer for further processing, including resource management, storage, and monitoring. The cloud network architecture may experience severe delays if the data obtained at the local layer is sent straight to the cloud layer, albeit this is not always the case.
Additionally, real-time prediction may take place at the edge layer, but cloud-based prediction for monitoring services is also an option. Additionally, real-time predictions made on a remote cloud, such quickest or safest route calculation, may need considerable delays due to the network’s capacity and quality. In such case, provided the necessary data is kept locally, the closest edge cloud can forecast traffic, distance, and road conditions. The data could not, however, be accessible or processed locally due to the edge cloud’s limited capacity for storage and processing. There are three alternatives in such situation: (i) transport the necessary data from the cloud to the edge, analyze it, and then discard it; (ii) complete the prediction at the distant cloud; and (iii) train the prediction model at the remote cloud and forecast at edge layer (distributed fashion computation). Regarding (i) depending on the model being utilized, training may be costly. Regarding (ii), greater and longer reaction times would result from looking for the right data and then transferring it across the network interface. The model can be regularly trained using the suggested technique (iii), and only edge prediction would take place.