A novel privacy-preserving speech recognition framework using bidirectional LSTM

Utilizing speech as the transmission medium in Internet of things (IoTs) is an effective way to reduce latency while improving the efficiency of human-machine interaction. In the field of speech recognition, Recurrent Neural Network (RNN) has significant advantages to achieve accuracy improvement on speech recognition. However, some of RNN-based intelligence speech recognition applications are insufficient in the privacy-preserving of speech data, and others with privacy-preserving are time-consuming, especially about model training and speech recognition. Therefore, in this paper we propose a novel Privacy-preserving Speech Recognition framework using Bidirectional Long short-term memory neural network, namely PSRBL. On the one hand, PSRBL designs new functions to construct security activation functions by combing with an additive secret sharing protocol, namely a secure piecewise-linear Sigmoid and a secure piecewise-linear Tanh respectively, to achieve privacy-preserving of speech data during speech recognition process running on edge servers. On the other hand, in order to reduce the time spent on both the training and the recognition of the speech model while keeping high accuracy during speech recognition process, PSRBL first utilizes secure activation functions to refit original activation functions in the bidirectional Long Short-Term Memory neural network (LSTM), and then makes full use of the left and the right context information of speech data by employing bidirectional LSTM. Experiments conducted on the speech dataset TIMIT show that our framework PSRBL performs well. Specifically compared with the state-of-the-art ones, PSRBL significantly reduces the time consumption on both the training and the recognition of the speech model under the premise that PSRBL and the comparisons are consistent in the privacy-preserving of speech data.


Introduction
Utilizing speech as the transmission medium in Internet of things (IoTs) is an effective way to reduce latency while improving the efficiency of human-machine interactions. For example, the Siri from the Apple and the Cortana from the Microsoft obtain instructions through speech recognition and return the most matched results to users, which greatly improves users' work effectiveness. Due to the advantages of speech, many applications of "IoTs + *Correspondence: xuyan@ahu.edu.cn 1 Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Hefei, China Full list of author information is available at the end of the article speech" in our daily life have been promoted and developed, such as smart home [1] and self-driving vehicles [2]. Studies have demonstrated that speech recognition applications based on Recurrent Neural Network (RNN) perform well in terms of improving accuracy of speech recognition [3]. However, these RNN-based speech recognition applications deployed on a centralized cloud challenge the transmission performance of devices used in IoTs and the computing performance of the centralized cloud, since they are computation-intensive and require high capacity memory. The global edge computing market size is anticipated to reach USD 43.4 billion by 2027, exhibiting a CAGR of 37.4% over the forecast period, according to (2020) 9:36 Page 2 of 13 a new report by Grand View Research, Inc 1 . In addition, the data traffic of these explosively increasing terminal devices is transmitted to the cloud for processing, which will eventually exceed the cloud's computing and storage capabilities. Thus, most of RNN-based speech recognition applications are deployed on edge servers to alleviate challenges derived from computing-intensiveness and insufficient storage capabilities of the clouds [4,5]. However, in the era of big data with the explosive growth of data volume, deploying a speech recognition application on edge servers is not an effective solution since edge computing suffers from capacity limitations. Hence, the edge-cloud computing paradigm offers a tradeoff between speech recognition applications' requirements for computing resources and low latency, and improves the usage efficiency of the IoT devices [6].
It is no doubt that there is a lack of security technologies and privacy protection mechanisms for smart speech facilitated by the edge-cloud computing paradigm [7,8]. That is, the speech data involving private information could be maliciously collected, transmitted and analyzed during the process of speech recognition, which enables criminals to grasp users' daily behaviors and health statuses, etc. [9]. Secret Sharing [10], homomorphic encryption [11] and differential privacy [12] are three conventional algorithms for data privacy-preserving. The homomorphic encryption-based privacy-preserving methods are computation-intensive and require high capacity memory, which reduces their commercial practicability. In contrast, for privacy-preserving methods based on differential privacy, the added noise usually results in poor availability, which increases the occurrence probability of errors during the process of speech recognition. Secure Multi-Party Computing (SMC) [13] is an excellent method that could be well integrated with edge servers to improve the performance of privacy-preserving while ensuring high data availability. In addition, for speech recognition, Long Short-Term Memory neural network (LSTM) is a well-known method with high accuracy and practicability. Therefore, Ma et al. [14] proposed an outsourced privacy-preserving speech recognition framework (namely OPSR) based on advantages of SMC and LSTM.
Although OPSR greatly improves the commercial practicability of speech data and the accuracy of speech recognition while realizing the privacy-preserving of speech data, it still can be updated in terms of semantic dependence and time consumption. On the one hand, since studies [15,16] showed that the accuracy of speech recognition could be improved by extracting the bi-directional semantic dependence of speech, the right context of 1 https://www.grandviewresearch.com/press-release/global-edge-computingmarket speech sequence data should not be neglected. For example, due to the variety of noises coming from natural environments, the recording devices may lose some information of speech data, so that the thoughts of speakers cannot be fully expressed. Utilizing both the left and the right context information of the missing speech data is a significant way to infer the missing speech information, so as to achieve the accuracy improvement of speech data prediction. On the other hand, the activation functions used in OPSR require multiple iterations during the process of speech recognition. Intuitively, more iterations consume more time. Hence, it is necessary to seek another suitable ways instead of multiple iterations. Therefore, this paper designs a novel Privacy-preserving Speech Recognition framework using the Bidirectional Long short-term memory neural network, namely PSRBL. Based on SMC and Bi-directional LSTM (BiLSTM in short), PSRBL first divides the original speech data into two encrypted parts randomly by combining with an additive secret sharing protocol. Then it utilizes new piecewise-linear functions proposed in this paper to reconstruct the original activation functions in BiLSTM. After that, PSRBL inputs the encrypted speech data into BiLSTM. Finally, PSRBL achieves the privacy-preserving of speech data based on collaborations between two independent edge servers during the training and the recognition of the speech recognition model. The experimental results demonstrate that our PSRBL significantly reduces the time consumption in terms of the training and the recognition of models, and also improves the response speed while preserving the privacy information of the speech data on the edge servers. The main contributions of this paper are summarized as follows.
(1) We propose a novel speech recognition framework with privacy-preserving based on SMC and BiLSTM, namely PSRBL. The framework PSRBL can achieve the privacy-preserving of speech data by running SMC method on two independent edge servers. (2) We design new piecewise-linear functions to refit the original activation functions (i.e., Sigmoid and Tanh) in BiLSTM. Compared with the state-of-the-art OPSR framework, our PSRBL greatly reduces the time consumption in terms of model training and speech recognition while ensuring the consistency of training error.
The remainder of this paper is organized as follows. "PSRBL framework" presents our PSRBL framework. "Experiments" reports our experimental results. In "Related work", we overview recent work focusing on edge-cloud computing and speech recognition with privacy-preserving, and then we conclude the paper in "Conclusion" sections.

PSRBL framework
In this section, we will first present the architecture of PSRBL. Then the secure forward propagation and the secure backward propagation of BiLSTM (the core of PSRBL) will be introduced respectively. Finally, we will analyze the correctness and the security capability of BiLSTM.

The architecture of PSRBL
In this paper, we achieve improvement and optimization based on OPSR and propose the framework PSRBL. The architecture of PSRBL is shown in Fig. 1 (which will be described in the next paragraphs). As can be seen in Fig.  1, our PSRBL consists of seven participants, which are end-users, the smart audio devices that end-users use, two edge servers, the trusted third party, the smart IoT devices and the service providers.
Here, U denotes an end-user. AD denotes the smart audio device that an end-user uses, and it can record and preprocess the voice data of the end-user as well as receive the results of speech recognition. Besides, AD randomly divides the feature vector of the preprocessed data into two participants, denoted as A 1 and A 2 respectively, and sends them to the two edge servers, which are denoted as S 1 and S 2 respectively. Note that for each edge server there is a training system deployed. S 1 and S 2 employ BiL-STM for speech recognition training and work together to obtain the complete output, denoted as f . T denotes the trusted third party who supports the collaborative computing of S 1 and S 2 . I denotes the smart IoT devices receiving the original output results (which are denoted as f 1 and f 2 respectively) from BiLSTM. I performs the summation verification on f 1 and f 2 to get f , so as to obtain the speech recognition results that end-users require. SP denotes the service providers.
1) Compared with OPSR, we use BiLSTM instead of LSTM to perform the training of speech recognition systems (as shown in black in Fig. 1). BiLSTM consists of two unidirectional LSTMs (i.e., the forward LSTM with a forward sequence − → A =(x 0 , x 1 ,. . . ,x t ,. . . ) and the backward LSTM with a backward sequence ← − A =(. . . ,x t , x t−1 ,. . . ,x 0 ) that can be trained in parallel. As we have stated above, utilizing both the left and the right context information of the missed speech data could achieve the accuracy improvement of speech data prediction. Therefore, by combining the context information of speech data, BiLSTM can perform effective prediction. The secure forward propagation and the secure back propagation of BiLSTM will be respectively introduced in "Secure forward propagation" and "Secure backward propagation" sections in detail. 2) New piecewise-linear functions, which are respectively named PSigmoid and PTanh, are proposed to refit the activation functions with multiple iterations (i.e., Sigmoid and Tanh) for reducing the time consumption. Compared with Maclaurin polynomials, our piecewise-linear functions can greatly reduce the time consumption.
The PSigmoid and PTanh are defined as (1) and (2) respectively. The results of refitting Sigmoid by respectively using the piecewise-linear function PSigmoid and the activation function MNSigmoid used in OPSR are shown in Fig. 2a, and the results of refitting Tanh by respectively using the piecewise-linear function PTanh and the activation function MNTanh used in OPSR are shown in Fig. 2b.
After that, SP transmits the parameters of piecewiselinear functions and a third-tuple (a, b, M) to T (as shown in purple in Fig. 1), where a and b respectively denote the variable coefficient and the constant term of piecewiselinear functions, and M denotes the piecewise intervals. T first sums the middle values transmitted from both S 1 and S 2 to obtain the interval M, and then randomly divides (a, b) into a , b and a , b . Finally, a , b and a , b are respectively sent back to S 1 and S 2 to support their security two-party calculations (as shown in blue in Fig. 1).
From Fig. 2a, we can see that the curve PSigmoid excellently refits the curve Sigmoid, whereas the curve MNSigmoid only performs well in an interval [-2, 2]. Figure 2b shows that the curve PTanh excellently refits the curve Tanh, whereas the curve MNTanh only performs well in an interval [-1, 1].

Secure forward propagation
For each edge server there is a BiLSTM deployed, and each BiLSTM contains a single Forward layer LSTM (F-LSTM in short) and a single Backward layer LSTM (B-LSTM in short). Each single F-LSTM (B-LSTM) handles historical information by deploying a nonlinear function and a large number of linear functions and includes a single forget gate, a single input gate and a single output gate. Given a speech sequence data A=(x 0 , x 1 , . . . , x t ,. . . ), S 1 and S 2 perform the privacy-preserving calculations of the forget gate, the input gate and the output gate by employing a security addition protocol and a secure multiplication protocol. The symbols "→" and "←" respectively indicate the calculation processes of F-LSTM and B-LSTM. The superscript symbols " " and " " indicate the calculation processes of neural network on S 1 and S 2 respectively. The symbols as well as the corresponding interpretations used in the paper are shown in Table 1. The secure forward propagation of BiLSTM is shown in Fig. 3. We combine piecewise-linear functions PSigmoid(x) and PTanh(x) with a security addition protocol and a secure multiplication protocol to construct new secure piecewise-linear functions SPSigmoid(x) and SPTanh(x) respectively, which are treated as activation functions of gates. Since the construction steps of SPSigmoid(x) and SPTanh(x) are similar, we only take SPSigmoid(x) as an example, which is defined as follows.
where a and b respectively denote the coefficient and the constant term of PSigmoid(x) running on S 1 ; a and b respectively denote the coefficient and the constant term of PSigmoid(x) running on S 2 .
Forget gate. The forget gate SPSigmoid(x) treats h t−1 (the output vector of the previous unit) and x t (the input vector of the current unit) as input values. For each item in c t−1 (the memory vector of the previous unit), SPSigmoid(x) generates a value within [0, 1] to handle the forgetting degree of the previous unit. The relevant calculating processes are shown as follows. F-LSTM: Input gate. The input gate performs three steps.
Step 1), it utilizes SPSigmoid(x) to calculate i t (which is the filtered input vector of current unit). The relevant calculating processes are defined as follows. Table 1 Symbols as well as corresponding interpretations used in the paper Parameter running on the edge server Parameter running on the edge server B-LSTM: Step 2), the input gate uses SPTanh(x) to calculatec t (which is the candidate unit state vector of current unit) for handling how much new information required to be added. The relevant calculating processes are listed as follows. F-LSTM: and Step 3), by combining with a secure addition function and a secure multiplication function, the input gate employsc t−1 , i t andc t to update the unit state, which are calculated as follows. F-LSTM: B-LSTM: Output gate. The input values of the output gate are the calculated results coming from the input gate and the forget gate, which means the output of BiLSTM is influenced by both long-term memory and the current input value. The output gate first utilizes W o (the output weight matrix) and B o (the output bias term) to calculate o t (the current output vector of the output gate) by using following equations. F-LSTM: Then by combining with SPSigmoid(x) and SPTanh(x), h t (the output vector of current unit) of BiLSTM is calculated according to o t and c t . The relevant calculating processes are defined by following equations. F-LSTM: B-LSTM: In the entire process of secure forward propagation of BiL-STM, − → h t , ← − h t is the input vector of the output layer at ← − h t are calculated by using F-LSTM and B-LSTM respectively.

Secure backward propagation
The basis of the secure backward propagation of BiLSTM is Back Propagation Trough Time (BPTT) algorithm, and there is no complex derivative calculation in the secure backward propagation of BiLSTM in this paper, since the original activation functions Sigmoid(x) and Tanh(x) are refitted by SPSigmoid(x) and SPTanh(x) respectively, where SPSigmoid (x)=a and SPTanh (x)=a. That is, the training process of the secure backward propagation of (2020) 9:36 Page 8 of 13 BiLSTM can be performed by directly utilizing the secure addition protocol and the secure multiplication protocol, which is similar to that of its secure forward propagation. Based on the trusted third party T, the secure BTPP is performed by utilizing the results from security forward propagation to obtain the error of the loss function. After that, sending the error back to edge servers to update the corresponding weight matrix and bias items. BiLSTM iteratively runs the process until the error converges. Since the calculation process of back propagation in B-LSTM is the same as those of F-LSTM, we only take F-LSTM as an example to introduce details below. F-LSTM: where − − → δ t−1 denotes the total error vector at time t-1, (c,t) are partial error vectors, which are respectively calculated as follows.
Initializing the total error vector at time t to 1, i.e., − → δ t =1, and the total error vector before time t can be obtained by (24). Given the partial error vectors at each time, Meanwhile, by using the symbol ∇ to denote the gradient, we can obtain an equation: In addition, the corresponding derivations are shown as follows.
Using α to denote the learning rate of BiLSTM, and supposing that α is public and available, the weight matrix and the bias term can be updated by using the gradient values, which are calculated as follows.

Analyses for correctness and security analysis
Correctness. As shown in Fig. 2, the piecewise-linear functions PSigmoid(x) and PTanh(x) perform well.

Theoretically, PSigmoid(x) and PTanh(x) can infinitely approach the original activation functions Sigmoid(x) and
Tanh(x). In addition, the foundations of secure piecewiselinear functions SPSigmoid(x) and SPTanh(x) are the secure addition protocol and the secure multiplication protocol of the addition secret sharing protocol whose correctness has been proved in [14]. Security Analysis. In the processes of forward propagation and back propagation, on the one hand, all input vectors of secure piecewise-linear functions SPSigmoid(x) and SPTanh(x) are the calculated results of the secure  Since the security of the secure addition protocol and the secure multiplication protocol has been demonstrated in [14], the processes of forward propagation and back propagation are secure.

Experiments
In this section, we conduct multiple experiments to evaluate the performance of our PSRBL. We focus in answering the following two research questions.
(1) Question 1. Can the proposed piecewise-linear activation functions achieve the accuracy improvement of our PSRBL? (2) Question 2. Compared with existing state-of-the-art peers, does our PSRBL perform better?

Experimental dataset and environment
We utilize the state-of-the-art framework OPSR as comparison. OPSR is implemented in Python3 and Numpy in [14]. The adopted speech dataset is a part of TIMIT corpus, which has 123 features coming from the Fourier transform-based filter library [17]. To make fair  comparisons, the performance of our PSRBL is also evaluated on the same dataset. Given a data length l=64, 2000 processed speech sequence data with 123 features applied on a neural network with 80 neurons, and the time step is set to 8. Therefore, the size of the weight matrix W k is 203 ×80, where k ∈ f , i,c, o , and the output matrix and the offset terms are 8 ×80 and 1 ×80 respectively. In addition, PSRBL is also implemented by Python3 and Numpy, and follows the same experimental parameter settings that OPSR used. Note that OPSR uses the McLaughlin polynomial and the Newton iteration method, where the number of iterations is set to 10. However, our PSRBL does not need iterations, since its activation functions are refitted by the piecewise-linear functions. In our experiments, edge servers have the same configuration, which are Inter(R)Core(TM) i5-7500 CPU @3.40GHz and 8.00GB of memory.

Performance of the piecewise-linear activation functions (Question 1)
Since the piecewise-linear activation functions achieve an accuracy improvement of framework PSRBL, we first conduct multiple experiments to evaluate the performance of piecewise-linear activation functions with different numbers.
The experimental results of Sigmoid piecewise-linear fitting activation function PNSigmoid are shown in Table  2. With the increment of the piecewise number, the error values of the forward propagation of both original and secure LSTMs continuously reduce. When the piecewise number is 8, the error value reached 10 −14 , which can be ignored. That is, as the piecewise number increases continuously, the change of the error value can be ignored but the time consumption significantly increases. Therefore, the piecewise number of fitting Sigmoid activation function is set to 8.
The results of Tanh piecewise-linear fitting activation function PNTanh are shown in Table 3. When the piecewise number is 11, the error value reaches 10 −14 , which can be ignored. After that, as the piecewise number continuously increases, the time consumption significantly increases, so as to the piecewise number of fitting Tanh activation function is set to 11.

Performance comparisons (Question 2)
The experimental comparisons are shown in Figs Fig. 4b). Therefore, we can conclude that the frameworks PSRBL and OPSR have the same performance in the training process of the neural network with privacy-preserving.
As shown in Fig. 5, with the increment of the number of samples, the time consumptions of PSRBL-F, PSRBL-B and OPSR increase too. However, both PSRBL-F and PSRBL-B are much more efficient than OPSR. In the case where the number of samples is less than 100, the time consumptions of both PSRBL-F and PSRBL-B is half of that of OPSR (as shown in Fig. 5a). In the case where the number of samples is between 100 and 1000, OPSR takes one and a half times as long as both PSRBL-F and PSRBL-B do (as shown in Fig. 5b). In the case where the number of samples is between 1000 and 2000, OPSR still takes more time consumption as long as both PSRBL-F and PSRBL-B do (as shown in Fig. 5c).
To sum up, we can conclude that PSRBL significantly reduces the time of both the training and the recognition of the speech model. The reason is that the secure activation functions in PSRBL do not require iterations.  The time spent on a single iteration of a single secure activation function in OPSR, denoted as t s , is the sum of the time of running one-time secure addition protocol and one-time secure multiplication protocol. In the experiments, the number of iterations that secure activation functions required in OPSR is 10, which means the total time spent on a single secure activation function is 10t s . PSRBL uses piecewise-linear functions instead of secure activation functions to avoid iterations. Supposes that T needs t p to construct the piecewise intervals, the total time spent on a single secure activation function is t s +t p , which means the time consumption of PSRBL depends on t p . Table 4 shows the time consumptions (including initialization time and calculation time of a single sample) of forward and backward calculations. The initialization time refers to 1) the initialization of weight matrix and calculation vectors for each bias term, and 2) the initialization of three gates of LSTM. It can be seen that the initialization time of PSRBL is roughly the same as that of OPSR. However, for the time consumption of forward calculation, OPSR takes 4.5 times as long as PSRBL does, and for the time consumption of backward calculation, OPSR takes 2.3 times as long as PSRBL does. Table 5 shows the time consumptions of three gates of forward and backward calculations. It can be seen that the initialization time of PSRBL is roughly the same as that of OPSR. For the time consumptions of the forget gate, the input gate and the output gate, OPSR respectively takes 3, 4 and 6 times as long as PSRBL does.
As shown in Table 6, the privacy-preserving homomorphic encryption-based scheme needs to encrypt the data at first, which extends the data size, so as to greatly increase the overhead of communication storage during the training process of BiLSTM and applications of intelligent voice. Table 7 shows that the framework based on additive secret sharing performs better than the homomorphic encryption-based framework. In order to process one frame of audio, the homomorphic encryption-based privacy protection Gaussian mixture model (GMM) scheme [11] requires 616.759 ms, the time consumption is almost 6 times that of PSRBL, and the overhead of communication storage is more than 40 times that of PSRBL. In addition, compared with Hidden Markov Model (HMM) [10] and OPSR based on the addition secret sharing, BiLSTM-based models achieve higher accuracy. Compared with OPSR based on the addition secret sharing, PSRBL can significantly reduce the time consumption, which means it can not only save the training time for the entire model, but also improve the response time during speech recognition.

Related work
Edge computing is a popular computing paradigm with the aim to minimize the delay between end-users and the cloud, and many applications in our daily life have been promoted and developed, such as QoS prediction [18,19]. However, it is well known that edge computing suffers from capacity limitations. Thus, the edge-cloud computing paradigm is proposed to balance computing resources and low latency. Speech recognition [3] is one of the most important applications.
In the early days, speech recognition research mainly used Hidden Markov Model (HMM) [20,21]. Lee et al. [22] proposed a co-occurrence smoothing algorithm that enables accurate speech recognition on a minimal training dataset. Nevertheless, HMM neglects the longterm dependence relations between speech data. Lipton et al. [23] showed that Recurrent Neural Networks (RNN) could effectively solve the above problem. Differ from the sequential structure of Convolutional Neural Network (CNN) [24], RNN forms a complex recurrent chain structure through input layer, hidden layer and output layer. Gers et al. [25] improved the LSTM network structure by adding a forget gate and peepholes, since RNN becomes unreliable under complex application environments because of the long-term dependencies problem [26]. However, LSTM neglects the right context information of speech data, which results in loss of semantic information. Hence, Alex Graves et al. [27] proposed a bidirectional long-term and short-term memory neural network. Unlike the way of training from left to right that used in LSTM, Alex Graves et al. made the use of the left and the right context information to train the speech recognition model. Bin et al. [28] proposed a video captioning framework based on bidirectional long-term memory and soft attention mechanisms to enhance the ability of recognizing persistent motion in the video.
In the development of speech recognition technology, most speech recognition systems are deployed on cloud and (or) edge servers, and meanwhile the speech data is stored and transmitted in cleartext. That is, speech data with private information can be maliciously collected and analyzed [9]. Therefore, it is a feasible solution to protect speech data by using encryption algorithms. CryptoNets [29] is a neural network that can be trained by ciphertext. Some studies use Homomorphic Encryption (HE) to protect data privacy, such as MiniONN [30]. For example, Zhang et al. used BGV to achieve the encryption of the private data [31]; Yilmaz et al. utilized a partially homomorphic cryptosystem as the element of the proposed privacy-preserving solutions [32]. HE-based methods are time-consuming and require high capacity memory. Differential privacy is another data privacy-preserving method. However, it decreases the accuracy of speech recognition because of low data availability [11,17]. SMC can effectively improve the availability of encrypted data by combining edge servers [4]. Huang et al. [33] proposed a SMCbased lightweight framework to protect data privacy. Ma et al. [14] proposed a privacy-preserving speech recognition framework based on LSTM and SMC to achieve privacy-preserving.

Conclusion
This paper proposes a novel Privacy-preserving Speech Recognition framework using the Bidirectional Long short-term memory neural network (PSRBL). PSRBL makes full use of the left and the right context of speech data to improve the accuracy of speech recognition, and employs piecewise-linear functions to refit the original activation functions for reducing training and recognition time. In addition, PSRBL achieves the privacy-preserving of speech data during speech recognition based on SMC. The experimental results show that PSRBL outperforms the existing approaches.