Voice spoofing countermeasure for voice replay attacks using deep learning

In our everyday lives, we communicate with each other using several means and channels of communication, as communication is crucial in the lives of humans. Listening and speaking are the primary forms of communication. For listening and speaking, the human voice is indispensable. Voice communication is the simplest type of communication. The Automatic Speaker Verification (ASV) system verifies users with their voices. These systems are susceptible to voice spoofing attacks - logical and physical access attacks. Recently, there has been a notable development in the detection of these attacks. Attackers use enhanced gadgets to record users’ voices, replay them for the ASV system, and be granted access for harmful purposes. In this work, we propose a secure voice spoofing countermeasure to detect voice replay attacks. We enhanced the ASV system security by building a spoofing countermeasure dependent on the decomposed signals that consist of prominent information. We used two main features— the Gammatone Cepstral Coefficients and Mel-Frequency Cepstral Coefficients— for the audio representation. For the classification of the features, we used Bi-directional Long-Short Term Memory Network in the cloud, a deep learning classifier. We investigated numerous audio features and examined each feature’s capability to obtain the most vital details from the audio for it to be labelled genuine or a spoof speech. Furthermore, we use various machine learning algorithms to illustrate the superiority of our system compared to the traditional classifiers. The results of the experiments were classified according to the parameters of accuracy, precision rate, recall, F1-score, and Equal Error Rate (EER). The results were 97%, 100%, 90.19% and 94.84%, and 2.95%, respectively.


Introduction
The voice is considered a form of human biometrics and is a medium of communication. The characteristics of a person's voice is unique, so the voice can used for authentication and biometric identification purposes [1]. Voice biometrics is a simple way of authenticating users that doesn't require any unique sensor device or equipment [2,3]. A regular smartphone or microphone can be used for this. Voice biometrics are used in the process of verification or recognition of the speaker. Voice biometrics is the technology which uses one-to-one processing to compare the speeches of two individuals. If both speeches originate from the same individual, it is referred to as speaker verification. On the other hand, speaker identification is where an unknown individual is identified with his voice. It is a one-to-many process, although this could result in numerous repetitions of 1-1 comparisons. According to [4], there are two attributes of a person referred to as biometrics. There is the more natural authentication technique such as the iris, face, fingerprint, etc., and the behavioral technique such as signature, voice, gait, etc.
Our primary goal is to create a security layer for the protection of customers from voice replay spoofing attacks to the ASV systems. In this system, the speech signals are used to do a 1-1 comparison of the user's voice and voice prints stored in the database. Naika [5] called this the Automatic Speaker Verification. The ASV system has two main types of attacks-the spoofing and zero effort attack. In the latter attack, a speaker who isn't registered speaks an 'authentic' speech to be granted access as if he was actually registered. This kind of attack is simple to detect because a 1-1 comparison will fail to match with the registered user. In the spoofing attack, the attacker attempts to gain access by playing a speech that was previously recorded, and is similar to the speech of registered speakers. In recent times, the susceptibility of ASV systems to voice spoofing attacks is increasing [6]. The voice spoofing attacks are further divided into logical and physical access attacks. These attacks reduce the efficiency of ASV systems [7]. Building of state-of-theart ASV systems is an emerging topic. Over the past few years, several authors have organized numerous evaluation challenges [8]. This challenge primarily deals with the logical access attacks, where the samples are created with voice conversion or text-to-speech or algorithms. Challenges have also been organized to address both logical and physical access attacks, and just physical access attacks [9][10][11]. For greater security of the ASV systems, the spoofing countermeasure should be able to detect the kinds of attacks in the training set, and the system must be capable and robust enough to identify unseen attacks with a reduced EER rate. The anti-spoofing system should be able to generalize. [8] addressed this issue by employing some deep audio features extraction processes. In [12], several embeddings were obtained from an inner layer of the deep neural network in order to indicate the whole audio or the frame of audio signals. The anti-spoofing system tries to ascertain the genuineness of the input speech signal. After the process of extracting the features, a classifier is used to classify the speech into a genuine or fake. Researchers make use of deep learning or ML classifiers, but deep learning classifiers have proven to be more effective; this is also evident in several studies using deep learning approaches [13][14][15][16]. The following are the contributions of this research: 1. We propose a state-of-the-art strong voice spoofing countermeasure based on the decomposed signals through the process of empirical mode decomposition for the detection of voice replay attacks. 2. We examine the features of the decomposed acoustic and create a hybrid features-based architecture to detect fake speech. 3. We study the existing ML classifiers for the performance of detecting spoofing. 4. We evaluate the deep learning classifiers for their effectiveness against voice replay attacks.
The remainder of this paper is organized as such: "Literature review" section critically presents related literature. The methodology of the proposed study is discussed in "Proposed method" section. "Results" section discusses the experimental results of the proposed system and compares it with traditional methods. "Conclusion" section concludes this paper.

Literature review
Hanilci et al. [17] presented a method for detecting voice replay spoofing using a high-frequency and glottal excitation band. The information of glottal excitation was extracted using a method of Iterative Adaptive Inverse Filtering which illustrates the unique specifics of fake and genuine speech. They observed a decrease of 3.68% and 8.32% in the Equal Error Rate (EER) in the evaluation and development set. In [18], the authors explored the improved Enhanced Teager Energy Operator (ETEO) and cepstral coefficient, and signal mass to detect replay attacks. The EER of the evaluation and development sets were 10.75% and 5.55%, respectively. The authors in [19] evaluated a spectrum analyzer referred to as 'cochlear' which comprises of a level-dependent compression and a sharp frequency tuning. They then created a method by using an adaptive notch and resonant filter for the cochlear model. This technique showed advancements in the EER by 60.8% and 51.9%. In [20], the authors proposed a framework to detect replayed audio for the security of the ASVs from fraudulent purposes. The framework could also detect the fake audio created by spoofing algorithms.
Aljasem et al. [21] presented a system to detect replay attacks based on the GMM and SVM classifiers, and the analysis of the linear prediction. The evaluation set generated an EER of 4.8%. In [22], the authors proposed a framework for the detection of replay audio and the security of voice assistants like Alexa and Siri based on the difference in the locations of phonemes between a live human's voice and the replay audio. The authors in [23] introduced a technique to detect voice spoofing based on the Gammatone Cepstral Coefficients features and LTP for the security of the Internet of Things (IoT) and cyber-physical systems. These features were merged and fed into the SVM for differentiating purposes. In [24], the authors proposed a voice replay detection framework by employing the spectral and spatial of signals. They emphasized on the non-speech segments and used spatial features based on the generalized cross correlation to identify the difference. Yaguchi et al. [25] investigated the logSpec and cepstral coefficients to enhance the identification of attacks. The first feature is based on a ratio of the noise and harmonic sub-band. The two features were extracted with the linear prediction signals. The development and evaluation datasets had a reduced EER of 7% and 51.7% respectively. In [26], the authors proposed a spoofing detection technique for replay attacks based on an energy separation algorithm [27,28]. They also examined a Teager energy operator as a result of it being robust to the noise. They observed EER improvements of 66.34% and 21.88% in noisy and clean environments respectively. The authors in [29] designed a technique to detect live audio signals based on constant-Q transform which uses distributed frequency bins geometrically. In [30], the authors created spectral based features, i.e., shifted-CQCC and the Glottal Mel-Frequency Cepstral Coefficient (GMFCC), and integrated them for the detection of replay signals. The shifted-CQCC generated an EER of 11.34%, while CQCC produced an EER value of 7.94%. Meng et al. [31] proposed an anti-spoofing measure for smart home systems called ARRAYID which detects the passive liveness that uses the collated speech to distinguish between a live human and the replayed speech. Mittal and Dua [32], explored the deep learning models, CNN and LSTM, and the CQCC spectral feature. Two levels were used to detect spoofs. In the first level, LSTM and CNN were utilized. The next level used time distributed wrappers and LSTM. The authors in [33] presented a system which distinguishes between the genuine and a replayed audio by exploring the manipulations generated by the recording device using an SVM. In [34], the authors proposed a system to detect replay attacks which places prime importance on replaying and recording devices. They also introduced a countermeasure system using the evaluation of a regular audio spoofing tool.
Garg, Bhilare, and Kanhangad [35] introduced an anti-spoofing measure based on MFCC and CQCC feature. Sub-band analysis was performed on these features. The baseline system had an improved EER value of 36.33%. In [36], the authors proposed an anti-spoofing measure for replayed speech based on the human cochlear as compared to the filter bank models. They also designed two features for extracting features from modulation. The experimental results showed that the integration of these features outperformed the filter bank. The authors in [37] presented a linear prediction signal for the detection of replayed speech. The residual-MFCC and excitation source obtained from the linear residual audio signals were integrated for the detection of the replay audio. The residual-MFCC had better outcomes compared to the baseline systems. In [38], the authors evaluated a detection system for spoof speech based on linear frequency residual cepstral coefficient. Two classifiers called the GMM and CNN were used to distinguish between the replayed and genuine audio signals. A decline of 28.78% and 42.72% in the EER values of the development and evaluation sets respectively was reported.

Proposed method
The primary objective of this research is the detection of voice replay attacks against ASV systems. Our proposed system comprises two basic stages: the extraction of features and classification stage. In the first stage, the audio signals are decomposed and two features of 7, 7-dim, i.e., MFCCs and GTCCs are obtained. Next, we used the Bi-directional long short-term memory network as a deep learning classifier. Our proposed spoofing countermeasure determines the user's authenticity based on the voice provided. We used the ASVs-poof2019 PA dataset for all tests. Figure 1 illustrates the proposed countermeasure. For the implementation of this work, we used a MATLABR2022a. The Mat-lab2022a has several tools for audio processing. We also performed extraction of features and classification in a single tool.

Empirical mode decomposition
This process was built in 1998. It deals with the audio signals such that fast oscillations are covered on the slow oscillations which can further be decomposed into meek and intrinsic oscillations in a unique way making use of a dynamic scale without the vital earlier machine specifics [39]. This kind of decomposition is implemented with a restricted single information time scale for it to be suitable for non-linear and non-stationary processes which produces Intrinsic Mode Functions (IMFs) [40]. This has been applied in system detection challenges and health monitoring [41]. Although several applications have proven the legitimacy and robustness of the EMD, it has not been used as a countermeasure to optimize the system and for the purposes of voice spoofing [42]. Before now, the EMD has been studied in two forms: varying the process of sifting and configurations that are empirically stated. It has been used in several applications including fault detection, evaluation of biomedical data, and analysis of power and seismic signals. The EMD elements are called IMFs. By using EMD, the noise in signals can be eradicated and the signals reconstructed again. We acquire the significant elements for assessing the audio signals. In this paper, we initially empirically decomposed the signals and obtained the two prinicpal coefficients as presented in Fig. 2.

Mel-Frequency cepstrum coefficient
The MFCC is a popular technique of obtaining features from audio signals [43]. It is referred to as the filter banks-based cepstral domain features obtaining technique. The Mel-scaled filter bank and the Fast Fourier Transform (FFT) is used in the audio signals. The filter bank divided the spectrum non-linearly by adhering to the mel-scale. The lower zones' frequency filters have lower bandwidth than their counterparts. The mel-scale has the spacing of the frequency below 1kHz, in contrast to the logarithmic spacing. The final stage consists of the ranges of coefficients according to their significance. Their importance is obtained through the computation of the discrete cosine transform of the filter bank's logarithmic output. The signals were decomposed, and the 7-dim MFCCs features were extracted from the audio signals. Figure 3 below shows the details.

Gammatone cepstral coefficients
In the next phase, the audio signals were decomposed and the 7-dim features of the GTCC were extracted for more evaluation. It is another technique for obtaining features originally created in [44]. Gammatone's function presents several characteristics that make the GTCC filters suitable to imitate the auditory of the system's spectral and human response [45][46][47]. Gammatone's function is computed by the multiplication of the Gamma distribution function with the sinusoidal tone. It is illustrated as follows: The K, B, n, φ, and f c represent the amplitude factor, bandwidth parameter, filter order, phase shift and filter central frequency, respectively. The filter impulse response period is directly connected to the equal rectangular bandwidth, i.e., is a metric used to approximate the bandwidth of human audio filters in the cochlea, a part of the ear. There is a connection between the ERB and B. Equation 2 shows the computation of the ERB as: (1) The fc, minBW, EarQ, and n represent the filter central frequency, lowest bandwidth at zones of lower frequencies, asymptotic quality at higher frequency zones, and  The f h , EarQ, and minBW illustrate the increased frequency and ERB parameters, while i is the GT filter index. The stage is computed by employing Eq. 4 below: In Eq. 3 above, the N illustrates the amount of filters. The GTCC extraction of feature process is similar to that of the MFCCs, but GTCCs use gammatone filter bank instead of a mel-filter bank. 7-dim decomposed GTCCs (3) features were procured from the audio. Figure 4 below gives the details.

Dataset
The ASVspoof2019 [48] PA dataset we used for experimentations is publicly available and the statistics are shown in Fig. 5. The sub-folders are three in number: the training, development, and evaluation folder. They all contain bonafide and replayed speech samples. The genuine data comprises 200 samples collated from 20 various speakers as illustrated in (#11). In a single environment produces voice replay in accordance to nine different attacks, resulting in 1,800 generated samples as illustrated in (#13). A matching method is used for the development partition samples. It is however only for the 10 different speakers as shown in (#12), therefore, 900 samples as depicted in (#14). This process is repeated Classification Figure 6 illustrates the suggested spoofing countermeasure classification. The audio is processed, and the extracted features are passed into the BiLSTM network to be classified into bonafide or spoofed audio. BiLSTM has continuously been utilized in several approaches [49]. BiLSTM is a Recurrent Neural Network [50] used for Natural Language Processing and the prediction of the time series. The audio signal is also data in the time series. The input moves in a single direction in the LSTM network. In the BiLSTM network, on the other hand, the input flows in both forward and backward directions. This allows the BiLSTM network to use prominent details from both directions. BiLSTM has an additional LSTM layer that varies the movement of details. This means that the input moves in the opposite direction in the additional LSTM layer. Afterwards, the output obtained from the two layers are then merged. Figure 6 illustrates the specifics of the BiLSTM framework below.

Results
This section discusses the comprehensive performance evaluation of the proposed system to detect voice replay attacks. Our technique's performance was evaluated using the Accuracy, Recall, Precision, F1-score, and Equal Error Rate (EER) performance parameters. However, the comparison with other systems will be based on the EER. This experiment was conducted to evaluate our technique (emd-GTCC+emd-MFCC-BiLSTM) using the ASVspoof2019 PA dataset. This dataset has three sets: the training, evaluation, and development set. The training set is used for training, while the evaluation set is used to test the model that has been trained. The samples of the development set cannot be used to evaluate spoofing countermeasures. We empirically decomposed the audio signals and extracted the 7-dim features of the MFCC and GTCC from the evaluation and training set.
As far as we know, this is the earliest effort of the signals being decomposed and evaluation of the efficiency of the detectors of spoofing. We used the 14-dim (emd-MFCC and emd-GTCC) features and fed it into the BiLSTM classifier to classify the audio into authentic or spoofed. There are various algorithms which depict improved performance on the classification of the time series data. The audio is a data in the time-series and the proposed BiL-STM framework has shown impressive results.  illustrates the outcome of our spoofing countermeasure. Our proposed method obtained a remarkable accuracy of 97% for binary classification of spoofed and bona-fide audio. The 100% precision rate of our technique signifies that the proposed countermeasure is effective in detecting replay signals. It had recall and F1-score of 94.84% and 90.19%, respectively. The ASVspoof organizers' baseline used Constant Q Cepstral Coefficient (CQCC) and GMM as a form of classifier. Also, the baseline used GMM and Linear Cepstral Coefficients (LFCC) to classify. The resultant systems however are not effective enough to be used in a real-time environment as a result of the features' inability to obtain maximum information. The 2.95% EER value of our method is significantly lower than the baseline methods. The voice replay detection baseline methods obtained an EER of 13.54% and 1q.04% using LFCC-GMM and CQCC-GMM, respectively, in comparison to our system which obtained 10.59% and 8.09%, respectively. The ASVspoof2019 PA dataset contains audio samples recorded making use of several recording gadgets of different qualities: perfect, high or low. The sizes of the room used for the replay attacks  recordings are also of different sizes (10-20m, 5-10m and 2-5m). The PA dataset is assorted. The proposed method had an accuracy of 97%, indicating it is effective in detecting voice replay spoofing attacks.

Confusion matrix of the system
This section gives a comprehensive evaluation of the results of classification of our proposed system as depicted in Fig. 8 were spoofed were detected as bona-fide. 3% of the data are classified incorrectly, the rest are correctly classified. In the confusion matrix, 1 stands for the bona-fide class, and 2 stands for a spoofed class.

SVM performance
The SVM classifier is utilized in various applications. Firstly, 14-dim features were extracted for the training of the SVM classified. The SVM obtained 78.03% accuracy and 68.27% precision. The F1-score and recall attained by the emd-MFCC and emd-GTCC+SVM are 74.31% and 81.53%, respectively. Figure 9 shows the detailed results.

Confusion matrix of the sVM classifier
A confusion matrix was created for the SVM classifier to evaluate the performance in detecting replay attacks. Figure 10

The ensemble classifier's performance
The second traditional classifier utilized is the ensemble classifier for the detection of replay attacks. This classifier is utilized in several applications. 14-dim features are obtained and passed into the classifier for the classification into spoofed and bonafide audio. This method obtained an accuracy of 84.71%, 6.68% higher than that of the SVM classifier. The precision rate is 71.53%, higher than the 68.27% of the SVM. The F1-score and recall are 76.63% and 81.53%, respectively. Our technique had the impressive outcomes of precision of 100% and accuracy of 97%, 12.29% higher than that of the ensemble classifier. Figure 11 shows the comprehensive results of the classifier and our technique. Figure 12 shows the comprehensive classification performance outcomes of the TP, FP, FN, and TN values. Figure 12 shows that the ensemble classifier accurately classified 3,644 and 13,162 bona-fide and spoofed samples, and 1,756 and 1,278 audio samples are inaccurately classified.

Performance of kNN classifier
The performance of the KNN classifier in detecting voice replay attacks was checked. KNN classifier is utilized in several applications. The obtained 14-dim features are passed in to the KNN for the classification into a bonafide or a replay voice. Figure 13 shows that the accuracy realized using our proposed method with KNN classifier is 77.09%. The precision rate of 89.2% is 18.49% less than that of an ensemble classifier. The F1-score and recall is 67.94% and 54.86%, respectively. These two parameters on the KNN-based method are smaller than that of the SVM-based technique and the ensemble classifier.

Performance comparison with existing systems
The performance of our proposed technique is likened to the other existing methods. The comparison is based on the obtained EER value. The most ineffective approach is the baseline with the EER value of 13.54% using LFCC-GMM, while the CQCC-GMM had an EER of 11.04%. The second most effective approach is [49] with an EER value of 7.99%. A Deep Neural Network and CQSPIC method was used. The DNN was used for the classification into authentic or replay speech. In comparison with other methods, our proposed method performed remarkably well with an EER of 2.95%, which is significantly smaller EER value than those of the other techniques. Figure 15 illustrates the comparison between our proposed approach and the others. The comprehensive experimental conclusions and comparison with traditional classifiers show that our proposed approach can encapsulate the unique features from the authentic audio and replay signals.

Conclusion
Attackers use enhanced gadgets to record the voices of bona-fide and registered speakers, replay it to ASV systems to obtain unlawful access for malicious purposes. These kinds of attacks are serious menaces to the security of these systems. To secure the ASV systems from voice replay spoofing attacks, we proposed a method which uses the empirical mode decomposition of speech signals. GTCC and MFCC are used as features, and the BiLSTM is used to classify the audio into bona-fide or spoofed. The ASVspoof2019 PA dataset is used for the experiments carried out. An accuracy of 97% and precision rate of 100% is achieved by our approach. The F1-score and recall values are 94.84% and 90.19%, respectively. Our proposed approach obtained a significantly lower EER value of 2.95%, and is 8.09% and 10.59% less than the traditional baseline methods. The evaluation and conclusions indicate that our proposed system is reliable for the detection of replay attacks. Subsequently, we aim to explore the efficiency of our proposed approach on the algorithm-generated voice attacks.