Categorizing Malware via A Word2Vec-based Temporal Convolutional Network Scheme

As edge computing paradigm achieves great popularity in recent years, there remain some technical challenges that must be addressed to guarantee smart device security in Internet of Things (IoT) environment. Generally, smart devices transmit individual data across the IoT for various purposes nowadays, and it will cause losses and impose a huge threat to users since malware may steal and damage these data. To improve malware detection performance on IoT smart devices, we conduct a malware categorization analysis based on the Kaggle competition of Microsoft Malware Classification Challenge (BIG 2015) dataset in this article. Practically speaking, motivated by temporal convolutional network (TCN) structure, we propose a malware categorization scheme mainly using Word2Vec pre-trained model. Considering that the popular one-hot encoding converts input names from malicious files to high-dimensional vectors since each name is represented as one dimension in one-hot vector space, more compact vectors with fewer dimensions are obtained through the use of Word2Vec pre-training strategy, and then it can lead to fewer parameters and stronger malware feature representation. Moreover, compared with long short-term memory (LSTM), TCN demonstrates better performance with longer effective memory and faster training speed in sequence modeling tasks. The experimental comparisons on this malware dataset reveal better categorization performance with less memory usage and training time. Especially, through the performance comparison between our scheme and the state-of-the-art Word2Vec-based LSTM approach, our scheme shows approximately 1.3% higher predicted accuracy than the latter on this malware categorization task. Additionally, it also demonstrates that our scheme reduces about 90 thousand parameters and more than 1 hour on the model training time in this comparison.


Introduction
Recent developments in the field of edge computing have led to extensive attention on smart device security in the Internet of Things (IoT) environment [1]. Nowadays, smart devices interact with networks for various purposes. A mass of personal information, including health (2020) 9:53 Page 2 of 14 on malware detection and categorization on IoT remains imperative and promising. Malware detection and analysis have received extensive discussion, yet traditional approaches are not fully available on edge devices in the IoT environment. Certain traditional defend techniques applied to general desktop computing environments rely on pre-defined rule libraries. However, the portability of smart devices causes that they are not always connected to fixed and trusted networks, and thus perimeter-based defenses, including firewalls and intrusion detection, are not available for edge devices [5]. Moreover, as smart devices put more emphasis on real-time interaction, the corresponding malware identification requires faster response speed than on traditional platforms. Current malware identification for edge devices mainly relies on the malware signature databases from software distributors, yet this approach can not meet the demand of detecting the ongoing number of malware in edge computing paradigm. Research on automatic malware analysis techniques in the IoT environment is exceptionally urgent. In our previous works, to measure the stability of cyber-physical systems (CPSs) under malicious attacks, we developed a finitetime observer to estimate the state of the CPSs [6]. Then, we proposed a kernel learning algorithm to improve the malware detection performance on complex datasets with noise [7]. In addition to detection performance, memory footprint and response speed are also of enormous importance for current smart devices on IoT, and this poses higher requirements for edge malware analysis. In this article, we are committed to improving edge malware identification performance with low memory footprint and fast response speed.
As one of the most energetic technology companies, Microsoft has paid great enthusiasm into the IoT field, and Windows-based applications have been well-developed via their Azure IoT platform services [8]. Focused on the Windows-based malware invasion problem on the IoT platform, this article proposes a malware categorization scheme for attributing malware into different families through a Word2Vec-based temporal convolutional network (TCN). The model performance is evaluated by comparing with several representative works, i.e., Naive Bayes Classifier, OneHot-based TCN, Word2Vec-based long short-term memory (LSTM), on the Microsoft Malware Classification Challenge (BIG 2015) dataset.
In this research, opcode and application programming interface (API) call name sequences are extracted from the malware assembly files firstly. Then, in consideration of the benefits of pre-training strategy for achieving better performance, a Word2Vec model, which encodes textual data with distributed representation by considering the context, is implemented for input name vectorization. Compared with one-hot encoding approach, Word2Vec encodes the input names into more compact numeric vectors by training a language model, and it leads to lower memory footprint and better representational ability. Finally, a TCN, as an advanced convolutional network structure for sequence modeling tasks, is developed to attribute the malware. Compared with other recurrent neural networks (RNNs), e.g., gated recurrent unit (GRU) and LSTM, TCN is easy to implement in parallel because of its convolutional structure. In addition, TCN demonstrates significant advantage of lower memory requirement than canonical recurrent networks due to the shared filters across the convolutional layers. Our contributions in this article are summarized as follows. The remainder of this article is organized as follows. The next section gives a summary of the background consisting of Word2Vec model, TCN structure, and recent works on IoT malware classification and categorization. Following that, the proposed scheme and the time complexity are elaborated and analyzed. Then, the next part describes the experimental settings and results for model evaluation. The final section includes a conclusion of the proposed scheme and a promising direction for further research.

Word2Vec model
Input name sequences from malware samples are textual data that should be encoded into numeric vectors for feature representation. Word embeddings are general approaches to map primitive representation of words into high-dimensional numeric vectors in an embedding space with maintaining word distances. Nowadays, word embeddings have gained an incresed research interest, and among which Word2Vec is one of the most significant text representation models [9,10]. Word2Vec assumes that the contexts in the natural language are of high correlation, and hence words can be vectorized according to the contexts [11]. Then, word vectors can be obtained from training corpus to measure the semantic similarities between words in natural language. Note that word vectors are generally generated from the weights of trained language models rather than the direct training targets in Word2Vec. Generally, Word2Vec includes two kinds of architectures, i.e., contextual bag-of-words (CBOW) and skip-gram (SG), to learn distributed representation [12][13][14]. A simple skip-gram model architecture is shown in Fig. 1 [10]. A large and growing body of literature has studied the effectiveness of Word2Vec model in various areas. In [15], Word2Vec technique was applied to social relationship mining in a multimedia recommendation method. This method recommended users multimedia based on a trust relationship, and Word2Vec here was used to encode the sentiment words in related comments into word vectors. In [16], a Word2Vec-based music modeling method adopted skip-gram to model slices of music from a large music corpus. Word2Vec was proved a useful embedding technique to capture meaningful tonal and harmonic relationships in music according to their experimental results. Word2Vec has also shown powerful representation ability for inverse virtual screening in the early stage of drug discovery process. In [17], Word2Vec was combined with a dense fully connected neural network algorithm to perform a binary classification on input protein candidates. In addition, several recent studies investigating Word2Vec in the areas of malware classification and detection have been carried out. In [18], a malware detection method named DroidVecDeep was designed to detect unknown malicious applications on the Android platform. Here, features were extracted by static analysis and ranked by mean decrease impurity firstly, and then were transformed into compact vectors to train a deep classifier according to Word2Vec model. In [19], a LeNet5 structure was developed for malware classification based on the multi-channel feature matrixes, which were converted from malware binary files and assembly files via Word2Vec technique.

Temporal convolutional network
RNNs are considered the general methods for sequence modeling tasks. However, certain convolutional structures show state-of-the-art performance in some sequence modeling tasks, such as audio synthesis, machine translation, and language modeling [20][21][22]. Then, to verify whether convolutional structures are subject to some specific sequence modeling applications, TCN structure was developed and compared with common RNNs, such as GRU and LSTM, on a comprehensive set of sequence modeling tasks. The comparison results on these tasks indicate better performance and longer effective memory of TCN structure [23].
TCN uses a specific 1D convolutional structure for sequence information representation. Assuming x = (x 1 , . . . , x t , . . . , x l ) is the input sequence, l denotes the input sequence length, x t denotes the input at time step t, g ∈ R h×n represents n convolutional filters with kernel size h, " " denotes convolution operator, and then a canonical 1D convolutional operation can be formed as [24]: However, 1D convolutional networks are facing information leakage and output shrink problems. To overcome these limitations, TCN combines 1D fully-convolutional network (FCN) and casual convolutions [25]. In 1D FCN, hidden layers have the identical length as input sequence to prevent output length shrink. In casual convolutions, output at time step t is convolved only with the neural nodes at time t and the earlier ones in the previous layer. Moreover, considering the receptive field of 1D FCN is linear to the number of convolutional layers, dilated convolution technique is integrated into TCN structure for longer effective memory. Then, the dilated convolutional layer can be defined as: where " d " denotes the convolution operation with dilation factor d.
Residual connection is another important ingredient of TCN [26]. According to residual connection, the output of a branch which contains a series of transformations G is added to the input of the block. Assuming the input of the residual block is z, and the output of the block is o, then the residual block can be defined as: Compared with canonical RNNs, such as LSTM and GRU, TCN always has longer effective memory and better performance. Additionally, two other advantages are determined by the particular TCN structure. The fact that neural nodes in each hidden layer are not sequentially connected enables parallel computation for higher computational efficiency, and the shared filters across each layer lead to fewer parameters in TCN. A common TCN structure is illustrated in Fig. 2 [27].

Machine learning methods on edge malware detection and categorization
With the rapid development of IoT, smart devices have suffered various attacks in edge computing paradigm. For instance, in the distributed denial-of-service (DDoS) attack on October 21, 2016, large amounts of IoT devices, such as digital video recorders (DVRs) and internet protocol (IP) cameras, were infected by Mirai to participate in this attack [28]. Therefore, research on malware categorization and analysis in the IoT environment is of great significance. As machine learning methods, such as support vector machine (SVM), extreme learning machine (ELM), neural network (NN), have shown good achievements on classification tasks, there has been a surge of interest in machine learning methods on edge malware detection in recent years. In [29], Sagar developed a three-stage malware detection model to improve detection performance. Term frequency-inverse document frequency (TF-IDF) and information gain (IG) features were extracted in the first stage, and then principal component analysis (PCA) technique was brought in for feature extraction. Finally, a deep belief network (DBN) with optimized activation function was constructed to attribute the malware. In [4], Niu et al. combined static analysis and extreme gradient boosting (XGBoost) method to overcome the low accuracy of static analysis and high resource overhead of dynamic analysis on X86-based IoT devices in an autonomous driving application. In [30], the opcodes of IoT applications were transmuted into a vector space, and then fuzzy and fast fuzzy tree methods were developed to detect and classify the malware. In addition, control flow graph (CFG) was another common choice for malware classification. In [31], a CFG-based deep learning model was constructed to identify malware and benignware IoT disassembled samples.

The proposed malware categorization scheme
In this section, a brief introduction to the malware dataset for this work are described firstly. Then, pre-processing to filter the input sequences is analysed. Furthermore, a Word2Vec-based TCN for malware categorization is elaborated. Through the employment of a pre-trained Word2Vec model, the input name sequences are embedded into a vector space, and then a TCN structure is developed for malware categorization. The whole process is illustrated in Fig. 3. The comparison between the state-of-the-art Word2Vec-based LSTM approach (left) and our proposed scheme (right) is illustrated in Fig. 4. The comparison in Fig. 4 shows that the main differences between our proposed scheme and this Word2Vecbased LSTM are pre-processing and categorization network. In pre-processing, we apply extra useful tricks for feature extraction. Continuously repeated names representing repeat processes in program execution provide no additional information for malware categorization. Therefore, the strategy to remove the repeat is designed here. In addition, too short sequences, which provide The whole process of our proposed scheme. The whole process consists of input, output, preprocessing and test set validation process, and three network modules Word2Vec pre-trained model, input embedding module and TCN categorization module inadequate information for family classification and lead in much noise for feature representation, are eliminated in our scheme. Considering the categorization network, TCN in our scheme has longer effective memory due to the dilated convolution structure. Moreover, residual structure is another reason that our scheme performs better than Word2Vec-based LSTM. More details about the proposed scheme are described in the following parts.

Dataset
In dataset [32] are performed to evaluate the proposed scheme. The original dataset with approximately 500GB consists of more than 20K malware samples belonging to nine malware families. In this work, considering the test data with no labels are unavailable for supervised tasks, only labeled training data in the whole competition dataset are utilized. The corresponding assembly source file of every malicious program is produced from binary file through interactive disassembler pro (http://www. hex-rays.com/products/ida/). Then the opcode and API call name sequences are extracted from the corresponding assembly source files.

Pre-processing
Input name sequences are roughly extracted from assembly source files, and therefore further data processing is an essential and primary step before feature representation [33]. Some extracted sequences contain many consecutive duplicate opcode and API call names which supply no more information for modeling. Then reducing consecutive repeated names is an imperative procedure. Meanwhile, extracted sequences from assembly source files have unequal lengths so that unifying the length of the sequences is another consideration. As a whole, main data pre-processing techniques in this work are as follows: • Filter consecutive duplicate opcode and API call names: Remove the consecutive and identical names in input sequences to avoid redundant information. • Filter short sequences: Some sequences from assembly source files which just consist of several opcode and API call names may contain insufficient information to identify the corresponding programs, and these sequences will be removed from the dataset. • Unify the sequence length: Samples with various length are tricky for neural networks, and therefore unifying the sequence length is imperative for malware categorization. In this work, a sequence length L is pre-set to equalize the lengths [34]. The sequences with length longer than L retain the first L names, and those shorter than L are unified via zero-padding.
After the data pre-processing, the sample size in dataset reaches 10868 and the vocabulary contains 1121 unique opcode and API call names. In the experiments, the extracted sequences are split into training set, validation set, and test set with the proportion of 0.64, 0.16, and 0.2, respectively. The statistical information of each category is shown in Fig. 5 and data samples are shown in Fig. 6.

Word2Vec-based TCN structure
Word2Vec-based TCN mainly consists of a Word2Vec and a TCN sequence analysis model. In this structure, input sequences are transmitted to Word2Vec model in the first step, and then the embedding layer weights are initialized with the numeric vectors from the trained Word2Vec model. Subsequently, a specific TCN for malware categorization is trained. Finally, the Word2Vecbased TCN model is automatically evaluated on the test set. The algorithm description is presented in Algorithm 1. TCN, which consists of several specific convolutional strutures, is an advanced sequence modeling structure. Compared with common RNNs, such as LSTM and GRU, TCN is characterized by fewer network parameters and faster training speed with better performance on sequence modeling tasks. In this article, a TCN structure as illustrated in Fig. 7 is developed for malware categorization. In Fig. 7, the TCN is constructed by stacked residual blocks where the dilation factor is exponentially grown as the blocks are stacked. In addition, each residual block contains two dilated causal convolutional layers and all the convolutional layers contain 32 filters in this TCN. Finally, in the last layer, "fc" which is a fully connected layer with 9 hidden neurons and softmax activation function outputs the predicted family probabilities.

Loss function and optimization
Considering malware categorization on the Microsoft Malware Classification Challenge (Big 2015) dataset is affiliated with multi-class problems, categorical crossentropy loss function is adopted in this article.
Assuming y ij denotes the true probability of the ith sample belonging to malware family j,ŷ ij denotes the predicted probability of the ith sample belonging to family j, N denotes the sample size, M denotes the number of malware families, and then categorical cross-entropy loss function is defined as: Adam optimizer, which combines the first moment estimation and the second moment estimation of the gradient, is a common optimizer in neural networks [35]. Hence Adam optimizer is employed in this work.

Time complexity
When 1D convolutional structure is used for sequence modeling in natural language processing, the input sequences are always encoded into numeric vectors firstly. Then, assuming x ∈ R l×m is an input sequence where l denotes the length of the input sequence and m denotes the dimensionality of embedding space, n denotes the number of convolutional filters, h denotes the length of the 1D convolutional filter kernel (l h), and then the time complexity of the 1D convolutional layer is: Assuming d is the dilation factor of the dilated convolutional layer in TCN, the time complexity of this dilated convolutional layer is: Moreover, the mathematical form of a residual connection is: where o denotes the residual block output, z denotes the input of the block and G denotes a series of transformations. It can be seen that the residual connection is linear, assuming G in a residual block contains two dilated convolutional layers which is the general case, then the time complexity of this TCN structure can be approximately estimated as: From (5) and (8), the time complexity between TCN residual block and 1D convolutional structure is roughly comparable. Considering the input data are determinate after pre-processing and embedding space construction, the number of convolutional filters and the length of filter kernels is the main variable parameters in convolutional structure for time consumption. Moreover, since dilated convolutions are potent tricks in TCN structure for a large receptive field, TCN residual blocks enable less computing time with the growth of the dilation factor. Finally, the TCN will achieve good performance with less time consumption by stacking with several residual blocks.

Experiments
To evaluate the performance of our proposed malware categorization scheme, the classical Naive Bayes Classifier for N-gram model (Ngram NBC, for short) is the baseline in our experiments [36]. In addition, to verify that the numeric vectors from pre-trained Word2Vec model are capable to represent the malware feature sequences more precisely, the current popular one-hot encoding technique combined with TCN (OneHotTCN, for short) is compared in our experiments. Then, our proposed scheme (Word2VecTCN, for short) is compared with the state-of-the-art malware categorization model in [34] (Word2VecLSTM, for short). Finally, our scheme is compared with some other recent works on the same malware dataset.

Experimental environment
Our experiments are conducted on the Kaggle competition of Microsoft Malware Classification Challenge (BIG 2015) dataset to evaluate the malware categorization performance on the IoT malware recognition task. Considering the samples in each category are different in quantity, we divide the dataset into training, validation, and test set in a stratified fashion to ensure the same relative proportion in each set. More dataset statistical information is in the previous section.
Here, our experiments are implemented by Python with some additional libraries, such as TensorFlow, Keras, and some others, while the training and evaluation processes are conducted on Tesla K80 GPU in Google Colaboratory system, which is a Google cloud service supporting  artificial intelligence research [37]. In addition, early stopping and learning rate schedule are extra strategies in the training phase. The learning rate is initially 0.001, and then reduced to 10% of the original value if the validation loss stops declining for 5 epochs.

Metrics
The following basic criteria are universally defined for performance evaluation of machine learning techniques: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Here, to evaluate the performance of the malware categorization models, some metrics based on the above criteria such as true positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), Fmeasure (F-M), accuracy (ACC) are calculated and compared [38]. In the experiment, the metrics of each malware family are computed firstly. Then, considering class imbalance problem in this dataset, further weighted results of the nine malware families are also calculated in this article. The metrics are defined as: Additionally, total training time, test time, and training time per epoch are other important indexes to be used for time consumption evaluation in this article.

Parameter selection
The parameters in our proposed scheme and comparison methods are elaborated in this section. As shown in Table 1, TCN and LSTM have some similar parameters. Here, "max sequence length" is the maximum length of the opcode and API call name sequences. Then, the sequences whose lengths are longer than the threshold are clipped to "max sequence length", and the shorter ones are padded with 0 to reach the fixed "max sequence length". The parameter "batch size" is the number of samples fed into the models in each iteration. The parameter "learning rate" is the learning rate in the optimization procedure. Malware opcode and API call names should be mapped into numeric vectors before feature representation, and "embedding size" is the dimension of embedding space. The parameter "number of layers" points out the number of the network layers. For example, two LSTMs are stacked in this article. The parameter "dropout rate" is the dropout proportion of the network nodes in the training phase. The parameter "hidden layer neuron" represents the number of neural neurons in LSTM hidden layers. The parameter "number of filters" is the filter size in the convolutional layers. The parameter "number of stacks" is the number of stacked convolutional structures in residual blocks. Considering dilated convolution used in TCN, "dilations" is a list of dilation factors in dilated convolution. The parameter "kernel size" is the filter kernel size in the convolutional layers. There is no need to tune all   parameters in both networks, and "-" represents the corresponding parameter is inexistent in current network. Moreover, the parameters in OneHot-based method are basically identical with those in Word2Vec-based TCN, except that there is no "embedding size" in OneHotbased TCN.

Results
Experimental results are shown in this section. Figures 8  and 9 reveal the accuracy and loss comparisons between our scheme and OneHot-based TCN in the training phase. Figures 10 and 11 reveal the accuracy and loss comparisons between our scheme and Word2Vec-based LSTM in the training phase. The confusion matrix on the test set of our scheme is illustrated in Fig. 12. Then the metrics on each family of our scheme are computed in Table 2.
The weighted evaluation metrics and the time consumption comparisons on this malware categorization task are presented in Tables 3 and 4, separately. Finally, accuracy comparison between our scheme and other works on this dataset is conducted in Table 5.
The comparisons between our proposed Word2Vecbased scheme and OneHot-based one are shown in Figs. 8 and 9. From Fig. 8, the validation accuracy of our scheme is initially 29.8% and increases to a final value of 97.9%, while the validation accuracy of OneHot-based TCN method is initially 11.4% and grows to a final accuracy of 96.5%. From Fig. 9, the validation loss is initially 5.88 and decreases to 0.12 finally of our scheme, while the validation loss of OneHot-based TCN is initially 2.34 and decreases to 0.21 finally. The two figures reveal Word2Vec owns stronger feature representation ability than the onehot encoding on this malware category dataset. Specifically, in terms of the embedding layer, the dimension of numeric vectors generated from one-hot encoding reaches 1121 which is the number of unique opcode and API call names, while the dimension of numeric vectors trained from Word2Vec is 300. It can reduce large memory footprint in edge devices.
The comparisons between our proposed scheme and the state-of-the-art Word2Vec-based LSTM model are shown in Figs. 10 and 11. From these two figures, considering that the "dropout rate" in our scheme is higher than that in Word2Vec-based LSTM, our scheme is a bit behind Word2Vec-based LSTM model at the beginning of the training phase. Still, with the powerful feature representation ability, our scheme achieves higher accuracy and lower loss than Word2Vec-based LSTM model both on the training set and validation set finally. Furthermore, Word2Vec-based model needs to train about 672 thousand parameters while our scheme just requires approximately 582 thousand parameters, and the results show that Word2Vec-based TCN has better representation ability and lower running memory in the training phase. Figure 12 presents the predicted results on test set of our scheme visually, and Table 2 computes the metrics on each malware family. The result combining Fig. 12 and Table 2 reveals that FPR of "Ramnit" is the highest among the nine families, and therefore how to identify the malware samples which are conceived as "Ramnit" more accurately is the bottleneck to enhance the performance of Word2Vec-based TCN scheme. When applying this scheme to the practical IoT environment, the samples recognized as "Ramnit" need to be paid more attention. Tables 3, 4, and 5 show the comparisons of our scheme and some representative methods. From Table 3, the weighted metrics are computed and compared. The results show that the weighted F-measure and the accuracy of our scheme are approximately 1.2% and 1.3% higher than those of the Word2Vec-based LSTM, and the weighted FPR of our scheme is approximately 0.3% lower. Among all these metrics, Word2Vec-based TCN achieves the best performance. In Table 4, "Training time" is the runtime in the whole training phase, "Test time" is the  Considering the convolutional structure is easy to be trained in parallel and the parameters of our scheme are fewer than those in LSTM, TCN takes much less training time than LSTM. In addition, our proposed scheme has been compared with the other three recent works which are also on the same Microsoft malware dataset in Table 5.
The comparison also verifies the good performance of our scheme.

Conclusion
In this article, a Word2Vec-based TCN scheme is proposed for malware categorization in consideration of edge computing security. Opcode and API call name sequences are extracted from malicious samples firstly, and then the pre-processing is conducted for data cleaning. Subsequently, through the Word2Vec pre-training on the feature sequences, numeric vectors of the input names are generated. Additionally, the malware feature sequences represented by numeric vectors are fed into TCN to fit an IoT malware categorization model. Finally, the model performance is evaluated on the test set. The comparisons with other representative works verify that our proposed scheme can achieve decent performance while requiring a small quantity of memory and training time. From the occupancy of resources point-of-view, the benefits of combining Word2Vec model and TCN structure are noticeable.
Considering the low occupancy of resources and good computing performance of our scheme, it has potential applications on smart devices for security. As a universal The entries in boldface show the highest accuracy and the corresponding method among this performance comparison malware categorization scheme, our scheme suggests its promising applications in multiple fields of edge computing security, such as intelligent transportation system security control, smart factory protection and some others. The applications of our scheme on these edge computing fields will be considered in future work.