According to the three-layer framework of literature similarity checking task presented in Fig. 1, our proposed literature similarity checking approach \(\mathrm {SC_{MEC}}\) can be divided into two steps: first, we need to convert each literature from mobile devices into its corresponding index; second, we need to send the literature indexes to the central cloud platform for uniform similarity checking. Next, we introduce the major two steps of \(\mathrm {SC_{MEC}}\) approach.
Step 1: Conversion from literatures to indexes.
Traditional literature similarity checking approaches need to make comparisons between different literatures in a direct and straightforward way, e.g., word by word, sentence by sentence, paragraph by paragraph, etc. Such a direct and straightforward literature comparison way is often time-consuming and tiresome [39, 40], which probably decreases the satisfaction degree of researchers who often expect a quick and accurate literature similarity checking result. Therefore, to speed up the above literature similarity checking process, we need to convert the initial literatures which are often long into corresponding shorter embeddings. Since the content of each literature is often much, we convert each literature into a short index or embedding. This way, we can evaluate whether two literatures are similar or not by comparing their index values instead of their literature contents. The advantage of such an operation is that we can minimize the time cost for similar literature evaluation and discovery. Here, we adopt the classic Simhash and LSH techniques (the time complexity of the mentioned Simhash and LSH techniques have been proven to be close to O(1)). In our proposal, we use Simhash technique to convert each literature into a corresponding Boolean vector, whose purpose is to convert the text information of each literature into a corresponding 0/1 string that is easy to process and calculate in the subsequent similar literature evaluation process.
Next, we introduce the concrete step of conversion from literatures to indexes. First, we use Simhash technique to convert each literature (here, the literature set is denoted by \(Lit_{Set}\) = \((lit_1, \dots , lit_n)\)) into a long signature (here, the signature set for the literatures in \(Lit_{Set}\) is denoted by \(Sig_{Set}\) = \((sig_1, \dots , sig_n)\)). The concrete conversion process is introduced in detail at follows.
(1) Word segmentation for literatures.
Each literature often includes many words, which makes it hard to calculate the similarity degree between different literatures directly. To tackle this issue, we first convert a long literature into a word vector through various mature word segmentation tools in natural language processing (NLP) domain, e.g., word2vec or fastText. The concrete word segmentation process will not be introduced in detail here. Interested readers can refer to the related literatures in NLP. Here, we take a literature lit as an example for illustration. We assume that the literature lit is converted into a word vector \(V_{lit}\) as specified in (1) based on word2vec or fastText. Here, m is not a fixed value since different literatures often include different word number after word segmentation.
$$\begin{aligned} V_{lit} = (a_1, \dots , a_m) \end{aligned}$$
(1)
(2) Hash projection from a word vector to a 0/1 vector.
In the last substep, we have converted each literature lit into a word vector \(V_{lit}\). However, it is challenging to evaluate the similarity between word vectors. To tackle this issue, we further convert the word vector \(V_{lit}\) of literature lit into a numerical vector \(NumV_{lit}\). The concrete conversion process is based on any hash projection table. Here, for simplicity, we use the classic ASCII coding table adopted widely in computer domain to achieve the goal of hash projection. For example, if \(V_{lit} = (a_1, a_2)\), \(a_1\) = 11110000 and \(a_2\) = 10101010, then \(NumV_{lit}\) = (1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0). Next, we replace the “0” entries in the vector by “-1” and then derive a new vector constituted by “1” and “-1” entries only. For example, \(NumV_{lit}\) = (1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0) is updated to be \(NumV_{lit}\) = (1 1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 -1 1 -1).
(3) Vector weighting.
Weight significance is often inevitable in many applications involving multiple dimensions or criteria [41,42,43]. Inspired by the above analysis, in a literature, each word should be assigned a concrete weight value indicating the importance and significance of the word in depicting the whole literature. The weights of words in a literature could be generated in many ways such as TF/IDF, which is not repeated here. Here, we assume that weight vector corresponding to the m words in vector \(V_{lit}\) in (1) is W as specified in (2)-(3).
$$\begin{aligned} W_{lit} = (w_1, \dots , w_m) \end{aligned}$$
(2)
$$\begin{aligned} \sum \limits _{j = 1}^m {{w_j}} = 1 \end{aligned}$$
(3)
(4) Vector union by addition.
For a literature lit, with its word vector \(V_{lit}\) = \((a_1, \dots , a_m)\) in Eq. (1) (here, please note that the hash projection from 0 to -1 has been applied to the word vector) and its weight vector \(W_{lit} = (w_1, \dots , w_m)\) in Eq. (2), we can make a dot production operation between vectors \(V_{lit}\) and \(W_{lit}\), whose result is denoted by \(DP_{lit}\) in (4). For example, if \(V_{lit}\) = \((a_1, a_2)\), \(a_1\) = 11110000, \(a_2\) = 10101010, \(w_1\) = 0.4, \(w_2\) = 0.6, then \(DP_{lit}\) = (1 1 1 1 -1 -1 -1 -1) * 0.4 + (1 -1 1 -1 1 -1 1 -1) * 0.6 = (0.4 0.4 0.4 0.4 -0.4 -0.4 -0.4 -0.4) + (0.6 -0.6 0.6 -0.6 0.6 -0.6 0.6 -0.6) = (1 -0.2 1 -0.2 1 -0.2 1 -0.2).
$$\begin{aligned} DP_{lit} = (b_1, \dots , b_m) = V_{lit} * W_{lit} = \sum \limits _{j = 1}^m {{a_j}*{w_j}} \end{aligned}$$
(4)
(5) Dimension reduction.
In the above substep, we have converted each literature lit into a corresponding \(DP_{lit} = (b_1, \dots , b_m)\) by Eq. (4). However, from the dimension perspective, each entry in vector \(DP_{lit}\) can take any real value and therefore, its value range is often very large and not suitable for subsequent similarity calculation and evaluation. To overcome this shortcoming, we reduce each dimension’s value range as follows, since binary embedding is widely applied to various big data scenarios to reduce the search and processing time [44,45,46]. In concrete, we make the following conversions in Eq. (5). Afterwards, each entry \(b_j\) in vector \(DP_{lit}\) is equal to either 1 or 0, which narrows the value range of vector \(DP_{lit}\) significantly. This way, we successfully achieve the goal of dimension reduction.
$$\begin{aligned} b_j = \left\{ \begin{array}{rl} 1 &{} \text {if } b_j > 0,\\ 0 &{} \text {if } b_j \le 0. \end{array} \right. ( j = 1, 2,\dots , m ) \end{aligned}$$
(5)
(6) Deep dimension reduction by LSH.
To further reduce the dimensions of \(DP_{lit}\) for each literature lit, we use LSH technique to build a deep index for each lit. Concretely, we generate an m-dimensional vector \(X = (x_1, \dots , x_m)\) randomly by Eq. (6), where each entry \(x_j\) of vector X belongs to range [-1, 1]. Next, we convert the m-dimensional vector \(DP_{lit}\) into a smaller r-dimensional vector\(Z (r \ll m)\) by Eqs. (7)-(9). Here, the purpose of Eq. (7) is to calculate the projection from the original vector \(DP_{lit}\) to the vector X; the purpose of Eq. (8) is to reduce the dimensions involved; afterwards, we repeat the operations in Eqs. (7) and (8) r times to obtain \(z_1, \dots , z_r\). Then we can get a new vector \(Z = (z_1, \dots , z_r)\) which is much shorter than the original vector \(DP_{lit}\). This way, we successfully achieve the goal of dimension reduction. In addition, the LSH technique has been proven a lightweight nearest neighbor discovery approach whose time complexity is approximately O(1). Therefore, the proposed LSH-based similar literature discovery approach is very suitable for the big data context.
$$\begin{aligned} x_j = random (-1, 1) \end{aligned}$$
(6)
$$\begin{aligned} z = DP_{lit}*X =\sum \limits _{j = 1}^m {{b_j}*{x_j}} \end{aligned}$$
(7)
$$\begin{aligned} z = \left\{ \begin{array}{rl} 1 &{} \text {if } z > 0,\\ 0 &{} \text {if } z \le 0. \end{array} \right. \end{aligned}$$
(8)
$$\begin{aligned} Z = (z_1, \dots , z_r) \end{aligned}$$
(9)
Step 2: Similarity checking of literatures based on indexes.
In Step 1, for each literature in \(Lit_{Set}\) = \((lit_1, \dots , lit_n)\), we have obtained a corresponding hash index \(Z_j ( j =1, 2, \dots , n )\). Next, we compare any two literatures \(lit_i\) and \(lit_j\) \(( 1 \le i \le n, 1\le j \le n)\) through comparing their respective hash indexes \(Z_i\) and \(Z_j\). In concrete, if \(Z_i = Z_j\) holds, we can simply conclude that literatures \(lit_i\) and \(lit_j\) are similar with high probability. However, the above literature similarity evaluation solution is not always correct since LSH is a probability-based neighbor search technique. To minimize the negative influences incurred by probability, for each literature \(lit_j\) in \(Lit_{Set}\), we do not generate only one hash index \(Z_j\); instead, we generate h indexes \(Z_{j}^1, \dots , Z_{j}^h\). Afterwards, the similarity between literatures \(lit_i\) and \(lit_j\) is evaluated by Eq. (10). This way, we can evaluate whether two literatures are similar or same based on their respective hash indexes, to improve the literature similarity evaluation efficiency.
$$\begin{aligned} lit_i \,\, \text {and} \,\, lit_j \,\, \text {are similar iff} \,\, \exists \,\, k \,\, \text {satisfying} \,\, Z_{i}^k = Z_{j}^k (1 \le k \le h) \end{aligned}$$
(10)
The details of our proposed \(\mathrm {SC_{MEC}}\) algorithm can be described clearly by the following pseudo code.