The basic idea of the proposed Ano-Det method is introduced as follows: we firstly convert the health data of students into lightweight health indexes and stored them in the cloud platform; next, we calculate the similarity between each pair of the health conditions of students based on the health indexes; finally, we cluster the students based on their health indexes and discover the possible anomalies based on the clustering results. The concrete details of Ano-Det method is described as follows.
Step 1: Generate each student’s health index
As indicated in the example in Fig. 1, the students’ health data monitored by wearable sensors are often expressed with a curve which fluctuates with time. Therefore, we first model the students’ health data with a multi-dimensional matrix \(\kappa\) depicted in Eqs. (1)-(2). Here, we assume that there are N students, i.e., \(s_{1}\),..., \(s_{N}\) and M health criteria (e.g., heart rate, blood pressure, etc), i.e., \(c_{1}\),..., \(c_{M}\). Moreover, each entry in matrix \(\kappa\), i.e., \(A_{i,j}\) (i = 1, 2,..., N; j = 1, 2,..., M) represents the student \(s_{i}\)’s health data over criterion \(c_{j}\). Furthermore, as described in Fig. 1, each entry \(A_{i,j}\) is a time-aware fluctuant curve; therefore, we formulate \(A_{i,j}\) with a vector in Eq. (2) where K denotes the number of time points at which wearable sensors monitor and record the health conditions of students. For example, K = 3 means that three pieces of health data are monitored by wearable sensors. From certain points of view, parameter K describes the health data monitoring frequency.
$$\begin{aligned} \kappa = \begin{array}{cc} &{} c_{1}\quad \cdots \quad c_{M}\\ \begin{array}{c} s_{1} \\ \vdots \\ s_{N} \end{array} &{} \left[ \begin{array}{ccc} A_{1, 1} &{} \cdots &{} A_{1, M} \\ \vdots &{} \ddots &{} \vdots \\ A_{N, 1} &{} \cdots &{} A_{N, M} \end{array}\right] \end{array} \end{aligned}$$
(1)
$$\begin{aligned} A_{i, j} = (a_{i, j, 1},\ldots , a_{i, j, K}) \end{aligned}$$
(2)
As Eqs. (1)-(2) shows, \(\kappa\) is an \(N*M*K\) tensor. To ease the following calculations, we need to convert the \(N*M*K\) tensor \(\kappa\) into a multi-dimensional vector. To achieve this goal, we first convert the K-dimensional vector \(A_{i,j}\) into a concrete value. Concretely, we first produce a K-dimensional vector B presented in Eq. (3). Here, each entry in vector B is generated by Eq. (4) where function \(\varGamma (-1, 1)\) is responsible for producing a random data belonging to [-1, 1]. Thus, with the K-dimensional vector \(A_{i,j}\) and the K-dimensional vector B, we compute their inner product according to Eq. (5) and the final result is denoted by \(\varOmega _{i, j}\).
$$\begin{aligned} B=(b_{1},\ldots , b_{K}) \end{aligned}$$
(3)
$$\begin{aligned} b_{k} = \varGamma (-1, 1)(k=1, 2,\ldots , K) \end{aligned}$$
(4)
$$\begin{aligned} \varOmega _{i, j} = A_{i, j}*B \end{aligned}$$
(5)
According to Eq. (5), \(\varOmega _{i, j}\) is a concrete value belonging to (\(-inf, +inf\)). Next, to ease the following calculations, we convert the real-value \(\varOmega _{i, j}\) into a Boolean-value \(\varPsi _{i, j}\), which is formulated by Eq. (6). In Eq. (6), \(\varPsi _{i, j}\) value is mapped to be 1 or 0, whose rationale is explained as follows: let us consider a data point D and a hyperplane H; if point D is above the hyperplane H, then the \(\varPsi _{i, j}\) value corresponding to D is equal to 1; otherwise, if point D is below the hyperplane H, then the \(\varPsi _{i, j}\) value corresponding to D is equal to 0. This way, we can use such a kind of position relationship between point D and hyperplane H to evaluate whether two points are close or not. This is the theoretical basis behind the hash mapping operation adopted in Eq. (6).
This way, we convert the K-dimensional vector \(A_{i,j}\) in Eq. (2) into a Boolean-value \(\varPsi _{i, j}\). Correspondingly, the \(N*M*K\) tensor \(\kappa\) in Eq. (1) can be simplified to be the \(N*M\) matrix \(\kappa\) in Eq. (7). Next, we continue to simplify the \(N*M\) matrix \(\kappa\) into an N-dimensional vector, which could be finished by the transformation in Eq. (8). Here, \(\pi _{i}\) is the decimal value corresponding to the Boolean vector (\(\varPsi _{i, 1}\),..., \(\varPsi _{i, M}\)). For example, if (\(\varPsi _{i, 1}\),..., \(\varPsi _{i, M}\)) = (1, 1, 1), then \(\pi _{i}\) = 7. This way, we successfully convert the \(N*M\) matrix \(\kappa\) in Eq. (7) into an N-dimensional vector \(\kappa\) in Eq. (8). In other words, each student \(s_{i}\) is corresponding to a concrete decimal value \(\pi _{i}\). According to the index theory, decimal value \(\pi _{i}\) can be considered as the health index of student \(s_{i}\).
$$\begin{aligned} \varPsi _{i, j}=\left\{ \begin{array}{rcl} 1 &{} &{} \text {when}\ \varOmega _{i, j}>0\\ 0 &{} &{} \text {when}\ \varOmega _{i, j}<0 \end{array} \right. \end{aligned}$$
(6)
$$\begin{aligned} \kappa = \begin{array}{cc} &{} c_{1}\quad \cdots \quad c_{M}\\ \begin{array}{c} s_{1} \\ \vdots \\ s_{N} \end{array} &{} \left[ \begin{array}{ccc} \varPsi _{1, 1} &{} \cdots &{} \varPsi _{1, M} \\ \vdots &{}\ddots &{}\vdots \\ \varPsi _{N, 1} &{} \cdots &{} \varPsi _{N, M} \end{array}\right] \end{array} \end{aligned}$$
(7)
$$\begin{aligned} \kappa = \begin{array}{cc} \begin{array}{c} s_{1} \\ \vdots \\ s_{N} \end{array} &{} \left[ \begin{array}{c} \pi _{1} \\ \vdots \\ \pi _{N} \end{array}\right] \end{array} \end{aligned}$$
(8)
The advantages of health index here are three-fold: first, health index contains little privacy of students and hence can be transmitted or released to the cloud platform with less privacy risks, which can minimize the privacy disclosure concerns of people when a cloud platform integrates the distributed data of people together for uniform data processing and mining; second, health index-based similar student retrieval is rather quick; third, health index-based similar student retrieval results are rather close to the similar student retrieval results based on original health data that are sensitive to students. Therefore, we use the health indexes of students to take part in the subsequent distance calculation (Step 2) and anomaly detection (Step 3). This way, we can guarantee that the distance calculation and anomaly detection process is time-efficient and privacy-guaranteed.
Step 2: Calculate the similarity between each pair of students based on their health indexes
As discussed in Step 1, each student \(s_{i}\) is corresponding to a concrete decimal value \(\pi _{i}\). Here, \(\pi _{i}\) is obtained from the random vector B in Eq. (3) which bring additional uncertainty in creating the accurate health indexes of students. To minimize the uncertainty, q (q is an integer larger than 1) decimal values are necessary to be obtained for each student \(s_{i}\). In concrete, for each \(s_{i}\), we repeat the operations in Eqs. (3)-(8) q times to generate \(\pi _{i, 1}\),..., \(\pi _{i, q}\). After that, we get a new matrix \(\kappa\) as specified in Eq. (9). According to Eq. (9), each student \(s_{i}\) is corresponding to a q-dimensional vector (\(\pi _{i, 1}\),..., \(\pi _{i, q}\)). Then vector (\(\pi _{i, 1}\),..., \(\pi _{i, q}\)) can be regarded as the health index of student \(s_{i}\).
$$\begin{aligned} \kappa = \begin{array}{cc} \begin{array}{c} s_{1} \\ \vdots \\ s_{N} \end{array} &{} \left[ \begin{array}{c} \left( \pi _{1, 1} \cdots \pi _{1, q} \right) \\ \vdots \\ \left( \pi _{N,1} \cdots \pi _{N, q} \right) \end{array}\right] \end{array} \end{aligned}$$
(9)
With the health indexes of two students \(s_{i}\) and \(s_{j}\), i.e., (\(\pi _{i, 1}\),..., \(\pi _{i, q}\)) and (\(\pi _{j, 1}\),..., \(\pi _{j, q}\)), we can compute the similarity between \(s_{i}\) and \(s_{j}\) (denoted by \(Sim(s_{i}, s_{j})\)) based on the formula in Eqs. (10)-(11). Here, \(Sim(s_{i}, s_{j})\) represents the number of dimensions whose values of \(s_{i}\) and \(s_{j}\) are equal. For example, let us consider two students \(s_{1}\) and \(s_{2}\) whose health indexes are (1, 2, 3, 4, 5) and (1, 2, 3, 6, 7), respectively. Then their similarity \(Sim(s_{1}, s_{2})\) = 3 according to Eqs. (10)-(11). Furthermore, to loosen the judgement condition in Eq. (11), we create p (p is an integer larger than 1) hash tables, i.e., we generate \(\kappa _{1}\) ,..., \(\kappa _{p}\) by Eq. (9). Next, we update Eq. (11) to be Eq. (12) where the similarity judgement condition is loosened considerably.
$$\begin{aligned} Sim(s_{i}, s_{j}) = \sum \limits _{z=1}^{q} Sim_{i, j, z} \end{aligned}$$
(10)
$$\begin{aligned} \begin{aligned} Sim_{i, j, z} = 1, \text {iff}\ \pi _{i, z} = \pi _{j, z}(z=1, 2,\ldots , q) \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned} \begin{aligned} Sim_{i, j, z} = 1, \text {iff}\ \pi _{i, z} = \pi _{j, z}(z=1, 2,\ldots , q) \\ \text {holds}\ \text {in}\ \text {any}\ \kappa _{1},\ldots , \kappa _{p} \end{aligned} \end{aligned}$$
(12)
Step 3: Student health condition clustering and anomaly detection
According to the similarity between different students calculated in Step 2, we can cluster the students into different groups. In general, the students whose similarity with each other is large belong to an identical group. For example, if two students whose similarity is q, then they would be put into an identical group. Here, for discovering the most similar students, we set a threshold \(T (T \le q)\) for \(Sim(s_{i}, s_{j})\). More specifically, only the students \(s_{i}\) and \(s_{j}\) whose \(Sim(s_{i}, s_{j})\) is not smaller than T are deemed as similar. Following such a clustering rule, we can divide all the students into different groups. Furthermore, the students who have no similar students could be regarded as anomaly. This way, we can recognize the anomaly students accurately and meanwhile the sensitive information contained in health data transmitted to the cloud platform can be protected very well.
Next, we use the following algorithm to better ease the understanding of our Ano-Det method.