Linear Regression
In a vector \(Y = f(X), Y\) is a random variable, which depends on other random variables \(X_1, X_2, X_3, ..., X_F\). Y is a random scalar variable, X is a random vector variable, and f() is an unknown function [10]. By measuring X and knowing the function f(), it is feasible to predict Y. This prediction may be advantageous for the direct measurement of Y because measuring X is less costly, less invasive, less harmful to the patient, and permits many more estimations per day. To find f(), the values of Y and X should be measured for N times as expressed in Eq. 4.
$$\begin{aligned} y(n) = f(x(n)), \ n = 1, ..., N \end{aligned}$$
(4)
Where y represents a measured value and x determines a vector of numbers. By assuming observed values y(n), \(n = 1, ..., N\), \(y(n)\in R\), and back to the vector of variables x(n), \((x(n) \in R^F )\), it is possible to predict the future values of y(n), by using feature values of x(n). x(n), called the “regressor”, is the vector of the independent variable. In contrast, y(n), known as the “regressand”, is the dependent variable; the relationship between them is unknown, as shown in Fig. 4.
In this paper, index \(n \in [1, N]\) identifies the patient, and \(f \in [1, F]\) specifies the features so that \(x_f(n)\) is the \(f-th\) feature of the \(n-th\) patient. In linear regression, the assumption is represented in Eq. 5.
$$\begin{aligned} Y = w_1X_1 + w_2X_2 + ... + w_FX_F \end{aligned}$$
(5)
Which measured values are all affected by errors. Equation 6 denotes the vector of the “regressors”.
$$\begin{aligned} y(n) = [x(n)]^{T}w + v(n) \end{aligned}$$
(6)
Where the vector of the “regressors” x(n) and w are the set of weights to be found expressed in Eqs. 7 and 8, respectively. The weight of a feature indicates how significant it is for the following groups of methods. This shows that any particular approach it may be altered dependent on the algorithm that is being studied. Furthermore, v(n) is the measurement error.
$$\begin{aligned} x(n) = [x_1(n), ..., x_F(n)]^T \end{aligned}$$
(7)
$$\begin{aligned} w = [w_1, ..., w_F] \end{aligned}$$
(8)
To expand the problem for several patients, it is possible to write the vector as a matrix, in which each row represents an individual patient, as argued in Eq. 9.
$$\begin{aligned} \left[ \begin{array}{c} y(1)\\ y(2)\\ .\\ .\\ .\\ y(N) \end{array}\right] = \left[ \begin{array}{ccccc} x_{1}(1) &{} x_{2}(1) &{} x_{3}(1) &{} ... &{} x_{F}(1)\\ x_{1}(2) &{} x_{2}(2) &{} x_{3}(2) &{} ... &{} x_{F}(2)\\ . &{} . &{} . &{} ... &{} .\\ . &{} . &{} . &{} ... &{} .\\ . &{} . &{} . &{} ... &{} .\\ x_{1}(N) &{} x_{2}(N) &{} x_{3}(N) &{} ... &{} x_{F}(N) \end{array}\right] \left[ \begin{array}{c} w_1\\ w_2\\ .\\ .\\ .\\ w_F \end{array}\right] \left[ \begin{array}{c} v(1)\\ v(2)\\ .\\ .\\ .\\ v(N) \end{array}\right] \end{aligned}$$
(9)
This paper applies the different linear regression models mentioned in Table 2 to the provided dataset to estimate UPDRS as a linear regression problem.
LLS
Finding the straight line across a group of data points that provides the greatest possible fit is the objective of the LLS approach, which is both the simplest and most often used type of linear regression [10]. Considering LLS makes it possible to write the vector of measurement Y as Eq. 10.
$$\begin{aligned} Y = X^T.w + v(n) \end{aligned}$$
(10)
Where,
-
X has dimensions, number of patients, and features.
-
w is a column vector of weights, and it is unknown.
-
Y the experimental target data.
-
v(n) is a column vector of errors.
w is the unknown vector, and to find out that, the concept of square error is considered, which is the function of w, as demonstrated in Eq. 11.
$$\begin{aligned} f(w) = ||y - X.w||^{2} \end{aligned}$$
(11)
Equations 12 and 13 represent how to find the minimum of the function, the gradient of f(w) should be evaluated and then set equal to 0, and at the end, w can be found.
$$\begin{aligned} \nabla f(w) = -2.X^T.y + 2.X^T.X.w = 0 \end{aligned}$$
(12)
$$\begin{aligned} w = (X^T.X)^{-1}.X^T.y \end{aligned}$$
(13)
Once the optimal value of vector w has been determined, this estimate may be inserted into the f(w) formula, and the minimum is then calculated. Figures 5, 6, 7, and 8 show the results obtained in LLS. From Fig. 5, it can be seen that, by considering LLS, Jitter RAP and Shimmer DDA have a more significant weight than other features. High values of Jitter and Shimmer usually confirm high instability during sustained vowel production. Figures 6 and 7 show the results of the comparison of estimation y and true values for training and test datasets, respectively. The similarity between training and testing appears by comparing them, which describes models that generalize successfully. Figure 8 also shows the similarities between the estimated and true values.
Conjugate Gradient
The Conjugate Gradient technique is generally used as an iterative algorithm; however, it can be used as a direct method and produce a numerical solution [28]. The assumption is that two vectors of \(d_i\) and \(d_k\) are conjugate concerning the symmetric matrix of Q if Eq. 14 is true.
$$\begin{aligned} d_{i}^{T}Qd_{k} = 0 \end{aligned}$$
(14)
To generate conjugate vectors, one of the solutions is considering Eq. 15.
$$\begin{aligned} Qu_k = \lambda _{k}u_k \end{aligned}$$
(15)
Where Q is symmetric, and therefore \(U^{-1} = U^{T}\) and the eigenvectors are orthogonal. In this case, the eigenvectors are also Q-orthogonal. The conjugate gradient algorithm starts from the initial solution, which means \(w_0 = 0\). It evaluates the function’s gradient as stated in Eq. 16.
$$\begin{aligned} g_0 = \nabla f(w_0) = Qw_0 - b = -b \end{aligned}$$
(16)
Approximation of w can be found through Eq. 17.
$$\begin{aligned} w_1 = w_0 + \alpha _{0}d_0 \end{aligned}$$
(17)
Where \(\alpha _{0}\) can be obtained by using Eq. 18
$$\begin{aligned} \alpha _{0} = \frac{d_{0}^{T}b}{d_{0}^{T}Qd_0} \end{aligned}$$
(18)
The conjugate gradient algorithm is defined through Eqs. 19, 20, 21, 22, and 23 to solve the system \(Qwb = 0\), where Q is symmetric and positive definite in \(\mathrm {I\!R}^{N*N}\). The algorithm is started by setting \(d_0 = - g_0 = b\), initial solution as \(w_0 = 0\), and \(k = 0\).
$$\begin{aligned} \alpha _{k} = -\frac{d^{T}_{k}g_{k}}{d^{T}_{k}Qd_k} \end{aligned}$$
(19)
$$\begin{aligned} w_{k+1} = w_k + \alpha _k d_k \end{aligned}$$
(20)
$$\begin{aligned} g_{k+1} = Qw_{k+1} - b = g_k + \alpha _kQd_k \end{aligned}$$
(21)
$$\begin{aligned} \beta _k = \frac{g_{k+1}^{T}Qd_k}{d_{k+1}^{T}Qd_k} \end{aligned}$$
(22)
$$\begin{aligned} d_{k+1} = -g_{k+1} + \beta _k d_k \end{aligned}$$
(23)
Defining the stopping condition of N makes it possible to understand when the algorithm is stopped. If k reaches the N, the threshold is met. Otherwise, the procedure is repeated. Figures 9, 10, 11, and 12 illustrate the results obtained by applying the Gradient Algorithm. The solution has been applied using 100000 iterations and setting the learning coefficient \(10^{5}\). Figures 10, 11, and 12 can be seen as the acceptable performance of the conjugate method. By comparing Figs. 10 and 11 with Figs. 6 and 7, it can be seen that both conjugate and LSS have almost the same performance in the case of generalization. However, the optimal weight vector has a different story. As seen in Fig. 5, shimmer DDA reaches ten value, but this number for Conjugate, Fig. 9, is 0. Among all the features, motor UPDRS stands first and has the highest weight value.
Adam optimization algorithm
Adam stands for Adaptive Moment Estimation. It is an extension of gradient descent. The term “gradient descent” refers to the first-order iterative optimization process used to locate a local minimum or maximum of a differentiable function. There are some points behind Adam’s optimization [10]. The k-th central moment of a random variable \(\alpha\) with mean \(\mu\) is defined as Eq. 24.
$$\begin{aligned} m^{(k)} = E\{(\alpha - \mu )^{k}\} \end{aligned}$$
(24)
The variance of a random variable is the second central moment; the \(k-th\) moment of a random variable \(\alpha\) is defined in Eq. 25.
$$\begin{aligned} \mu ^{(k)} = E\{\alpha ^{k}\} \end{aligned}$$
(25)
This optimization is called Adam because of using the estimation of the first and second moments of the gradient to adapt the learning rate for the weight of the neural network. Adam tests Eqs. 26 and 27.
$$\begin{aligned} \mu ^{(1)}_{i} = m = \beta _1\mu _{i-1}^{(1)} + (1 - \beta _1) \nabla f(x_s) \end{aligned}$$
(26)
$$\begin{aligned} \mu ^{(2)}_{i} = v = \beta _2\mu _{i-1}^{(2)} + (1 - \beta _2) [\nabla f(x_s)]^2 \end{aligned}$$
(27)
Where
-
\(\nabla f(x_s)\) is a current gradient.
-
m and v are moving average.
-
\(\beta\) is the hyperparameter with the default values of 0.9 and 0.999.
The correction factor is introduced due to the relationship between true moments, which is estimated, as mentioned in Eqs. 28 and 29.
$$\begin{aligned} \hat{m} = \hat{\mu }_{i}^{(1)} = \frac{\hat{\mu }_{i}^{(1)}}{1 - \beta ^{i+1}_{1}} \end{aligned}$$
(28)
$$\begin{aligned} \hat{v} = \hat{\mu }_{i}^{(2)} = \frac{\hat{\mu }_{i}^{(2)}}{1 - \beta ^{i+1}_{2}} \end{aligned}$$
(29)
The only step left is to use moving averages to scale the learning rate individually for each parameter. w is calculated by Eq. 30.
$$\begin{aligned} w_{i+1} = w_i - \gamma \frac{\hat{\mu }^1}{\sqrt{\hat{\mu }_{i}^{2} + \epsilon }} \end{aligned}$$
(30)
Where \(\gamma\) is the learning rate or step size. The proportion that weights are updated. Larger values result in faster initial learning before the rate is updated. Smaller values slow the learning rate down during training. \(\epsilon\) is a tiny number to prevent any division by zero in the implementation. Also, it is used as a stopping condition parameter in the algorithm when the new minimum value of the gradient is calculated after each iteration. It is not decreasing more than the value of \(\epsilon\). Figures 13, 14, 15, and 16 demonstrate the result obtained from the Stochastic gradient with Adam. There is a point behind the algorithm were choosing the number of iterations is very significant. If it is too small, the reduction in error will be prolonged, while if it is too large, divergent oscillations in the error plot will occur. In this paper learning coefficient and the number of iterations are set to 0.001 and 2000 after testing the result. By observing Fig. 13, the Total UPDRS depends mostly on Motor UPDRS. This is the same result that Conjugate Gradient got. As stated by Fig. 16, the error is distributed around intervals near zero, which is similar to Figs. 8 and 12. In Figs. 14 and 15, it can be observed that the data’s prediction follows the axis bisector, which confirms that it is correct most of the time. Stochastic gradient with Adam in Python can be shown Fig. 17.
Ridge Regression
Multicollinearity is a significant issue with data. In multicollinearity, least squares are unbiased, and variances are substantial, resulting in predictions far from real values. Ridge regression is utilized to reduce over-fitting concerns; when the noise is excessively huge, the optimum vector w may assume a tremendous value, leading to over-fitting. If \(y = Xw + v\) has some large values of noise/error, the vector \(\hat{w}\) may take huge values. Then, it might be convenient to solve the new problem, as stated in Eq. 31.
$$\begin{aligned} \min _{w} = ||y - Xw||^{2} + \mu ||w||^{2} \end{aligned}$$
(31)
\(\mu\) should be set by trial and error. The gradient of the objective function is noted in Eq. 32.
$$\begin{aligned} \nabla _wg(w) = 2X^TXw - 2X^Ty_{meas} + 2\lambda w \end{aligned}$$
(32)
By setting Eq. 32 equal to 0 leads to Eq. 33.
$$\begin{aligned} \hat{w} = (X^TX + \lambda I)^{-1}X^Ty_{meas} \end{aligned}$$
(33)
Figures 18, and 19 illustrate the Ridge Regression method’s satisfactory performance. Motor UPDRS, like all other techniques except LLS, has the highest weight value, as stated in Fig. 20. Figure 21 illustrates the error histogram of the Ridge Regression, which follows the exact behavior of other methods.