Predicting the total Unified Parkinson’s Disease Rating Scale (UPDRS) based on ML techniques and cloud-based update

Nowadays, smart health technologies are used in different life and environmental areas, such as smart life, healthcare, cognitive smart cities, and social systems. Intelligent, reliable, and ubiquitous healthcare systems are a part of the modern developing technology that should be more seriously considered. Data collection through different ways, such as the Internet of things (IoT)-assisted sensors, enables physicians to predict, prevent and treat diseases. Machine Learning (ML) algorithms may lead to higher accuracy in medical diagnosis/prognosis based on health data provided by the sensors to help physicians in tracking symptom significance and treatment steps. In this study, we applied four ML methods to the data on Parkinson’s disease to assess the methods’ performance and identify the essential features that may be used to predict the total Unified Parkinson’s disease Rating Scale (UPDRS). Since accessibility and high-performance decision-making are so vital for updating physicians and supporting IoT nodes (e.g., wearable sensors), all the data is stored, updated as rule-based, and protected in the cloud. Moreover, by assigning more computational equipment and memory in use, cloud computing makes it possible to reduce the time complexity of the training phase of ML algorithms in the cases we want to create a complete structure of cloud/edge architecture. In this situation, it is possible to investigate the approaches with varying iterations without concern for system configuration, temporal complexity, and real-time performance. Analyzing the coefficient of determination and Mean Square Error (MSE) reveals that the outcomes of the applied methods are mostly at an acceptable performance level. Moreover, the algorithm’s estimated weight indicates that Motor UPDRS is the most significant predictor of Total UPDRS.


Introduction
Numerous industries now profit from cutting-edge technology by disseminating technology to the populace. A recent study demonstrates the engagement of several researchers in wireless communication [1][2][3][4][5] to improve an existing system by addressing pertinent difficulties. Some areas of research, such as AI, have played a crucial part in the evolution of intelligence over these years to develop different projects in areas such as image recognization [6]. As stated by [7], remote-controlled robots will soon become more widespread in various fields. ML, which utilizes historical data as input and can predict the Page 2 of 16 Hamzehei et al. Journal of Cloud Computing (2023) 12:12 future based on the output [8], is a discipline that handles a wide range of problems in several fields [9]. The importance of ML may be observed in several areas of health care, including genomic medicine [10], cancer detection [11], and early diabetes diagnosis [12]. Moreover, the integration of ML and other technologies, such as the IoT, resulted in the smart hospital development project developed by [13,14] in order to manage hospitalized patients efficiently.
The other examples are [15,16] developed an intelligent system for discovering social distance in the hospital during the pandemic. The last example compares ML methods to classified bioinformatics data [17]. This research concentrates on one of the diseases known as Parkinson's disease (PD), a chronic degenerative disorder of the Central Nervous System (CNS) that predominantly affects the motor system. According to [18], PD is the most quickly spreading neurological disorder globally. Marras et al. [19] forecasts that by 2030, the number of Americans with PD will increase to 1.2 million from the current one million. Analyzing, understanding, and predicting the sign of PD is essential as, according to [20], there is no cure for PD, and the only treatment options are medicines, lifestyle modifications, and surgery. As there is no cure for PD at present, the Unified Parkinson's disease Rating Scale (UPDRS) is used to monitor the course of the condition. This paper aims to apply different ML methods to understand their performance based on the MSE and R-Squared ( R 2 ) . Figure 1 represents summarizing steps, from collecting data to determining the model's evaluation of the different algorithms. We proposed four linear regression methods, outlined in Table 1, for application to the PD data and evaluation to determine the most accurate.

Data Collection
The dataset used in the study reaches from the UC Irvine ML repository, which was created by Athanasios Tsanas and Max Little of the University of Oxford [21]. Table 2 represents the columns of the available dataset after dropping test time (since it is not considered in this paper). Dataset consists of 5876 rows and 21 columns. Each patient is related to one row in the dataset, characterized by twenty features (after dropping test time). Each patient's information is gathered in an ordered manner. On a similar day of medical examination, it is feasible for the same patient to have many rows with the same UPDRS result but distinct values for Shimmer, Jitter, etc. Various voice recordings have been analyzed to determine the values. The dataset rows should be shuffled to ensure the model stays general and has lower overfit. Shuffling the rows causes improvement in the ML model quality and predictive performance. Then the data is split into three different parts, • Training, 50% of the original dataset. • Validation, 25% of the original dataset. • Test, 25% of the original dataset.  Because of the different range of features and to improve the efficiency of the algorithms, data should normalize by applying the normalization of the data concerning the training dataset by using Eqs. 1, 2, and 3. Where, • z_train norm is the normalized training set.
• z_validation norm is the normalized validation set.
• z_test norm is the normalized test set.
• µ train is the mean row vector of the training set.
• σ train is the standard deviation row vector of the training set.
The normalization changes the values of numeric columns in the dataset into a standard scale without distorting differences in the ranges of values. Now, by considering the correlation, the linear relationship between two variables is determined. A strong correlation is called if the coefficient value is between + 0.50 and +1. Moderate correlation is used while the value is between +0.30 and +0.49. A value below +0.29 is called a low correlation. Figure 2 shows the covariance matrix after dropping irrelevant features. From Fig. 2, it can be seen that there is a strong correlation between motor UPDRS and total UPDRS; shimmer and jitter parameters have a moderate correlation with each other. In contrast, there is a low correlation between all other parameters and total UPDRS. Only highly and moderately correlated features have been evaluated to assess the final accuracy of the Machine Learning algorithms. However, it can be seen that in medical datasets, features may occasionally be retrieved after checking with the physician to determine the value of features in the actual world.

Cloud-Based Computing Services and Updates
One of the main problems related to ML algorithms is storing, processing, and updating the data efficiently. However, the traditional way, the local host, is still popular among researchers and developers [22,23]. One solution to locally distributed host issues is to benefit from the concept of cloud services, which enables developers and researchers to access applications and data remotely [24,25] in a platform with a huge amount of storage capacity and computational power. Storing and processing data in Cloud services provides benefits, including, • Increasing the security of code and data and protecting them against hacker attacks. • Providing accessibility. In this case, accessing the data anytime from anywhere is possible. • Allocating extra computational resources, which may result in reducing the time complexity of the processing. The cloud service enables developers to increase their code privacy and flexibility and, in advanced cases, reduce the system's time complexity and power consumption.
On the other hand. cloud computing can facilitate the fast and secure processing of distributed end-user data.
As an illustration, smart sensors collect users' medical information, such as blood pressure and heart rate, then send those to the cloud for further processing, where the machine learning algorithms operate. Using the cloud-based updates, we can periodically refresh the training data to make more precise predictions and decisions. Cloud-based control also allows us to involve a human expert in the case of sensitive situations, such as the timely prediction of a heart attack. In particular, this paper has used Google Colab, a free cloud service that enables the execution of Python code in Jupyter Notebook format [26,27]. It is an additional use of the cloud in addition to the justification provided above. Moreover, it is possible to use Google Cloud ML APIs for further processing by considering its cloud facilitations. Figure 3 represents how it is possible to achieve better time complexity by using the concept of the Cloud.

Linear Regression
In a vector Y = f (X), Y is a random variable, which depends on other random variables X 1 , X 2 , X 3 , ..., X F . Y is a random scalar variable, X is a random vector variable, and f() is an unknown function [10]. By measuring X and knowing the function f(), it is feasible to predict Y. This prediction may be advantageous for the direct measurement of Y because measuring X is less costly, less invasive, less harmful to the patient, and permits many more estimations per day. To find f(), the values of Y and X should be measured for N times as expressed in Eq. 4.
Where y represents a measured value and x determines a vector of numbers. By assuming observed values y(n), n = 1, ..., N , y(n) ∈ R , and back to the vector of variables x(n), (x(n) ∈ R F ) , it is possible to predict the future values of y(n), by using feature values of x(n). x(n), called the "regressor", is the vector of the independent variable. In contrast, y(n), known as the "regressand", is the dependent variable; the relationship between them is unknown, as shown in Fig. 4. In this paper, index n ∈ [1, N ] identifies the patient, and f ∈ [1, F ] specifies the features so that x f (n) is the f − th feature of the n − th patient. In linear regression, the assumption is represented in Eq. 5.
Which measured values are all affected by errors. Equation 6 denotes the vector of the "regressors".
Where the vector of the "regressors" x(n) and w are the set of weights to be found expressed in Eqs. 7 and 8, respectively. The weight of a feature indicates how significant it is for the following groups of methods. This shows that any particular approach it may be altered dependent on the algorithm that is being studied. Furthermore, v(n) is the measurement error.
To expand the problem for several patients, it is possible to write the vector as a matrix, in which each row represents an individual patient, as argued in Eq. 9.
This paper applies the different linear regression models mentioned in Table 2 to the provided dataset to estimate UPDRS as a linear regression problem.

LLS
Finding the straight line across a group of data points that provides the greatest possible fit is the objective of the LLS approach, which is both the simplest and most often used type of linear regression [10]. Considering LLS makes it possible to write the vector of measurement Y as Eq. 10.

Where,
• X has dimensions, number of patients, and features. • w is a column vector of weights, and it is unknown.
• Y the experimental target data.
• v(n) is a column vector of errors.
w is the unknown vector, and to find out that, the concept of square error is considered, which is the function of w, as demonstrated in Eq. 11. Equations 12 and 13 represent how to find the minimum of the function, the gradient of f(w) should be evaluated and then set equal to 0, and at the end, w can be found.
Once the optimal value of vector w has been determined, this estimate may be inserted into the f(w) formula, and (12) ∇f (w) = −2.X T .y + 2.X T .X.w = 0 (13) w = (X T .X) −1 .X T .y the minimum is then calculated. Figures 5, 6, 7, and 8 show the results obtained in LLS. From Fig. 5, it can be seen that, by considering LLS, Jitter RAP and Shimmer DDA have a more significant weight than other features. High values of Jitter and Shimmer usually confirm high instability during sustained vowel production. Figures 6 and 7 show the results of the comparison of estimation y and true values for training and test datasets, respectively. The similarity between training and testing appears by comparing them, which describes models that generalize successfully. Figure 8 also shows the similarities between the estimated and true values.

Conjugate Gradient
The Conjugate Gradient technique is generally used as an iterative algorithm; however, it can be used as a direct method and produce a numerical solution [28]. The assumption is that two vectors of d i and d k are conjugate concerning the symmetric matrix of Q if Eq. 14 is true.
To generate conjugate vectors, one of the solutions is considering Eq. 15.
Where Q is symmetric, and therefore U −1 = U T and the eigenvectors are orthogonal. In this case, the eigenvectors are also Q-orthogonal. The conjugate gradient algorithm starts from the initial solution, which means w 0 = 0 . It evaluates the function's gradient as stated in Eq. 16.
Approximation of w can be found through Eq. 17.
(15) Qu k = k u k (16)   Where α 0 can be obtained by using Eq. 18 The conjugate gradient algorithm is defined through Eqs. 19, 20, 21, 22, and 23 to solve the system Qwb = 0 , where Q is symmetric and positive definite in IR N * N . The algorithm is started by setting d 0 = −g 0 = b , initial solution as w 0 = 0 , and k = 0.
Defining the stopping condition of N makes it possible to understand when the algorithm is stopped. If k reaches the N, the threshold is met. Otherwise, the procedure is repeated. Figures 9, 10, 11, and 12 illustrate the results obtained by applying the Gradient Algorithm. The (17) solution has been applied using 100000 iterations and setting the learning coefficient 10 5 . Figures 10, 11, and 12 can be seen as the acceptable performance of the conjugate method. By comparing Figs. 10 and 11 with Figs. 6 and 7, it can be seen that both conjugate and LSS have almost the same performance in the case of generalization. However, the optimal weight vector has a different story. As seen in Fig. 5, shimmer DDA reaches ten value, but this number for Conjugate, Fig. 9, is 0. Among all the features, motor UPDRS stands first and has the highest weight value.

Adam optimization algorithm
Adam stands for Adaptive Moment Estimation. It is an extension of gradient descent. The term "gradient descent" refers to the first-order iterative optimization process used to locate a local minimum or maximum of a differentiable function. There are some points behind Adam's optimization [10]. The k-th central moment of a random variable α with mean µ is defined as Eq. 24.
The variance of a random variable is the second central moment; the k − th moment of a random variable α is defined in Eq. 25. This optimization is called Adam because of using the estimation of the first and second moments of the (24) • ∇f (x s ) is a current gradient.
• m and v are moving average. • β is the hyperparameter with the default values of 0.9 and 0.999.
The correction factor is introduced due to the relationship between true moments, which is estimated, as mentioned in Eqs. 28 and 29. The only step left is to use moving averages to scale the learning rate individually for each parameter. w is calculated by Eq. 30.
Where γ is the learning rate or step size. The proportion that weights are updated. Larger values result in faster initial learning before the rate is updated. Smaller values slow the learning rate down during training. ǫ is a tiny number to prevent any division by zero in the implementation. Also, it is used as a stopping condition parameter in the algorithm when the new minimum value of (30)  the gradient is calculated after each iteration. It is not decreasing more than the value of ǫ . Figures 13, 14, 15, and 16 demonstrate the result obtained from the Stochastic gradient with Adam. There is a point behind the algorithm were choosing the number of iterations is very significant. If it is too small, the reduction in error will be prolonged, while if it is too large, divergent oscillations in the error plot will occur. In this paper learning coefficient and the number of iterations are set to 0.001 and 2000 after testing the result. By observing Fig. 13, the Total UPDRS depends mostly on Motor UPDRS. This is the same result that Conjugate Gradient got. As stated by Fig. 16, the error is distributed around intervals near zero, which is similar to Figs. 8 and 12. In Figs. 14 and 15, it can be observed that the data's prediction follows the axis bisector, which confirms that it is correct most of the time. Stochastic gradient with Adam in Python can be shown Fig. 17.

Ridge Regression
Multicollinearity is a significant issue with data. In multicollinearity, least squares are unbiased, and variances are substantial, resulting in predictions far from real values. Ridge regression is utilized to reduce over-fitting concerns; when the noise is excessively huge, the optimum vector w may assume a tremendous value, leading to over-fitting. If y = Xw + v has some large values of noise/error, the vector ŵ may take huge values. Then, it might be convenient to solve the new problem, as stated in Eq. 31.
µ should be set by trial and error. The gradient of the objective function is noted in Eq. 32.
By setting Eq. 32 equal to 0 leads to Eq. 33.

Results and discussion
To analyze the methodologies and comprehend how it is feasible to anticipate the "regressand" and "regressor" in PD, MSE and R 2 were examined. Equation 34 provides the formulation of the MSE, which is commonly used to determine if the model's findings are satisfactory. A value closer to 0 consider a better assessment. On the other hand, also R 2 is considered, which is defined in Eq. 35.
Tables 3, 4, and 5 demonstrate that the solutions provide a more accurate assessment of the training set than the test set. In fact, the MSE of the training set is less than that of the test set. On the other hand, Table 6 (35) represents the value of R 2 , which should be as close as possible to 1.
In the Coefficient of determination, the value of 1.0 shows a perfect fit and a highly reliable model for the future, while a value of 0.0 would indicate the calculation fails to model the data accurately. From Table 6, the best value with Coefficient of determination is for the Adam optimization algorithm; however, it is very close to the result of the other methods. Behind acceptable results of methods and their ability to consider     12:12 them in the PD dataset, by assuming Cloud storage and Google Cloud, the algorithm's time complexity decreased dramatically and, on average, obtained 40 40 percent of the optimal time required to run methods. It means that not only is the development of techniques useful for researchers and doctors but also, using the concept of the Cloud enables programmers to develop algorithms more optimally.

Conclusion
By increasing the usage of ML, applying methods in various areas can solve issues and improve current assumptions. As in medical studies, because of the simplicity, some factors are considered steady parameters, and researchers and doctors have an easier understanding of the problem. Nevertheless, if more factors are assumed, they are interested because single-factor techniques have limited application. That is why the regression model is frequently used in such multi-factor situations. Linear regression models allow finding relationships between multiple factors to be defined and characterized. This research demonstrates that the most correlated feature to Total UPDRS is UPDRS Motor. This is predicted since PD manifests via the patient's movement, but the voice's characteristics negatively impact the Total UPDRS parameter's evaluation. Moreover, the ability of linear regression in the prediction of PD is another result obtained in this paper and confirms that by developing the Linear Regression methods, it is possible to predict PD, which is a disease without a cure, and prevention of it plays an important role. Ultimately, ML methods should be run and implemented in an environment that takes time to run optimally. That is why the concept of the Cloud is considered to reduce time complexity and enhance accuracy and accessibility. It is strongly advised that, for future expansion, various PD datasets be used and that a neural network be used in the cloud, followed by an evaluation of the methodologies. Moreover, in this regard, some recent and novel research on edgecomputing task management has been indicated, which facilitates real-time processing [8] and [4].