Development of a cloud-assisted classification technique for the preservation of secure data storage in smart cities

Cloud computing is the most recent smart city advancement, made possible by the increasing volume of heterogeneous data produced by apps. More storage capacity and processing power are required to process this volume of data. Data analytics is used to examine various datasets, both structured and unstructured. Nonetheless, as the complexity of data in the healthcare and biomedical communities grows, obtaining more precise results from analyses of medical datasets presents a number of challenges. In the cloud environment, big data is abundant, necessitating proper classification that can be effectively divided using machine language. Machine learning is used to investigate algorithms for learning and data prediction. The Cleveland database is frequently used by machine learning researchers. Among the performance metrics used to compare the proposed and existing methodologies are execution time, defect detection rate, and accuracy. In this study, two supervised learning-based classifiers, SVM and Novel KNN, were proposed and used to analyses data from a benchmark database obtained from the UCI repository. Initially, intrusions were detected using the SVM classification method. The proposed study demonstrated how the novel KNN used for distance capacity outperformed previous studies. The accuracy of the results of both approaches is evaluated. The results show that the intrusion detection system (IDS) with a 98.98% accuracy rate produces the best results when using the suggested system.


Introduction
Cloud computing can be used to explore the full potential of smart city services, which are supported by highly inventive and scalable service platforms. Smart cities require a decentralized cloud-based platform and an open-source network to be implemented. Multi-sensor apps can perform complex big data processing using dispersed sensor networks thanks to Internet of Things features included in the cloud platform [1]. The Indian government has different plans for implementing the smart city objective in different cities, depending on the level of development required. India is transforming both rural and urban areas into smart cities in order to improve the quality of life and communication between the government and its citizens. Many factors influence the growth of a smart city, including the facilitation of multiple land uses, the provision of adequate housing for all, the encouragement of multiple modes of transportation, the creation of citizen-friendly and cost-effective governance, and the provision of a distinct character for the city. A cloud-assisted categorization strategy for secure data storage preservation in smart cities is a highly efficient and secure approach to organizing and preserving data that is collected from various smart city devices such as sensors and cameras. This strategy involves leveraging cloud computing technology to store and process data, along with a classification algorithm to sort the data into specific categories based on certain criteria.
The primary objective of this strategy is to provide a secure and scalable solution for managing the vast amounts of data generated by smart city devices. With the use of cloud computing, data can be centrally stored and processed, making it more straightforward to manage and analyze. Furthermore, the classification algorithm ensures the efficient categorization of the data based on its content, which facilitates easier storage and retrieval.
This technique has the potential to enhance the effectiveness of smart city initiatives by enabling better data management and analysis. The centralized storage and analysis of data can provide insights that can inform decision-making and lead to improved services for residents. Additionally, the use of a classification algorithm streamlines the organization of the data, making it easier to access and retrieve specific information when needed.
Cloud computing [2] provides a large platform for smart cities by providing domain-specific applications with the services they require, driving the design of all system components, and determining the majority of technical choices for everything from intelligent devices and sensors to middleware and computing infrastructure. Figure 1 depicts the entire datamining process, from data storage to data analytics. However, when the amount of data stored becomes extremely large, handling and managing it becomes extremely difficult. Structured databases and database management systems are thus created to address these issues. Efficient database management systems are required for retrieving specific information from large amounts of aggregate data. Because database management systems are widely used, gathering all types of information is simple [3]. Data warehouses collect and store information from various sources. Data mining is a powerful tool for many businesses because it reduces the amount of information available in data warehouses. To differentiate data mining tools, an automated analysis process is used. As a result, new information can be discovered using historical data. As a result, at a specific time, a large set of data is analyzed using data mining.
This method is used to analyses various fields or variables in small data samples. This approach provides simple and effective solutions for performing relatively simple data analysis. Essential data that is present in an unorganized manner can be discovered by effectively using data mining. Data mining tools are used to discover previously unknown patterns in databases. Fraudulent credit card transactions are detected, and anomalous data identification is performed, resulting in pattern discovery issues. The representation of fundamental data entry errors is done here. For presenting the final results, the network supervisor domain experts are presented in an understandable human form. They extract predictive information from applications using highly efficient data mining tools. Text reports, scientific data, or satellite images can all be valuable sources for extracting information. It is not enough to simply retrieve information in order to make decisions. To improve decision-making, new methods for dealing with primarily collected data are developed. This technology can extract the essence of stored information, discover patterns in raw data, and perform automatic data summarization [4].

Motivation
In order to effectively manage network traffic, it is necessary to categorize input data requirements into defined classes through classification methods. The behavior of data flow on the network must be analyzed, and traffic must be classified into attacker and non-attacker categories. To achieve this, a wire shark dataset is utilized, which goes through three critical steps. The first step is data preprocessing, which eliminates redundancy in the data [5]. The next step is clustering, which groups data into clusters based on their similarity and dissimilarity. The center point of each cluster is determined using the k-means clustering approach. The Euclidean distance is then computed to describe the distance between each data point and the center point. Finally, a classifier is used to categorize the input data based on polarity, resulting in accurate classification while reducing execution time. Future work could focus on enhancing the classification accuracy by incorporating a hybrid classifier or by exploring different classification algorithms [6,7]. Additionally, the study acknowledges the limitations of only considering technological means of addressing network security, and suggests that legal and institutional frameworks should also be taken into account.

Major contributions
This research work presents several significant contributions to the field of intrusion detection in cloud storage: i) A novel intrusion detection method using the K-Nearest Neighbor (KNN) classification algorithm has been introduced. This method analyzes the unusual patterns of activity in the network and identifies and isolates abnormal nodes. ii) A framework model has been proposed that utilizes a modified version of the KNN algorithm for classifying network traffic. The model applies a predicting algorithm to improve the accuracy of the classification process. iii) Machine learning techniques have been employed to evaluate the effectiveness of the proposed algorithm and achieve the desired results. iv) The outcomes of the proposed approach have been analyzed and compared with those of existing methods in terms of various parameters such as accuracy, execution time, and information retrieval metrics.

Organization of paper
The subsequent sections of this paper are structured as follows: in section 2, presented the Literature review, and section 3 describes the Proposed work along with the methodology. Moving on, Section 4 examines the analysis of the results. Finally, in Section 5, the conclusions of the paper and future work are presented.

Literature review
In this section, we have reviewed the existing work & methods completed by the different researchers. The fundamental research issue surrounding the cloud is ensuring the order of clients' data in the cloud. Customers' various data is stored by big data storage providers; this should be confirmed. Distributed computing has steadily advanced in information technology and will continue to shape I.T. organizations in the coming years.
Cloud is also facing significant difficulties. Ensure that appropriate physical, canny, and staff security controls are in place, especially when collecting cloud data [8]. Furthermore, when moving such massive amounts of data, the data organization may not be reliable. This territory depicts the investigation work related to the issue space of ensuring data security in cloud storage. A brief report summarizes the research conducted by several scientists in the field of sickness prognosis. According to [9], the primary goal of the data security model is to detect attacks in the rail transportation sector. An expert attack detection system known as the BAS was developed to detect assaults and reduce their impact on the subway's environment control subsystem. Expert systems enable the detection of unauthorized operations and attacks, as well as the inference engine and knowledge base. There are blacklist and allow list regulations included, which can be used to prevent unauthorized attacks. The regulations provided extensive protection for the subway system's environment control system's data security. This method protects the data of several subway subsystems. This technology is currently being tested due to a number of limitations. However, IDSs can be deployed in urban areas using big data principles.
The authors of [10] discussed internal IDs and IDS models. Real-time forensic algorithms and data mining techniques are used in these models. Data mining techniques were demonstrated to aid in cyber investigation and attack detection. Several analyses from different researchers were used to provide a variety of methods for detecting assaults, which were reported in this paper. The evaluation of this work was beneficial in reaching a satisfactory conclusion. The proposed method improved the precision and increased the number of new discoveries by up to 95%. Existing methods, by contrast, have a 90% accuracy and discovery rate. Based on these findings, it was obvious that the proposed method outperformed previous algorithms in terms of precision and intrusion detection.
The study conducted by the author of this paper [11] aimed to investigate an intrusion detection algorithm capable of classifying a large percentage of potential attacks as true or false without the need for operator input [12]. The proposed algorithm was developed using immunology stimulation rules and a Negative Selection algorithm. To achieve this, the co-stimulation system and a two-tier negative selection technique were employed. The primary objective of the system was to minimize detection errors while reducing the need for human intervention. Through the proposed MNSA algorithm, the study was able to detect around 34% of all attacks without the need for non-self-information. Moreover, the algorithm confirmed over 90% of the recognitions that did not require additional data or an operator unit. This implies that the proposed algorithm has the potential to significantly reduce the workload of network administrators and enhance the efficiency of network security.
However, there are still some limitations to the proposed algorithm that need to be addressed. For instance, the algorithm's accuracy and performance might be affected by the variation in the types of network traffic. Therefore, further research is required to validate the effectiveness of the algorithm in different network environments. Additionally, the algorithm's ability to detect unknown or zero-day attacks needs to be evaluated to determine its overall reliability in real-world scenarios. To prevent lung cancer, the author [13] proposed a brand-new clustering technique called Foggy K-means algorithm. To prevent lung cancer, a suggested strategy and powerful analytical ability were offered. This study compared the proposed method with the traditional K means algorithm. By comparing results, the cluster authenticity criteria were shown to better suit the proposed method [14]. Field experts could use these findings to create more robust clusters for prediction. The harmful effects of smoking, tuberculosis, radiation produced by various industries, and radioactive materials may all be linked to various illnesses. The results of the proposed clustering technique could be used to category lung cancer patients in future research. This method will identify the factors that have a significant impact on lung cancer.
According to the paper [15], various prediction instruments were used for clustering. This study proposed a novel modified approach for climate prediction called K-mean clustering generic methodology. The goal of this project was to measure the level of pollution in the air. A dataset from the state of West Bengal was used for this purpose. Using the peak mean values of the clusters, a climate group catalogue was created. The K-Means clustering algorithm was used in the air pollution data suite. Climate groups were described using various clusters. The term "modified K" denotes that the algorithm validated the new data and classified it into accessible clusters. The proposed method predicts information on upcoming climate conditions. West Bengal state weather forecasting data was included in the data set. The effects of air pollution could be mitigated with the help of this data set. The modelled estimates accurately predicted climate conditions. Finally, the authors conducted various tests to validate the proposed algorithm's accuracy.
The article's author [16] describes the Student Achievement Analysis System (SPAS), which tracks students' academic performance at a specific institution. This work's proposed approach included a forecasting model. This forecasting model could predict the performance of students in a specific course. This course sequentially assisted professors in recognizing poor student performance. These students were predicted using the proposed method. Some data mining rules were used to forecast student performance. A data mining technique known as classification was used in this work. This technique classified students based on the grades they received.
According to the authors of the paper [17], predicting share profit is an important topic in data analysis and prediction. It was assumed that the historical primary data had some analytical relationship with future share profits. The information retrieved from the past worth of these shares was used to decide the selling and purchasing of shares in this work. As a result, those who invested in the stock market benefited from this strategy. A classification model known as a decision tree was used in this work.
To predict the analysis problems, the researcher used the k-means algorithm to present the results based on accuracy [18]. For this purpose, both natural and synthetic datasets were used. K-Means was a clustering method. The primary goal of this algorithm was to divide n patterns into k clusters. Every pattern was linked to the cluster with the lowest mean. Each cluster was assigned a random number of clusters, k. Every integer was given a random start value. The proposed technique was used to category the collection of items based on their characteristics. These objects were divided into K groups. To group objects, the sum of the squares of distances between them was minimized. For this, the Euclidean distance formula and the corresponding cluster centroid were used. Clustering produced effective results with the highest accuracy and robustness, according to the tests.
The author briefly explains the concept of clustering in this work. Clustering divided the data into clusters of similar entities [19]. Objects in each cluster were similar. These objects, however, were distinct from those in other clusters. K-means is a well-known clustering algorithm. This algorithm was widely used in data clustering. However, this algorithm is computationally expensive. The choice of initial centroids had a significant impact on the quality of the final results. This paper proposed a novel approach for improving the algorithm's competence and productivity. The technique presented here reduces the difficulty and time required for mathematical computation. Furthermore, the proposed technique preserved the ease of use of the k-means algorithm. The proposed solution also addresses the issue of the dead unit.
The article [20] discusses research on methods for classifying and predicting non-linear datasets. And it has been stated that, when compared to other approaches used for prediction and classification, the neural network approach is generally regarded as the best classification method. The B.P. algorithm is the most effective classifier of an artificial neural network because it uses the updating approach of weights. Faults are also propagated backward using this method. This method is constrained by local minima solutions. This study solves the problem by employing an effective modified technique that improves accuracy and is used in a variety of future prediction applications.
The study's authors proposed classification methods for risk prediction, pattern recognition, and data mining in clinical cardiovascular medicine [21]. The data has been modelled and classified using a data mining technique known as categorization. Unfortunately, conventional medical scoring methods can only be used up to a point due to the linear combination of elements in the input set. As a result, non-linear complex interaction modelling is not used in medicine. Classification methods are used to overcome this limitation because complex nonlinear correlations between dependent and independent variables can be discovered. Furthermore, it can identify any and all possible links between various prognostic indicators.
The study's author [22] proposes two methods for selecting features from the dataset: SVM-RFE and gain ratio. Depending on the circumstances, the healthcare industry has a wealth of data that must be mined for hidden patterns. Data mining techniques in this field are required for optimal judgement. The features saved in the proposed method can be used with the Random Forest and Nave Bayes algorithms. The obtained results can be used to improve the procedure's performance level. Each factor is assigned a specific importance rating using this method. Experiment results confirmed that the proposed method achieves the highest precision with the least amount of computing effort.
The Author [23] discussed the dual issues of privacy and security in a big data-enabled cloud environment in this work. The three methods of big data management discussed in this study are outsourcing from data owners, sharing with data consumers, and cloud-based management. We advocated for the implementation of the SHA3 hashing technology, which generates a hash of user information and stores it in the Trust Center, as a means of providing secure user authentication of Data Owners and Data Users. The data's owners securely transmit it to the cloud server. When data is compressed using the LZMA method, big data-enabled cloud storage becomes more efficient. Finally, we used SALSA20 Encryption Map Reduce to accelerate the encryption and decryption processes. After encryption, the data is uploaded to a remote server.
While cloud computing is relatively [24] mature and its potential benefits well understood by individual, industry and government consumers, a number of security and privacy concerns remain. Unsurprisingly, designing cryptographic solutions to ensure the security of cloud services and the privacy of data outsourced to the cloud remains an ongoing research area. This paper provides a critique of the wide range of cryptographic schemes designed for securing sensitive data in the cloud computing environment, as well as outlining the research opportunities in the use of cryptographic techniques in cloud computing.
Cloud storage systems are increasingly turning to NoSQL [24] database management systems (DBMS) due to their superior availability and performance compared to traditional DBMSs. However, some NoSQL DBMSs sacrifice consistency guarantees for performance gains by using eventual consistency, where an operation is confirmed without checking all nodes. Different consistency levels can be adopted, affecting system behavior. Therefore, it's crucial to assess system design considering distinct consistency levels to develop cloud storage systems. This study proposes an approach using reliability block diagrams and generalized stochastic Petri nets to evaluate availability and performance of cloud storage systems with redundant nodes and eventual consistency based on NoSQL DBMS. The experiment shows that system configuration can cause unavailability from 1 s to 21 h in a year, and performance can decrease by up to 17.9%.
This paper [25] investigates the problem of efficient data integrity auditing supporting provable data update in cloud computing environment. It introduces an efficient outsourced data integrity auditing scheme based on the Merkel sum hash tree (MSHT). The scheme could meet the requirements of provable data update and data confidentiality without dependency on a third authority. This paper [26] introduces a threefold methodology to improve the trade-off between I/O performance and capacity utilization of cloud storage for CDS services. This methodology includes: i) Definition of a classification model for identifying types of users and contents by analyzing their consumption/ demand and sharing patterns, ii) Usage of the classification model for defining content availability and load balancing schemes, and iii) Integration of a dynamic availability scheme into a cloud-based CDS system.
This paper [27] presents a comparative and systematic study of leading techniques for secure sharing and protecting the data in the cloud environment. It discusses the functioning, potential, and achievements of each solution and provides a comparative analysis. The applicability of the techniques is discussed as per the requirements and the research gaps along with future directions are reported in the field.
This paper [28] discusses a new generation cloud storage system that integrates distributed storage technology. It is designed to support all kinds of OLTP or OLAP business applications and to solve the problems of data security and smooth storage expansion.
The identity of the Data [29] User making a request for data must be confirmed by the Trust Center before the request can be fulfilled. To read the specified data file, the secret keystream is applied. We looked at two methods, clustering with DBSCAN and indexing with Fractal Index Tree, for big data management in the cloud. The proposed SADS-Cloud technique was developed for the E-healthcare application, evaluated, and compared to other approaches based on a number of parameters, including information loss, compression ratio, throughput, encryption time, decryption time, and efficiency.
The lack of consideration for the influence of the suggested approach on energy consumption and environmental sustainability is a limitation of the literature review. While using cloud computing for data storage and processing has advantages such as scalability and simplicity of management, it also consumes a lot of energy and has a detrimental influence on the environment. As a result, future research might concentrate on creating and assessing strategies that combine the advantages of cloud computing with energy efficiency and ecological concerns. Another possible research gap is the requirement for a more thorough evaluation and testing of the suggested approach on real-world smart city datasets. While the research exhibits promising findings on a simulated dataset, the suggested technique's performance may vary in different smart city scenarios with variable data qualities and volume.
As a result, future research may include testing and evaluating the suggested approach on a variety of realworld smart city datasets to assess its efficacy and applicability in various scenarios. The study focuses mostly on the technological components of the suggested method, with less emphasis placed on the social and ethical consequences of using cloud computing and data categorization in smart cities. Future study might look at the social and ethical implications of using such approaches in smart city settings, such as privacy, data ownership, and responsibility. Comparative study of exiting work shown in Table 1.

Proposed work
Prediction analysis is the process used to forecast potential future outcomes based on present data. Prediction analysis's foundation is clustering and classification. Clustering and classification are the two parts of the prediction analysis process. The cluster head in this research is constructed using the k-mean clustering technique. The output is used as a classification input by the SVM classifier.
The intrusion detection system in this study makes use of a KNN and an SVM model to carry out its operations. There are three benefits to the system: value and the cutoff value. This represents the total number of nodes that are quite close together. The cutoff value is the criterion used to rank the outliers among the nodes. The following terms are defined to help clarify this method's procedure: 3. The node's feature vector is composed with Network, S is the collection of all nodes in the network, whether pathological and normal. 4. The distance between two separate nodes and is their Euclidean distance, denoted by eudis. 5. The distance function of a node is the value obtained by adding the Euclidean distances of all of its neighboring nodes.

Research methodology
The prediction analysis is carried out in this study. Based on the existing dataset, the prediction analysis can forecast future opportunities. The first section of this research looks at how the KNN classification approach can be used to solve the problem of intrusion detection in wireless sensor networks. An intrusion detection system based on the KNN algorithm is evaluated for parameter selection and error rate in order to distinguish abnormal nodes from normal ones. We decided to put the intrusion detection system through its paces to see how effective it was. The terminal device's physical foundation is made up of both wireless sensor nodes and a wired network card. The wireless sensor nodes used to monitor network activity and propagate blacklists are manufactured by Ningbo Zhongke Integrated Circuit Co., Ltd. Terminal hardware allows for the detection of anomalies in control systems, network traffic, node anomaly evaluation, and attack resolution. The software stack includes TinyOS, an embedded operating system, and the AVRStudio IDE. A serial communication aid is used to exchange control data messages. The intrusion detection system is a complex system that consists of various components, including a wireless network interface module (WAN IM), a data storage module, an analysis and judgment module, and an intrusion reaction module. The WAN IM is installed on the wireless sensor nodes to collect raw data, which is then stored in the data domain by the data storage module. The data is then used in the evaluation and analysis phase. The analysis and judgment module reads the test settings and data from the data storage module to analyze and draw conclusions based on the data. This module also updates the intrusion response module. The intrusion response module plays a critical role in notifying the wireless network interface component of the malicious nodes that need to be blocked. Once a blacklist containing the abnormal nodes has been broadcast throughout the network, normal nodes will stop receiving and relaying RREQ signals from the abnormal nodes. This is because any unusual node will be prohibited from further communication. At the same time, the blacklist will be distributed to other nodes to assist in responding to a flooding attack. To improve the accuracy of the system, we trained the model for a total of one thousand cycles. The training process allows the system to learn from the data and improve its performance over time. This intrusion detection system is a crucial tool for maintaining the security of wireless sensor networks and ensuring the safe and efficient operation of smart city applications.
k-mean is a clustering algorithm. Similar and dissimilar data are grouped using this method based on their similarities. In the k-mean clustering, the dataset is considered by the k-mean method. The arithmetic mean is computed using this dataset. The arithmetic mean represents the dataset's focal point. Starting from the center, the Euclidian distance is calculated [31]. The points that are comparable and different are also divided into distinct groups. In this study, the Euclidian distance is measured dynamically. This effect improves the clustering accuracy. To measure Euclidian distance dynamically, this study employs a technique known as backpropagation. This method clusters uncluttered points and improves the clustering accuracy.

Pre-processing
In this step, the data is provided as input. Missing values are depicted in the cleaned data. In this step, the redundant values are removed. In this step, the standard deviation, mean values, and so on are calculated.

Phase of prediction
In this step, the division of the input dataset results in the generation of training and testing sets. As shown in Fig. 2, we divided the dataset in the tanning set data into two parts: the first portion (70%) is used as a tanning set, and the remaining 30% is used as a testing test.
The prediction analysis is performed using the KNN classification model. This classifier accepts training and testing data as input. Predicted data is the output that is obtained. The K-Nearest Neighbor (KNN) algorithm is a simple method. KNN is a non-parametric supervised learning approach since it doesn't make any assumptions about the underlying data distribution. This technique categorizes the patterns based on neighboring training patterns in the feature space. The labels of the training pictures are used to store the feature vectors throughout the training process. The unlabeled question point in the categorization is distributed in the direction of its k-nearest neighbors' labels. The item is chosen via majority vote sharing based on the labels of its k closest neighbors. The classification of the object is done successfully in this algorithm. The nearby object class in the scenario when k = 1. When there are only two classes, k is an odd integer. When multiclass categorization is used, there can be a tie if k is an odd whole number [32]. Table 2 contain all the mathematical notation which is used in this paper. This classifier's primary goal is to categorize patterns according to the majority class of their closest neighbors.
Variable v in the equation above represents the class label [33]. The class label for its closest neighbors in this equation is y i . The variable I represents the indicator function. In this function, the value "1" is returned in case of an actual argument. If the opposite is true, "0" is returned. As a result, the patterns are allocated to its K closest neighbors' class. A collection of labeled objects, a distance or similarity measure, and other key elements of this approach are identified. These components calculate the separation between objects and their closest neighbors. The value of k serves as a proxy for the distance. The identification task can be successful by choosing a suitable similarity function and values for parameter k.

Previous algorithm
As observed in the IDS packet transformation, the categorization difficulties are solved using supervised machine learning algorithms. Your data is transformed using a method known as the kernel trick, and based on these alterations, it determines the best cutoff between the possibilities.

Proposed algorithm
The k-mean clustering process results in some points being left unflustered, which reduces accuracy. When using k-mean clustering on the dataset, the whole dataset, including all instances, is utilized as input. The whole dataset was divided into groups of similar kind using K-mean clustering. The results of the k-mean clustering technique will be used as input for the SVM classifier, which may categorize data based on hyperplanes [34]. The k-mean technique will be enhanced for clustering in this research. The classification process will use the clustering result as input, which improves the prediction analysis accuracy. As observed in the IDS packet transformation, the categorization difficulties are solved using supervised machine learning algorithms. Your data is transformed using a method known as the kernel trick, and based on these alterations, it determines the best cutoff between the possibilities.
Determine the distance in Euclidean space between the remaining data items and the first grouping centers U i : L(β) is the Levy distribution function's probability density function?
Determine the judgment value for each location's odor concentration: The text input is altered by the language model and converted to a vector, where cosine similarity G is a popular metric of similarity (J): The most extensively researched and used technique for unsupervised learning problems is cluster analysis [35]. The class approach separates a data set into multiple different subsets called "class clusters," each with its own clustering center. It is determined by how similar each sample is to every other sample in a data set. Throughout the clustering method, only the cluster structure is formed automatically. Each node in a network creates its own cluster, with one node serving as the cluster's leader. Data from the cluster's other nodes is delivered to the cluster head, who aggregates the information, adds signal processing to it, and delivers it to the distant base station. As a result, operating as a cluster head node requires much more resources than serving in another role. As a result, if the node functioning as the cluster's head dies, all nodes in the cluster lose their ability to communicate with one another. in this paper is that the authors have chosen two types of malware that make significant modifications to the guest operating system, which makes them relatively easy to detect. The author believes that these types of malware would also be easily detected by signatures.
However, the reviewer suggests that in order to prove the effectiveness of the approach proposed by the author, they should have chosen more covert malware, such as rootkit kernels, which are known for being particularly difficult to detect. By choosing such malware, the author could demonstrate whether their detector is able to detect subtle deviations and thus be more effective in detecting sophisticated attacks. The arithmetic mean of the complete data set is taken to measure the center points in this approach. The points with similar values are grouped in an individual cluster, while others are grouped in a different cluster. Consider the problem of clustering a set of n objects I = {1 . . . , n} into K clusters. For each object i ∈ I , we have a set of m features {x ij : j ∈ J }, where x ij describes the j the features of object i quantitatively. Let x i = (x i1 , x im ) T be the feature vector of the object i and X = (x 1 , . . . , x n ) be the feature matrix or data set.

Algorithm 2. Cluster the Node
As an optimization problem that minimizes the following clustering objective function, the clustering job may be restated: under the following constraints: where p = 1,2. For k = 1, . . . , K , v_k ∈ R mis , the kth cluster prototypes, and for every i ∈ I, u_ik identifies whether the item I is a member of the kth cluster. For p = 1 and p = 2, the clustering issue may be solved effectively using the K-median and K means methods. Let the cluster prototype matrix be in the following V = [v 1 , . . . , v K ] ∈ R m×K , and the membership matrix U = [u 1 , . . . , u n ] ∈ R K ×n , where v i = (v i1 , . . . , v im ) T and u_i = (u i1 , . . . , u iK ) T Both algorithms solve the clustering problem in iterative ways as follows: cluster prototypes {v t k : k = 1, . . . , K }. Step 2. Let t = t + 1, and update the membership matrix U t by fixing the cluster prototype matrix V ∧ (t − 1). For any i ∈ I, randomly select k t * ∈ argmin{� x i − v t−1 k � p : k = 1, . . . , K }, and set u t ik * = 1 and, for any k = k t * , set u t ik = 0. Step 3. Update the cluster prototype matrix V t by fixing the membership matrix U t . When p = 1, for any k = 1 . . . , K and j ∈ J , set v t kj as the median of the jth feature values of these objects in cluster k. When p = 2, for any k = 1, . . . , K , set v t k as the centroid of these objects in cluster k; that is, v t k = 1 i∈I i u ik u(i ∈ I)u ik x i .
Step 4. If, for any i ∈ I and k = 1, . . . , K , we have u t ik = u t−1 ik , stop and return to U and V; otherwise, go to Step 2.

Store protocol
Data File in the different blocks as M = (m 1 , m 2 , . . . , m n ) And every block contains different s sectors in the form of m i = m i1 m i2 · · · � m is (1 ≤ i ≤ n) where sector m iz ∈ Z q (1 ≤ z ≤ s) , denotes concatenation. Client first computes h i = H 2 (m i )(1 ≤ i ≤ n) from the data block on top of the ordered hash values of node w i stores the corresponding hash value h i Based on g, , and secret key sk , the client computes the value. M and T are then deleted from the local storage of the client's computer. Only metadata is maintained. Timedependent pseudo-randomness generated by the Bitcoin blockchain is used to produce periodic challenges. A hash value hash of the latest block that has arrived since time t in the Bitcoin block chain is obtained by entering the time t. A pseudo-random-bit generator C = {B.I., F, l}, where C denotes the auditor's checking policy. It invoked on the input h (b) (b) to receive a random bit by selecting a keys pair f . Then auditor generates a challenge And sends it to CSP [30].
The challenge Q (b) CSP Computed the indices and coefficients by using the equations: Then, CSP validates the proof of data to check the integrity of the challenged blocks by the following equations: s , σ (b) the auditor verifies the correctness of ρ (b) . It verified the indices and coefficients by the value with T as using the equation: Third, the auditor verifies the proof ρ (b) by checking the following equation: The auditor verifies that the challenged data blocks are intact if the equation holds. Auditor saves a log entry to document their auditing of the data: A random is chosen by the client from the subset B of indices of Bitcoin blocks and transmitted to the auditor. Then auditor receives the value of Q (b) , h (b) , and ρ (b) From log file .
Challenge index vector is denoted by C = (i 1 , i 2 , . . . , i c ) . Now it obtains the corresponding multi-proof p . Then auditor generates the proof of the appointed logs as follows: and sends it to the client with Sig sk a ρ (B) .
It was verifying theSig sk a ρ (B) And invoke the ρ (B) hash (b) to receive Q (b) and indices and coefficients are verified i η , a η (1 ≤ η ≤ l) . The client verifies h(B).it by using the Eq. 16.
The client verifies the secret key sk, and the verified h (B) as follows: Equation 17 verifies the client data and node secret key by computing the hash value generated by Eq. 18.
Assuming the calculation above is correct, the customer may be particular that the auditor performed an honest audit of CSP for all previously disputed data blocks appointed by B. The equation's accuracy can be explained as follows:

User define parameters optimizations
In the proposed algorithm, we have used a linear method of classification it having a ∈ ℝ n and b ∈ ℝ. that are both unknown but constant. During the SVM training process, the classification parameters (also known as level 1 parameters) are calculated. After fine-tuning these settings, the hypothesis function permits binary classification for any x in the range x ∈ ℝ n .

sgn(·) is defined as
In the case when the dataset under examination is not linearly separable, a nonlinear function: φ: R n → D is used to map the data to a space D, where d ∈ ℕ is the number of dimensions of the space D. is is used as where ϕ(x) which used for training errors ξ , then improved classification parameters are derived from

Results of analysis
The job required extensive meticulous testing from a data mining standpoint. It's also vital to consider the planning and preliminary processing that went into experimenting. This chapter outlines all experimental equipment that will be used to demonstrate the results of a tiny categorization of UCI data set using Python code.

UCI dataset
The University of California School of Information and Computer Science has a substantial collection of datasets that may be used in research projects [36]. According to the kind of machine learning problem, the datasets are categorized. Datasets for classification, regression, recommendation systems, and univariate and multivariate (19)  time-series datasets are available. Many UCI datasets have already been cleaned and are prepared for use. The dataset collected from different sources is given as input for classification, as shown in Table 3. Due to the presence of compromised servers, few classes are generated.

Performance evaluation metrics
The results of the proposed research will be implemented with some estimated variables, for example: Precision, Sensitivity, Specificity and Accuracy.
The accuracy of a recognition system is measured by correctly identified out of total classified data.
True Positive Rate (TPR) correctly classified data. The FPR measures how often negative samples are incorrectly interpreted as positive due to false positives involving unhealthy samples. F-Score: The F-score is an accuracy statistic that combines the precision and recall of a test into a single number. It is used to assess binary categorization systems, which assign examples to one of two classes.

SVM classifier implementation
The data are divided into several classes using the SVM classification model, as shown in Fig. 3 and Table 2. In the presence of a compromised server, the classes are classified. This approach provides an accuracy of 84%. Table 4 shows the performance evaluation of a machine learning model used to classify different types of cyber threats in a cloud storage environment. The evaluation (28) F − score = 2 * (precision * recall)/(precision + recall) metrics used in this table are precision, recall, F1-Score, and support. The model achieved a perfect precision, recall, and F1-Score for the "Compromised server" class, which means that the model correctly identified all instances belonging to this class without any false positives or false negatives. However, the "failed attack exploit" and "spambot malicious download" classes were not detected by the model at all, resulting in a precision, recall, and F1-Score of 0 shown in Fig. 4

Proposed classifier implementation
The data were classified into different groups using the suggested KNN classification [37] with a distance implementation model with altered distance, as illustrated in Fig. 5. Hyperparameters are parameters that are not learned from the training data and must be set before training the model. In SVM and KNN models, adjusting various hyperparameters, such as the regularization parameter (C) and kernel type for SVM, and the number of neighbors (k) for KNN, can enhance model performance. Tuning hyperparameters is a crucial step in building accurate and robust machine learning models. Grid search, random search, and Bayesian optimization are some methods used for optimizing hyperparameters. These methods involve systematically testing different combinations of hyperparameters and evaluating model performance using cross-validation.  The specific hyperparameters used for SVM and KNN models, as well as any optimization methods employed, would depend on the dataset, project objectives, and available computational resources in the cloud-assisted categorization strategy for secure data storage preservation in smart cities. However, it is important to emphasize that hyperparameter tuning can significantly improve model performance and should be considered a critical stage in model development.
In the presence of a compromised server, the classes are classified. This approach provides an accuracy of 84%.
We have performed the anova test on security parameters of cloud computing in smart cities, which is shown in Table 5.

Result output of the proposed classifier
The accuracy score of the KNN classifier is typically represented as the average of correctly predicted instances (true positives + true negatives) divided by the total number of instances in the dataset. Accuracy of KNN model is 84.0 and SVM is 90.35%. In this paper we have used the cross-validation techniques during the model training phase to estimate the generalization error and evaluate the model's performance. Cross-validation involves partitioning the data set into multiple folds and iteratively training and evaluating the model on different folds. The final performance metric is computed as the average of the performance measures across all the folds. Comparative analysis Figure 6 and Table 6 present a comparative examination of the capabilities of the SVM and the KNN, respectively. The results of the comparison graph demonstrate that the accuracy level achieved by the KNN classifier is superior to that achieved by the SVM classifier [38]. Figure 7 compares the execution times of the proposed and presented algorithms to demonstrate how they perform. The comparison graph demonstrates that the KNN strategy yields better outcomes than the SVM approach regarding execution time [39]. Figure 8 presents the results of a comparison between the SVM and the KNN in terms of performance. The results of the comparative graph demonstrate that the precision level achieved by the KNN classifier is superior to that achieved by the SVM classifier [40].
A comparative analysis of the performances of SVM and KNN is shown in Fig. 9. P. Su et.al [34] and the author of [41] used the number of abnormal and normal node identification during the data transmission. Thakare et.al [42] and C. H. Wang [43], T. Wang [44] used the behavior of a cluster of received data during the transmission. In the proposed work, we have considered the number of abnormal and normal nodes and cluster behavior during the transmission and the feature matrix of transmitted data. The outcomes of the comparison graph shown in Table 7, the recall level of the KNN classifier is better than the SVM classifier. Table 5 and Fig. 10 shows the comparative analysis of existing work performed by the different authors with our proposed work in terms of accuracy [46,47]. The result is that our proposed work outperforms the existing work [48,49].

Conclusion and future work
Machine Learning is a powerful method for extracting useful information from a raw dataset. To cluster comparable and dissimilar datasets, the similarity of the input dataset is assessed. In this process, the SVM method is used to classify both comparable and dissimilar data types, and the arithmetic mean of the dataset is calculated to determine the center point. The Euclidean distance is then used to compare the similarity of two data points. Finally, an SVM classifier is employed to classify the clustered data based on the input dataset. This study focuses on the use of the KNN algorithm to predict cardiac disease, where the clustered results are used as input for the classification process. Compared to the current method, the improved technique has higher classification accuracy and shorter execution time. However, the proposed algorithm can be further improved by integrating a hybrid classifier for prediction analysis.
The results of the proposed algorithm were evaluated by comparing it with other existing approaches. However, the study's emphasis on security and privacy has limitations in addressing human-centered aspects that could impede the widespread adoption of smart cities. To enhance public confidence, further research is necessary to visualize the daily experiences of residents living in smart cities and  quantify the various interactions and operational difficulties they face. It is important to note that only technological means were considered in this analysis, and the legal and institutional frameworks of a city are equally crucial components that need to be taken into account.
Limitation of proposed methods, the algorithm's performance may be affected by the specific dataset used, and its generalizability to other datasets is uncertain. The proposed algorithm may be further improved by incorporating a hybrid classifier for prediction analysis.

Future work
The use of the law to address trust issues in smart cities is an important topic for future study. Further, smart city projects will benefit immensely from more research aimed at resolving the highlighted obstacles of smart cities (trust challenges, including trust challenges, operational and transition challenges, technology challenges, and sustainability challenges). In future works, we will explore the use of ensemble techniques and compare their performance to the single models used in this study. By using ensemble techniques, researchers could potentially improve the accuracy and reliability of the cloud-assisted categorization strategy and enable more effective data management in smart cities.