FLM-ICR: a federated learning model for classification of internet of vehicle terminals using connection records

With the rapid growth of Internet of Vehicles (IoV) technology, the performance and privacy of IoV terminals (IoVT) have become increasingly important. This paper proposes a federated learning model for IoVT classification using connection records (FLM-ICR) to address privacy concerns and poor computational performance in analyzing users’ private data in IoV. FLM-ICR, in the horizontally federated learning client-server architecture, utilizes an improved multi-layer perceptron and logistic regression network as the model backbone, employs the federated momentum gradient algorithm as the local model training optimizer, and uses the federated Gaussian differential privacy algorithm to protect the security of the computation process. The experiment evaluates the model’s classification performance using the confusion matrix, explores the impact of client collaboration on model performance, demonstrates the model’s suitability for imbalanced data distribution, and confirms the effectiveness of federated learning for model training. FLM-ICR achieves the accuracy, precision, recall, specificity, and F1 score of 0.795, 0.735, 0.835, 0.75, and 0.782, respectively, outperforming existing research methods and balancing classification performance and privacy security, making it suitable for IoV computation and analysis of private data.


Introduction
The Internet of Vehicles (IoV) is a network system that connects cars with other objects (such as mobile phones, computers, roads, traffic lights, and pedestrians) using wireless communication and information exchange technologies.At its core is a traffic information network control platform that extracts and utilizes the attributes and static and dynamic information of all vehicles through sensors on each car, enabling effective monitoring of vehicle status and providing comprehensive services based on different needs.IoV has been widely applied in distance assurance and real-time navigation, significantly improving traffic efficiency [1].However, the development of IoV relies on the big data generated by users and their vehicles, which presents challenges in data collection, transmission, and analysis.Firstly, there is a lack of security, which may involve the risk of privacy breaches [2,3], and secondly, there is uneven resource allocation, which may lead to service unfairness [4].To overcome these challenges, IoV needs to strengthen privacy protection measures to ensure the security of user data while also considering differences in data distribution and establishing a fair resource allocation mechanism.When mining sensitive data, it is necessary to extract usable features without revealing privacy and to use privacypreserving machine learning (ML) algorithms to balance learning content and privacy security.For example, extracting helpful information while protecting patient privacy in medical research is necessary.The method to address this issue is to extract general features without disclosing personal privacy, requiring privacy-preserving machine learning algorithms to balance learning objectives and privacy security [5].In 2016, Google proposed a privacy-preserving learning framework called Federated Learning (FL), which features data providers keeping their data locally, thus suppressing data privacy leakage from the source [6][7][8][9][10].As a mainstream privacy computing method, FL is a shared ML algorithm with good learning performance.Additionally, FL uses differential privacy (DP) [11] to protect the privacy of the computing process, preventing privacy information leakage and utilizing a large amount of user data for model training.
In recent years, various research methods have emerged.In the study of data distributions, Nilsson et al. [12] conducted a benchmark study on the MNIST dataset, comparing the performance of three FL algorithms using both IID and non-IID data partitions against centralized methods.Li et al. [13] proposed a comprehensive data partition strategy to address non-IID data cases in FL and better understand non-IID data settings.In the study of privacy protection technologies, Abadi et al. [14] proposed the GDP-FL learning algorithm, which combines DP to train models, protect gradient information, and conduct refined privacy cost analysis within the DP framework.Mahawaga Arachchige et al. [15] introduced the LATENT local DP algorithm, providing privacy protection when interacting with untrusted ML services.It enables data owners to add a randomization layer before data leaves their devices.In the study of FL algorithms, Choudhury et al. [16] presented an FL framework where a global model is trained on distributed health data protected by a DP mechanism against potential privacy attacks.Yang et al. [17] proposed the PLU-FedOA algorithm, optimizing horizontal FL deep neural networks with individualized local DP.Lu et al. [18] proposed a new FL-based architecture, comprising a hybrid blockchain architecture composed of permissioned blockchain and locally directed acyclic graph, and suggested an asynchronous FL scheme.Yang et al. [19] proposed an efficient asynchronous FL algorithm and a dynamic hierarchical aggregation mechanism utilizing gradient sparsification and asynchronous aggregation techniques.In the study of IoV applications, Zhao et al. [20] designed an FL collaborative authentication protocol to prevent private data leakage and reduce data transmission delay for vehicle clients sharing data.Luo et al. [21] addressed the issue of private data leakage in smart cars within the IoV network by introducing a local DP algorithm and designing a data privacy protection scheme tailored to IoV characteristics.Bakopoulou et al. [22] applied FL to mobile packet classification, enabling collaboration among mobile devices to train a global model without sharing raw training data.
From the current research work, it can be concluded that there are a series of problems to be solved at present: the existing deep learning algorithm has the risk of leakage when training a large amount of private data, even if the classification performance is good, it cannot consider data privacy.The existing privacy-preserving FL algorithm provides low security, slow training speed, and cannot balance performance and security.Accordingly, this paper aims to build an FL classification model (FLM-ICR) that balances privacy protection and performance to analyze the Internet of Vehicles terminals (IoVT)' connection records, verify the terminal device's function, and dynamically monitor users' normal usage.The FL and ML methods combination in FLM-ICR brings unique advantages to IoVT applications, including protecting user privacy, improving model accuracy and performance, and providing real-time responsiveness and adaptability.This combination promotes the development and innovation of IoVT, providing users with a better driving experience and services.This paper's innovation lies in using skew classes to divide the dataset into four clients by simulating the non-IID data distribution [23][24][25] in practical application scenarios.Based on the client-server architecture of horizontal FL, the federated Gaussian differential privacy (federated GDP) algorithm is used in the client and server to double protect the security of FL training.The Federated Momentum Gradient Descent (MGD) algorithm [26] is used in local model training to speed up convergence, and an improved multilayer perceptron (MLP) [27] and logistic regression (LR) [28] network is used as the model backbone to improve classification performance.These measures solve privacy leakage problems, low-security protection levels, low efficiency, and poor classification performance in the current research.FLM-ICR securely analyzes shared data in IoVT application scenarios, providing a new direction for the research of privacy-preserving FL.The main contributions of this paper are as follows: • Using the improved MLP and LR networks as the backbone of FLM-ICR enables better handling of classification problems.It is simple to implement, facilitating the integration of FL for update training.• Adopting the federated MGD algorithm as the training optimizer accelerates convergence in the local model updates of FLM-ICR and avoids local optima, making it convenient to use and achieving efficient computation.
• Under the client-server architecture of horizontal FL, adopting the federated GDP algorithm safeguards the security of the FL calculation process.It can balance the classification performance and security of FLM-ICR.
The organization structure of the article is as follows: the first part is the introduction, which introduces the background and significance of the research in this field and the related work; the second part is the preliminary knowledge, which introduces the theoretical knowledge of FL and DP; the third part is the proposed methodology, which details the data collection module, federated learning control module, differential privacy training module and classification prediction module in the FLM-ICR model framework; the fourth part is the simulation experiment, which introduces the preliminary preparation, model evaluation, training result, and comparative experiment; the fifth part is the conclusion, which summarizes the full text and looks forward to the subsequent work.

Federated learning
FL, a distributed ML framework, enables data sharing and joint modeling while ensuring data privacy and legal compliance.It includes horizontal FL [29,30], vertical FL, and federated transfer learning.Horizontal FL involves more overlap in sample features and less overlap in sample sources across multiple sets.In comparison, vertical FL involves less overlap in sample features and more overlap in sample sources.Federated transfer learning applies models learned in one field to another based on data, task, or model similarities.The schematic diagram of the FL classification is shown in Fig. 1.
Since the dataset used for model training in this paper is the connection records of different users under the same type of IoVT, the samples conform to the characteristics of more feature overlap and less source overlap and belong to the horizontal FL model.The main methods to protect privacy and security in FL are homomorphic encryption, secure multi-party computation, and DP.Considering the communication overhead, accuracy, and privacy protection degree comprehensively, the DP method is selected to protect privacy in the FL calculation process.

Differential privacy
DP, as a widely used privacy protection algorithm, uses the technology of adding noise to distort sensitive data, ensuring that deleting or adding a piece of data in the dataset will not affect the query results.DP protects data availability while significantly reducing the risk of privacy leakage.The original definition is: for two datasets D and D ′ on the independent variable space X that differ only by one data, if there is a random algorithm A(x) , x ∈ X , such that any output set S is obtained.There is: Where Pr is the probability function, ǫ is the privacy budget with ǫ > 0 , δ is the disturbance with 0 < δ < 1 , and the random algorithm A(x) satisfies relaxed (ǫ, δ) −DP on dataset X .The smaller ǫ and δ are, the closer Pr [A(D) ∈ S] and Pr A D ′ ∈ S are, the smaller the per- turbation difference, the better the effect of DP.At this time, the difference between D and D ′ cannot be inferred from the outputs of A(D) and A D ′ , thus protecting the privacy information of the dataset.
This paper adopts a noise-based DP algorithm, which is divided into global DP and local DP.The model finally constructed in this paper is used in the FL scenario, so the federated GDP method is adopted.The Gaussian method [31][32][33] is designed as a random algorithm A(x) to protect the gradient in the model training process, and the privacy of the FL calculation process is protected by adding Gaussian noise to perturb the model.Set the gradient clipping boundary value C and the standard devia- tion σ of Gaussian distribution in DP, and the privacy budget ǫ is negatively correlated with noise.Define: (1) This comprehensive framework is visually depicted in Fig. 2, clearly illustrating the interplay between these essential components.
The data collection module is designed to systematically gather a wide array of connection record data from IoVT by leveraging the diverse range of Internet-connected devices integrated into vehicles.These devices encompass in-vehicle communication systems and an assortment of sensors, enabling the collection of crucial The differential privacy training module harnesses the power of DP, an advanced data privacy protection technique, to safeguard the integrity and confidentiality of connection records and other critical vehicle information.This module incorporates sophisticated functionalities such as a noise generation mechanism, which introduces controlled randomness to the data to prevent the extraction of sensitive details, noise addition, which involves the deliberate injection of noise to obscure individual data points, and the calculation of the DP budget, ensuring that the level of privacy protection is carefully calibrated and maintained throughout the training process.By integrating these robust mechanisms, the module ensures that the privacy of sensitive information is upheld, thereby fortifying the security of the entire system.
The classification prediction module, by using the MLP and LR networks as the model backbone, can accurately classify normal and abnormal IoVT based on their connection records after FL training.This advanced classification capability not only enhances the accuracy and efficiency of the classification process but also plays a crucial role in maintaining the security and privacy of personal data.By accurately identifying normal and abnormal IoVT based on their connection records, this module contributes to an elevated data security and integrity level, ultimately fostering an enhanced driving experience and service for users.
The subsequent sections will provide a comprehensive and detailed examination of the four modules within the model framework.This thorough analysis aims to elucidate each module's specific functions, interactions, and significance, offering a comprehensive understanding of their roles in the context of the FLM-ICR model.

Data collection module
The dataset studied in this paper comes from the user connection records generated by the GT101 terminal equipment of a car networking company.The IoVT user connection record is the interconnection information established between the terminal and the big data platform when the user utilizes the terminal.It has become a vital indicator for assessing the functionality of the terminal device.These records contain extensive information, including vehicle travel, driving behavior, and vehicle health status.The data within these records is typically generated in real-time, providing real-time updates on the vehicle's status and behavior.Regular maintenance by checking the number of days that the terminal equipment is usually connected is a crucial link in the work of the IoV.The dataset used in the experiment is the 15-day user connection records of 1500 IoV terminals (GT101).In order to protect the terminal device information and the privacy of the users, only the terminal number and connection status are kept, named the connection record dataset (101.csv).101.csv has a total of 1,500 observations.The first 15 columns are used as independent variables (input x), and the extracted features are used to quantify the daily connection status, with no connection records recorded as 0 and those with connection records recorded as 1.The last column is used as the category value (output y), and the category values are <=8 (abnormal connection) and >8 (normal connection).The intercepted part of the dataset is shown in Fig. 3.
The CSV Dataset class serves as a loading class specifically designed for handling CSV datasets.It processes the data into a format suitable for the model.Then, it randomly divides it into a training set and a test set at a ratio of 8:2.This results in 1200 data in the training set and 300 in the test set.Subsequently, the training dataset is partitioned among four clients.In the case of an IID data distribution, 300 pieces are allocated to each client, ensuring that all clients possess an identical number of training data with the same category proportion.Conversely, a skewed class distribution is implemented for non-IID data distributions, resulting in each client receiving a distinct proportion of data from each class while Fig. 3 IoVT connection record dataset maintaining an equal total data allocation across all four clients.In scenarios involving unbalanced data distribution, the entire training sample is distributed among the four clients, with each client randomly receiving varying numbers and proportions of training data.It is important to note that the distribution of IID data represents an idealized scenario with no practical significance, as its usage is tantamount to centralized learning.This paper conducts model training on non-IID data distribution to ensure the model's applicability to real-world scenarios.G .The FedAvg algorithm ensures that the updated global model parameter reflects the collective knowledge contributed by the individual clients while preserving data privacy and mitigating the impact of potential noisy updates.Subsequently, the average value of the model parameters is computed to update the model on the server side, ensuring that the global model reflects the collaborative insights derived from the FL process.④ Model broadcast: The server takes the lead in disseminating the newly aggregated model parameters to each client participating in the FL process.This broadcast mechanism ensures that all clients receive the updated global model parameters, enabling them to synchronize their local models with the collective insights and refinements derived from the collabora-tive FL process.This synchronization process plays a pivotal role in fostering a cohesive and updated understanding of the model across all participating clients, ultimately contributing to the continual improvement and convergence of the global model.⑤ Local model update: Each client initiates the process by updating its model parameters and recalculating locally.In the FL framework, the system iterates through randomly selected clients, enabling them to download the parameters of the trainable model from the server.Subsequently, the current global model is passed to the client, empowering them to update their local model based on locally available data.The client then performs local training to refine the model, ultimately returning the updated local model.Following this, the client uploads the new model parameters to the server, prompting the server to aggregate updates from multiple clients.This collaborative process continually improves the global model by integrating insights and refinements from the diverse network of participating clients.

Differential privacy training module
The federated GDP algorithm is employed in both the client and server to enhance the security of the FL calculation process, providing dual protection for both entities during their involvement in FL training.This protection encompasses two key aspects: the client-side federated GDP algorithm training and the server-side federated GDP algorithm aggregation.On the client side, the federated GDP algorithm ensures the privacy and security of the client's data during the FL training process.It employs advanced privacy-preserving mechanisms to safeguard sensitive information.This allows clients to securely contribute their local model updates without compromising the confidentiality of their data.On the server side, the federated GDP algorithm aggregates the client updates.It leverages secure aggregation protocols to combine the model updates from multiple clients while preserving privacy.This ensures that the server can effectively learn from the collective knowledge of the clients without accessing their data.FL training achieves a robust and privacy-preserving framework by employing the client-side federated GDP algorithm training and the serverside federated GDP algorithm aggregation.This approach protects the privacy of the client's data and enables collaborative learning across distributed devices, fostering advancements in ML while maintaining data security.
Training of the Client-Side Federated GDP Algorithm: Model training is executed on the client side, with each federated client possessing a fixed dataset and computing power to engage in federated MGD.Employing Algorithm 1 to process clients with identical network architecture and loss function, each local model is initialized by a global model from the server side.The number of iterations for the federated MGD algorithm aligns with the number of training epochs.Following each step of the local iterative update, the parameters are pruned, and the client computes the gradient update, generates the updated model, and shares it with the aggregation server.However, it is essential to note that local data is private to each client and is not shared.The client-based federated GDP algorithm is detailed in Algorithm 1.

Algorithm 1 Client-Based Federated GDP
Algorithm 1 incorporates the of noise into the local model training process to safeguard the confidentiality of the client's raw data.More specifically, each client leverages the federated GDP algorithm to handle the gradients when computing the gradient updates, guaranteeing that sensitive client information remains secure and is not divulged during the model training process.
Aggregation of the Server-Side Federated GDP Algorithm: The server side is responsible for housing the global model, overseeing the entire model training process, and disseminating the initial model to all participating clients.It utilizes Algorithm 2 to receive and aggregate updates from all participating clients in each FL iteration, culminating in constructing a new model with updated parameters.The server-based federated GDP algorithm is detailed in Algorithm 2.

Algorithm 2 Server-Based Federated GDP
In Algorithm 2, the server employs the federated GDP to updates during the aggregation of client updates.This ensures that each client's contribution is effectively integrated into the final model, facilitating comprehensive global model updates and refinement.
By combining this client-server architecture with the federated GDP algorithm, FLM-ICR can achieve global model updates and optimization while protecting user privacy.This approach enhances classification performance and guarantees the security and confidentiality of private data, thereby fostering a robust and privacy-conscious framework for model refinement and collaborative learning.

Classification prediction module
Based on the client-server architecture of horizontal FL, FLM-ICR uses the improved MLP and LR network as the backbone of the local model to improve the classification performance.This approach is straightforward to implement and seamlessly integrates FL for update training.The federated MGD algorithm optimizes local model training, accelerating convergence and facilitating efficient calculations for classification tasks.
FLM-ICR uses the improved MLP and LR network as the backbone of the locally trained model.This choice proves more suitable than other algorithms for addressing the IoVT-based connected record classification problem outlined in this paper.The relatively small number of neural network layers utilized by LR and MLP, coupled with a modest parameter count, results in minimal computing resource requirements and swift computational speed during training.Furthermore, LR and MLP model structures are relatively simple, facilitating comprehension and implementation, and are less susceptible to issues such as overfitting.The output results of LR and MLP are calculated based on mathematical formulas, which are highly interpretable.MLP and LR are good at handling classification problems and are easy to use, which can better integrate FL models for update training.
The backbone network structure of FLM-ICR enhances non-linear modeling capability, model expressiveness, feature extraction capability, and generalization ability compared to traditional MLP and LR networks.This augmentation significantly improves the model's classification performance.It enables it to adapt to diverse data distributions, giving FLM-CR a competitive edge over alternative methods across various tasks.Establishing the MLP and LR network involves taking each IoVT connection record as input and outputting the classification accuracy of the two types of connections.PyTorch is employed to classify text data, and the two network architectures are presented below.
MLP ((model): Sequential ((0): Linear (in_features=3, out_features=200, bias=True) (1): Dropout (p=0.2, inplace=False) (2): ReLU () (3): Linear (in_features=200, out_features=2, bias=True))) The improved MLP linear layers, Dropout, and the ReLU activation function.This architecture is established using the Sequential class to construct a feedforward neural network for sample classification.Initially, the linear layer conducts linear transformations to augment the feature information of the samples, with an input dimension of 2 and an output dimension of 200.Dropout is then implemented with a probability of 0.2 for random Dropout, mitigating overfitting.Subsequently, the ReLU non-linear activation function is employed to enhance the network's non-linear expressive capability.Finally, the linear layer is utilized for dimension reduction and classification purposes.
LR ((linear): Linear (in_features=3, out_features=2, bias=True) (sigmoid): Sigmoid () (model): Sequential ((0): Linear (in_features=3, out_ features=2, bias=True) (1): Sigmoid ())) The improved LR model comprises linear layers and utilizes the Sigmoid activation function.This configuration enables the model to calculate the probability of a given sample belonging to a specific class.Initially, the linear layer conducts linear transformations and then applies the Sigmoid activation function for binary classification.The Sequential class is employed to build a feedforward neural network, with the linear layer executing linear transformations.Ultimately, the Sigmoid activation function produces the probability of a sample being associated with a particular class.

Preliminary preparation
In order to ensure the repeatability of the experimental results, all the experiments in this paper were carried out on the same laptop.Experimental environment configuration: The central processor is an Intel Core (TM) i7-7700K CPU @ 4.20GHz, with 16GB of memory, utilizing the deep learning frameworks Python 3.8 and PyTorch 1.8.1, and running on the Windows 10 operating system.
The improved MLP and LR networks categorize IoVT into corresponding types based on input connections when the dependent variable takes on different categorical values.To achieve the best training result, it is necessary to select appropriate optimizers and training step sizes in the model setup to minimize the value of the loss function.The MLP and LR networks utilize the Federated MGD algorithm to update the optimized network weights with a momentum setting 0.9.In terms of loss function selection, the improved MLP network employs the cross-entropy loss function, while the improved LR network uses the logarithmic loss function.The relevant model parameters are set as follows: output size is 2, the number of clients is 4, the learning rate is 0.01, batch size is 128, training epochs are 60, the number of local update rounds for clients is 1, the gradient clipping boundary value C is 0.5, and the standard deviation σ of Gaussian noise is 0.5.

Model evaluation
To assess the effectiveness and feasibility of FLM-ICR, it is crucial to simultaneously consider multiple indicators for evaluating the model's performance.In binary classification, the confusion matrix is the primary evaluation index during the model evaluation stage.This matrix, obtained from the experiment, is fundamental for measuring classifier accuracy and deriving most evaluation indicators.It categorizes two-category samples into positive (P) and negative (N) samples and predictions into true (T) and false (F), as depicted in Table 1.
In Table 1, TP is the number of predicted positive classes in the actual positive class, TN is the number of predicted negative classes in the actual negative class, FP is the number of predicted positive classes in the actual negative class, FN is the number of predicted negative classes among the actual positive classes.It can be seen that the accuracy ACC = TP+TN TP+TN +FP+FN is the proportion of the actual positive class in the prediction result.The precision P = TP TP+FP is the proportion of the actual positive class in the predicted positive class.The recall R = TP TP+FN is the proportion of the actual positive class correctly classified, also known as the sensitivity.The specificity TNR = TN TN +FP is the proportion of the actual negative cases that are correctly classified.The F1 score is F 1 = 2P×R P+R , which combines the precision and recall scores.
The above five evaluation indicators can reflect the performance of the classification model, and ACC can objec- tively reflect the overall quality of the model, and the value range is [0, 1].The closer the ACC is to 1, the better the model performance.However, in the case of unbalanced positive and negative samples, the correct rate can only partially reflect the quality of the model.The the P , the better the model performance.The higher the R , the better the model performance.The larger TNR is, the smaller the misjudgment rate is and the better the model performance is.Since both P and R only describe the quality of the model from a single aspect, it makes little sense to simply pursue the improvement of a single indicator.Increasing these two indicators simultaneously is necessary to obtain the optimal model.As the harmonic mean of P and R , the F 1 is a balance point between P and R , which can consider both P and R of the classifi- cation model.When P and R increase simultaneously, the larger the F 1 , the better the model.In this paper, the con- fusion matrix is used to evaluate the classification performance of FLM-ICR.The 300 sample results discussed in the experimental confusion matrix are the classification results of the test data.

Training result
FLM-ICR is trained on the IoVT connection record non-IID data distribution, using MLP and LR networks as the model backbone called FL-MLP and FL-LR, respectively.FLM-ICR trains FL-MLP and FL-LR models with data privacy-preserving capabilities.The confusion matrices of FL-MLP and FL-LR are obtained as shown in Fig. 5(a).After 60 training rounds, the fitting curves of FL-MLP and FL-LR classification accuracy are shown in Fig. 5(b) below.The values of the five evaluation indicators ACC , P , R , TNR , and F 1 can be obtained from the confusion matrix and related calculation formulas.The training result can be seen more intuitively in Fig. 6.
The confusion matrices in Fig. 5(a) show that FL-MLP and FL-LR trained by FLM-ICR have better classification performance.The fitting curves in Fig. 5(b) show that the classification accuracies of FL-MLP and FL-LR trained by FLM-ICR are 0.81 and 0.78, respectively.Accuracy grows slowly early in training because non-uniform sampling causes each client to have data from one class and very little data from another.As the number of training epochs increases, the accuracy gradually stabilizes to a higher level.The model performance indicators in Fig. 6 show that the FL-MLP model trained by FLM-ICR has ACC of 0.81, P of 0.74, R of 0.88, TNR of 0.74, and F 1 of 0.8.The FL-LR model trained by FLM-ICR has ACC of 0.78, P of 0.73, R of 0.79, TNR of 0.76, and F 1 of 0.76.FL-MLP and FL-LR, which FLM-ICR can train, have achieved good model performance.
By exploring various levels of client collaboration for model training, it is demonstrated that the number of clients participating in each round of collaborative training in FLM-ICR impacts model performance.The model's performance in FLM-ICR in the real decentralized data scenario is proved effective by exploring the unbalanced data distribution for model training.

• Client collaboration level
The client collaboration level is called the C value, which controls the number of multi-client parallelisms.In order to explore the influence of the level of collaboration between clients on model performance, FLM-ICR trained FL-MLP and FL-LR models on non-IID data distribution for C=1, C=0.75, C=0.5, and C= 0.25 for experiments.FLM-ICR training FL-MLP model obtained ACC of 0.81,

• Unbalanced data distribution training
The unbalanced data distribution represents the distribution in practical application scenarios like the IoV.In order to verify the actual feasibility of FLM-ICR, an experiment is carried out under the condition of unbalanced data distribution in the client.The confusion matrices of FL-MLP and FL-LR on non-IID and unbalanced data distribution are shown in Fig. 9.
Figure 9 presents the confusion matrices of FL-MLP and FL-LR when applied to non-IID and unbalanced data distributions.These matrices can assess the models' ability to handle data heterogeneity and class imbalances and facilitate a comprehensive comparison and evaluation of the FL-MLP and FL-LR classification performance.The performance indicators of FLM-ICR trained FL-MLP and FL-LR models on non-IID and unbalanced data distribution, respectively, are shown in Fig. 10.
It can be seen from Fig. 10 that the FL-MLP model trained by FLM-ICR on the unbalanced data distribution obtains ACC , P , R , TNR , and F 1 as 0.76, 0.72, 0.75, 0.76, and 0.73, respectively.The FL-LR model trained by FLM-ICR obtained ACC , P , R , TNR , and F 1 as 0.72, 0.67, 0.74, 0.7, and 0.7, respectively.The performance of the FL-MLP model trained by FLM-ICR on the non-IID data distribution is very similar to that of the FL-MLP model trained on the imbalanced data distribution.The performance of the FL-LR model trained on the non-IID data distribution by FLM-ICR is slightly lower than that of the FL-LR model trained on the unbalanced data distribution.The reason is that the client in the unbalanced data distribution differs in the amount of data.Approaching the model performance under the non-IID data distribution takes more training rounds.However, the final results are similar.The experimental results prove that the unbalanced data distribution has little effect on the model performance of FLM-ICR, and the practical feasibility of FLM-ICR has been fully verified.

• Verify the validity of FL
To verify the effectiveness of FL in FLM-ICR, the MLP and LR network models are trained separately.To compare the model performance of FL-MLP and MLP and the model performance of FL-LR and LR to illustrate that using FL in FLM-ICR can protect data privacy while still having better model performance.The confusion matrices of FL-MLP and MLP and FL-LR and LR are shown in Fig. 11.
Figure 11 showcases the confusion matrices of FL-MLP, MLP, and FL-LR and LR.These matrices provide a comprehensive visual representation of the classification performance of these models.They can gain insights into the ACC , P , R , TNR , and F 1 of FL-MLP, MLP, FL-LR, and It can be seen from Fig. 12 that ACC , P , R , TNR , and F 1 obtained by the FL-MLP model are 0.81, 0.74, 0.88, 0.74, and 0.8, respectively.The MLP model obtained ACC , P , R , TNR , and F 1 as 0.83, 0.76, 0.9, 0.76, and 0.82, respec- tively.The FL-LR model obtained ACC , P , R , TNR , and F 1 as 0.78, 0.73, 0.79, 0.76, and 0.76, respectively.The LR model obtained ACC , P , R , TNR , and F 1 as 0.8, 0.73, 0.88, 0.73, and 0.79, respectively.It can be seen from the experimental results that although the MLP and LR models provide the best model performance, they have   no privacy protection capabilities lack FL training.However, FL-MLP and FL-LR trained with FL in FLM-ICR can protect data privacy and are very close to the model performance of MLP and LR.Furthermore, it can be proved that FL in FLM-ICR is effective and excellent, which can maintain the balance of data privacy and model performance.
The above three parts of the experiment: the positive impact of the number of clients participating in each FLM-ICR FL collaborative training round on model performance is explored; the applicability of the model performance of FLM-ICR under unbalanced data distribution in natural scenes is demonstrated; it is verified that FL-MLP and FL-LR trained with FL in FLM-ICR are effective.The experimental results fully demonstrate the positive significance of the client-side cooperative training mode, confirm that FLM-ICR is suitable for practical application scenarios, and illustrate that FL plays a vital role in the model establishment.

Comparative experiment
In order to further verify the validity and feasibility of the model, the comparative experiment was set up under the same dataset and experimental environment, compared with CNN-LSTM [34], GDP-FL [14], LATENT [15], and PLU-FedOA [17] for comparison.To prove that the model performance of FLM-ICR is superior to other methods and has better model performance while protecting data privacy.The model performance comparison table between FLM-ICR and four algorithms is shown in Table 2.
As can be seen from the performance comparison between FLM-ICR and the four algorithms in Table 2, although CNN-LSTM has the highest ACC among them, as a traditional deep learning method, it has no privacy protection capability, so the overall performance is not as good as FLM-ICR.The five performance indexes of FLM-ICR are better than those of GDP-FL and LATENT.Depending on the data and application scenario, FLM-ICR outperforms other methods in terms of ACC , P , R , TNR , and F 1 .The fol- lowing insights may explain the advantages of FLM-ICR in these aspects:

Conclusion
In the IoV application scenario, the FLM-ICR proposed in this paper is based on the connection record data of IoVT and uses FL and ML methods to classify normal and abnormal terminals while ensuring data privacy efficiently.FLM-ICR uses the improved MLP and LR network as the backbone of the model, which can better handle classification problems and is simple and easy to implement, which is convenient for integrating FL for update training.Under the client-server architecture of horizontal FL, FLM-ICR uses the federated GDP algorithm to protect the security of the FL calculation process and uses the federated MGD algorithm as the training optimizer to accelerate the local model convergence and achieve efficient calculation.The FL-MLP model trained by FLM-ICR safely and cooperatively obtained ACC , P , R , TNR , and F 1 as 0.81, 0.74, 0.88, 0.74, and 0.8, respectively, and the trained FL-LR model obtained ACC , P , R , TNR , and F 1 are 0.78, 0.73, 0.79, 0.76, and 0.76, respectively.Experiments explore the positive impact of the number of clients participating in FLM-ICR federated collaborative training in each round on model performance.The applicability of the model performance of FLM-ICR under unbalanced data distribution in natural scenes is demonstrated.It is verified that FL-MLP and FL-LR trained with FL in FLM-ICR are effective.The comparative experiment in the same dataset and experimental environment shows that the model performance of FLM-ICR is better than the existing four methods and has higher classification performance and security.FLM-ICR provides a new idea for future big data sharing and collaboration.It can be extended to actual scenarios such as hospitals and banks to protect data privacy and collaborative training and analysis of data while ensuring personal privacy information.In future work, FLM-ICR needs to be improved the following areas: (1) Communication and computation costs: Reducing bandwidth and energy consumption through techniques like optimized aggregation or compressed model updates enhances the efficiency of FL. (2) Model personalization and adaptation: Techniques such as user feedback and context-aware learning enable personalized model training and adaptation to individual user preferences and driving behaviors.
(3) Scalability and large-scale deployment: Developing scalable algorithms and infrastructure facilitates the widespread deployment of FL in the IoV domain as the number of IoVT and connected vehicles increases.By addressing these limitations and exploring potential improvements, future research can advance FL in IoV applications, leading to more effective models that enhance user-driving experiences and services.

Fig. 1
Fig. 1 Schematic diagram of FL classification

( 2 )
A(x) = f (x) + Y information such as communication times, intricate vehicle behaviors, and data transmission.This comprehensive data collection process ensures that a rich and detailed dataset is obtained, facilitating in-depth analysis and insights into the functioning and interactions of IoVT.The federated learning control module is responsible for the seamless execution of FL algorithms on IoVT, ensuring the efficient coordination of model training and robust data privacy considerations.This is achieved through a series of intricate processes, including local computation, where individual IoVT devices perform computations on their local data, model aggregation, which involves the consolidation of locally trained models from multiple devices, model broadcasting, where the updated global model is distributed to the individual devices, and parameter updates, which involve refining the model parameters based on the aggregated information.This meticulous orchestration ensures that the FL algorithms operate effectively and that data privacy is rigorously maintained throughout the model training process.

Fig. 5
Fig. 5 (a) Confusion matrices for FL-MLP and FL-LR (b) Fitting curves of FL-MLP and FL-LR classification accuracy

Fig. 8
Fig. 8 (a) Effects of different C values on the performance of FL-MLP (b) Effects of different C values on the performance of FL-LR

Fig. 9
Fig. 9 Confusion matrices for FL-MLP and FL-LR on non-IID and unbalanced data distributions

Fig. 10 (
Fig. 10 (a) Performance indicators of FL-MLP on non-IID and unbalanced data distributions (b) Performance indicators of FL-LR on non-IID and unbalanced data distributions

Fig. 11
Fig. 11 Confusion matrices for FL-MLP and MLP, and FL-LR and LR

( 1 )
Data diversity: FLM-ICR can fully utilize data on multiple vehicles and terminal devices for model training, improving model accuracy and performance.(2) Privacy protection: FLM-ICR keeps data on the local device for model training, avoids centralized data storage and transmission, and effectively protects user privacy.(3) Real-time and adaptability: FLM-ICR can perform real-time model training on vehicles and terminal devices, allowing the model to respond and adapt to different driving scenarios and needs on time.(4) Distributed computing: FLM-ICR distributes model training tasks across multiple vehicles and terminal devices and integrates model updates from all parties through aggregation algorithms, thereby improving the efficiency of model training.The feasibility of FLM-ICR is analyzed theoretically and verified by experiments on IoVTconnected recording datasets.

Table 1
Confusion matrix

Table 2
Performance comparison