Privacy-preserving sports data fusion and prediction with smart devices in distributed environment

With the rapid advancement of sports analytics and fan engagement technologies, the volume and diversity of physique data generated by smart devices across various distributed sports platforms have grown significantly. Extracting insights and enhancing fan experiences from such data offer considerable benefits. Yet, this process unveils two primary challenges. Firstly, efficiently utilizing the vast datasets in sports analytics is daunting due to the complex nature of the sports industry. Secondly, the data collected from diverse sources and stored in distributed platforms contain sensitive information like fan preferences and athlete performance metrics, posing risks of privacy breaches. To address these challenges, we leverage an advanced Locality-Sensitive Hashing technique, known as PSDFP ALSH , tailored for the sports domain. This paper presents a new privacy-preserving method for sports data fusion and prediction in distributed environments, utilizing enhanced Locality-Sensitive Hashing to protect sensitive information while maintaining high data utility. Through extensive experimentation, our approach demonstrates superior performance over existing methods in terms of Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and computational efficiency.


Introduction
With the integration of physical activities, cyber technologies, and advanced Internet of Things (IoT) applications, the domain of sports analytics is experiencing significant growth, offering unprecedented opportunities for enhancing athletic performance, fan engagement, and sports management.As a key component of this evolution, Cyber Physical Systems (CPS) are increasingly employed to collect, process, and analyze vast amounts of data generated from various sensors and devices across various domain [1][2][3][4], such as sports venues and athletes.
Among the myriad applications of CPS in sports, enhancing fan experiences at live events, optimizing athletes' performance through detailed analytics, and personalizing sports marketing strategies are paramount.The ability to fuse and intelligently analyze data from diverse sources can lead to substantial benefits, enabling stakeholders to make informed decisions and provide superior services [5][6][7][8][9].
However, the aggregation and utilization of sensitive data, including athletes' health information, fans' personal preferences, and real-time location data, pose significant privacy concerns.Directly analyzing such data from multiple sources without adequate privacy measures could lead to unintended privacy breaches and potential misuse of personal information [10][11][12][13][14][15].For example, monitoring a fan's movements within a stadium and their purchasing habits could unintentionally expose private information if not managed carefully.Consequently, there exists an inherent tension between leveraging data to improve sports analytics and ensuring the privacy of individuals involved [16][17][18][19].
To address these challenges, this paper proposes a novel privacy-preserving approach for sports data fusion and prediction, named PSDFP ALSH , utilizing an amplified Locality-Sensitive Hashing (LSH) technique [20,21] tailored for the sports domain.This approach aims to balance privacy concerns with the need for high-accuracy analytics and efficient data processing.Our contributions are threefold: • We introduce a privacy-preserving data fusion framework for sports analytics, leveraging LSH techniques to transform sensitive high-dimensional data into privacy-preserving low-dimensional indices without sacrificing data utility.• We detail the process of aggregating these indices from various data sources, such as wearable sensors and venue IoT devices, to make accurate predictions and provide insights into athletes' performance and fans' behaviors.• Through extensive experiments with real-world sports data sets, we demonstrate the effectiveness of our approach in providing high-accuracy predictions while ensuring the privacy of individuals, outperforming existing methods in terms of Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and computational efficiency.
The rest of the paper is organized as follows: "Related work" section reviews related work in the field of sports analytics and privacy protection."Motivation" section describes the motivation behind our approach and formalizes the problem."A privacy-aware data fusion and prediction approach: PSDFP ALSH " section details the methodology of our PSDFP ALSH approach."Experiments" section presents experimental results and comparisons with other methods.Finally, "Conclusion" section concludes the paper and discusses future research directions.

Related work
The intersection of privacy protection, sports data fusion, and prediction has garnered significant attention in recent years.This section reviews the literature on methodologies and technologies developed to address privacy concerns while enhancing the accuracy and utility of sports data analysis.
Privacy Protection in Data Fusion.A novel approach integrating federated learning into multi-sensor data fusion was proposed, showcasing superior convergence performance and effective privacy preservation [22][23][24].Another study introduced a low-energy data fusion privacy protection algorithm for three-dimensional wireless sensor networks, significantly improving privacy protection and data fusion accuracy [25,26].Additionally, a privacy protection strategy based on time slot allocation and relay in Wireless Body Area Networks (WBAN) demonstrated improved transmission success rates and reduced packet loss [27].The application of big data and MEMS sensors in martial arts training prediction models exemplifies the fusion of diverse data sources to enhance performance and injury risk predictions [28].
In the realm of smart city services, a privacy-aware data fusion and prediction framework utilizing edge computing was developed, showing improved performance in terms of accuracy and computational efficiency [29][30][31].Moreover, the use of federated learning in pedestrian trajectory prediction models has been explored, offering better data privacy security and prediction performance [32][33][34][35].The development of medical sports data privacy protection methods based on legal risk control highlights the importance of standardizing data handling practices in the medical field [36].An edge computing-based method for big data privacy protection in martial arts training movement trajectory prediction addresses both accuracy and security concerns [37].Furthermore, an improved privacy protection algorithm for multimodal data fusion combines spatial and transform domain steganographic techniques, ensuring safe and reliable information fusion [38].
The incorporation of artificial intelligence into IoT data fusion and sensing detection has demonstrated potential in facilitating effective data integration and enhancing privacy protection [39][40][41].A multimedia fusion privacy protection algorithm based on IoT data security under network regulations employs blockchain for a secure, reliable, and collusion-resistant scheme [42][43][44].Additionally, an innovative method of distributed multi-sensor data fusion under privacy protection leverages federated learning for enhanced privacy [45][46][47].The significance of legal frameworks and ethical guidelines in the context of sports data privacy protection is increasingly recognized.Research on medical sports data privacy protection underscores the need for a correct understanding of data's role and the standardization of data handling by medical personnel [36].

Motivation
Imagine a scenario in Fig. 1 where an athlete, let's call him John, participates in a series of events and training sessions.His performance data, along with physiological metrics, are continuously collected via wearable devices and IoT sensors deployed around the sports facilities.Additionally, John interacts with various other systems for nutrition, healthcare, and fan engagement, generating a diverse set of data spanning multiple domains.
To optimize John's training regimen, enhance his performance, and personalize fan interactions, it is crucial to merge and analyze this multifaceted data.However, this raises significant privacy concerns.John's health data, location information, and personal preferences could be exposed, potentially leading to privacy breaches.
In light of these challenges, this paper proposes a novel, privacy-aware approach for sports data fusion and prediction.Our method leverages an amplified LSH technique to protect the sensitive information inherent in sports data, allowing for the secure and efficient analysis of data from multiple sources.By doing so, we aim to improve athlete performance and fan engagement while upholding the highest standards of privacy protection.Please note that sport data often comprise varied data from multiple platforms or sources, which naturally results in different data formats.In this paper, we adopt the locality-sensitive hashing technique to convert the diverse data with different formats (e.g., real data, integer data, etc) into hashing indices without data formats (e.g., a six-dimensional index value 100110) uniformly.This way, we can overcome the challenges brought by different data formats in multi-source sport data integration.

A privacy-aware data fusion and prediction approach: PSDFP ALSH
In this section, we introduce our novel approach, namely, PSDFP ALSH , tailored for the sports analytics domain.This method capitalizes on the enhanced Locality-Sensitive Hashing (LSH) algorithm to address privacy concerns inherent in sports data fusion and prediction.Actually, there are massive kinds of hashing techniques that can transform sensitive high-dimensional data into privacy-preserving lowdimensional indices, such as Minhash, Simhash, Hashcode, etc.However, we only adopt the LSH technique to secure the sensitive user data when performing multi-source data integration.The main reason is that LSH can guarantee (1) close data points are still close after hashing projection (2) distant data points are still distant after hashing projection.With such two advantages, we can leverage LSH techniques to transform sensitive high-dimensional data into privacypreserving low-dimensional indices without sacrificing data utility.Our approach is segmented into three pivotal phases: transforming sports performance and fan interaction data into privacy-compliant indices, identifying approximate nearest neighbors based on these indices, and executing targeted predictions for athletes and fans.

Constructing privacy-compliant indices using enhanced LSH functions
Initially, we apply a series of enhanced hash functions to convert the sensitive sports data into privacy-compliant indices.Let p i,k denote the performance metrics of ath- lete i in event k.For a athletes and b sporting events, we represent the athlete-event data matrix as P, detailed in Eq. ( 1).The hashing process, as illustrated in Eq. ( 2), anonymizes the raw data into a binary index.In the formula above, � p = (p 1 , p 2 , . . ., p b ) symbolizes the data vector for a specific athlete, encompassing performance metrics across events, and � r = (r 1 , r 2 , . . ., r b ) represents a randomly generated vector with each component assigned a value from -1 to 1. Through this hashing procedure, as depicted in Eq. ( 2), the computation of the hash functions, if the dot product of p and r is positive, then g( p) yields 1. Conversely, g( p) defaults to 0, effectively converting the original athlete data into a binary digit, thus preserving privacy.
Owing to the intrinsic variability and the necessity for precision, we engage M distinct hash functions g 1 (), g 2 (), . . ., g M () to compile a comprehensive binary sequence for each athlete, denoted as the privacy-compliant index I (p) which encapsulates (g 1(p) , g 2(p) , . . ., g M(p) ) .This procedure ensures the transformation of sensitive sports data into a less intrusive index, safeguarding athletes' and fans' privacy while retaining data utility for predictive analysis within cloud platforms.The following algorithm outlines the process for constructing privacycompliant indices for athletes and fans.

Identifying athletes' nearest neighbors
This section delineates the process to discern athletes' nearest neighbors within the enhanced sports analytics framework, utilizing the PSDFP ALSH methodology.This process involves several critical steps, leveraging the constructed privacy-compliant indices to ascertain similarity amongst athletes, thereby facilitating targeted predictions and strategic insights.
Following the construction of privacy-compliant indices, consider athlete a 1 whose index is denoted by I (a 1 ) = (g 1(a 1 ) , g 2(a 1 ) , . . ., g M(a 1 ) ) and similarly, I (a 2 ) for athlete a 2 .To deduce similarity, an "AND" operation is executed across all corresponding bits of their indices.The requisite condition for similarity is formalized in Eq. ( 3), ensuring that for athletes to be considered similar, their respective indices must align across all M hash functions.
The intrinsic probabilistic characteristics of LSH might lead to errors in assessing similarity (Step 1), which could impact the overall predictive accuracy.To address this, multiple sets (N) of hash functions are employed, with each set undergoing the "AND" operation (Step 1).The existence of any similar relationship across these N SM 1 matrices signifies similarity between athletes, mitigating false negatives.This step entails generating Q sets of SM 2 matrices by repeating the hash function grouping process (Step 2), each set incorporating the "OR" operation.A relaxed similarity criterion is applied; athletes are considered similar if at least one of the Q SM 2 matrices indicates similarity, optimizing the balance between sensitivity and specificity of the prediction model.
To refine the similarity matrix and reduce false positives, the "AND" operation is applied across W SM 3 matrices generated in Step 3.This rigorous criterion ensures that only consistent similarities across all matrices are recognized, culminating in the final similarity matrix, SM 4 , which serves as the foundation for precise data prediction and recommendation.

Making data prediction and athlete/event recommendation
Upon establishing the final similarity matrix SM 4 through "Constructing privacy-compliant indices using enhanced LSH functions and Identifying athletes' nearest neighbors" sections , which encapsulates the similarity relationships among athletes, we proceed to predictive analyses and recommendations for a specified athlete, denoted as a target .The initial step involves iden- tifying a target 's nearest neighbors within SM 4 , collating these into a set named NN.
Subsequently, leveraging the preferences or performance metrics of similar athletes, predictions for a target are formulated.The predictive function is encapsulated in Eq. ( 5), where |NN| signifies the count of similar athletes within NN and p a,k denotes the per- formance metric of athlete a in event k.Predictions are made for each event, followed by a ranking process, culminating in the selection of an optimal event or performance improvement strategy for a target .Algo- rithm 3 outlines the process for making these datadriven recommendations.
We have to admit that LSH is a kind of probabilistic approximate neighbor search technique.As a consequence, we cannot guarantee an always success when using LSH for multi-source data integration and analyses.However, we have adopted some measures to reduce the failure rate from the following two aspects: (1) we use multiple hashing functions instead of one hashing function when creating indices, which can reduce the false-positive probability; (2) we use multiple hashing tables instead of one hashing table when recognizing close users, which can reduce the false-negative probability.This way, we can reduce the failure rate of our proposal as much as possible.

Experimental configurations
To validate the effectiveness of our proposed PSDFP ALSH method within the sports analytics domain, a comprehensive experimental setup was designed.We employed the WSDream dataset, a real-world dataset comprising performance records involving a series of peopleservice invocation events.The performance evaluation included comparisons with two benchmark methods: the widely acknowledged CF method, known for its predictive accuracy but lacking privacy measures, and the Ser-Rec distri-LSH , a privacy-aware approach utilizing LSH for data prediction.Our method's parameters were meticulously selected based on preliminary trials: the number of hash functions (M) and hash groups (N) were tested with values set to 4, 6, 8, and 10, while Q and W, indicating the depth of similarity matrices, were similarly varied.This parameterization was designed to investigate (5) the trade-off between privacy protection and predictive accuracy at various levels of data sparsity.
The experimental platform was configured with a 2.50 GHz processor and 16.0 GB memory, running on a Windows 11 OS, and utilized Python 3.6 for implementation.The analysis was conducted across varying data sparsity levels (0.1, 0.3, 0.5, 0.7, and 0.9) to assess the robustness and scalability of the proposed method against the changing density of available data.

Test 1: Accuracy Comparison.
In this analysis, we rigorously evaluate the accuracy of three different methodologies using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) as our key metrics.MAE quantifies the average magnitude of errors between predicted values and actual outcomes, whereas RMSE measures the square root of the average squared differences, providing insight into the variance of prediction errors.Lower values of MAE and RMSE are indicative of higher predictive accuracy.To thoroughly examine the performance across various scenarios, we introduced data sparsity levels at 0.1, 0.3, 0.5, 0.7, and 0.9.This gradient allows for a comprehensive analysis under differing degrees of data availability.
In our findings, illustrated in Fig. 2, at a sparsity level of 0.1, the MAE for CF is notably the lowest, underscoring its precision.However, the MAE for our PSDFP ALSH method, while marginally higher than that of CF, significantly surpasses SerRec distri-LSH in accuracy, establishing a compelling case for its application in privacy-sensitive environments.A discernible trend is observed with increasing data sparsity: the accuracy for all methods diminishes, as indicated by rising MAE values.Nonetheless, our method consistently outperforms SerRec distri-LSH across sparsity levels up to 0.9.Similar results of RMSE could be found in Fig. 3.
Despite the superior performance of CF in terms of MAE and RMSE, its direct utilization of user data without anonymization renders it unsuitable for scenarios demanding strict privacy adherence, such as in distributed edge-computing frameworks.Consequently, while PSDFP ALSH 's accuracy is slightly lower in comparison, its robust privacy-preserving capabilities and competitive predictive performance, especially against Ser-Rec distri-LSH , validate its effectiveness and feasibility for sports analytics applications.The challenges posed by high data sparsity (0.9 ratio) exacerbate the difficulty in achieving accurate predictions across all methods, attributed to the scant availability of pertinent user data for making informed predictions and recommendations.

Test 2: Evaluating PSDFP ALSH 's Accuracy with Variable Parameters
In alignment with the experimental setup detailed in "Experimental configurations" section, we meticulously varied the parameters M, N, Q, and W to investigate the impact on the accuracy of our PSDFP ALSH methodology.Tables 1 and 2 present this analysis, with each column representing the product of M and N, indicating different combinations of hash functions and groups, while each row representing the product of Q and W. Table 1 depicts the MAE associated with three methods, while Table 2 illustrates the RMSE associated Fig. 2 Comparative MAE analysis of the three approaches under varying data sparsity conditions with three methods.The product Q * W , signifies diverse scenarios based on the aggregation of similarity matrices.
Our observations from Tables 1 and 2 reveal that a larger N value yields notably superior accuracy for PSDFP ALSH , as indicated by lower MAE and RMSE values compared to other configurations.On the contrary, a smaller M value yields notably superior accuracy for PSDFP ALSH , as indicated by lower MAE and RMSE values compared to other configurations.It was further noted that holding M, N, and Q constant while increasing W led to a decrease in MAE and RMSE, indicating enhanced accuracy.Conversely, keeping M, N, and W fixed while augmenting Q resulted in a slight increase in these error metrics.

Profile 3: Computational Efficiency of the Approaches
In this evaluation, we explored the computational efficiency of the PSDFP ALSH , SerRec distri-LSH , and CF methodologies.This analysis aimed to ascertain the time cost implications of each method under different data availability conditions.As illustrated in Fig. 4, CF consistently incurs higher computational time compared to the other two methods, which could be attributed to its direct use of original user data without any form of anonymization or compression, inherently leading to longer processing time.
Conversely, the time costs for PSDFP ALSH and Ser-Rec distri-LSH are notably comparable and significantly lower than those for CF.The similar efficiency of the two LSH-based methods mainly stems from their use of hash functions to convert high-dimensional data into a lower-dimensional, privacy-preserving format.This  The CF method's reliance on Pearson similarity calculations directly on users' original data not only raises concerns regarding privacy breaches but also results in increased computational demands, underscoring the superior efficiency and privacy compliance of the PSDFP ALSH approach in distributed edge-computing environments dedicated to smart sports analytics.

Conclusion
The integration and prediction of multi-source data within edge computing environments are pivotal for enabling intelligent sports analytics applications.However, the inherent tension between ensuring data privacy and maintaining data availability poses significant challenges, particularly in scenarios where preserving the integrity of athlete and event data is paramount.In this study, we introduced an approach leveraging the Locality-Sensitive Hashing technique to anonymize original data into a less sensitive index format.This transformation allows for the calculation of similarities using indices rather than raw data, thereby safeguarding athlete privacy without compromising on the utility of the data for predictive analytics.
Future research directions include the integration of additional privacy-preserving mechanisms alongside our method to further fortify data security.Moreover, we plan to incorporate more contextual variables, such as temporal and spatial factors, to enrich the prediction models.Recognizing that sports data analytics is profoundly influenced by a multitude of factors, integrating these additional dimensions will enable more nuanced and accurate predictions, thus advancing the field of smart sports analytics.

p
Fig.1A motivating example and challenges

Algorithm 1
Constructing privacy-compliant indices for athletes/fans

Fig. 3
Fig. 3 Comparative RMSE analysis of the three approaches under varying data sparsity conditions

Fig. 4
Fig.4 Comparative efficiency analysis of the three approaches under varying data sparsity conditions