Topic and knowledge-enhanced modeling for edge-enabled IoT user identity linkage across social networks

Huang, Rui; Ma, Tinghuai; Rong, Huan; Huang, Kai; Bi, Nan; Liu, Ping; Du, Tao

doi:10.1186/s13677-024-00659-z

Research
Open access
Published: 21 May 2024

Topic and knowledge-enhanced modeling for edge-enabled IoT user identity linkage across social networks

Rui Huang¹,
Tinghuai Ma^1,2,
Huan Rong³,
Kai Huang¹,
Nan Bi⁴,
Ping Liu⁵ &
…
Tao Du⁶

Journal of Cloud Computing volume 13, Article number: 107 (2024) Cite this article

267 Accesses
Metrics details

Abstract

The Internet of Things (IoT) devices spawn growing diverse social platforms and online data at the network edge, propelling the development of cross-platform applications. To integrate cross-platform data, user identity linkage is envisioned as a promising technique by detecting whether different accounts from multiple social networks belong to the same identity. The profile and social relationship information of IoT users may be inconsistent, which deteriorates the reliability of the effectiveness of identity linkage. To this end, we propose a topic and knowledge-enhanced model for edge-enabled IoT user identity linkage across social networks, named TKM, which conducts feature representation of user generated contents from both post-level and account-level for identity linkage. Specifically, a topic-enhanced method is designed to extract features at the post-level. Meanwhile, we develop an external knowledge-based Siamese neural network for user-generated content alignment at the account-level. Finally, we show the superiority of TKM over existing methods on two real-world datasets. The results demonstrate the improvement in prediction and retrieval performance achieved by utilizing both post-level and account-level representation for identity linkage across social networks.

Introduction

The exponential growth of the Internet of Things (IoT) and mobile edge computing (MEC) empowers social networks [1], infusing social media posts with dynamic and diverse characteristics [2, 3]. Concurrently, the number of social platforms centered around IoT devices is steadily increasing [4, 5], approximately 80% of Internet users tend to register multiple accounts on different social networks to access various online services [6]. Social networks are progressively meeting users’ escalating demands for self-promotion with IoT devices through cutting-edge media, such as 3D image and augmented reality videos, which increase numerous computational demands [7,8,9]. The evolution of MEC meets users’ real-time needs, offloading tasks to nearby nodes and enhancing the responsiveness and interactivity for MEC applications [10, 11]. MEC applications enhance the responsiveness and interactivity. For instance, with edge AI, users can swiftly summarize video content, edit photos, and optimize content using prompts [12]. Additionally, MEC applications can analyze user behavior in real-time, providing personalized services and recommendations [13]. For example, leveraging location data from users’ IoT devices (e.g., vehicles or gaming devices) [14,15,16], social network applications can suggest nearby activities, businesses, or friends [17]. Thus, diversity of platforms and online data brought by IoT and MEC applications presents huge potential for improving cross-platform applications, such as analysis of social network structure [18,19,20], cross-domain topic detection [21], and multi-layer rumor influence minimization [22, 23]. These applications are hungry for the comprehensive amalgamation of user data which is from diverse social networks [24, 25]. However, due to the cross-platform online data heterogeneity and the diversity of posts caused by the myriad IoT devices, it is challenging to process the integration of the users’ separated data from diverse social networks.

In light of this, as illustrated in Fig. 1, cross-social network identity linkage is envisioned as a promising technique to amalgamate the separated IoT user data for the construction of comprehensive social profiles [26], playing the role as a vital prerequisite of the above cross-platform applications. Driven by MEC, identity linkage can operate on edge nodes for real-time cross-network product recommendations and advertising placements, enhancing user experience and delivering economic value [27]. Particularly, driven by the diversify of user attributes(e.g., user profile, social relationships) and data generated by edge-enabled IoT users, a complete view of IoT user’s characteristics can be modeled to identify accounts from multiple social networks [28, 29]. Some studies have been conducted using users’ attribute information for identity linkage, such as users’ social relationships [30,31,32], and user profiles [33,34,35], etc. However, users prefer to make their posts public, but set their social relationships as private, and their profiles could be dynamically changed. With the User Generated Contents (UGCs) generated by edge-enabled IoT users, a variety of user features can be extracted (e.g., writing style, spatial-temporal feature) without the issues mentioned above. From the perspective of UGCs [6, 36], e.g., posts, tweets, and publications, capturing correlations between posts can characterize user behavior with low acquisition difficulty instead of user profiles and social relationships [37, 38].

Although using UGCs to tackle the identity linkage task can reduce the inconsistency of accessing user data, it also brings challenges to accurately model IoT user features by considering the cross-platforms distribution disparities and abundant semantic information of UGCs (e.g., Text). First, due to the latent semantic information making contributions to the similarity of UGCs, it is necessary to find hidden correlations of different semantic features. Users may post texts with different content but describing the same event on different social networks. Meanwhile, because of the extensive social network data and complex natural language semantics, it is important to represent different deep semantic information to capture user features and identify the corresponding user accounts without text annotations on multiple social networks. Second, the granularity of post is too limited to calculate the correlation between different accounts [39]. If there is a discrepancy in the presentation of posts in different accounts belonging to the same user, comparison of post-level may lead to miss of the target account, which increases the difference of same user identities and degrades accuracy of identity linkage. Therefore, it is essential to represent macro user characteristics (e.g., account-level features) and make it reinforce the post similarity representation [40]. Furthermore, temporal factors should be considered which play a vital role in the feature representation of UGCs.

Thus, in this paper, we propose a topic and knowledge-enhanced edge-enabled IoT user identity linkage model, named TKM. First of all, the topic information enhances shallow semantics information represented by BiLSTM in post-level feature representation. Then we use a account-level feature representation, which introduces external knowledge alignment to reduce the discrepancy of data distribution among different platforms. When generating similarity distribution of different levels, we use the attention mechanism for incorporating the topic and shallow semantic features in post-level, while using encoder structure of the Transformer at account-level to incorporate temporal factors. Finally, we evaluated our work with the dataset from real social platforms: Twitter, Instagram and Flickr.

Our contribution is summarized as follows.

We propose a UGC-based approach named TKM for identity linkage across social networks, incorporating post-level and account-level information to uncover hidden correlations among user features, particularly enhancing semantic information at the post-level using topic information.
We develop a knowledge-aware modeling in account representation learning integrated into TKM, which performs feature alignment among users across social networks by introducing external knowledge, to address the challenge of semantic disparities caused by different data distributions across different social platforms.
Experiments on real-world datasets demonstrate that using deep semantics features with latent topic vectors and using global representations based on knowledge graphs of users outperforms methods using shallow semantics information for identity linkage tasks.

The organization of this paper is as follows. In “Related work” section, related work is reviewed regarding user identity linkage, topic representation model, and external knowledge base, respectively. “Preliminaries” section introduces basic concepts, definitions, and formulation. In “Methodology” section, a topic and knowledge-enhanced identity linkage method is elaborated. “Performance evaluation” section presents the experimental results. “Conclusions and future work” section gives the conclusion.

Related work

User identity linkage

Existing works use user profiles, user relationships, UGCs, and combinations of these information for user identity linkage across networks. Traditional methods usually adopt the user’s profile information [41, 42]. Goga et al. [43] focused on profile attributes for analyzing social network users, and investigated how profile attributes, such as usernames, location, and friends, affect the overall matching reliability. Nevertheless, the users’ information could be profile fictitious. To characterize users more comprehensively, Zhou et al. [44] addressed the challenges arising from incomplete user information and sparse user pairs by proposing TransLink. This approach utilized the user’s social relationships to generate embedding vectors, which are then projected into a uniform low-dimensional space.

However, in recent years, an increasing number of users are choosing to conceal their social relationships and dynamically update their user profiles, which can affect the performance of user identity linkage. Different from the above works, several efforts have been made to address this challenge using users’ published content, which is UGCs. Generally, UGCs contain rich characteristics of users and remain public and unaltered. User features, including events, hobbies, attitudes, and other characteristics, can be inferred by analyzing the textual information within UGCs. Chen et al. [36] considered the textual information of posts and used GloVe and BiLSTM to generate user features. It is characterized that similarity between pairs of user posts in adjacent time periods contributes more to the user similarity distribution. The location information in each post can also generate rich user representations. Based on user’s physical presence, Feng et al. [45] proposed an end-to-end deep learning based framework by utilizing the spatial-temporal locality of user activities to extract representative features from trajectory. They also demonstrated that network accessing related information can be translated into location, and thus help complete the user identity linkage task. To alleviate the limitation of using absolute location, Chen et al. [26] proposed HFUL, which generates location information in user posts based on kernel density estimation. Additionally, they developed an index structure from the spatio-temporal data and employ pruning strategies to reduce the search space. With the help of the Bayesian personalized ranking (BPR) framework, Song et al. [46] investigated the relationship between multimodal information and used latent compatibility to unify the different complementary kinds of information. In addition, there exist the models that use multilayer perceptron to fuse the scores of different modality information similarities [47], as well as the model based on adversarial learning to reduce the information distribution distances across different social platforms [48]. When using heterogeneous user information, the effectiveness of integrating different modal information would indirectly affect the effectiveness of the model. Moreover, users are gradually becoming more aware of their personal information, and it is increasingly difficult to obtain their multimodal information [49].

However, existing works based on user-generated contents (UGCs) lack a comprehensive representation of textual features, particularly overlooking the latent semantic information embedded within textual content and neglecting the challenge of semantic distribution disparities across networks. Therefore, in this paper, we concentrate on utilizing textual information from UGCs to comprehensively represent user characteristics, specifically delving into latent textual representations.

Topic representation model

In recent years, the topic model has achieved prominent success in natural language processing tasks. Topics can be represented by using latent variable generation models [50]. For example, Kingma et al. [51] proposed a variational auto-encoder(VAE). They used a deep learning model to approximate the probability distribution parameters on the latent vector layer, to extract a low-dimensional representation of the strain variables in high-dimensional information. Nan et al. [52] proposed a topic model based on the Wasserstein autoencoders (WAE) structure, to address the challenge of distribution matching and avoid the problem of posterior collapse. Furthermore, for the short text posts typically available in social networks, Li et al. [53] clustered the sentiment of the comments into a single document and adopted the topic information to generate a summary. However, the limitation is that the topic information they used is tags given by the user, rather than the latent topic information in the text.

Beyond this, due to the aim of detecting topic information in social networks, Pathak et al. [54] proposed a sentiment analysis model for topic modeling at the sentence level, which used latent semantic indexing constrained by regularization. At the same time, short text posts on social networks usually have an informal style and might contain some spelling mistakes, Internet buzzwords, and informal grammar. Kolovou et al. [55] proposed a sentiment analysis framework called Tweester, which incorporates several models, including the topic model, semantic sentiment model, and word embedding model, to solve the problems of tweet polarity classification and tweet quantification. In particular, they demonstrated that topic modeling could improve the performance of semantic analysis tasks in informal, short-text posts like tweets.

External knowledge base

Recently, knowledge graph has attracted increasing research attention as an approach to introducing external knowledge. Lehmann et al. [56] extracted structured knowledge from different language versions of Wikipedia and mapped it to a single shared ontology consisting of different classes and properties, as a combination of different knowledge. Beyond this, to explore event-centric knowledge graphs, Sap et al. [57] focused on inferential knowledge, which is expressed in the form of If-Then relations with variables.

Recent developments of language representation have heightened the need for the introduction of external knowledge. Liu et al. [58] explored the knowledge-driven challenges in specific-domain by integrating BERT with a knowledge graph. Wang et al. [59] proposed a model named KEPLER to address the challenge of knowledge embedding and pre-trained language representation, which integrates not only factual knowledge into pre-trained language representation models but also generates effective knowledge embedding information. In addition, Sun et al. et al. [60] proposed a contextualized language and knowledge embedding model, named CoLAKE, to reduce the heterogeneity of relevant knowledge contexts and language representations by constructing a word-knowledge graph (WK graph). Moreover, among the approaches that introduce external knowledge to describe the global characteristics of users, Karidi et al. [61] proposed a followee recommendation method that models followers and potential followees based on the same external knowledge and the topics of interest to users.

Preliminaries

In this section, we first introduce the necessary definitions of the identity linkage across social networks and then formulate the research problem.

Table 1 Key terms and descriptions

Full size table

Basic concepts and definitions

Before introducing our methodology, we define some of the main key terms and descriptions used in this paper, which are listed in Table 1. Some of these terms are described only in the environment of social network ${SN}^{X}$, since we can define these terms similarly in social network ${SN}^{Y}$.

Definition 1

(Post-level and account-level representation). Given a social media network ${SN}^{X}$ or ${SN}^{Y}$, each user in the network has her own vector space to represent the different characteristics of this user. In our paper, the user vector space consists of post-level vector representation and account-level vector representation. For each user, post-level representations focus more on the detailed features and connections of each post. Account-level representations are coarser-grained and focus more on the overall features of the user. More specifically, post-level representation consists of the BiLSTM-based textual representation and the VAE-based topic representation. Account-level representation refers to the global features of the user account, which are generated by introducing an external knowledge base. Moreover, textual representation based on BiLSTM refers to shallow semantic information, while topic vector representation indicates deep semantic information.

Definition 2

(Identity linkage). Given two different users $u_{i}^{X}$ and $u_{j}^{Y}$. We design the representation learning models to generate user vector space from UGCs. Thereafter, we aim to determine whether $u_{i}^{X}$ and $u_{j}^{Y}$ in different media networks are different user accounts belonging to the same user identity.

If the matching results of $u_{i}^{X}$ and $u_{j}^{Y}$ do belong to the same user identity, it can be defined by

$$\begin{aligned} \psi \left( u_{i}^{X},u_{j}^{Y} \right) =\left\{ \begin{array}{c} 1,u_{i}^{X}=u_{j}^{Y}\\ 0,u_{i}^{X}\ne u_{j}^{Y}\\ \end{array}. \right. \end{aligned}$$

(1)

Problem formulation

In our paper, our proposed model tries to tackle the following two main questions “Whether it is possible to determine if two user accounts refer to the same user identity only using the user’s text posts”, and “Could topic information and comprehensive knowledge graph-based user features enhance the shallow semantic information of users to perform the identity linkage task”. Given two arbitrary social media networks ${SN}^{X}$ and ${SN}^{Y}$, user set $U^{X} = \left\{ u_{1}^{X},u_{2}^{X},u_{3}^{X},\ldots ,u_{n}^{X} \right\}$ which is from ${SN}^{X}$, $U^{Y} = \left\{ u_{1}^{Y},u_{2}^{Y},u_{3}^{Y},\ldots ,u_{n}^{Y} \right\}$ which is from ${SN}^{Y}$. Without loss of generality, since we only use the content of users’ textual posts, which is the common component of mainstream social media networks, and it has the advantage of being easily accessible. Furthermore, the social media networks ${SN}^{X}$ and ${SN}^{Y}$ in our model can be arbitrary.

For each post $\mathcal {H} _{\textrm{i}}=\left\{ \left( t_{g}^{i},p_{g}^{i} \right) \right\} _{g=1}^{G}$, where G is the number of posts. Each post has its own property, where $t_{g}^{i}$ refers to the content of the g-th textual post of the i-th user, and $t_{g}^{i}$ refers to the timestamp of it. There are two levels of vector representations generated from these posts, including the post-level vector representation $level_{\mathcal {H}}$ and the account-level vector representation $level_U$. $level_{\mathcal {H}}$ includes two vectors, one is the textual vector representation $\widetilde{t}_{g}^{i}$ that shallow semantic information of post $t_{g}^{i}$, and the other one is the topic latent vector representation $z_{g}^{i}$. $level_U$ corresponds to the account vector representation $m_i$, which is based on the knowledge graph $\mathcal {K} \mathcal {G}$ to employ alignment operations across social media networks.

In addition, our paper focuses on the linkage of different user accounts in two social networks, while our model can also be extended to address the linkage in a multi-social network environment by the following methods.

Given social media networks $SN^X,SN^Y,SN^Z(X,Y,Z\in N)$, with users $u_{i}^{X}$, $u_{j}^{Y}$, $u_{f}^{Z}$. If $\left(u_{i}^{X},u_{j}^{Y}\right)$ and $\left(u_{j}^{Y},u_{f}^{Z}\right)$ refer to the same user identity; then we can establish a linkage between user $u_{i}^{X}$ and user $u_{f}^{Z}$, which represents that $u_{i}^{X}$, $u_{j}^{Y}$, $u_{f}^{Z}$ belong to the same user identity.

Methodology

In this section, we detail the proposed topic and knowledge-enhanced identity linkage method with attentive modeling. In essence, the purpose of utilizing topic information is to enhance the shallow semantic information of the posts. Meanwhile, the application of an external knowledge base could perform alignment of UGCs. Accordingly, we can tackle the user identity linkage task by using different representations from multiple levels.

The overall design of TKM

As illustrated in Fig. 2, our proposed model consists of two key components: post-level representation generation and account-level representation generation, to address the challenges in the problem formulation. In particular, two kinds of information are included in the post-level representation learning, one is the information generated with the topic model to represent the deep semantic features in the post, and the other is the shallow semantic features generated with the BiLSTM model. Simultaneously, we use the attention mechanism integrated with temporal post correlation, to fuse the similarity distributions of the two post-level representations. In the account-level representation, we resort to the knowledge graph to obtain top-K triples for each post, and generate the embedding vector of their knowledge representations with the help of the attention mechanism. In particular, the encoder structure of the Transformer is utilized to generate account representation. Moreover, we use a fusion strategy to process post-level similarity and account-level similarity.

Post-level vector representation

VAE based topic latent vector representation

Undoubtedly, the topic is fundamental to the analysis of UGCs in social media, and it is also a significant component of post representation learning. In fact, not all users add topic tags to their posts, and we need to generate the topic features from the high-dimensional text information. Meanwhile, although each post appears to be independent, users may use multiple posts to describe similar topics. Intuitively, unlike formal articles where sentences are correlated with each other, prior posts are not likely to depend on subsequent posts on social networks. Towards this end, we resort to TodKat [62], which designed a topic model to encode the latent topic vectors of utterances in the dialogue. Especially, due to the characteristics of posts in social networks, we propose to use the VAE-based topic representation model with the sequential structure for accurate topic latent representation learning. For simplicity, we omit the superscript i, which refers to the i-th user.

To generate latent topic vector for each post’s text content $t_g$, we use the internal loop structure of $z_g$ to handle time series information. A topic layer is added to the RoBERTa model Ra. $Ra_{\mathrm {\varphi }}$ is the part before the topic layer, while $Ra_{\theta }$ is the part after the topic layer [63]. Here, the variational approximate posterior can be calculated as

$$\begin{aligned} \begin{array}{l} q_{\phi }\left( z_g\mid \varvec{x}_{\le g},\varvec{z}_{<g} \right) \mathrm {}=N\left( z_g\mid f_{\mu _{\phi }}\left( x_{g}^{R},h_{g-1} \right) ,f_{\sigma _{\phi }}\left( x_{g}^{R},h_{g-1} \right) \right) \\ \mathrm {where~}h_{g-1}=f_{\tau }\left( z_{g-1},x_{g-1}^{R} \right) ,\mathrm {~for~}g>1\\ \end{array}, \end{aligned}$$

(2)

where $x_{g}^{R}$ refers to the output of $Ra_{\mathrm {\varphi }}(t_g)$. $f_{\mu _{\phi }}(\cdot )$ and $f_{\sigma _{\phi }}(\cdot )$ correspond to two multilayer perceptrons, respectively. To be more specific, the multi-headed attention mechanism could be treated as the meaning of the query “which parts of the context in the post cues to the latent topic representation”. It is worth noting that multi-headed attention mechanism is proven to have the ability to catch features more effectively in Transformer model [64]. Thereafter, we can obtain $f_{\tau }$ as

$$\begin{aligned} f_{\tau }\left( z_{g-1},x_{g-1}^{L} \right) =\mathrm {~Attention~}\left( z_{g-1},x_{g-1}^{L},x_{g-1}^{L} \right) , \end{aligned}$$

(3)

where $z_{g-1}$ refer to the query, and $x_{g-1}^{L}$ correspond to keys and values. And to represent the dependencies between $z_{g-1}$ and $z_g$ in posts, we can represent the prior of $z_g$ as

$$\begin{aligned} p\left( z_g\mid h_{g-1} \right) =\mathcal {N} \left( z_g\mid f_{\mu _{\gamma }}\left( h_{g-1} \right) ,f_{\sigma _{\gamma }}\left( h_{g-1} \right) \right) , \end{aligned}$$

(4)

where $f_{\mu _{\gamma }}(\cdot )$ and $f_{\sigma _{\gamma }}(\cdot )$ symbolize two multilayer perceptrons similar to (2). In fact, there would not exist a natural posterior $p_{\theta }\left( z_g\mid \varvec{x}_{\le g},\varvec{z}_{<g} \right)$ of $z_g$. The posterior of $z_g$ is replaced using a neural network, represented as $q_{\phi }\left( z_g\mid \varvec{x}_{\le g},\varvec{z}_{<g} \right)$. Moreover, we adopt VAE model to process each post, where we need to reconstruct post text based on the latent topic vector $z_g$. As a consequence, language model based on the encoder-decoder architecture can boost the performance towards reconstruction of the post and generate the topic more accurately. Accordingly, we can assign the process of the reconstruction of $x_{g}^{R}$ from $z_g$ as

$$\begin{aligned} \hat{x}_n=Ra_{\theta }\left( ~z_g,\mathrm {~}x_{g}^{R} \right) . \end{aligned}$$

(5)

In addition, according to VAE [51], we can construct Variational Lower Bound(VLB) $L_t$ as the sum of the reconstruction loss and the regularization loss. In particular, the reconstruction loss represents the similarity of the generated latent topic vector $z_g$ to post content $x_{g}^{R}$, while the regularization loss refers to the difference between probability distribution of $z_g$ and the priori probability distribution (i.e., Gaussian distribution). Thereafter, we can formulate $L_t$ as

$$\begin{aligned} \begin{array}{r} L\left( \theta ,\phi ;x \right) =E_{q_{\phi }\left( z_{\le G}|x_{\le G} \right) }\left[ \log \!\,\left( p_{\theta }\left( x_{\le G}|z_{\le G} \right) \right) \right] \mathrm {}\\ -D_{KL}\left( q_{\phi }(z_{\le G}\mid x_{\le G})\parallel p_{\theta }(z_{\le G}) \right) \\ \end{array}, \end{aligned}$$

(6)

where $D_{KL}$ means KL divergence, and both $p_{\theta }(z_{\le G})$ and $q_{\varphi }(z_{\le G}|x_{\le G})$ are Gaussian. Thereafter, we can generate latent topic representation with sequential structure for each post, and we have a language model using post content for fine-tuning, which is adopted later in the knowledge-based representation.

A BiLSTM based textual vector representation

To solve the problem of weak semantic information in short texts, we adopt the BiLSTM framework to process users’ historical posts. In our work, we regard the text features generated by this method as shallow semantic information. It plays a vital role in processing text information due to the explicit modeling of semantic relations within sentences. Despite the tremendous success of applying BiLSTM in natural language processing (NLP) tasks and identity linkage task [36, 65], there is scarce work exploiting the incorporation of shallow semantic information with latent topic information for identity linkage.

Firstly, for a user’s post $\mathcal {H}$, its textual content $t_g$ is composed of $\varUpsilon$ words in multiple sentences, which can be represented as $t=\left\{ Word^1,Word^2,\cdots ,Word^\varUpsilon \right\}$. To generate the embedding vector of each word, we utilize Global Vectors (GloVe) [36]. In particular, GloVe is a word embedding model based on the statistical information of global lexical co-occurrence to learn word vectors, it combines the advantages of both statistical information and local context window approaches. And the use of BiLSTM provides a complete modeling of the semantic information of posts. Specifically, for each word $Word^{\upsilon }, v=\left\{ 1,2,\dots ,\varUpsilon \right\}$, the embedded vector is $\textrm{e}^{\upsilon }\in \mathbb {R} ^{D_e}$. Update gate and reset gate can be calculated as

$$\begin{aligned} \left\{ \begin{array}{c} \textrm{u}_{\upsilon }=\sigma \left( \textrm{W}_u\left[ \overrightarrow{\textrm{f}}^{\upsilon -1},\textrm{e}^{\upsilon } \right] +\textrm{b}_u \right) \\ \textrm{r}_{\upsilon }=\sigma \left( \textrm{W}_r\left[ \overrightarrow{\textrm{f}}^{\upsilon -1},\textrm{e}^{\upsilon } \right] +\textrm{b}_r \right) \\ \end{array}, \right. \end{aligned}$$

(7)

where $\textrm{W}_u$ and $b_u$ are the weight matrices and the bias vectors of the update gate. $\textrm{W}_r$ and $\textrm{b}_r$ are the weight matrices and the bias vectors of the reset gate. $\sigma (\mathrm {\cdot })$ denotes the sigmoid activation function. Here, the memory cell state $\textrm{m}_{\upsilon }$ and the vector $\overrightarrow{\textrm{f}}^{\upsilon }$ generated by the forward LSTM can be represented as

$$\begin{aligned} \left\{ \begin{array}{c} \textrm{m}_{\upsilon }=\tanh \!\,\left( \textrm{W}_m\left[ \textrm{r}_{\upsilon }\odot \vec {\textrm{f}}^{\upsilon -1}\mathrm {~},\textrm{e}^{\upsilon } \right] +\textrm{b}_m \right) \\ \overrightarrow{\textrm{f}}^{\upsilon }=\textrm{u}_{\upsilon }\odot \textrm{m}_{\upsilon }+\left( 1-\textrm{u}_{\upsilon } \right) \odot \vec {\textrm{f}}^{\upsilon -1}\mathrm {~~~~~~~~}\\ \end{array}, \right. \end{aligned}$$

(8)

where $\textrm{W}_m$ and $\textrm{b}_m$ are the weight matrices and the bias vectors. $\odot$ denotes the element multiplication operation. Similarly, we can obtain the backward LSTM vector $\overleftarrow{\textrm{f}}^{\upsilon }$. Here, the vector representation generated using BiLSTM for $Word^{\upsilon }$ can be expressed as

$$\begin{aligned} \mathrm {{\textbf {f}}}^{\upsilon }=\left[\overrightarrow{\textrm{f}}^{\upsilon },\overleftarrow{\textrm{f}}^{\upsilon }\right]. \end{aligned}$$

(9)

Consequently, the BiLSTM-based vector representation $\widetilde{t}$ containing all the information of the post text can be defined as

$$\begin{aligned} \widetilde{t}=\sum \limits _{\upsilon =1}^{\varUpsilon }{\mathrm {{\textbf {f}}}^{\upsilon }}. \end{aligned}$$

(10)

Similarity fusion of post-level representations

After feature representation, we incorporate the above two vector representations to generate post-level similarity distribution. Considering that users are likely to post on different social networks with similar content or topics in close adjacent time period [66]. Intuitively, we need to incorporate a temporal correlation factor when generating similarity distributions. Towards this end, we resort to UserNet [36], with the key modification that image representation is replaced by topic representation. Especially, we propose to use the attention mechanism for incorporating the topic latent vector representation $z_g$ with textual vector representation $\widetilde{t}_{g}$ to generate account-level similarity distribution. The similarity between the different types of semantic information of the posts of $u_{i}^{X}$ and $u_{j}^{Y}$ and the temporal weights can be calculated as

$$\begin{aligned} \left\{ \begin{array}{c} S_{g,n}^{\tilde{t}}=cos\!\,\left( \tilde{{t}}_{g}^{i},\tilde{{t}}_{n}^{j} \right) \\ S_{g,n}^{z}=cos\!\,\left( {z} _{g}^{i},{z} _{n}^{j} \right) \\ \hat{{p}}_{g,n}=\frac{1}{\log \!\,\left| p_g-p_n \right| }\\ \end{array}, \right. \end{aligned}$$

(11)

where $S_{g,n}^{\tilde{t}}$ and $S_{g,n}^{z}$ denote the shallow semantic similarity and topic similarity between the g-th post of $u_{i}^{X}$ and the n-th post of $u_{j}^{Y}$.And $\hat{{p}}_{g,n}$ denotes the temporal relevance weight between the posts of $u_{i}^{X}$ and $u_{j}^{Y}$, where $p_g$ and $p_n$ are timestamps. Then, $\hat{S}_{g,n}^{\tilde{t}}=\hat{{p}}_{g,n}S_{g,n}^{\tilde{t}}$ and $\hat{S}_{g,n}^{z}=\hat{{p}}_{g,n}S_{g,n}^{z}$ are applied to denote pair-wise similarities that imply temporal factors, respectively.

In addition, if the textual features (e.g., word association) of users’ posts on different social networks is the dominant feature, more confidence needs to be applied to the shallow semantic information. In fact, because posts in social networks are informal, shallow semantic information could not accurately identify the association between users. Intuitively, we need to set different confidence for different representations. Accordingly, the attention mechanism for the incorporation of two post-level similarities can be expressed as

$$\begin{aligned} \left\{ \begin{array}{c} \textrm{h}_{\tilde{t}}=tanh\!\,\left( \textrm{W}_{\tilde{t}}\hat{S}_{n}^{\tilde{t}}+\textrm{b}_{\tilde{t}} \right) \\ \textrm{h}_z=tanh\!\,\left( \textrm{W}_z\hat{S}_{n}^{z}+\textrm{b}_z \right) \\ \left[ \alpha _{\tilde{t}},\alpha _z \right] =softmax\!\,\left( \textrm{a}^{\textrm{T}}\textrm{con}\!\,\left( \textrm{h}_{\tilde{t}},\textrm{h}_z \right) \right) \\ \end{array}, \right. \end{aligned}$$

(12)

where $\hat{S}_{n}^{\tilde{t}}=\left[ \hat{S}_{1,n}^{\tilde{t}},\hat{S}_{2,n}^{\tilde{t}},\cdots ,\hat{S}_{G,n}^{\tilde{t}} \right]$ and $\hat{S}_{n}^{z}=\left[ \hat{S}_{1,n}^{z},\hat{S}_{2,n}^{z},\cdots ,\hat{S}_{G,n}^{z} \right]$. $\textrm{W}_{\tilde{t}}$, $\textrm{W}_z$ and $\textrm{b}_{\tilde{t}}$, $\textrm{b}_z$ subtables denote weight matrices and bias vectors, $con(\mathrm {\cdot )}$ denotes the concentration operation. And $\alpha _{\tilde{t}}$, $\alpha _z$ denotes the confidence of different semantic information, and the post-level similarity can be calculated as

$$\begin{aligned} \hat{S}_n=\alpha _{\tilde{t}}\hat{S}_{n}^{\tilde{t}}+\alpha _z\hat{S}_{n}^{z}. \end{aligned}$$

(13)

Meanwhile, post-level similarity distributions can be defined as $\textbf{d}=\left[ d_1,d_2,\cdots ,d_N \right] \in \mathbb {R} ^N$, where $d_n=\textrm{avg}\!\,\left( \hat{\textrm{S}}_n \right) _{n=1}^{N}$. N refer to the total number of posts of $u_{j}^{Y}$.The post-level similarity space can consequently be denoted as $\tilde{y}_{\mathcal {H}}=\textrm{sigmoid}\!\,\left( \textbf{w}^T\textbf{d}+b \right)$, and the loss function uses the cross entropy loss function, which is denoted as $L_{\mathcal {H}}$.

Account-level vector representation

Commonsense knowledge retrieval and embedding

Having obtained the similarity distribution of different features of the text, we can move forward to model account-level representation. In fact, the same user on different social networks has the problem of data distribution disparity. Intuitively, without data alignment processing across social networks, it will affect the effectiveness of identity linkage [45]. Towards this end, we introduce an external knowledge base, which successfully describes user information tasks [67], to perform the alignment of UGCs. The source of external knowledge is the Atomic Knowledge Graph [57], which is an event-centric knowledge graph structure, and we used the If-Event-Then-Mental-State structure in it. (e.g., “If X gives Y a gift, then Y will likely show appreciation”), it has a promising performance in the task of utterance representation. To be more specific, this structure contains three different kinds of information: $x Intent\,\,\tilde{\varepsilon }^{xI}$, the likely intents of the event, $x React\,\,\tilde{\varepsilon }^{xR}$, the likely reactions of the event’s subject, and $o React\,\,\tilde{\varepsilon }^{oR}$, the likely reactions of others. For example, given an event “x gives o a gift”, the $\tilde{\varepsilon }^{xI}$ could be “x wants to get alone with o”, the $\tilde{\varepsilon }^{xR}$ could be “x feels nervous” and $\tilde{\varepsilon }^{oR}$ could be “o feels grateful”.

To retrieve the most relevant events to the textual information $t_g$, we use the SBERT model [68], which has achieved great success in computing textual semantic similarity. Here, we selected the MEAN pooling strategy, which is to compute the average of all token output vectors.

We denote the most relevant events extracted from the knowledge graph $\mathcal {K} \mathcal {G}$ as $\left\{ \tilde{\varepsilon }_{g,k}^{xI},\tilde{\varepsilon }_{g,k}^{xR},\tilde{\varepsilon }_{g,k}^{oR} \right\} _{k=1}^{K}$, which represents the top-K most similar triples for the g-th post. We then use the language model Ra that has been fine-tuned during the topic latent vector representation to generate the embedding vectors for the retrieved knowledge. Here, we can generate $u_g$ by $Ra^{CLS}(t_g)$. Moreover, based on the attention mechanism model, we can generate representations of the posts from the retrieved events triples, which can be calculated as

$$\begin{aligned} \left\{ \begin{array}{c} v_k=\tanh \!\,\left( \left[ \hat{r}_{g,k},z_{n,k} \right] W_{\alpha } \right) \\ \alpha _k=\frac{exp\!\,\left( v_k\left[ z_g,u_g \right] ^{\top } \right) }{\sum _k{\mathrm {}}exp\!\,\left( v_k\left[ z_g,u_g \right] ^{\top } \right) }\\ \end{array}. \right. \end{aligned}$$

(14)

Thereafter, the embedding vector $\varvec{\hat{R}}_g$ of a post can be calculated as

$$\begin{aligned} \varvec{\hat{R}}_g=\sum \limits _{k=1}^K{\mathrm {}}\alpha _k\hat{r}_{g,k}. \end{aligned}$$

(15)

Then, based on the self-attention mechanism, $\varvec{\hat{R}}_g$ is aggregated by event relation types to generate ${\hat{R}}_g$.

Account representation learning

Having generated knowledge-based representation for each post, we should focus on how to use these representations for the generation of account features. A naive approach is to stack the obtained vectors into a new matrix chronologically. However, the inherent relationships between posts could not be modeled with this method. In fact, the semantic relatedness among different posts plays a pivotal role in account features, and we need to preserve semantic information when incorporating sequential characteristics of posts. Towards this end, we propose to embed the knowledge-based vector representation ${\hat{R}}_g$ of users’ historical posts by using the encoder structure of the Transformer [64]. And each ${\hat{R}}_g$ is input to the token sequence chronologically. Given the set of ${\hat{R}}_g$ of all posts of the user, we embed the sequential factor of posts using the positional encoding [64], which can be calculated as

$$\begin{aligned} \begin{array}{c} PE_{(\textrm{pos},2i)}=sin\!\,\left( pos/10000^{2i/\mathbb {D}} \right) \\ PE_{(\textrm{pos},2i+1)}=cos\!\,\left( pos/10000^{2i/\mathbb {D}} \right) \\ \end{array}, \end{aligned}$$

(16)

where $\mathbb {D}$ is the dimension of ${\hat{R}}_g$ and pos is the position of the currently processed post. Then, the encoder of Transformer is utilized to derive the account representation vector $m_i$ for i-th user. In particular, self-attention and multi-head attention could explore the semantic connections between posts more effectively.

In addition, we propose to use Siamese neural network to accurately generate the similarity distributions of different account representations. Accordingly, we can formulate the objective function for classification as

$$\begin{aligned} \tilde{y}_U=\textrm{Softmax}\left( W_t[m_i, m_j,|m_i-m_j|] \right) , \end{aligned}$$

(17)

where $m_i$ and $m_j$ denote the account representations of the i-th user on social network $SN^X$ and the j-th user on social network $SN^Y$, respectively. $W_t$ denotes the weight matrix, and then the cross-entropy loss function is used to train the model, denoted as $L_U$.Finally, the loss function of our identity linkage model is defined as $L=L_\mathcal {H}+L_U$.

To generate the final probability $\tilde{y}$ of user similarity, we adopt incorporate $\tilde{y}_{\mathcal {H}}$ and $\tilde{y}_U$ with different strategy, where $\varvec{\tilde{y}}_{\mathcal {H}}$ and $\varvec{\tilde{y}}_U$ are the probabilities of two accounts belonging to the same identity. We experiment with three fusion strategies: computing the geometric mean of $\varvec{\tilde{y}}_{\mathcal {H}}$ and $\varvec{\tilde{y}}_U$, computing the arithmetic mean of $\varvec{\tilde{y}}_{\mathcal {H}}$ and $\varvec{\tilde{y}}_U$, and select the max value of $\varvec{\tilde{y}}_{\mathcal {H}}$ and $\varvec{\tilde{y}}_U$. The default configuration is geometric mean.

The procedure of TKM is summarized in Algorithm 1. TKM features a nested loop structure for pairwise similarity calculation, with an overall algorithm complexity of $O(G{\cdot }N)$, where G and N represent the number of posts for the two users currently being compared.

Performance evaluation

Experiment settings

Datasets

To satisfy the requirement for comparison with existing methods, two public-access user identity linkage datasets, called TWIN and TWFL, are utilized. Unlike synthetic datasets, data in these datasets are collected from real social networks, including Twitter, Instagram and Flickr. Specifically, each dataset contains a microblogging platform and an image sharing platform with timestamps for each post, respectively.

TWIN [36]: TWIN dataset collects user’s latest 200 posts with timestamps from two heterogeneous platforms i.e., Twitter and Instagram, based on the mapping pair obtained from “#mytweet via Instagram”. Specifically, posts from Instagram have both text and images, while those from Twitter have only text. The dataset comprises posts collected from 2009 to 2018. Users with low post counts are excluded.

TWFL [69]: TWFL dataset collects user pairs between Twitter and Flickr in 2013 by using the “Friend Finder” mechanism, which is presented on major social platforms. We used the URL of the image provided by TWFL for each post in Flickr to make the dataset adequate. Users with low post counts are excluded.

Table 2 shows the details of two original datasets we utilized in our experiments.

Table 2 A brief description of two datasets

Full size table

Baselines

First, to evaluate the effectiveness of our method for user identity linkage, five baselines are selected as follows.

DPM [41] is a model based on homogeneous UGCs which treat all text content as a whole when dealing with user posts. In the experiments, the DPM is conducted by merging textual posts together and generating textual representation with Doc2Vec. Then, representations are projected to fixed dimensions using principal component analysis and user similarity is generated using MLP.
GLM [45] is a model which considers the temporal factors among posts. In the experiments, textual posts are embedded by, which is a word embedding model based on the statistical information of global lexical co-occurrence to learn word vectors. Then BiLSTM is employed to generate textual representation. Additionally, the similarity between user pairs is generated by MLP.
TPA [70] is a topic-aware model based on tBERT. In the experiments, similarity between user pairs is generated by taking the average similarity of pair-wise user posts which is calculated by the topic-informed BERT-based architecture.
UserNet-T [36] is a model with time-aware similarity generation. In the experiments, Only the textual information of posts is considered and GloVe and BiLSTM are used to generate user features. Specifically, the similarity of pair-wise user posts in close adjacent time period contributes more to the user similarity distribution.
UserNet [36] is a extension of model UserNet-T. It explores users’ image and text features generated by pre-training models to tackle the user identity linkage task. In addition, it utilizes an attention mechanism to integrate the similarities of different modalities with temporal factors.

In addition, to evaluate the effectiveness of different model components, two derivations of TKM are proposed as follows.

TKM-NoK removes the account features, which represent the global features of user text, and only use the other two vector representations for user identity linkage.
TKM-NoZ restructures TKM by removing the latent topic vector representation which is similar to TKM-NoK.

Evaluation metric

In order to comprehensively evaluate the effectiveness of TKM, prediction metrics and ranking metrics are utilized to compare the matching and retrieval performance with baseline models. Specifically, prediction metrics are utilized to evaluate matching performance, including Accuracy, Precision, Recall and F1-score. For retrieval performance, we use the Hit-precision based on top-k candidates of user identity linkage results [41], which is defined as follows.

$$\begin{aligned} h(x)=\frac{k-({\text {hit}}(x)-1)}{k} \end{aligned}$$

(18)

where ${\text {hit}}(x)$ is the correct linked user position in the returned top-k candidates list. And hit-precision can be calculated as follows.

$$\begin{aligned} \text{ hit-precision } =\frac{1}{n} \sum \limits _{i=1}^{i=n} h\left( x_i\right) \end{aligned}$$

(19)

where n is the number of user pairs. The Hit-precision is a metric used to measure the retrieval performance of identity linkage. It indicates the method’s ability to accurately retrieve other users most relevant to the given social network user.

Implementation details

In our experiments, TKM and baseline models are conducted on a server with Intel i9-12900K CPU (5.1 GHz, 16 cores), 64 GB DRAM, and 2 NVIDIA RTX 2080Ti GPUs.

Before conducting the experiment, data pre-processing is performed, including removing URLs, processing emojis and tags, etc. During the experiment, we use known matched user pairs as positive samples, i.e., positively matched pairs, and randomly generate negative samples, i.e., negatively unmatched pairs, based on these known matched pairs. Thereafter, we set up the experiments as follows. First, the number of positive and negative samples used is the same, and, in the dataset, the training set accounts for 80%, and the validation and test sets each account for 10%. In the optimization process, Adam Optimizer is utilized, and the learning rate is set to 0.0001. In addition, for the hyperparameters in the model, which are the learning rate, batch size, etc., we use the grid search method to determine the optimal values of the hyperparameters. In addition, we fine-tuned our method with 200 epochs based on training and validation dataset and evaluate model performance with the testing dataset. In addition, the results of each experimental instance are reported by calculating the average value of 10 times repetition independently.

Performance evaluation and analysis

Overall performance

In this section, we focus on the performance under the comparison between the TKM and baselines (i.e., DPM, GLM, TPA, UserNet-T and Usernet). Tables 3 and 4 demonstrate the overall performance of all the methods on the two datasets TWIN and TWFL. It can be observed that the proposed TKM model outperforms other baselines on both TWIN and TWFL. The experimental results demonstrate that different quality of datasets affects the model performance. The proposed TKM and baselines perform better on TWIN than on TWFL. For example, DPM gets a higher Hit-precision (k = 5) by 3.38%, GLM by 7.12%, TPA by 2.51%, UserNet-T by 0.77%, UserNet by 1.2%, TKM-NoK by 0.5%, TKM-NoZ by 3.46%, TKM by 1.14% on TWIN. This is mainly because Flickr is aimed at photographers who focus more on posting professional photos than on sharing their lives, so the hidden semantic relevance of posted content between Instagram and Twitter is stronger than Flickr and Twitter. For convenience, we adopt k = 5 in the subsequent comparison of Hit-precision unless otherwise stated.

Table 3 Comparison with the baseline in terms of accuracy, precision, recall and F1-secore (%)

Full size table

Table 4 Comparison with the baselines in terms of Hit-precision (%)

Full size table

The group 1 of model evaluates the effectiveness of topic, including DPM, GLM, TPA and TKM-NoK, and the experimental results of them are presented. As depicted in Tables 3 and 4, the average Hit-precision of DPM and GLM, which lack topic information, converges to 0.2140 and 0.3413 on TWIN, respectively. TPA, which taking the average similarity of pair-wise posts’ topic, performs better than models without topic component (in group 1) by at least 18.19% in terms of Hit-precision and 7.15% in terms of accuracy on TWIN.The model which gets most prominent best performance in this group is TKM-NoK, its average Hit-precision converges to 0.6759 which outperforms TPA, GLM and DPM by the retrieval performance of 3.16X, 1.98X and 1.29X on TWIN, respectively. Besides, the accuracy of TKM-NoK also outperforms TPA, GLM and DPM by the improvement of 1.13X, 1.25X and 1.42X on TWIN without the external knowledge information, respectively. The main reason is that topic representation can provide additional signals to the semantic information, improving feature extraction for short text information in posts. Additionally, TKM-NoK, with its integrated attention mechanism and temporal post correlation, comprehensively models semantic information correlation.

Then, the group 2 method is categorized to evaluate the effectiveness of knowledge, which consists of DPM, GLM and TKM-NoZ. As depicted in Tables 3 and 4, TKM-NoZ, which introduces an external knowledge base to characterize the semantic features of posts, make the increase of average accuracy and Hit-precision. The average Hit-precision on TWIN in TKM-NoZ is improved by 44.36% and 31.63% compared with DPM and GLM, respectively. And the average accuracy on TWIN in TKM-NoZ is improved by 22.68% and 14.67% compared with DPM and GLM, respectively. Therefore, the data distribution across different social platforms is an important factor and introducing an external knowledge can reduce the limitation of it.

The experimental results of group 3 involving DPM, GLM, UserNet-T and TKM evaluate the effectiveness of modeling with temporality. It is noted that, by using a temporal modeling with attention mechanism, UserNet-T’s average Hit-precision converges to 0.5952 which outperforms DPM and GLM by the retrieval performance of 2.78X and 1.74X on TWIN, respectively. Besides, the accuracy of TKM-NoK also outperforms DPM and GLM by the improvements of 1.28X, and 1.42X on TWIN, respectively. The reason is that UserNet-T used temporal modeling of paired posts to generate similarity distributions, while BiLSTM was used to analyze the global correlation between posts when extracting semantic information. Similar modeling is utilized in TKM because users are likely to post on different social networks with similar content or topics in close adjacent time period. In Tables 3 and 4, it is observed that the TKM outperforms the other three algorithms which converge to 0.7012 in terms of Hit-precision and to 0.8601 in terms of accuracy on TWIN. It demonstrates that it is effective to use the attention mechanism for incorporating the topic and shallow semantic features at post-level and use encoder structure of the Transformer at account-level to incorporate temporal factors.

In addition, TKM performs 1.16% and 2.9% better than UserNet in terms of Hit-precision and accuracy on TWIN. It confirms that the dominant role of textual information in user representations can be further improved by exploring the multi-level latent semantic information.

The effect of different post counts

In this section, we focus on the performance with different post counts under the comparison between the TKM and baselines. The number of posts affects the completeness of the user representation. In general, a greater number of posts typically contain a more diverse set of user characteristics. To perform the evaluation of TKM, the post count is utilized as 60, 90, 120 and 150 while other parameters are utilized as default settings. Table 5 demonstrates the performance of all the methods with different post counts on the two datasets TWIN and TWFL. It can be observed that the Hit-precision of all methods except for DPM present upward trend curves as post counts increase from 60 to 150. Real social network posts often contain varying levels of semantic information. Consequently, methods relying on basic semantic feature representation, such as DPM and GLM, demonstrate worse performance compared to others. As depicted in Table 5, it is seen that after several experiments, the average performance of TPA improves by 11.09% on TWIN, which outperforms DPM and GLM and by the improvement of 8.73X and 1.03X, respectively. Meanwhile, the curve of the DPM on TWFL shows a fluctuating trend, with the post counts at 120 and 150 Hit-precision not as good as post count at 90, which indicates that the performance of the method based on simple embedding of posts is unstable. As for GLM, which effectively models the connection between different posts based on BiLSTM, its Hit-precision on different datasets presents upward trend curves.

Table 5 Comparison with the baselines in terms of post numbers(Hit-precision@Top-5) (%)

Full size table

In addition, TKM-NoZ gets a higher improvement of Hit-precision with post counts increases from 60 to 90, which is improved by 12.14% and 10.54% on TWIN compared with DPM and GLM, respectively. This suggests that using global text features can capture enough information for identity linkage under the environment with reduced post counts. And the average performance of TKM, TKM-NoK and TKM-NoZ improves by 15.78%, 16.46% and 15.53% on TWIN. The experimental results demonstrate that the methods using topic information can improve better performance with an increase in post count, while introducing external knowledge can improve the performance of the model with a shortage of post count.

The contribution of different components of TKM

In order to evaluate the contribution of latent topic features and the account features in the model, we compared TKM with the two derivations, i.e., TKM-NoK and TKM-NoZ. As seen from Table 5, TKM always outperforms its two variants across all post counts. This section uses post counts at 60 and 150 as examples to evaluate the contributions of different TKM components.

As depicted in Table 5, it is seen that TKM achieves 4.11% higher Hit-precision scores than TKM-NoZ on TWIN and 3.11% higher on TWFL with post counts at 60. And TKM also achieves 4.36 higher Hit-precision scores than TKM-NoZ on TWIN and 6.68% higher on TWFL with post counts at 150. The evaluations demonstrate the effectiveness of TKM when the latent topic information is utilized to enhance shallow semantics information represented by BiLSTM in post-level feature representation.

In addition, TKM performs better than TKM-NoK by 3.21% in terms of Hit-precision on TWIN and 1.04% higher on TWFL with post counts at 60. On the other hand, TKM also achieves 2.53% higher Hit-precision scores than TKM-NoZ on TWIN and 1.89% higher on TWFL with post counts at 150. Utilizing coarse-grained account-level feature representation indeed benefits the user identity linkage tasks by reducing the limitation of post-level similarity calculation. Table 5 also illustrates TKM-NoK has better performance than TKM-NoZ. This suggests that latent topic features, representing deep semantic information, play a more significant role in user identity linkage than account representation. The use of external knowledge alignment reduces platform-related data distribution disparities, but sacrifices detailed user features capturing deep semantic information.

The effect of different different fusion strategies

Firstly, we investigate the contribution of the attention mechanism of post-level in our work. We remove the attention mechanism of TKM-NoA used in post-level feature similarity fusion, instead taking the average of the post-level similarities and using the Geometric Mean strategy to fuse similarities generated in different levels. In particular, the attention mechanism has the purpose of fusing the similarities derived from different post-level features in our model. Table 6 shows that the attention mechanism can improve the performance of our model. Because posts in social networks are informal, TKM-NoA could not accurately identify the association between users. Intuitively, we need to set different confidence for different representations.

Table 6 Comparison with different fusion strategies in terms of accuracy, precision, recall and F1-secore (%)

Full size table

Table 7 Comparison with different fusion strategies in terms of Hit-precision@Top-5 (%)

Full size table

In addition, for the fusion of post-level user similarity and account-level similarity, we employ three strategies to compute and produce the final user similarity. Throughout all three strategies, all post-level features are retained in the evaluation process, and the attention mechanism is preserved during the fusion of these features. The performance of various similarity fusion strategies is presented in Table 7. Upon experimental evaluation, we observed that the Max strategy performed significantly worse, while the other fusion strategies exhibited better performance with marginal differences in results. Furthermore, aside from the Max strategy, we found that the similarity fusion strategy had a minimal impact on model performance when compared to the removal of different representations or attention mechanisms. This indicates that the Max strategy is not applicable to our proposed identity linkage model. Moreover, the selection of either the Geometric Mean strategy or the Arithmetic Mean strategy had little impact on the model. The decisive factor for the model lies in the different types of user representations and the method by which these representations interact with each other.

Conclusions and future work

In this paper, we focus on the cross-social network identity linkage using different features generated from edge-enabled IoT users’ text posts. In particular, we propose a topic and knowledge-enhanced identity linkage method with attentive modeling, which only uses the textual information of users. At the same time, it combines different levels of information for the complete modeling of user characteristics. Meanwhile, we explore the problem of reducing the semantic disparities that are caused by different data distributions across platforms. The experimental results demonstrate the effectiveness of using the latent topic features with the operation of introducing external knowledge bases in the cross-social network identity linkage problem. When given more textual content published by users, our proposed approach can introduce additional latent semantic signals, which enhance the representational capacity of edge-enabled user information. Furthermore, we find that semantic representation and fusion techniques exert a more significant influence on the model compared to similarity fusion strategies.

Although using publicly available user posts for identity linkage can enhance data acquisition efficiency, our method does not capitalize on the image information within user posts. In terms of future work, we plan to establish associations between different levels of text and image representations, particularly focusing on latent semantic correlations. This approach aims to further enhance the performance of identity linkage with abundant UGCs generated by MEC applications.

Availability of data and materials

No datasets were generated or analysed during the current study.

References

Wu Y, Huang H, Wu N, Wang Y, Bhuiyan MZA, Wang T (2020) An incentive-based protection and recovery strategy for secure big data in social networks. Inf Sci 508:79–91
Article Google Scholar
Han H, Asif M, Awwad EM, Sarhan N, Ghadi YY, Xu B (2024) Innovative deep learning techniques for monitoring aggressive behavior in social media posts. J Cloud Comput 13(1):19
Article Google Scholar
Başarslan MS, Kayaalp F (2023) Mbi-grumconv: A novel multi bi-gru and multi cnn-based deep learning model for social media sentiment analysis. J Cloud Comput 12(1):5
Article Google Scholar
Jiang N, Chen J, Zhou RG, Wu C, Chen H, Zheng J, Wan T (2020) PAN: Pipeline assisted neural networks model for data-to-text generation in social internet of things. Inf Sci 530:167–179
Article Google Scholar
Wang W, Xu X, Bilal M, Khan M, Xing Y (2024) Uav-assisted content caching for human-centric consumer applications in iov. IEEE Trans Consum Electron 70(1):927-938
Chen X, Song X, Peng G, Feng S, Nie L (2021) Adversarial-enhanced hybrid graph network for user identity linkage. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, pp 1084–1093
Asif M, Al-Razgan M, Ali YA, Yunrong L (2024) Graph convolution networks for social media trolls detection use deep feature extraction. J Cloud Comput 13(1):33
Article Google Scholar
Shaji B, Singh RLR, Nisha K (2023) High-performance fuzzy optimized deep convolutional neural network model for big data classification based on the social internet of things. J Supercomput 79(9):9509–9537
Xu X, Bao G, Bilal M (2024) Asynchronous federated learning for vehicular edge caching of consumer content. IEEE Consum Electron Mag. https://doi.org/10.1109/MCE.2024.3358025
Xu X, Yang C, Bilal M, Li W, Wang H (2022) Computation offloading for energy and delay trade-offs with traffic flow prediction in edge computing-enabled iov. IEEE Trans Intell Transp Syst 24(12):15613–15623
Xu X, Liu Z, Bilal M, Vimal S, Song H (2022) Computation offloading and service caching for intelligent transportation systems with digital twin. IEEE Trans Intell Transp Syst 23(11):20757–20772
Article Google Scholar
Yan H, Xu X, Bilal M, Xia X, Dou W, Wang H (2023) Customer centric service caching for intelligent cyber-physical transportation systems with cloud-edge computing leveraging digital twins. IEEE Trans Consum Electron 70(1):1787–1797
Liu W, Xu X, Qi L, Zhou X, Yan H, Xia X, Dou W (2024) Digital twin-assisted edge service caching for consumer electronics manufacturing. IEEE Trans Consum Electron 70(1):3141–3151
Xu X, Tang S, Qi L, Zhou X, Dai F, Dou W (2023) Cnn partitioning and offloading for vehicular edge networks in web3. IEEE Commun Mag 61(8):36–42
Liu Z, Xu X, Han F, Zhao Q, Qi L, Dou W, Zhou X (2023) Secure edge server placement with non-cooperative game for internet of vehicles in web 3.0. IEEE Trans Netw Sci Eng. https://doi.org/10.1109/TNSE.2023.3321139
Meng K, Liu Z, Xu X, Xia X, Tian H, Qi L, Zhou X (2023) Heterogeneous edge service deployment for cyber physical social intelligence in internet of vehicles. IEEE Trans Intell Veh. https://doi.org/10.1109/TIV.2023.3325372
Liu G, Bao G, Bilal M, Jones A, Jing Z, Xu X (2023) Edge data caching with consumer-centric service prediction in resilient industry 5.0. IEEE Trans Consum Electron 70(1):1482–1492
Ma X, Dong L, Wang Y, Li Y, Liu Z, Zhang H (2023) An enhanced attentive implicit relation embedding for social recommendation. Data Knowl Eng 145(102):142
Google Scholar
Shah N, Willick D, Mago V (2022) A framework for social media data analytics using elasticsearch and kibana. Wirel Netw 28(3):1179–1187
Devika R, Subramaniyaswamy V (2021) A semantic graph-based keyword extraction model using ranking method on big social data. Wirel Netw 27:5447–5459
Article Google Scholar
Lee RKW, Hoang TA, Lim EP (2019) Discovering hidden topical hubs and authorities across multiple online social networks. IEEE Trans Knowl Data Eng 33(1):70–84
Article Google Scholar
Kuhnle A, Alim MA, Li X, Zhang H, Thai MT (2018) Multiplex influence maximization in online social networks with heterogeneous diffusion models. IEEE Trans Comput Soc Syst 5(2):418–429
Article Google Scholar
Li H, Yang W, Wang W, Wang H (2024) Harmfulness metrics in digital twins of social network rumors detection in cloud computing environment. J Cloud Comput 13(1):36
Article Google Scholar
Fang J, Meng X, Qi X (2023) A top-k poi recommendation approach based on lbsn and multi-graph fusion. Neurocomputing 518:219–230
Article Google Scholar
Bouyer A, Beni HA, Arasteh B, Aghaee Z, Ghanbarzadeh R (2023) Fip: A fast overlapping community-based influence maximization algorithm using probability coefficient of global diffusion in social networks. Expert Syst Appl 213:118869
Chen W, Wang W, Yin H, Zhao L, Zhou X (2023) Hful: a hybrid framework for user account linkage across location-aware social networks. VLDB J 32(1):1–22
Article Google Scholar
Li Z, Bilal M, Xu X, Jiang J, Cui Y (2022) Federated learning-based cross-enterprise recommendation with graph neural networks. IEEE Trans Ind Inform 19(1):673–682
Article Google Scholar
Huang R, Chen Z, He J, Chu X (2022) Dynamic heterogeneous user generated contents-driven relation assessment via graph representation learning. Sensors 22(4):1402
Article Google Scholar
Ta N, Li K, Yang Y, Jiao F, Tang Z, Li G (2022) Evaluating public anxiety for topic-based communities in social networks. IEEE Trans Knowl Data Eng 34(3):1191–1205
Article Google Scholar
Zhou X, Liang X, Zhang H, Ma Y (2015) Cross-platform identification of anonymous identical users in multiple social media networks. IEEE Trans Knowl Data Eng 28(2):411–424
Article Google Scholar
Man T, Shen H, Liu S, Jin X, Cheng X (2016) Predict anchor links across social networks via an embedding approach, vol 16. AAAI, pp 1823–1829
Zhao H, Zhou H, Yuan C, Huang Y, Chen J (2015) Social discovery: Exploring the correlation among three-dimensional social relationships. IEEE Trans Comput Soc Syst 2(3):77–87
Article Google Scholar
Mu X, Zhu F, Lim EP, Xiao J, Wang J, Zhou ZH (2016) User identity linkage by latent user space modelling. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, pp 1775–1784
Zafarani R, Liu H (2013) Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, Association for Computing Machinery, New York, pp 41–49
Zhang H, Kan MY, Liu Y, Ma S (2014) Online social network profile linkage. In: Asia Information Retrieval Symposium, Springer, pp 197–208
Chen X, Song X, Cui S, Gan T, Cheng Z, Nie L (2021) User identity linkage across social media via attentive time-aware user modeling. IEEE Trans Multimed 23:3957–3967
Article Google Scholar
Bhagat S, Kim DJ (2023) Examining users’ news sharing behaviour on social media: role of perception of online civic engagement and dual social influences. Behav Inf Technol 42(8):1194–1215
Article Google Scholar
Anand M, Sahay KB, Ahmed MA, Sultan D, Chandan RR, Singh B (2023) Deep learning and natural language processing in computation for offensive language detection in online social networks by feature selection and ensemble classification techniques. Theor Comput Sci 943:203–218
Article Google Scholar
Lasri K, Tonneau M, Naushan H, Malhotra N, Farouq I, Orozco-Olvera V, Fraiberger S (2023) Large-scale demographic inference of social media users in a low-resource scenario. Proceedings of the International AAAI Conference on Web and Social Media 17:519–529
Article Google Scholar
Li Z, Xu X, Hang T, Xiang H, Cui Y, Qi L, Zhou X (2022) A knowledge-driven anomaly detection framework for social production system. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/TCSS.2022.3217790
Chen B, Chen X (2022) Mauil: Multilevel attribute embedding for semisupervised user identity linkage. Inf Sci 593:527–545
Article Google Scholar
Wang W, Yin H, Du X, Hua W, Li Y, Nguyen QVH (2019) Online user representation learning across heterogeneous social networks. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, pp 545–554
Goga O, Loiseau P, Sommer R, Teixeira R, Gummadi KP (2015) On the reliability of profile matching across large online social networks. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, pp 1799–1808
Zhou J, Fan J (2019) Translink: User identity linkage across heterogeneous social networks via translating embeddings. In: IEEE INFOCOM 2019-IEEE conference on computer communications, IEEE, pp 2116–2124
Feng J, Zhang M, Wang H, Yang Z, Zhang C, Li Y, Jin D (2019) Dplink: User identity linkage via deep neural network from heterogeneous mobility data. In: The World Wide Web Conference, Association for Computing Machinery, New York, pp 459–469
Song X, Feng F, Liu J, Li Z, Nie L, Ma J (2017) Neurostylist: Neural compatibility modeling for clothing matching. In: Proceedings of the 25th ACM international conference on Multimedia, Association for Computing Machinery, New York, pp 753–761
Hadgu AT, Gundam JKR (2019) User identity linking across social networks by jointly modeling heterogeneous data with deep learning. In: Proceedings of the 30th ACM Conference on Hypertext and Social Media, Association for Computing Machinery, New York, pp 293–294
Li C, Wang S, Wang H, Liang Y, Yu PS, Li Z, Wang W (2019) Partially shared adversarial learning for semi-supervised multi-platform user identity linkage. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Association for Computing Machinery, New York, pp 249–258
Wisniewski PJ, Knijnenburg BP, Lipford HR (2017) Making privacy personal: Profiling social network users to inform privacy education and nudging. Int J Hum-Comput Stud 98:95–108
Article Google Scholar
Wu X, Dong X, Nguyen TT, Luu AT (2023) Effective neural topic modeling with embedding clustering regularization. In: International Conference on Machine Learning, PMLR, pp 37335–37357
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Nan F, Ding R, Nallapati R, Xiang B (2019) Topic modeling with wasserstein autoencoders. arXiv preprint arXiv:1907.12374
Li P, Huang L, Ren Gj (2020) Topic detection and summarization of user reviews. arXiv preprint arXiv:2006.00148
Pathak AR, Pandey M, Rautaray S (2021) Topic-level sentiment analysis of social media data using deep learning. Appl Soft Comput 108:107440
Article Google Scholar
Kolovou A, Kokkinos F, Fergadis A, Papalampidi P, Iosif E, Malandrakis N, Palogiannidi E, Papageorgiou H, Narayanan S, Potamianos A (2017) Tweester at semeval-2017 task 4: Fusion of semantic-affective and pairwise classification models for sentiment analysis in twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, pp 675–682
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Van Kleef P, Auer S et al (2015) Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195
Article Google Scholar
Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, Roof B, Smith NA, Choi Y (2019) Atomic: An atlas of machine commonsense for if-then reasoning, vol 33. AAAI, pp 3027–3035
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-bert: Enabling language representation with knowledge graph, vol 34. AAAI, pp 2901–2908
Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, Tang J (2021) Kepler: A unified model for knowledge embedding and pre-trained language representation. Trans Assoc Comput Linguist 9:176–194
Article Google Scholar
Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, Zhang Z (2020) Colake: Contextualized language and knowledge embedding. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp 3660–3670
Pla Karidi D, Stavrakas Y, Vassiliou Y (2018) Tweet and followee personalized recommendations based on knowledge graphs. J Ambient Intell Humanized Comput 9(6):2035–2049
Article Google Scholar
Zhu L, Pergola G, Gui L, Zhou D, He Y (2021) Topic-driven and knowledge-aware transformer for dialogue emotion detection. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, pp 1571–1582
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. CoRR. arXiv:1907.11692
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Google Scholar
Yao Y, Huang Z (2016) Bi-directional lstm recurrent neural network for chinese word segmentation. In: International conference on neural information processing, Springer, pp 345–353
Raghavan V, Ver Steeg G, Galstyan A, Tartakovsky AG (2014) Modeling temporal activity patterns in dynamic social networks. IEEE Trans Comput Soc Syst 1(1):89–107
Article Google Scholar
Wang X, He X, Cao Y, Liu M, Chua TS (2019) Kgat: Knowledge graph attention network for recommendation. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, Association for Computing Machinery, New York, NY, USA, pp 950–958
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, pp 3980–3990
Shao J, Wang Y, Gao H, Shi B, Shen H, Cheng X (2023) Asylink: user identity linkage from text to geo-location via sparse labeled data. Neurocomputing 515:174–184
Article Google Scholar
Peinelt N, Nguyen D, Liakata M (2020) tBERT: Topic models and BERT joining forces for semantic similarity detection. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp 7047–7055

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China (62372243).

Author information

Authors and Affiliations

School of Computer, Nanjing University of Information Science and Technology, Ningliu Road, Nanjing, Jiangsu, 210044, China
Rui Huang, Tinghuai Ma & Kai Huang
School of Computer Engineering, Jiangsu Ocean University, Cangwu Road, Lianyungang, Jiangsu, 222005, China
Tinghuai Ma
School of Software, Nanjing University of Information Science and Technology, Ningliu Road, Nanjing, Jiangsu, 210044, China
Huan Rong
School of Competitive Sports, Shandong Sport University, Century Avenue, Jinan, Shandong, 10600, China
Nan Bi
Department of Physical Education and Teaching, Hebei Finance University, Hengxiang North Street, Baoding, Hebei, 071051, China
Ping Liu
School of Continuing Education and Training, Shandong Sport University, Century Avenue, Jinan, Shandong, 10600, China
Tao Du

Authors

Rui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Tinghuai Ma
View author publications
You can also search for this author in PubMed Google Scholar
Huan Rong
View author publications
You can also search for this author in PubMed Google Scholar
Kai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Bi
View author publications
You can also search for this author in PubMed Google Scholar
Ping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Du
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Rui Huang: Conceptualized and designed the methodology, contributed to the analysis of findings, and drafted the manuscript. Tinghuai Ma: Conducted extensive research on the application of identity linkage across social networks and drafted the manuscript. Huan Rong: Contributed to the analysis of findings and revised the manuscript for important intellectual content. Kai Huang: Contributed to the performance evaluation of TKM with baselines, and assisted with the interpretation of data. Nan Bi: Assisted in data collection, analysis, and interpretation and contributed to the interpretation of data and insights. Ping Liu: Assisted in the code implementation of TKM and revised the manuscript for technical accuracy. Tao Du: Designed and prepared Figs. 1 and 2, and contributed to the manuscript’s organization and formatting.

Corresponding author

Correspondence to Tinghuai Ma.

Ethics declarations

Ethics approval and consent to participate

The research in this paper does not involve any illegal or unethical practices.

Consent for publication

The authors read and approved the final manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, R., Ma, T., Rong, H. et al. Topic and knowledge-enhanced modeling for edge-enabled IoT user identity linkage across social networks. J Cloud Comp 13, 107 (2024). https://doi.org/10.1186/s13677-024-00659-z

Download citation

Received: 09 March 2024
Accepted: 24 April 2024
Published: 21 May 2024
DOI: https://doi.org/10.1186/s13677-024-00659-z

Topic and knowledge-enhanced modeling for edge-enabled IoT user identity linkage across social networks

Abstract

Introduction

Related work

User identity linkage

Topic representation model

External knowledge base

Preliminaries

Basic concepts and definitions

Definition 1

Definition 2

Problem formulation

Methodology

The overall design of TKM

Post-level vector representation

VAE based topic latent vector representation

A BiLSTM based textual vector representation

Similarity fusion of post-level representations

Account-level vector representation

Commonsense knowledge retrieval and embedding

Account representation learning

Performance evaluation

Experiment settings

Datasets

Baselines

Evaluation metric

Implementation details

Performance evaluation and analysis

Overall performance

The effect of different post counts

The contribution of different components of TKM

The effect of different different fusion strategies

Conclusions and future work

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords