PICF-LDA: a topic enhanced LDA with probability incremental correction factor for Web API service clustering

Web API is a popular way to organize network services in cloud computing environment. However, it is a challenge to find an appropriate service for the requestor from massive Web API services. Service clustering can improve the efficiency of service discovery for its ability of reducing search space. Latent Dirichlet Allocation (LDA) is the most frequently used topic model in service clustering. To further improve the topic representation ability of LDA, we propose a new variant model of LDA with probability incremental correction factor (PICF-LDA) to generate the high-quality service representation vectors (SRVs) for Web API services. We first compute the words’ topic contribution degree (TCD) in the service description text by its context weight and part-of-speech (POS) weight. Then the probability incremental correction factor (PICF) for a word is designed based on TCD and the word’s maximum topic probability value. PICF is used to correct the probability distributions in SRVs. Experiments show that PICF-LDA has a better performance than LDA, the variant LDA models and other state-of-the-art topic models in service clustering.


Introduce
With the popularization of cloud computing technology, more and more enterprises have migrated their business systems to cloud service platforms [1]. Publishing cloud services is the main way for enterprises to encapsulate their business capability or products. Users can find suitable cloud services according to their needs [2]. As we all know, cloud computing can provide users with computing power, software and hardware resources. Therefore, it can greatly save the time and cost for the tenants to build new business systems [3].
Web API service is a common service publishing way in cloud computing environment [4]. Many enterprises have published a lot of web API services. For example, Google provides map service by multiple Web APIs, such as Static Maps API, Street View Image API, Distance Matrix API, Roads API, and Time Zone API. Given a road, we can embed its street view image with the speed limit into a web page by invoking Street View Image API and Roads API. Taking the website ProgrammableWeb as an example, we can find more than 25,000 web API services by the end of May 2022. It is convenient for us to build a new value-added application system by invoking these Web APIs. However, how to find an appropriate service for the users is becoming a challenge for the increasing number of Web API services published on the Internet [5].
Service clustering can group similar cloud services as a service cluster. It can reduce search space and improve the efficiency of service discovery [6]. Service clustering is widely used in service discovery. Early research on service clustering and discovery mainly focused on Web services described by WSDL. A WSDL document is a structured text with many tags. We can easily extract the feature information of Web services from the WSDL document to achieve service clustering [7].
Currently, the function descriptions of Web API services are mostly organized by natural language [8]. The developers presented a short piece of text to describe the function, performance and usage of the Web API service. It is difficult to obtain the feature information about Web API services for the lack of tags in the service description texts [9]. To get the key feature of service description text, many researchers have applied topic models to generate topic vectors for Web API services [10][11][12][13]. These topic vectors are generated by service function description text. They are called service representation vectors (SRVs) in this paper. Service clustering for Web API services was carried out by computing the similarity of these SRVs.
LDA is easy to use and robust in topic modeling [14]. It is widely used in service clustering, text mining, sentiment analysis and other fields. In many clustering methods, LDA and its variant models are employed to generate SRVs. However, LDA doesn't consider the mutual position, semantic similarity and occurrence frequency of the words in the documents. In addition, some words with poor relevance to the topic will limit the quality of topic vectors generated by LDA [15]. Aiming to improve the express ability and semantic balance of topic modeling, we present a topic enhanced LDA with probability incremental correction factor for service clustering. The main contributions of this paper are as follows.
(1) Topic contribution degree (TCD), calculated by words' occurrence frequency, context similarity and part-of-speech, is proposed to evaluate the importance of a word in generating SRVs. (2) We design a probability incremental correction factor (PICF) for each word based on TCD. The quality of SRVs generated based on LDA is improved by integrating PICF to correct the topic probability distribution. (3) An extensive set of experiments are implemented to evaluate the performance of PICF-LDA. Experiment results show that the performance of PICF-LDA outperforms LDA, some variant LDA models and state-of-the-art topic models in service clustering.
The rest of this paper is organized as follows. Related work introduces the related works on service clustering and LDA model. How to compute topic contribution degree for the words in service description text is presented in Topic contribution degree. PICF-LDA elaborates the proposed PICF-LDA. Experiment verifies the effectiveness of PICF-LDA by service clustering experiments. Finally, conclusions and future work are presented in Conclusions.

Related work
Previous study on service clustering is mainly about the clustering methods of Web services which were described by Web service description language (WSDL). There are many tags in the WSDL documents, such as type, operation, input and output [16]. The feature information used in Web service clustering can be easily extracted from WSDL documents by these tags.
Kumara proposed a new approach to cluster Web services by mining WSDL documents and generating an ontology via hidden semantic patterns within the complex terms in service features to measure similarity [17]. Wu clustered Web services by utilizing both WSDL documents and tags to handle the clustering performance limitation caused by uneven tag distribution and noisy tags. He employed tag co-occurrence, tag mining, and semantic relevance measurement for tag recommendation [7]. Agarwal proposed an approach based on Length Feature Weight. It is used to vectorize the pre-processed WSDL files after preprocessing the WSDL documents. Experiments have proved that the proposed method outperforms the clustering done by using TF-IDF method for vector space representation of web services [18].
With the increasing number of cloud services described by natural language, it is difficult to obtain the service features from their description texts. So topic models are widely used in current service clustering. They are employed to extract topic features from the description texts of the cloud services. These topic features are used to perform service clustering. The topic models applied in service clustering mainly include LSA, LDA, BTM, HDP, and DMM [19]. Among the above models, LDA and its variants are the most widely used. Many researchers have managed to improve the performance of traditional LDA model. For example, Chen proposed WT-LDA which seamlessly integrates tagging data and WSDL documents through augmented LDA [20]. A semantic Web service discovery based on LDA clustering that learns the OWL-S Web service documents was presented by Zhao. It can make the documents more abundant of the semantic information [21]. Shi put forward WE-LDA which used word2vec to obtain word vectors and cluster words into word clusters by K-means + + . These word clusters were incorporated to the semi-supervised training process of LDA [22].
Bukhari proposed Web service search engine (WSSE) by extracting topics from Web service descriptions based on LDA. WSSE is based on the probabilistic topic modeling and clustering techniques that are integrated to support each other by discovering the semantic meaning of Web services and reducing the search space [23]. Zhao employed Word2Vec to adapt the representation of services, and learned a list of similar words in service corpus. Moreover, He integrated TF-IDF into the similarity calculation to filter noise words. This method can enhance LDA with the similar words finding and filtering strategy for service clustering [24]. As following work, Zhao proposed a model named as HRT-LDA. The effects of different tags on clustering performance are considered in HRT-LDA. A tag filtering and appending strategy based on transfer learning, Word2vec, TF-IDF and semantic computing is integrated into LDA. Experiments shows that HRT-LDA outperforms the state-of-art LDA in service clustering [25].
Web service structure was modeled as Weighted Directed Acyclic Graph (WDAG) by Baskara. Then Biterm Topic Model was employed to mine the topic on the WDAG for high precision service similarity calculation [26]. To improve topic modeling accuracy, an SP-BTM that only chooses the words with specific parts-of speech to form biterms was proposed by Hu and Liu [27]. After using the HDP model to solve service vectors' dimension problems, Cao adopted SOM neural network to cluster these service vectors [28]. Fletcher deployed the HDP technique to extract topics from service description and user requirements to enhance the discovery of services [29].
Some clustering methods in other fields also have enlightening significance for us to improve the quality of service clustering. For example, Li proposed an adaptive time interval clustering algorithm based on density grid. The algorithm can perform adaptive time-interval clustering according to the size of the real-time ship trajectory data stream, so that a ship's hot zone information can be found efficiently and in real-time [30]. Zhao presented an efficient framework to cluster previous summaries with new data. It significantly outperforms the existing incremental face clustering methods [31]. Xue developed a novel density-based clustering approach for incomplete data based on Bayesian theory, which conducts imputation and clustering concurrently and makes use of intermediate clustering results. Experimental results show the effectiveness of the proposed method [32].
A data-driven clustering recommendation method, called DDCR, is proposed to recommend hierarchical clustering or spectral clustering for scRNA-seq data. They perform DDCR on two typical single cell clustering methods, SC3 and RAFSIL, and the results show that DDCR recommends a more suitable downstream clustering method for different scRNA-seq datasets and obtains more robust and accurate results [33]. Hu presented a two-level weighting strategy to measure the importance of views and features. A collaborative working mechanism is introduced to balance the within-view clustering quality and the cross-view clustering consistency [34]. Xiong proposed a semantic clustering news keyword extraction algorithm based on TextRank. It uses the word vectors and k-means clustering to obtain semantic clustering. The proposed algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms [35].

Topic contribution degree
To make up for the deficiency of LDA in considering the words' mutual position, semantic similarity and occurrence frequency, we propose the concept of TCD to express the importance of a word in generating the SRVs. TCD calculates the words' weights in generating SRVs from three aspects: word context similarity, word frequency and part-of-speech. Relevant definitions about cloud service and TCD are presented as follows.
(1) Id is the ID number of the service; (2) n is the name of the service; (3) t is the set of service tags; (4) d is the service description text; (5) c is the category of the service; Definition 2. (service representation vector) Given a Web API service s, if srv = (k 1 , k 2 , …, k i , …, k n ) is the topic vector generated by a topic model based on s.d, then srv is called the service representation vector of s. Definition 3. (semantic similarity of words)w i and w j are two words in a piece of text T , V w i and V w j are the word vectors of w i and w j respectively, the semantic similarity of words w i and w j is defined as Here, TFw i = Nw i /Nw. Nw i and Nw are the number of w i in d and the total word number of d, respectively. IDF w i = lg |D| |{j:w i ∈d j }| , |D| is the number of documents in D and j is the number of documents including the word w i .
For each word w in d, the POS weight, denoted as PW(w, d), is a weight value assigned on w by its part-of-speech in d.
A word can better reflect the text topic if it appears frequently or has a high semantic similarity with other words. We know from the Definition 6 that the context weight of a word reflects the importance of a word in the text from the perspectives of word frequency and semantic similarity. It can be used to evaluate the importance of a word in generating SRVs.
In the service description, nouns are usually used to describe the function and operation objects of a cloud service. Verbs are mostly used to describe the operations or tasks contained in this service. Adjectives are commonly adopted to evaluate the quality of the cloud service. By introducing the part-of-speech of words, we further distinguish the importance of various words in the service description text when they are used to generate SRVs. The SRVs can be further optimized once the words in the service description are given different part-of-speech weights.
To comprehensively consider the influence of context weight and POS weight on generating SRVs, we present a concept of topic contribution degree (TCD).
Definition 8 (topic contribution degree) The topic contribution degree of a word w in document d is defined as TCD(w, d) = CW(w, d)*PW(w, d).
How to compute the TCD for words in the service description text is presented in Algorithm 1. Two empty set Corpus_w and TCD_S are firstly initialized in line (1). Corpus_w is used to store the service description texts for the cloud services in S. TCD_S is the set of TCD value for the words in service description texts. All the words in the services' description texts are added to Corpus_w by the codes from line (2) to line (4). Then, Word2vec is employed to train the vectors for each word in Corpus_w. These vectors will be used to calculate the semantic similarity between the words in context weight.
We compute TCD for each word form line (6) to line (12). The word's context weight is obtained by its context fitness and TF-IDF in line (8). The tool NLTK. pos_tag and stanfordcorenlp are used to determine the part-of-speech for every word in a service description text. The TCD value is computed based on the word's context weight and POS weight in line (9). It should be noted that the POS weight is set as a super parameter. We take the quality of SRVs as the optimization goal to evaluate POS weight for the words with different partof-speech by adjusting parameters. The evaluated TCD value will be added to the set TCD_ S. The algorithm will finally return the set TCD_S.

PICF-LDA
LDA is a topic model which can learn the hidden topic information of the existing documents and return topic vectors in the form of probability distribution. LDA is an efficient tool in topic modeling and text clustering domain. The graphical model of LDA is shown in Fig. 1. Here, n is the nth word in document d. α and η are topic parameter and proportions parameter; K is the number of topics; θ d is per-document topic proportions; Z d,n represents doucument-word matrix while W d,n represents words' probability distribution. LDA assumes that the prior distribution of document topics is a Dirichlet distribution. For any document d , its topic distribution is θ d = Dirichlet( − → α ).
Here, − → α is a hyper-parameter vector which have K dimensions. Then, the topic assignment of the n th word in document d can be calculated as z dn = multi(θ d ) . Similarly, LDA assumes that the prior distribution of words in the topic is Dirichlet distribution, that is, for any topic K, its word distribution is β k = Dirichlet( − → η ).
Finally, the observed word probability distribution of w dn is w dn = multi(β zdn ) . As shown in formula (1), the joint distribution of all the visible variables and the hidden variables in the LDA model can be approximated by Gibbs sampling.
we propose a new variant model of LDA with probability incremental correction factor (PICF-LDA) to generate high-quality SRVs. The graphical model of PICF-LDA is shown in Fig. 2. An incremental correction factor is presented to correct the probability distribution value of SRVs. The incremental correction factor for word n in document d is represented as PICF(d, n). The value of PICF(d, n) is assigned as the product of TCD(d, n) and argmax(d, n). Here, argmax(W d,n ) refers to the maximum probability value of all the topics for n in the word-topic distribution matrix W d,n .
PICF-LDA is an improved LDA model. It enhances the quality of topic vectors by utilizing PICF to correct the topic probability distribution. Algorithm 2 shows how to generate SRVs by PICF-LDA. An empty set EQ_srv is initialized in line (1). It is used to store the enhanced-quality SRVs for cloud service s in S. Then, the preprocessed web service description texts are sent to the LDA model to obtain srv(s), W d,n and θ d in line (3). Here srv(s) is the SRV for cloud service s, W d,n is the word-topic matrix and θ d is document probability distribution. The correction of SRVs is organized in line (6) to line (12). Each word in the service description text needs to be processed as follows when PICF-LDA performs probability correction.
For the word n in s.d, we first find its the maximum topic probability distribution argmax(W d,n ) and its related topic (denoted as kmax-topic). Then the correction factor for the word n is presented as PICF(d, n) in line (9). Here, we introduce λ to balance the order of magnitude for PICF. Finally, the value of θ k max_topic ,d in srv(s) will be updated by θ k max_topic ,d * PICF(d, n). Algorithm 2 returns the enhanced-quality SRVs in line (13).

Experiment
In this section, we carry out a series of clustering experiments to verify whether the quality of SRVs generated by PICF-LDA is better than LDA and other topic models. The pipeline process for cloud service or Web API clustering is shown in Fig. 3. All the experiments were conducted on a PC with Intel i7-8750 h and 16 GB RAM under Win10.
We first crawled the Web API services from cloud platforms, such as ProgrammableWeb and Casicloud. After preprocessing, the description information about these API services are transformed into service clustering texts. Then PICF-LDA is employed to generate SRVs based on the service clustering texts. Finally, K-means + + algorithm is used to cluster these Web API services based on the SRVs generated by PICF-LDA. The performance of the proposed PICF-LDA is compared with LDA, the variant LDA models and other state-of-the-art topic models from three evaluation metrics. Dataset An example of Web API service crawled from Program-mableWeb is provided in the left part of Fig. 4. The main information that we can get about a Web API service includes: service name, service tag, service category and service description text. Following steps are used to process the service description texts.
(1) Text splitting: The words in the service description are separated by spaces. (4) Lemmatization: Lemmatization is used to convert a word with tense or voice changes into its root form. (5) Addition of service tag and category: Words in service tag and category are added into the service description text.
After the above five steps, we have generated new description texts that can be used for service clustering. The text on the left side of Fig. 4 is the service description text processed by the above steps for "Tweet Archivist API".
We crawled 22,832 Web API services from Program-mableWeb. To ensure the quality of services participating in service clustering, the Web API service has been removed once its service description is less than 15 words. Meanwhile, categories with less than 20 services were also eliminated. The final data set in our experiment contains 19,241 Web API services. They are preprocessed  to generate the new service description texts for service clustering experiments. We rank service categories according to the number of Web services they include. The top 20, 50, 80 and 132 Web API service categories are chosen as the classification benchmark category. They are named DS1, DS2, DS3and DS4, respectively. We will carry out the experiments on these data sets with different granularity. Table1 presents the outline of datasets.

Evaluation Metrics
Let X = {x 1 , x 2 , ..., x k } and Y = {y 1 , y 2 , ..., y k } be the predicted clustering labels and the real category labels, respectively. The following evaluation metrics are employed to observe the performance of different topic models.

Normalized Mutual Information (NMI)
NMI is used to evaluate the degree of consistency between two samples. It is the normalization of mutual information (MI) score. The calculation method is shown in formula (2).
MI(X, Y ) is the mutual information of X and Y . It reflects the correlation degree between X and Y . H (x) and H(y) represent the entropies of X and Y respectively. F is the normalized function. The range of NMI value is [0,1]. The higher score means the better clustering quality in view of NMI.

Fowlkes-Mallows scores(FMI)
FMI is defined as the geometric mean of paired precision and recall rate. The calculation method is shown in formula (3).
where, TP is the number of true positive (the number of positive samples predicted as positive class), FP is the number of false positive (the number of negative samples predicted as positive class) and FN is the number of false negative (the number of positive samples predicted as negative class).
The range of FMI value is [-1,1]. The higher score means the better clustering quality in view of FMI.

Silhouette Coefficient (SC)
For a sample x, let a be the average distance from other samples in the same category, and b be the average  distance from the nearest samples in different categories. The Silhouette Coefficient of x is given by formula (4).
The range of SC value is [-1,1]. The higher score means the better clustering quality in view of SC.

Baseline models
To verify the performance of PICF-LDA, we have chosen the LDA and the following three variant LDA max(a, b) models as baseline models to verify the performance of our method on DS1 to DS4. Meanwhile, the topic models HDP, BTM, DMM and LSA were employed to perform service clustering. The service clustering quality was also evaluated between our method and these state-of-the-art topic models.
(1) LDA-K: This method uses the traditional LDA and K-means + + to perform service clustering. (2) WE-LDA [21]: In WE-LDA, the word vectors obtained by Word2vec are clustered into word clusters by K-means + + algorithm and these word clusters are incorporated to semi-supervise the LDA training process, which can elicit better distributed representations of Web services.
(3) ST-LDA [23]: In ST-LDA, Word2vec is adopted to adapt the representation of services, and learn a list of similar words in service corpus. TF-IDF is integrated into similarity calculation to filter noise words for LDA. (4) HRT-LDA [24]: In HRT-LDA, the effects of different tags on clustering performance are considered. A tag filtering and appending strategy based on transfer learning, Word2vec, TF-IDF and semantic computing is integrated into LDA.

Result and comparison
Compared with the traditional LDA model, the probability increment correction factor is added into PCIF-LDA. PCIF consists of three parts: context weight, TF-IDF and POS weight. The words' context weights and TF-IDFs can be calculated by algorithm 1. POS weight is a super parameter, which needs to be set by parameter adjustment.
We take DS1 as an example to set POS weights in this section. All the words in service description texts are divided into nouns, verbs, adjectives and adverbs according to their part-of-speeches. The adjectives and adverbs were given the same POS weight in this experiment. We use PWn, PWv and PWa to denote the POS weights for the nouns, verbs, adjectives and adverbs, respectively. The evaluation metrics SC, NMI and FIM were investigated during the adjustment of POS weights.
To comprehensively find the optimal POS weight in view of CS, NMI and FMI, we use CS (comprehensive score) to sum the value of three evaluation metrics. Table 2 provides the scores of various evaluation metrics with different POS weight in our experiment. We can see that highest quality of SRVs appears in the POS weight (0.1, 0.4, 0.5). That is the PICF-LDA shows the best performance when the weights of nouns, adjectives and verbs are set as 0.1, 0.4 and 0.5 in DS1.
After determining the POS weights for each dataset, we verify the quality of SRVs by service clustering experiments on the dataset DS1 to DS4. The comparison K-means + + algorithm is employed to perform service clustering.
We compare the performance between PICF-LDA and variant LDA models. The scores of three evaluation metrics for different datasets were presented in Figs. 5, 6, 7 and 8. We can see that PCIF-LDA has achieved better performance than other models in every evaluation metric and dataset. It proves that the introduction of PCIF has improved the quality of SRVs. Compared with the given models, the performance improvement data for each evaluation metric of PCIF-LDA was shown in Tab. 3. PCIF-LDA improves the topic extraction performance of the traditional LDA model by more than 22%. Compared with the other three state-of-art variant models, PCIF-LDA has also enhanced the scores of valuation metrics by 4%-28.2%. Therefore, PCIF-LDA is effective and advanced to extract the topic information for service clustering.
Besides LDA, HDP, BTM, DMM and LSA are also the commonly used topic models for service clustering. We  compared the performance between PICF-LDA and these state-of-the-art topic models. The score of NMI, FMI and SC for different datasets were presented in Figs. 9, 10, 11 and 12. We can see that PCIF-LDA has achieved better performance than these topic models in every evaluation metric and dataset. Table 4 shows the promotion proportion of different evaluation metrics in four data sets. The scores of NMI, FMI and SC has been increased by 17.8% to 32.8%, 23.6% to 33.1%, and 5.6% to 39.2%, respectively.

Conclusions
Although there are many new topic models, LDA is still widely used in service clustering for its ease of use and robustness. The researchers have also proposed a series of improved models for LDA. These variant models perform well in topic extraction. To further improve the performance of LDA in service clustering, we proposed an improved LDA model which is named as PICF-LDA. TCD is first presented to help determine the contribution of words in topic extraction. Then PICF is designed to correct the probability distribution of SRVs. PICF-LDA can generate high-quality SRVs for the cloud services and improve the quality of service clustering. We verify that the quality of SRVs generated by PICF-LDA is better than LDA and its variant models by experiments. Meanwhile, PICF-LDA also has a better the topic extraction performance than state-of-the-art topic models in service clustering.
In future work, we will investigate how to improve the performance of PICF-LDA from the perspective of feature word extraction. We will also apply the PCIF to other topic models to verify the universal effectiveness of the proposed method.