- Open Access
A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures
Journal of Cloud Computing volume 9, Article number: 2 (2020)
Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.
Hierarchical clustering has been proven to be a useful technique in the field of document organization, as it constructs a hierarchy structure of document collections and sub-collections. Such a structure can make the browsing and navigation process easier and quicker  by hiding irrelevant information from the users. Since each cluster and the corresponding sub-clusters represent a set of topic and sub-topics relationships , the hierarchy can help automated systems to return only relevant information to the user, by exploiting the relationships stored in the structure. Moreover, the hierarchy can be used to visualize and interactively explore large amounts of documents . Finally, the hierarchy may be used as a decision tree for the categorization of new documents. However, existing solutions for hierarchical document clustering are faced with serious challenges.
Some of the current problems with document clustering  include the selection of appropriate document features and similarity measures, the quality assessment of the clusters, the implementation of an efficient clustering algorithm which can make optimal use of the available memory and CPU resources, the association of meaningful labels to the final clusters, and the consideration of the semantic relationships between words. Hierarchical document clustering methods have to deal with additional challenges, including the handling of the very high dimensionality of the data. A medium to large set of documents can contain over 10,000 documents; this means that there can be millions of term-document relations, thus leading to an extremely high computational complexity and memory usage. This issue arises from the way most classical hierarchical clustering methods are implemented: they are based on the formulation of high dimensional distance matrices, used for pairwise comparisons between all the available data points.
The high volume of documents that have to be handled daily on the web presents a challenge to a cloud environment as well. In order to provide efficient solutions, researchers are increasingly turning towards scalable approaches, such as the utilization of cloud resources in addition to local computational infrastructures. The combination of running big data analytics algorithms using cloud computing infrastructures seems to be the solution. Cloud computing  provides shared computing resources on-demand over the Internet, including large numbers of compute servers and other resources, that have the ability to be scaled up and down according to the computational requirements. The topology of the computers in the cloud is usually hidden from the end user.
Taking all these issues into account, this work focuses on implementing a scalable hierarchical clustering algorithm for document clustering. It attempts to overcome limitations regarding the number of documents that can be handled by existing algorithms due to memory limitations, and to reduce the overall computational time. The innovation of our proposed algorithm lies in the fact that, instead of constructing an NxN similarity matrix by computing the pairwise similarities between all data points of the dataset in order to construct the hierarchical tree, we build a low-dimensionality frequency matrix as a representation of the root cluster. This cluster is then split recursively while moving down in the hierarchy, which significantly reduces the memory requirements. Additionally, the implementation of this work is based on a distributed computing architecture and therefore can handle an increasing number of documents based on the available resources. The input of our algorithm consists of documents represented as a bag of topics derived from topic modeling. The documents are transformed into the appropriate for the algorithm format during the preprocessing and transformation phases in our proposed framework. The whole framework has been dockerized in order to facilitate easy deployment on cloud computing infrastructures.
This work is an extended version of our previous work  that presented a multi-metric hierarchical clustering framework for item clustering. Here, we extend the previous work by re-designing our framework in order to be applicable to the more general field of document clustering, and we add a meta-clustering module to the framework. We explore the effectiveness and the performance of our method regarding memory usage and computational time through a more detailed evaluation and many more experiments, utilizing several datasets of varying sizes and content. We compare the results with more baseline hierarchical clustering methods, and we make use of the external evaluation metric FScore. Furthermore, we extend the previous work by parallelizing our clustering algorithm to achieve scalability, we make it suitable for cloud execution using a virtualization solution, and we measure the performance of the method using different hardware resources.
The rest of the paper is organized as follows; the “Literature review” section discusses related work, while the proposed integrated framework for document clustering is analysed in “A new document clustering framework” section. In “The hierarchical clustering algorithm” section our innovative hierarchical clustering algorithm is detailed, whereas “Experiments and evaluation” section contains the experimental results and the clustering evaluation. Finally, conclusions and future work are highlighted in the “Conclusion” section.
Topic modeling in document clustering
Getting from an initial collection of documents to a clustering of the collection is an elaborate procedure, which usually involves several stages. The basic operations are feature extraction and selection, document representation and clustering . Feature extraction is usually the first step of the process and filters out non-appropriate words from the documents’ descriptions. Feature selection is a preprocessing method that removes noisy features and reduces the dimensions of the feature space, in order to yield a better understanding of the data and overall better performance of the clustering method that takes as input those data. In the feature selection stage, various probabilistic models have been used in the literature, like Latent Dirichlet Allocation (LDA)  and Probabilistic Latent Semantic Analysis (PLSA) . Today, a lot of research works around topic modeling focus on distributed implementations of LDA, such as AD-LDA , PLDA  and PLDA+ . BigARTM  is another distributed implementation for topic modeling which includes all popular models such as LDA, PLSA, and many others. Other approaches make use of deep learning techniques for topic extraction (e.g. lda2vec ).
Ahmadi et al.  proved that topic model based clustering methods generally achieve better results than only applying traditional clustering algorithms like the K-means. LDA has been used in many papers for representation and dimensionality reduction of text documents, as well as for uncovering semantic relations in the text . Ma et al.  used LDA for document representation and identification of the most significant topics, the K-means++ algorithm was used to define the initial centers of the clusters and the K-means algorithm was used to form the final clusters. Qiu and Xu  presented a clustering method, where the LDA was used to extract topics from the texts and the centroids of the K-means algorithm were selected among the nouns with the highest probability values. More recently, Onan et al.  proposed an improved ant clustering algorithm, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. The latent Dirichlet allocation (LDA) was used to represent textual documents. Except from the classical LDA method, many variants were examined in the literature [19, 20], including hierarchical LDA, correlated topic models and hierarchical Dirichlet process.
According to , document clustering can be divided into hard clustering, where each document is assigned to exactly one cluster, and soft clustering, where each document is allowed to appear in more than one clusters. Hard clustering methods can be further categorized in the following sub-categories: 1) Partitioning methods, which allocate documents into a fixed number of clusters with K-means algorithm and its variants being the most popular one, 2) Hierarchical methods , which build a dendrogram of clusters and 3) Frequent itemset-based methods , which use association rule mining techniques to form the clusters. In  some representative papers applying those three categories are reviewed.
Hierarchical clustering algorithms  are categorized in two major categories: a) agglomerative (or top-down) algorithms and b) divisive (or bottom-up) algorithms. Agglomerative algorithms can be further categorized according to the similarity measures they employ into single-link, complete link, group-average, and centroid similarity. Top-down algorithms typically are more complex, as they hold information about the global distribution of the dataset, in contrast to bottom-up methods that make clustering decisions based on local patterns. The advantages of Hierarchical clustering algorithms are that they compose a tree of clusters that comprises a richer data structure with more information than those provided by flat algorithms’ output, and the fact that they do not require users to define the number of clusters.
Nevertheless, the complexity of the naive hierarchical clustering algorithm is O(N3) as for every decision that needs to be taken, an exhaustive scan of the NxN similarity matrix is necessary. Other more efficient algorithms can reduce the complexity to O(N2logN) (with a heap in the general case) or even O(N2) (with SLINK  for single-linkage, CLINK  for complete-linkage clustering in the general case, and ROCK , Chameleon  for categorical data). BIRCH  and its extensions  comprise hierarchical clustering procedures that are especially suitable for very large databases, and comprise state of the art incremental hierarchical methods. However, the creation of the NxN similarity matrix is necessary for the majority of the algorithms, hence memory requirement demands become extremely high.
There have been many recent studies on Hierarchical Clustering algorithms. In , an alternative approach of a single-linkage clustering algorithm was proposed, which was based on minimum spanning trees and had the same complexity as the single-linkage algorithm. In , a new non-greedy incremental algorithm for hierarchical clustering was suggested, which efficiently routes new data points to the leaves of an incrementally-built tree. Another recent work  proposed a hierarchical clustering algorithm based on the hypothesis that two reciprocal nearest data points should be put in one cluster. In another line of work, many researchers treated similarity-based hierarchical clustering as an optimization problem, making use of suitable objective functions [33, 34]. In  for example, the author introduces a cost function that, given pairwise similarities between data points, assigns a score to any possible tree on those points.
In this paper, we introduce a method for clustering documents represented by a number of topics, using an approach that does not demand pairwise comparisons between the documents, but it is instead based on the use of low dimensional frequency matrices. Since the main algorithm makes use of the frequency of occurrence of the main terms in the documents, we call it Frequency-based Hierarchical Clustering (FBHC). A relevant clustering method that we presented in one of our previous works  makes use of frequency matrices to construct an hierarchy of biological sequences.
A new document clustering framework
In this section, we present an efficient framework for hierarchical document clustering which makes use of topic modeling to extract feature vectors that represent the processed documents. The proposed framework is shown schematically in Fig. 1 and it is formally described in the following steps:
Input Documents in bag-of-words representation Step 1:Word preprocessingStemming, Removing stop words, Making orthographic transformations, Stripping punctuation and substitution, Excluding words that are not included in the WordNet databaseStep 2:Data transformationStep 2.1: Select a method to perform topic modeling Step 2.2: Select the number of topics Nθ and the number of words per topic NwStep 2.3: Transform documents to topic vectors Step 3:Data discretizationStep 3.1: Select the number of bins BStep 3.2: Discretize the topic vectors Step 4:Hierarchical ClusteringStep 4.1: Apply the Binary Tree Construction Algorithm (Algorithm 1) Step 4.2: Apply the Branch Breaking Algorithm (Algorithm 2) Step 5:Meta-ClusteringStep 5.1: Graph Construction Step 5.2: Graph Clustering Step 5.2.1: Select the threshold thrGStep 5.2.2: Exclude all those edges with weights<thrGStep 5.2.3: Apply the Graph Merging Algorithm (Algorithm 3) Step 6:EvaluationStep 6.1: Select a technique to compute semantic similarity between the derived topics Step 6.2: Compute Topic Similarity TS for each cluster Step 6.3: If the actual class labels are known, compute Fscore
Data preprocessing module
The initial data that are taken as input to the framework are documents composed of words. Each word has a corresponding frequency of appearance. Before importing the data to the data transformation module, the words are preprocessed using various methods, including stemming (using the Porter Stemming Algorithm Footnote 1), removing stop words, making orthographic transformations (using the spelling corrector Footnote 2), stripping punctuation and substitution. The words that are not included at the WordNet , a lexical database of English words, are excluded from the dataset at this module.
Data transformation module
The data transformation module employs topic modeling to the desired input document in order to transform it into a compressed representation in terms of its topics. In this way we can deal with the high dimensionality and the sparsity of the features of the documents. Topic modeling is based on the assumption that each document d is described as a random mixture of topics P(θ|d) and each topic θ as a focused multinomial distribution over terms P(w|θ). The number of topics Nθ and the number of terms per topic NW are specified by the user and express the degree of specialization of the latent topics. As the data transformation module is not part of the proposed clustering method, any topic modeling method can be used as part of this module, as a plugin. Literature on topic modeling offers hundreds of models adapted to different situations.
LDA: Latent Dirichlet Allocation [38, 39] is a commonly used method to extract semantic information from the documents and create a feature vector for each document. LDA builds a set of Nθ thematic topics, each expressed with a set of NW terms, utilizing terms that tend to co-occur in a given set of documents. The topic-term distribution P(θ|d) and the document-term distribution P(w|θ) are estimated from an unlabeled corpus of documents D using Dirichlet priors.
BigARTM: BigARTM  is an open-source library for regularized multimodal topic modeling of large collections, which is based on a non-Bayesian multicriteria approach — Additive Regularization of Topic Models, ARTM . It is a distributed implementation which is proven to be very fast and ideal for big collections of documents.
Lda2vec: lda2vec  is a deep learning-based model which creates topics by mixing Dirichlet topic models and word embedding. It constructs a context vector by adding the composition of a document vector and the word vector, which are simultaneously learned during the training process.
Data discretization module
The numeric vectors created by the data transformation module, i.e. the mixture of topics P(θ|d) calculated by the topic modeling process, are discretized into B partitions by assigning each value into a bin based on the closed interval where it belongs to. By making use of alphabetic letters to represent the bins, the numeric vectors are converted into character vectors, which constitute the input data to the clustering procedure. Practically, it is a lossy compression where the number of bins B is selected based on the amount of information we want to be considered by the model.
Design for cloud
Since the proposed clustering algorithm is oriented towards analyzing big data that may not fit in a single machine, provision for cloud execution becomes a necessity. Cloud computing  is moving from large-scale centralized data centres to more distributed multi-cloud settings, which may contain networks of larger and smaller virtualized infrastructure runtime nodes. The use of containers constitutes a lightweight virtualization solution characterized by low resource and time consumption.
Docker  is a containerization platform that allows Linux applications, their dependencies, and their settings to be composed into Docker images. These images run as Docker containers on any machine running the Docker daemon, which utilizes kernel namespaces and control groups to isolate running containers and control their set of resources. This makes the deployment of cloud-oriented applications easy, as the image of an application has to be built only once and then it can be deployed on every system running the Docker deamon. Docker is also appropriate for software benchmarking experiments, since multiple Docker images can be created based on the same root image but containing different benchmarked configurations.
An image of the proposed FBHC algorithm was built using the Docker technology in order to run performance experiences using different hardware resources in the cloud. The resources that were used and the corresponding experimental results are described in “Performance testing in the cloud” section.
The hierarchical clustering algorithm
In this section, we propose a novel hierarchical clustering algorithm, consisting of two phases: 1) the construction of a top down binary tree by consecutively dividing the frequency matrix  into two sub-matrices until only unique sequences remain at the leaf-level, and 2) the branch breaking algorithm, where each branch of the tree is pruned at an appropriate level using thresholds for the metrics. The metrics that are used to form the clusters are: a) Identity (I), b) Entropy (H) and c) Bin Similarity (BS), and are described in . In the final, meta-clustering phase, a graph of the leaf clusters generated by the clustering algorithm is constructed.
Binary tree construction
The first phase of the clustering method consists of a top down hierarchical clustering algorithm (Algorithm 1). At the beginning of the process, it is assumed that all N sequences belong to a single cluster (C0), the root cluster, which is consequently split recursively while moving down the different levels of the tree. Ultimately, the constructed output of the clustering process is presented as a binary tree. The tree is constructed per level by following a procedure for each cluster (Ci) of the specific level, that can run in parallel. This can be formally described in the following steps: Step 1 Construct frequency and frequency-similarity based matrices (FMi,FSMi). Step 2 Compute Identity, Entropy and Bin Similarity metrics of the matrices (Ii,ISi,Hi,HSi,BSi) applying the equations described in , on the FMi and the FSMi respectively. From now on, the identity metric computed on the FSM will be called Similarity (IS). Step 3 Split the frequency matrix into two sub matrices according to the following criteria: Criterion 1: Select the element of the FMi with the highest percentage. Criterion 2: If the highest percentage value exists in more than one elements of FMi, the column with the lowest entropy value is selected. Criterion 3: In the case where more than one columns exhibit the exact same entropy value, Criterion 1 is applied to the FSMi. Criterion 4: In the case of non-unique columns, Criterion 2 is applied to the FSMi. Criterion 5: If the number of columns is still more than one, one column from the above sub group of columns is randomly selected. Step 4 Update the Level matrix (Y) and the Metric matrix (M) that contains the metrics for each cluster (I, IS, H, HS, BS). Step 5 Check for leaf-cluster.
At the beginning of the process, the user can select the type of the algorithm, i.e. whether the split of the matrices is performed on the FM (identityalgo option), or on the FSM (similarityalgo option). In the similarityalgo option, Criteria 1 & 2 are skipped at Step 3.
The second phase of the clustering method consists of the branch breaking process (Algorithm 2). This algorithm is applied to the binary tree derived from the first phase. In the field of document clustering, creating a hierarchy of the documents can be very useful for document organization. In many cases, except from the formulation of a hierarchy structure of the clusters, extracting meaningful groups can also be useful. In a partitioning clustering algorithm, the exact number of clusters to be created is chosen by the user. In the FBHC algorithm, a solution to extract useful groups from the binary tree would be to cut the tree at a specific level TC, obtaining all those clusters that belong to level TC and all the leaf clusters that belong to higher levels than TC.
Since the tree is asymmetric and the number of documents in each cluster varies, the tree cannot be cut by selecting a unique level TC for the overall tree. A more accurate procedure to address this problem is to prune the tree using branch-specific thresholds: For each branch, the parent cluster is compared to its two children clusters recursively as one goes down through the path of the tree branch. The comparison is applied using the metrics that have been computed for each cluster Ci (Ii,ISi,Hi,HSi,BSi) and user selected thresholds for each metric (thrI, thrH, thrBS). An additional limitation set for the identity metric is that the leaf clusters must have an Identity value higher than 20%. This lower threshold is set to avoid pruning at a very high level of the tree in the case that Identity is too small and the improvement in the metrics is not big enough.
The hierarchy structure created during the clustering phase is ideal for the graph theory application. A graph can be useful to uncover connections between the clusters and obtain an insight of how similar the leaf clusters are. This information can be used to merge similar clusters together as a next step. To this end, an undirected, weighted and fully connected graph is constructed using the binary tree.
The graph is built by computing the graph similarity matrix which is a square matrix with order equal to the number of leaf clusters (C). The graph matrix is computed on the patterns that represent the clusters. The graph similarity Gi,j between two clusters Ci and Cj is calculated as the combination of three aspects: a) the number of bins that these clusters have in common through the whole pattern that is computed using the identical bins (pI), b) the number of bins that these clusters have in common through the whole pattern that is computed using the groups of bins (pS), and c) the distance between the nodes Ci,Cj of the tree. TL is defined as the maximum distance presented in the binary, i.e. the distance between the nodes that are far apart from each other.
The representative pattern of a cluster is a string of length equal to the number of topics and it is extracted using the cluster’s frequency matrix, as follows: The positions of the strings with an exact alignment are represented by the corresponding bin, whereas the rest of them are represented by the symbol “_”. Suppose that we are interested in the distance between clusters 53 and 18. The clusters’ patterns computed with the identical and grouped bins of cluster 53 and the corresponding patterns for cluster 18 are as follows:
pI53 = _ R G _ _ _ _ _ R Y Y Y Y G _ _ V,
pS53 = _ R G _ _ _ _ B R Y Y Y Y G _ A V,
pI18 = _ R G _ _ _ _ _ _ _ _ _ _ _ _ D _,
pS18 = _ R G _ _ _ _ _ G _ _ _ _ _ _ D _,
Then, |pI53∩pI18|=2 and |pS53∩pS18|=2.
After the graph construction, the graph is clustered into sub groups. Graph clustering is the task of grouping the graph nodes into clusters taking into consideration the weights of the edges, in such a way that there should be many high weighted edges within each node-cluster and relatively low between the node-clusters. The graph can be clustered using a user-selected threshold thrG, excluding from the graph all those edges that are characterized by a weight smaller than thrG. This threshold is expressed as a percentage and can be selected by observing the distribution of the weights’ values. If the user wants to export a specific number of clusters, then a graph merging procedure can be applied. As described in Algorithm 3, the clustered graph is composed of sub-graphs SGs. By sorting the weights in descending order, the most highly similar and strongly connected SGs can be merged by assigning each node to the corresponding central node of the SGs where it belongs to and forming merged clusters, until the desired number of clusters is reached.
Experiments and evaluation
In this section the datasets, the external evaluation measures and the four sets of experiments performed to evaluate and validate the proposed framework are presented.
In order to evaluate the effectiveness of the proposed clustering method on text documents, we used various datasets from several domains such as sentiment analysis, news articles, medical documents, web pages and abstracts provided by [43, 44]. More specifically, we used 23 benchmark datasets from  in order to test the accuracy of our framework, with the smallest and the largest ones consisting of 204 and 18,808 documents, accordingly. The number of actual classes of these documents vary from 4 to 51. The table also shows the number of terms of the original documents, i.e. the number of different words, and the final number of terms after the preprocessing. We also used two big datasets from , the NYTimes news articles and the PubMed abstracts, in order to evaluate the performance of the method in terms of computational time and memory usage. All these datasets are summarized in Table 1.
In order to apply the proposed clustering procedure, the datasets were preprocessed in the preprocessing module and then were transformed into numeric vectors using topic modeling. As use case in this paper we utilized the LDA method in the data transformation module. The number of topics Nθ that will represent the documents was chosen to be equal to 20, after experimenting on the values of Nθ from 5 to 500 and evaluating the results using the perplexity metric, as described in our previous work . Thus, for each document of the datasets, we created topic vectors of length 20.
The document vectors were then discretized in 10 bins represented by alphabetic letters from A to J, making each document represented by a sequence of characters. The bin with the highest percentage is represented by A, whereas the one with the lowest percentage is described with J. In order to create the FSM, the groups of similar bins that were used are non-overlapping and are given by pairing bins in descending order i.e. <A,B>,<C,D>,<E,F>,<G,H>.
Two external evaluation metrics were used to evaluate the effectiveness of the clustering procedure: FScore and Topic Similarity.
When we have knowledge about the true class where each document belongs to, then we can use FScore to measure the accuracy of the clustering results. A commonly used technique to measure Fscore in hierarchical clustering is to take into account the overall set of clusters that are represented in the hierarchical tree. In this paper, we use the FScore introduced by . Given a particular class Lr of size nr and a particular cluster Ci of size ni and assuming that nri documents of cluster Ci belong to the real class Lr, then the Fscore of this class and cluster is given by (4). To compute FScore, (2) and (3) must be used as follows.
where R(Lr,Ci) is the recall value defined as nri/nr, and P(Lr,Ci) is the precision value defined as nri/ni for the class Lr and the cluster Ci. The FScore of the class Lr, is the maximum FScore value attained at any node in the hierarchical clustering tree T. That is,
The FScore of the entire clustering solution is then defined to be the sum of the individual class FScore weighted according to the class size.
where c is the total number of classes. The higher the FScore values, the better the clustering solution is.
Due to the sparsity of the frequency matrix of each cluster and the fact that each cluster is characterized by only a few topics, we evaluated the clustering results by calculating the semantic similarity between major topics of each cluster. The topic similarity is extracted using semantic analysis of the topics that were derived from topic modeling.
Semantic similarity, in contrast to string-based matching can identify semantically relevant concepts that consist of different strings. More specifically, semantic similarity is a metric that is used to measure the distances between a set of terms contained in documents based on their meaning or semantic concept. Many techniques to compute semantic similarities of words are reported in the literature. Using Word Embeddings such as Google’s Word2Vec, or a semantic net such as WordNet are common techniques to compute semantic similarity.
Word2vec: Word2vec  is a group of models that are used to produce word embeddings. These models are neural networks that are trained to learn high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. Word2vec takes as its input a large corpus of text and produces a vector space, with each unique word in the corpus being assigned a corresponding vector in the space. Words that share common contexts in the corpus are located in close distance to one another in the space. Similarity between two vectors is defined as a cosine. To compute topic similarity, we use an R implementation of Word2vec to train a model for each dataset by making use of the documents’ description. The similarity between two documents of a dataset is computed using the cosine similarity between the topic vectors that have been extracted after topic modeling.
WordNet: Making use of a lexical taxonomy (i.e. WordNet) to define distances between concepts is another commonly used technique. WordNet structure [37, 47, 48] is a large lexical database of English with words grouped into sets of synonyms (synsets). Nouns, verbs, adjectives and adverbs are grouped into synsets, each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. There are many different distance metrics that make use of the WordNet taxonomy to obtain semantic similarities. In this work, in order to calculate the similarity between two words, we use the Resnik distance , where the information content of a word is denoted as the logarithm of the probability of finding the word in a given corpus. This metric only considers the information content of the lowest common level in the hierarchy, i.e. the concept in the taxonomy which has the shortest distance from the concepts compared.
Given that each topic-i is represented by a set of words θi, in order to compute the topic similarity between two topics- i,j, at first we obtained the pairwise similarities between all the words contained in θi,θj using the Resnik distance. To compute the overall matching score between the two topics, i.e. the pairwise Topic Similarity (TSi,j), we used the matching average method (5) , which calculates the similarity between two topics θi and θj by dividing the sum of similarity values of all match candidates of both sets by the total number of set tokens. More specifically, the Match(θi,θj) function of the equations counts the number of highly similar words of the two topics, i.e. the number of words that have Resnik similarity higher than the threshold 1. By employing (5), a Nθ×Nθ similarity matrix with the pairwise TS between all the Nθ topics was created.
Results and discussion
We have performed a number of experiments to evaluate the effectiveness and the performance of our framework. Therefore, this subsection is divided into five parts: a) the comparison against baseline hierarchical clustering algorithms in terms of effectiveness is further discussed in “Effectiveness evaluation” section, b) the comparison against a baseline division hierarchical clustering algorithm in terms of memory usage and computational time is further discussed in “Performance statistical evaluation” section, c) the performance experiments of the proposed method running in the cloud is further discussed in “Performance testing in the cloud” section, d) the complexity analysis is presented in “Complexity analysis” section, and e) the overall proposed framework presented in “A new document clustering framework” section applied on the NYTimes dataset is further discussed in “Experimental results on the NYTimes dataset” section.
The first set of experiments was focused on evaluating the quality of the proposed Frequency based hierarchical clustering (FBHC) method, by experimenting on the 10 first datasets described in Table 1. The effectiveness of the FBHC was examined using the external metrics TS and FScore, and comparing the results with baseline hierarchical clustering algorithms implemented in R language. Both division (Diana)Footnote 3 and agglomerative (Average, Single, Complete and Ward)Footnote 4 hierarchical clustering algorithms were used as baselines.
In Table 2, average FScore and TS values on the proposed algorithm and the baseline algorithms are presented. The best results achieved by an algorithm for each one of the datasets are highlighted as boldface, whereas the second highest results are presented in italics. The FScore was calculated taking into account the whole hierarchy structures that were created by the compared algorithms. For most of the datasets used in the experimental analysis, the highest FScore values are obtained by the proposed FBHC algorithm. For the LATimes, Oh10, Dmoz-Computers, Oh10 datasets, ward has a higher FScore and the FBHC algorithm comes second with a small difference, whereas for the Reviews dataset Diana comes first and FBHC comes second.
The average TS values were calculated on the final clusters that were set equal to the actual classes for each dataset. To obtain the final clusters of the dendrogram trees that were constructed using the baseline algorithms, the cutreeFootnote 5 R function was used, whereas for the FBHC method, the branch breaking algorithm followed by the meta-clustering module were applied experimenting on different thresholds until the desired number of clusters were obtained.
To compute the average TS value for each dataset presented in Table 2, we extracted the major topics i, j of each cluster, we computed the TSi,j values using WordNet and (5) for all the clusters and we computed the average value. Instead of WordNet, we could also use Word2vec to calculate TSi,j. However, the results in Table 2 would remain the same, as we present the difference TS−TSActual. Topic Similarity was calculated only for those clusters that contain more than 5 elements and include at least one major topic. Furthermore, TS was calculated for those datasets with actual classes characterized by major topics. The Tr31, Las2s, Tr12, Tr11, Tr45, Tr41, Oh10, Re0 and Re1 datasets do not follow the rule described in “Topic similarity” section, hence most of their clusters have NA values for the TS metric. The maximum value that TS may assume is 1, which indicates that each one of the clusters is characterized by a unique major topic. The Single method failed to create clusters with major topics, because it assigned most of the elements in one cluster with the rest of the clusters containing only one element each. Table 2 shows that the FBHC method usually produces TS values closer to the actual ones, compared to the other methods.
Performance statistical evaluation
The second set of experiments focused on evaluating the performance of the proposed clustering method, in terms of memory usage and computational time. The experiments were run with R on a computer with Intel Core i7 CPU 3.40 GHz with 8 cores and 24 GB RAM, using one core. The Frequency based Hierarchical Clustering (FBHC) algorithm was compared to the Baseline division Hierarchical Clustering algorithm (BHC) Diana. Figure 2 makes clear that using subsets of the NYTimes dataset of different sizes, the BHC algorithm has much higher memory demands. For the experiment with N equal to 50,000 documents, the BHC algorithm was running for 11 days before it aborted with an “out of memory” error.
Additional results can be found in Tables 3 and 4, where the average memory usage and computational time for both FBHC and BHC algorithms, and the corresponding results of the statistical evaluation of the aforementioned values, for each subset size are analyzed. Statistical evaluation is performed to ensure the significant difference of the performance of our proposed algorithm and the baseline one. This was necessary because the memory usage and computational time of the baseline algorithms varied in each execution. By the use of the statistical tests the results can be generalized.
In the statistical test, we hypothesize that using the BHC algorithm instead of the FBHC one we can achieve better performance in terms of memory usage and computational time. To determine whether this hypothesis must be rejected, a statistical hypothesis test in name t-test is used (more details about the statistical method can be found in ). Tables 3 and 4 report the Degree of Freedom (DF) i.e. the amount of information in the data, the 95% Confidence interval of the differences, the average values of the differences, the t-test value and finally the probability value (p-value) which is used to make a decision about the statistical significance of the terms and model. According to the reported results, the p-values for all subsets never exceed α=0.05, which means the null hypothesis must be rejected and that the second hypothesis is supported.
Tables 3 and 4 make clear that as the number of documents increases, the absolute value of the t-value of the statistical t-test for both memory usage and computational time increases, except for the first run (the subset with the smallest size) where the t-value of the memory usage was extremely high. This means that the difference between the performance of the two methods becomes more and more statistically important with the increment of the number of documents. For 25,000 documents, our method achieved over 99% reduction in both memory usage and computational time.
Performance testing in the cloud
The third set of experiments focused on evaluating the performance of the proposed clustering method in the cloud. In this round of experiments, we used the biggest dataset of our collection, the PubMed dataset, which contains 8 million documents. In order to test different cloud resource configurations, we built a docker image of the proposed clustering algorithm. The image is publicly available in the Docker hub (mariakotouza/fbc:pubmed) and includes all the subset datasets that were used in these experiments.
The docker image was run as a container on three different configurations: a) the local computer used in the second round of experiences, b) a server that had the following specifications: Ubuntu 18.04.3 LTS (kernel 184.108.40.206-generic), 2 x Intel Xeon X5650 @ 2.67 GHz with 16 cores and 118 GB RAM, and c) a configuration provided by the Okeanos national cloud infrastructure, with the following specifications: VMs with Ubuntu server 18.04, Intel(R) Xeon(R) CPU E5-2650 v3 2.30GHz with 4 cores and 16 GB RAM.
The scalability of our algorithm can be observed in Fig. 3, where different numbers of CPUs of the local computer, the cloud resources and the server were used for each subset. The X axis represents the number of CPUs, and the Y axis represents the execution time in seconds. The different lines in the figures correspond to a different subset size N. Comparing the three plots in the figure we observe that the computational time is highly affected by the available hardware. As for the memory usage, the demands for each core that is used are the same as those presented in Fig. 4 in the following sub-section.
Figure 4 shows the results for memory usage and computational time for different subsets of the dataset running on the local computer using one core. The figure makes clear that both metrics follow a linear model with the complexity being equal to O(N), which means that the running time and the memory usage increase at most linearly with the size of the input N.
The same result can be obtained using a theoretical analysis to estimate the computational cost of analyzing datasets of different sizes. Table 5 shows the expected computational cost of various clustering algorithms applied to the corresponding datasets and executed on the local computer, using 1 core (the results of FBHC running on 8 cores are also shown). The hypothetical values were predicted after training a regression model using as X and Y variables the Number of documents (N) and the corresponding computational time (T) that were calculated using the PubMed dataset and are depicted on Fig. 4.
The computational complexity of our proposed algorithm was compared to the following state-of-the-art hierarchical clustering procedures:
Diana Footnote 6 - The division hierarchical clustering algorithm which was used as a baseline in a previous set of experiments.
HAC Footnote 7 - The agglomerative hierarchical clustering algorithm using different linkage criterion, that were utilized as baselines in the previous sets of experiments.
SLINK Footnote 8 - An optimized implementation for hierarchical clustering using the single-linkage criterion with O(N2) time complexity.
Birch Footnote 9 - A top-down incremental hierarchical clustering method where points are inserted greedily using the node statistics, which is ideal for large datasets.
The values presented on Table 5 are hypothetical, as the ones for the FBHC algorithm. The results prove once again that the FBHC algorithm outperforms the rest of the methods in terms of computational time. The second-best algorithm that scales for large number of documents is the Birch algorithm. However, the algorithm is not scalable in terms of memory usage, as we were not able to run it on the local computer for datasets consisting of more than 80,000 documents due to memory limitation problems.
Experimental results on the NYTimes dataset
The last round of experiments include the application of our proposed hierarchical clustering framework on the NYTimes dataset. Using the binary tree construction algorithm of type similarityalgo, a binary tree with 23 levels and 1965 leaf clusters was constructed. Table 6 shows that the Identity and Similarity metrics began with 0 values at the root of the tree, whereas the Entropy metric began with 0.32. These values improved when descending down the different tree levels, until at the leaf level the Similarity value was equal to 100% while the Entropy was equal to 0.
During the second phase, the tree was pruned by applying the branch breaking algorithm using the percentage of 0.5% as threshold for all the comparisons of the metrics. The final tree consists of 20 levels and 58 leaf clusters. The average values of each level’s metrics using the FM and the FSM matrices are summarized in Table 7. The table shows that the identity value increased towards the leaves of the tree. Notably, when groups of similar bins are used instead of the bins themselves, the similarity value (IS) was a little higher as expected. The values of the Topic Similarity (TS) metric, which is discussed in the following sub-section, are also included in the table.
During the meta-clustering phase of the procedure, a graph is constructed using all the leaf clusters that have been formed after the branch breaking algorithm. The hierarchy structure of the clusters are presented in Fig. 5, where similar clusters are depicted using characteristic colors. The graph is clustered using a threshold equal to 10%, removing all the edges that were connecting the most dissimilar clusters. Figure 6 presents the fully connected and the clustered graphs. Evidently, most of the big clusters do not have similarities with other clusters, but some smaller clusters like 12, 35, 36, 57, 58, 89, 90, 131, 132 could be merged with the cluster 87 due to the high connectivity that is observed. The aforementioned clusters that were given as an example for merging are presented on Fig. 5 using red color.
In this paper, we presented a new scalable multi-metric hierarchical clustering framework for document clustering. The input documents are preprocessed and transformed into feature vectors using topic modeling, and afterword they are discretized forming sequences of characters. The clustering method is composed of three distinct phases: the binary tree construction algorithm, the branch breaking algorithm, and a meta-clustering module for generating graphical representations of the output. The metrics that are used to form the clusters include Identity, Similarity, Entropy and Bin Similarity. The clustering method exhibits a high degree of parallelism and several sub-processes can be distributed in multiple CPUs to speedup the whole process. It is also dockerized, to enable execution in almost any configuration in the cloud.
Using this frequency-based approach to perform hierarchical document clustering, many limitations on computational time and memory usage, as the number of documents increases, can be overcome. Our algorithm has increased scalability compared to existing hierarchical clustering algorithms, because it uses frequency tables to form the clusters instead of making pairwise comparisons between all the elements of the dataset. A series of efficiency and performance evaluation experiments have shown considerable reduction in both execution times and memory requirements over a wide variety of publicly available document sets and of cloud infrastructure.
A limitation of our proposed method may be the information loss that comes from the data discretization module, but it is up to the users to select the number of bins B in such a way that the amount of information that is considered by the model is sufficient, depending on the problem. Considering the effectiveness of the proposed method in the cloud, Future work involves further parallelization of the clustering algorithm in order to optimize the use of allocated resources in the cloud, including GPU usage. Moreover, the proposed framework could be extended to handle real time applications running in the cloud that demand new document categorization. This could be done by implementing a decision-making algorithm that exploits the hierarchy of the clusters to perform new document categorization into the existing clusters.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the LABIC repository, http://sites.labic.icmc.usp.br/text_collections/, and the UIC Machine Learning repository, https://archive.ics.uci.edu/ml/datasets/Bag+of+Words. The source code of the proposed hierarchical clustering method is publicly available on github, https://github.com/mariakotouza/FBHC, whereas the docker image of the method, which includes all the tested subsets of the Pubmed dataset, is publicly available on Docker hub with the name mariakotouza/FBHC:fbc, https://cloud.docker.com/u/mariakotouza/repository/docker/mariakotouza/fbc. Finally, the docker image of the whole clustering framework is available on Docker hub with the name mariakotouza/fbc:framework.
Latent Dirichlet allocation
Frequency based hierarchical clustering
Baseline hierarchical clustering
Jaiswal A, Janwe N (2011) Hierarchical document clustering: a review In: 2nd National Conference on Information and Communication Technology (NCICT) 2011 Proceedings published in International Journal of Computer Applications\(\circledR \)(IJCA), 37–41.
Roul RK, Asthana SR, Sahay SK (2015) Automated document indexing via intelligent hierarchical clustering: A novel approach In: 2014 International Conference on High Performance Computing and Applications, ICHPCA 2014. https://doi.org/10.1109/ICHPCA.2014.7045347.
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets In: Proceedings of the eleventh international conference on Information and knowledge management - CIKM ’02, 515. https://doi.org/10.1145/584792.584877.
Shah N, Mahajan S (2012) Document Clustering: A Detailed Review. Int J Appl Inf Syst (IJAIS) 4(5):30–38. URL https://doi.org/10.5120/8202-1598.
Bhardwaj S, Jain L, Jain S (2010) Cloud computing: A study of infrastructure as a service (iaas). Int J Eng Inf Technol 2(1):60–63.
Kotouza M, Vavliakis K, Psomopoulos F, Mitkas P (2018) A hierarchical multi-metric framework for item clustering In: 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT).. IEEE. https://doi.org/10.1109/bdcat.2018.00031.
Chen CL, Tseng FSC, Liang T (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. https://doi.org/10.1016/j.datak.2010.08.003.
Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis In: Proceedings of the eleventh international conference on Information and knowledge management - CIKM ’02, 211. https://doi.org/10.1145/584792.584829.
Newman D, Asuncion A, Smyth P, Welling M (2009) Distributed algorithms for topic models. J Mach Learn Res 10(Aug):1801–1828.
Wang Y, Bai H, Stanton M, Chen W-Y, Chang EY (2009) Plda: Parallel latent dirichlet allocation for large-scale applications In: International Conference on Algorithmic Applications in Management, 301–314.. Springer. https://doi.org/10.1007/978-3-642-02158-9_26.
Liu Z, Zhang Y, Chang EY, Sun M (2011) Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans Intell Syst Technol (TIST) 2(3):26.
Vorontsov K, Frei O, Apishev M, Romov P, Dudarenko M (2015) Bigartm: Open source library for regularized multimodal topic modeling of large collections In: Communications in Computer and Information Science, 370–381.. Springer International Publishing. https://doi.org/10.1007/978-3-319-26123-2_36.
Moody CE (2016) Mixing dirichlet topic models and word embeddings to make lda2v ec. arXiv preprint arXiv:1605.02019. https://arxiv.org/abs/1605.02019.
Ahmadi P, Gholampour I, Tabandeh M (2018) Cluster-based sparse topical coding for topic mining and document clustering. Adv Data Anal Classif 12(3):537–558. https://doi.org/10.1007/s11634-017-0280-3.
Rafi M, Shaikh MS, Farooq A (2010) Document Clustering based on Topic Maps. Int J Comput Appl 12(1):32–36.
Ma Y, Wang Y, Jin B (2014) A three-phase approach to document clustering based on topic significance degree. Expert Syst Appl 41(18):8203–8210. https://doi.org/10.1016/j.eswa.2014.07.014.
Lin Q, Jungang X (2013) A Chinese Word Clustering Method Using Latent Dirichlet Allocation and K-means In: Proceedings of the 2nd International Conference on Advances in Computer Science and Engineering.. Atlantis Press. URL https://doi.org/10.2991/cse.2013.60.
Onan A, Bulut H, Korukoglu S (2017) An improved ant algorithm with LDA-based representation for text document clustering. J Inf Sci 43(2):275–292. https://doi.org/10.1177/0165551516638784.
Yau CK, Porter A, Newman N, Suominen A (2014) Clustering scientific documents with topic modeling. Scientometrics 100(3):767–786. https://doi.org/10.1007/s11192-014-1321-8.
Xu J, Zhou S, Qiu L, Liu S, Li P2014. A Document Clustering Algorithm Based on Semi-constrained Hierarchical Latent Dirichlet Allocation. Springer International Publishing. https://doi.org/10.1007/978-3-319-12096-6_5.
Steinbach M, Karypis G, Kumar V, et al. (2000) A comparison of document clustering techniques In: KDD Workshop on Text Mining, vol 400, 525–526.. arXiv, Boston. https://dl.acm.org/doi/10.1145/233269.233324.
Chen CL, Tseng FSC, Liang T (2010) Mining fuzzy frequent itemsets for hierarchical document clustering. Inf Process Manag 46(2):193–211. https://doi.org/10.1016/j.ipm.2009.09.009.
Manning CD (2009) Intro to Information Retrieval. Inf Retrieval:1–18. https://doi.org/10.1109/LPT.2009.2020494.
Sibson R (1973) Slink: an optimally efficient algorithm for the single-link cluster method. Comput J 16(1):30–34.
Defays D (1977) An efficient algorithm for a complete link method. Comput J 20(4):364–366.
Guha S, Rastogi R, Shim K (2000) Rock: A robust clustering algorithm for categorical attributes. Inf Syst 25(5):364–366.
Karypis G, Han E-HS, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75. https://doi.org/10.1109/2.781637.
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases In: ACM Sigmod Record, vol 25, 103–114.. ACM. https://dl.acm.org/doi/10.1145/233269.233324.
Fichtenberger H, Gillé M, Schmidt M, Schwiegelshohn C, Sohler C (2013) Bico: Birch meets coresets for k-means clustering In: Lecture Notes in Computer Science, 481–492.. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-40450-4_41.
Agarwal A, Roul RK (2018) A novel hierarchical clustering algorithm for online resources In: Recent Findings in Intelligent Computing Techniques, 467–476.. Springer. https://link.springer.com/chapter/10.1007/978-981-10-8636-6_49.
Kobren A, Monath N, Krishnamurthy A, McCallum A (2017) A hierarchical algorithm for extreme clustering In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 255–264.. ACM. https://dl.acm.org/doi/10.1145/3097983.3098079.
Xie W-B, Lee Y-L, Wang C, Chen D-B, Zhou T (2019) Hierarchical clustering supported by reciprocal nearest neighbors. arXiv preprint arXiv:1907.04915. https://arxiv.org/abs/1907.04915.
Charikar M, Chatziafratis V, Niazadeh R (2019) Hierarchical clustering better than average-linkage In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2291–2304.. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611975482.139.
Cohen-Addad V, Kanade V, Mallmann-Trenn F, Mathieu C (2019) Hierarchical clustering: Objective functions and algorithms. J ACM (JACM) 66(4):26.
Dasgupta S (2015) A cost function for similarity-based hierarchical clustering. arXiv preprint arXiv:1510.05043. https://arxiv.org/abs/1510.05043.
Tsarouchis S-F, Kotouza MT, Psomopoulos FE, Mitkas PA (2018) A multi-metric algorithm for hierarchical clustering of same-length protein sequences In: IFIP International Conference on Artificial Intelligence Applications and Innovations, 189–199.. Springer International Publishing. https://doi.org/10.1007/978-3-319-92016-0_18.
Oram P (2001) WordNet: An Electronic Lexical Database(Fellbaum C, ed.). MA: MIT Press, 1998, Cambridge. 423. Applied Psycholinguistics, 22(1), 131-134.
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022.
Newman D, Baldwin T, Cavedon L, Huang E, Karimi S, Martinez D, Scholer F, Zobel J (2010) Visualizing search results and document collections using topic maps. Web Semant Sci Serv Agents World Wide Web 8(2-3):169–175.
Vorontsov K, Potapenko A (2015) Additive regularization of topic models. Mach Learn 101(1-3):303–323.
Pahl C, Lee B (2015) Containers and clusters for edge cloud architectures–a technology review In: 2015 3rd International Conference on Future Internet of Things and Cloud.. IEEE. https://doi.org/10.1109/ficloud.2015.35.
Merkel D (2014) Docker: lightweight linux containers for consistent development and deployment. Linux J 2014(239):2.
Rossi RG, Marcacini RM, Rezende SO (2013) Benchmarking text collections for classification and clustering tasks. Institute of Mathematics and Computer Sciences, University of Sao Paulo. http://sites.labic.icmc.usp.br/ragero/docs/TR_395.pdf.
Dheeru D, Karra Taniskidou E (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed Jan 2019.
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’99.. ACM Press. https://doi.org/10.1145/312129.312186.
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781.
Feinerer I, Hornik K (2017) Wordnet: WordNet Interface. R package version 0.1-14. https://CRAN.R-project.org/package=wordnet. Accessed Jan 2019.
Wallace M (2007) Jawbone Java WordNet API. http://mfwallace.googlepages.com/jawbone. Accessed Jan 2019.
Resnik P (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy 1. https://doi.org/10.1.1.55.5277.
Lin D (1998) Automatic retrieval and clustering of similar words In: COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.. Association for Computational Linguistics. https://doi.org/10.3115/980432.980696.
Kyun KT (2015) T test as a parametric statistic. Korean J Anesthesiol 68(6):540–546. https://doi.org/10.4097/kjae.2015.68.6.540.
This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE (project code:T1EDK-03464)
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Kotouza, M.T., Psomopoulos, F.E. & Mitkas, P.A. A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures. J Cloud Comp 9, 2 (2020). https://doi.org/10.1186/s13677-019-0150-y
- Hierarchical document clustering
- Topic modeling
- Performance testing