A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.


Introduction
Hierarchical clustering has been proven to be a useful technique in the field of document organization, as it constructs a hierarchy structure of document collections and sub-collections. Such a structure can make the browsing and navigation process easier and quicker [1] by hiding irrelevant information from the users. Since each cluster and the corresponding sub-clusters represent a set of topic and sub-topics relationships [2], the hierarchy can help automated systems to return only relevant information *Correspondence: maria.kotouza@issel.ee.auth.gr 1 Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece Full list of author information is available at the end of the article to the user, by exploiting the relationships stored in the structure. Moreover, the hierarchy can be used to visualize and interactively explore large amounts of documents [3]. Finally, the hierarchy may be used as a decision tree for the categorization of new documents. However, existing solutions for hierarchical document clustering are faced with serious challenges.
Some of the current problems with document clustering [4] include the selection of appropriate document features and similarity measures, the quality assessment of the clusters, the implementation of an efficient clustering algorithm which can make optimal use of the available memory and CPU resources, the association of meaningful labels to the final clusters, and the consideration of the semantic relationships between words. Hierarchical document clustering methods have to deal with additional challenges, including the handling of the very high dimensionality of the data. A medium to large set of documents can contain over 10,000 documents; this means that there can be millions of term-document relations, thus leading to an extremely high computational complexity and memory usage. This issue arises from the way most classical hierarchical clustering methods are implemented: they are based on the formulation of high dimensional distance matrices, used for pairwise comparisons between all the available data points.
The high volume of documents that have to be handled daily on the web presents a challenge to a cloud environment as well. In order to provide efficient solutions, researchers are increasingly turning towards scalable approaches, such as the utilization of cloud resources in addition to local computational infrastructures. The combination of running big data analytics algorithms using cloud computing infrastructures seems to be the solution. Cloud computing [5] provides shared computing resources on-demand over the Internet, including large numbers of compute servers and other resources, that have the ability to be scaled up and down according to the computational requirements. The topology of the computers in the cloud is usually hidden from the end user.
Taking all these issues into account, this work focuses on implementing a scalable hierarchical clustering algorithm for document clustering. It attempts to overcome limitations regarding the number of documents that can be handled by existing algorithms due to memory limitations, and to reduce the overall computational time. The innovation of our proposed algorithm lies in the fact that, instead of constructing an NxN similarity matrix by computing the pairwise similarities between all data points of the dataset in order to construct the hierarchical tree, we build a low-dimensionality frequency matrix as a representation of the root cluster. This cluster is then split recursively while moving down in the hierarchy, which significantly reduces the memory requirements. Additionally, the implementation of this work is based on a distributed computing architecture and therefore can handle an increasing number of documents based on the available resources. The input of our algorithm consists of documents represented as a bag of topics derived from topic modeling. The documents are transformed into the appropriate for the algorithm format during the preprocessing and transformation phases in our proposed framework. The whole framework has been dockerized in order to facilitate easy deployment on cloud computing infrastructures.
This work is an extended version of our previous work [6] that presented a multi-metric hierarchical clustering framework for item clustering. Here, we extend the previous work by re-designing our framework in order to be applicable to the more general field of document clustering, and we add a meta-clustering module to the framework. We explore the effectiveness and the performance of our method regarding memory usage and computational time through a more detailed evaluation and many more experiments, utilizing several datasets of varying sizes and content. We compare the results with more baseline hierarchical clustering methods, and we make use of the external evaluation metric FScore. Furthermore, we extend the previous work by parallelizing our clustering algorithm to achieve scalability, we make it suitable for cloud execution using a virtualization solution, and we measure the performance of the method using different hardware resources.
The rest of the paper is organized as follows; the "Literature review" section discusses related work, while the proposed integrated framework for document clustering is analysed in "A new document clustering framework" section. In "The hierarchical clustering algorithm" section our innovative hierarchical clustering algorithm is detailed, whereas "Experiments and evaluation" section contains the experimental results and the clustering evaluation. Finally, conclusions and future work are highlighted in the "Conclusion" section.

Topic modeling in document clustering
Getting from an initial collection of documents to a clustering of the collection is an elaborate procedure, which usually involves several stages. The basic operations are feature extraction and selection, document representation and clustering [4]. Feature extraction is usually the first step of the process and filters out non-appropriate words from the documents' descriptions. Feature selection is a preprocessing method that removes noisy features and reduces the dimensions of the feature space, in order to yield a better understanding of the data and overall better performance of the clustering method that takes as input those data. In the feature selection stage, various probabilistic models have been used in the literature, like Latent Dirichlet Allocation (LDA) [7] and Probabilistic Latent Semantic Analysis (PLSA) [8]. Today, a lot of research works around topic modeling focus on distributed implementations of LDA, such as AD-LDA [9], PLDA [10] and PLDA+ [11]. BigARTM [12] is another distributed implementation for topic modeling which includes all popular models such as LDA, PLSA, and many others. Other approaches make use of deep learning techniques for topic extraction (e.g. lda2vec [13]).
Ahmadi et al. [14] proved that topic model based clustering methods generally achieve better results than only applying traditional clustering algorithms like the K- means. LDA has been used in many papers for representation and dimensionality reduction of text documents, as well as for uncovering semantic relations in the text [15]. Ma et al. [16] used LDA for document representation and identification of the most significant topics, the K-means++ algorithm was used to define the initial centers of the clusters and the K-means algorithm was used to form the final clusters. Qiu and Xu [17] presented a clustering method, where the LDA was used to extract topics from the texts and the centroids of the K-means algorithm were selected among the nouns with the highest probability values. More recently, Onan et al. [18] proposed an improved ant clustering algorithm, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. The latent Dirichlet allocation (LDA) was used to represent textual documents. Except from the classical LDA method, many variants were examined in the literature [19,20], including hierarchical LDA, correlated topic models and hierarchical Dirichlet process.

Document clustering
According to [21], document clustering can be divided into hard clustering, where each document is assigned to exactly one cluster, and soft clustering, where each document is allowed to appear in more than one clusters. Hard clustering methods can be further categorized in the following sub-categories: 1) Partitioning methods, which allocate documents into a fixed number of clusters with Kmeans algorithm and its variants being the most popular one, 2) Hierarchical methods [3], which build a dendrogram of clusters and 3) Frequent itemset-based methods [22], which use association rule mining techniques to form the clusters. In [7] some representative papers applying those three categories are reviewed. Hierarchical clustering algorithms [23] are categorized in two major categories: a) agglomerative (or top-down) algorithms and b) divisive (or bottom-up) algorithms. Agglomerative algorithms can be further categorized according to the similarity measures they employ into single-link, complete link, group-average, and centroid similarity. Top-down algorithms typically are more complex, as they hold information about the global distribution of the dataset, in contrast to bottom-up methods that make clustering decisions based on local patterns. The advantages of Hierarchical clustering algorithms are that they compose a tree of clusters that comprises a richer data structure with more information than those provided by flat algorithms' output, and the fact that they do not require users to define the number of clusters.
Nevertheless, the complexity of the naive hierarchical clustering algorithm is O(N 3 ) as for every decision that needs to be taken, an exhaustive scan of the NxN similarity matrix is necessary. Other more efficient algorithms can reduce the complexity to O(N 2 logN) (with a heap in the general case) or even O(N 2 ) (with SLINK [24] for single-linkage, CLINK [25] for complete-linkage clustering in the general case, and ROCK [26], Chameleon [27] for categorical data). BIRCH [28] and its extensions [29] comprise hierarchical clustering procedures that are especially suitable for very large databases, and comprise state of the art incremental hierarchical methods. However, the creation of the NxN similarity matrix is necessary for the majority of the algorithms, hence memory requirement demands become extremely high.
There have been many recent studies on Hierarchical Clustering algorithms. In [30], an alternative approach of a single-linkage clustering algorithm was proposed, which was based on minimum spanning trees and had the same complexity as the single-linkage algorithm. In [31], a new non-greedy incremental algorithm for hierarchical clustering was suggested, which efficiently routes new data points to the leaves of an incrementally-built tree. Another recent work [32] proposed a hierarchical clustering algorithm based on the hypothesis that two reciprocal nearest data points should be put in one cluster. In another line of work, many researchers treated similarity-based hierarchical clustering as an optimization problem, making use of suitable objective functions [33,34]. In [35] for example, the author introduces a cost function that, given pairwise similarities between data points, assigns a score to any possible tree on those points.
In this paper, we introduce a method for clustering documents represented by a number of topics, using an approach that does not demand pairwise comparisons between the documents, but it is instead based on the use of low dimensional frequency matrices. Since the main algorithm makes use of the frequency of occurrence of the main terms in the documents, we call it Frequencybased Hierarchical Clustering (FBHC). A relevant clustering method that we presented in one of our previous works [36] makes use of frequency matrices to construct an hierarchy of biological sequences.

A new document clustering framework
In this section, we present an efficient framework for hierarchical document clustering which makes use of topic modeling to extract feature vectors that represent the processed documents. The proposed framework is shown schematically in Fig. 1 and it is formally described in the following steps: Input Documents in bag-of-words representation The initial data that are taken as input to the framework are documents composed of words. Each word has a corresponding frequency of appearance. Before importing the data to the data transformation module, the words are preprocessed using various methods, including stemming (using the Porter Stemming Algorithm 1 ), removing stop words, making orthographic transformations (using the spelling corrector 2 ), stripping punctuation and substitution. The words that are not included at the WordNet [37], a lexical database of English words, are excluded from the dataset at this module.

Data transformation module
The data transformation module employs topic modeling to the desired input document in order to transform it into a compressed representation in terms of its topics. In this way we can deal with the high dimensionality and the sparsity of the features of the documents. Topic modeling is based on the assumption that each document d is described as a random mixture of topics P(θ|d) and each topic θ as a focused multinomial distribution over terms P(w|θ). The number of topics N θ and the number of terms per topic N W are specified by the user and express the degree of specialization of the latent topics. As the data transformation module is not part of the proposed clustering method, any topic modeling method can be used as part of this module, as a plugin. Literature on topic modeling offers hundreds of models adapted to different situations. LDA: Latent Dirichlet Allocation [38,39] is a commonly used method to extract semantic information from the documents and create a feature vector for each document. LDA builds a set of N θ thematic topics, each expressed with a set of N W terms, utilizing terms that tend to cooccur in a given set of documents. The topic-term distribution P(θ|d) and the document-term distribution P(w|θ) (2020) 9:2 Page 5 of 17 are estimated from an unlabeled corpus of documents D using Dirichlet priors. BigARTM: BigARTM [12] is an open-source library for regularized multimodal topic modeling of large collections, which is based on a non-Bayesian multicriteria approach -Additive Regularization of Topic Models, ARTM [40]. It is a distributed implementation which is proven to be very fast and ideal for big collections of documents.
Lda2vec: lda2vec [13] is a deep learning-based model which creates topics by mixing Dirichlet topic models and word embedding. It constructs a context vector by adding the composition of a document vector and the word vector, which are simultaneously learned during the training process.

Data discretization module
The numeric vectors created by the data transformation module, i.e. the mixture of topics P(θ|d) calculated by the topic modeling process, are discretized into B partitions by assigning each value into a bin based on the closed interval where it belongs to. By making use of alphabetic letters to represent the bins, the numeric vectors are converted into character vectors, which constitute the input data to the clustering procedure. Practically, it is a lossy compression where the number of bins B is selected based on the amount of information we want to be considered by the model.

Design for cloud
Since the proposed clustering algorithm is oriented towards analyzing big data that may not fit in a single machine, provision for cloud execution becomes a necessity. Cloud computing [41] is moving from largescale centralized data centres to more distributed multicloud settings, which may contain networks of larger and smaller virtualized infrastructure runtime nodes. The use of containers constitutes a lightweight virtualization solution characterized by low resource and time consumption.
Docker [42] is a containerization platform that allows Linux applications, their dependencies, and their settings to be composed into Docker images. These images run as Docker containers on any machine running the Docker daemon, which utilizes kernel namespaces and control groups to isolate running containers and control their set of resources. This makes the deployment of cloudoriented applications easy, as the image of an application has to be built only once and then it can be deployed on every system running the Docker deamon. Docker is also appropriate for software benchmarking experiments, since multiple Docker images can be created based on the same root image but containing different benchmarked configurations.
An image of the proposed FBHC algorithm was built using the Docker technology in order to run performance experiences using different hardware resources in the cloud. The resources that were used and the corresponding experimental results are described in "Performance testing in the cloud" section.

The hierarchical clustering algorithm
In this section, we propose a novel hierarchical clustering algorithm, consisting of two phases: 1) the construction of a top down binary tree by consecutively dividing the frequency matrix [6] into two sub-matrices until only unique sequences remain at the leaf-level, and 2) the branch breaking algorithm, where each branch of the tree is pruned at an appropriate level using thresholds for the metrics. The metrics that are used to form the clusters are: a) Identity (I), b) Entropy (H) and c) Bin Similarity (BS), and are described in [6]. In the final, metaclustering phase, a graph of the leaf clusters generated by the clustering algorithm is constructed.

Binary tree construction
The first phase of the clustering method consists of a top down hierarchical clustering algorithm (Algorithm 1). At the beginning of the process, it is assumed that all N sequences belong to a single cluster (C 0 ), the root cluster, which is consequently split recursively while moving down the different levels of the tree. Ultimately, the constructed output of the clustering process is presented as a binary tree. The tree is constructed per level by following a procedure for each cluster (C i ) of the specific level, that can run in parallel. This can be formally described in the following steps: Step 1 Construct frequency and frequency-similarity based matrices (FM i , FSM i ).
Step 2 Compute Identity, Entropy and Bin Similarity metrics of the matrices (I i , IS i , H i , HS i , BS i ) applying the equations described in [6], on the FM i and the FSM i respectively. From now on, the identity metric computed on the FSM will be called Similarity (IS).
Step 3 Split the frequency matrix into two sub matrices according to the following criteria: Criterion 5: If the number of columns is still more than one, one column from the above sub group of columns is randomly selected.
Step 4 Update the Level matrix (Y ) and the Metric matrix (M) that contains the metrics for each cluster (I, IS, H, HS, BS).
Step 5 Check for leaf-cluster.
At the beginning of the process, the user can select the type of the algorithm, i.e. whether the split of the matrices is performed on the FM (identity algo option), or on the FSM (similarity algo option). In the similarity algo option, Criteria 1 & 2 are skipped at Step 3.

Branch breaking
The second phase of the clustering method consists of the branch breaking process (Algorithm 2). This algorithm is applied to the binary tree derived from the first phase. In the field of document clustering, creating a hierarchy of the documents can be very useful for document organization. In many cases, except from the formulation of a hierarchy structure of the clusters, extracting meaningful groups can also be useful. In a partitioning clustering algorithm, the exact number of clusters to be created is chosen by the user. In the FBHC algorithm, a solution to extract useful groups from the binary tree would be to cut the tree at a specific level T C , obtaining all those clusters that belong to level T C and all the leaf clusters that belong to higher levels than T C .
Since the tree is asymmetric and the number of documents in each cluster varies, the tree cannot be cut by selecting a unique level T C for the overall tree. A more accurate procedure to address this problem is to prune the tree using branch-specific thresholds: For each branch, the parent cluster is compared to its two children clusters recursively as one goes down through the path of the tree branch. The comparison is applied using the metrics that have been computed for each cluster C i (I i , IS i , H i , HS i , BS i ) and user selected thresholds for each metric (thrI, thrH, thrBS). An additional limitation set for the identity metric is that the leaf clusters must have an Identity value higher than 20%. This lower threshold is set to avoid pruning at a very high level of the tree in the case that Identity is too small and the improvement in the metrics is not big enough.

Graph construction
The hierarchy structure created during the clustering phase is ideal for the graph theory application. A graph can be useful to uncover connections between the clusters and obtain an insight of how similar the leaf clusters are. This information can be used to merge similar clusters together as a next step. To this end, an undirected, weighted and fully connected graph is constructed using the binary tree.  10 Elm ← the elements of Elm with the max(FSM i ); 11 if Elm.length <2 then cell i ← Elm; Go to step 14; end 12 Elm ← the elements of Elm with the min(HS i ); 13 if Elm.length <2 then cell i ← random element of Elm; end Division 14 j ← index of the left child of node i (C i ); 15 Add the sequences that belong to cell i to cluster C j ; 16 Add all the other sequences to cluster C j+1 ; Return(leaf, j, C j , C j+1 ); end endParallelism(); 17 Collect the results from the parallel processes and fill in column Y [ , level + 1] with the corresponding cluster ids; end Return Y, M; The graph is built by computing the graph similarity matrix which is a square matrix with order equal to the number of leaf clusters (C). The graph matrix is computed   on the patterns that represent the clusters. The graph similarity G i,j between two clusters C i and C j is calculated as the combination of three aspects: a) the number of bins that these clusters have in common through the whole pattern that is computed using the identical bins (pI), b) the number of bins that these clusters have in common through the whole pattern that is computed using the groups of bins (pS), and c) the distance between the nodes C i , C j of the tree. TL is defined as the maximum distance presented in the binary, i.e. the distance between the nodes that are far apart from each other.
The representative pattern of a cluster is a string of length equal to the number of topics and it is extracted using the cluster's frequency matrix, as follows: The positions of the strings with an exact alignment are represented by the corresponding bin, whereas the rest of them are represented by the symbol "_". Suppose that we are interested in the distance between clusters 53 and 18. The clusters' patterns computed with the identical and grouped bins of cluster 53 and the corresponding patterns for cluster 18 are as follows: Then, |pI 53 ∩ pI 18 | = 2 and |pS 53 ∩ pS 18 | = 2.
After the graph construction, the graph is clustered into sub groups. Graph clustering is the task of grouping the graph nodes into clusters taking into consideration the weights of the edges, in such a way that there should be many high weighted edges within each node-cluster and relatively low between the node-clusters. The graph can be clustered using a user-selected threshold thrG, excluding from the graph all those edges that are characterized by a weight smaller than thrG. This threshold is expressed as a percentage and can be selected by observing the distribution of the weights' values. If the user wants to export a specific number of clusters, then a graph merging procedure can be applied. As described in Algorithm 3, the clustered graph is composed of sub-graphs SGs. By sorting the weights in descending order, the most highly similar and strongly connected SGs can be merged by assigning each node to the corresponding central node of the SGs where it belongs to and forming merged clusters, until the desired number of clusters is reached.

Experiments and evaluation
In this section the datasets, the external evaluation measures and the four sets of experiments performed to evaluate and validate the proposed framework are presented.

Experimental setup Datasets
In order to evaluate the effectiveness of the proposed clustering method on text documents, we used various datasets from several domains such as sentiment analysis, news articles, medical documents, web pages and abstracts provided by [43,44]. More specifically, we used 23 benchmark datasets from [43] in order to test the accuracy of our framework, with the smallest and the largest ones consisting of 204 and 18,808 documents, accordingly. The number of actual classes of these documents vary from 4 to 51. The table also shows the number of terms of the original documents, i.e. the number of different words, and the final number of terms after the preprocessing. We also used two big datasets from [44], the NYTimes news articles and the PubMed abstracts, in order to evaluate the performance of the method in terms of computational time and memory usage. All these datasets are summarized in Table 1.
In order to apply the proposed clustering procedure, the datasets were preprocessed in the preprocessing module and then were transformed into numeric vectors using topic modeling. As use case in this paper we utilized the LDA method in the data transformation module. The number of topics N θ that will represent the documents was chosen to be equal to 20, after experimenting on the values of N θ from 5 to 500 and evaluating the results using the perplexity metric, as described in our previous work [6]. Thus, for each document of the datasets, we created topic vectors of length 20.
The document vectors were then discretized in 10 bins represented by alphabetic letters from A to J, making each document represented by a sequence of characters. The bin with the highest percentage is represented by A, whereas the one with the lowest percentage is described with J. In order to create the FSM, the groups of similar bins that were used are nonoverlapping and are given by pairing bins in descending order i.e. < A, B >, < C, D >, < E, F >, < G, H >.

Evaluation measures
Two external evaluation metrics were used to evaluate the effectiveness of the clustering procedure: FScore and Topic Similarity.

FScore
When we have knowledge about the true class where each document belongs to, then we can use FScore to measure the accuracy of the clustering results. A commonly used technique to measure Fscore in hierarchical clustering is to take into account the overall set of clusters that are represented in the hierarchical tree. In this paper, we use the FScore introduced by [45]. Given a particular class L r of size n r and a particular cluster C i of size n i and assuming that n ri documents of cluster C i belong to the real class L r , then the Fscore of this class and cluster is given by (4). To compute FScore, (2) and (3) must be used as follows.
where R(L r , C i ) is the recall value defined as n ri /n r , and P(L r , C i ) is the precision value defined as n ri /n i for the class L r and the cluster C i . The FScore of the class L r , is the maximum FScore value attained at any node in the hierarchical clustering tree T. That is, The FScore of the entire clustering solution is then defined to be the sum of the individual class FScore weighted according to the class size.
where c is the total number of classes. The higher the FScore values, the better the clustering solution is.

Topic similarity
Due to the sparsity of the frequency matrix of each cluster and the fact that each cluster is characterized by only a few topics, we evaluated the clustering results by calculating the semantic similarity between major topics of each cluster. The topic similarity is extracted using semantic analysis of the topics that were derived from topic modeling. Semantic similarity, in contrast to string-based matching can identify semantically relevant concepts that consist of different strings. More specifically, semantic similarity is a metric that is used to measure the distances between a set of terms contained in documents based on their meaning or semantic concept. Many techniques to compute semantic similarities of words are reported in the literature. Using Word Embeddings such as Google's Word2Vec, or a semantic net such as WordNet are common techniques to compute semantic similarity.
Word2vec: Word2vec [46] is a group of models that are used to produce word embeddings. These models are neural networks that are trained to learn high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. Word2vec takes as its input a large corpus of text and produces a vector space, with each unique word in the corpus being assigned a corresponding vector in the space. Words that share common contexts in the corpus are located in close distance to one another in the space. Similarity between two vectors is defined as a cosine. To compute topic similarity, we use an R implementation of Word2vec to train a model for each dataset by making use of the documents' description. The similarity between two documents of a dataset is computed using the cosine similarity between the topic vectors that have been extracted after topic modeling.
WordNet: Making use of a lexical taxonomy (i.e. Word-Net) to define distances between concepts is another commonly used technique. WordNet structure [37,47,48] is a large lexical database of English with words grouped into sets of synonyms (synsets). Nouns, verbs, adjectives and adverbs are grouped into synsets, each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. There are many different distance metrics that make use of the WordNet taxonomy to obtain semantic similarities. In this work, in order to calculate the similarity between two words, we use the Resnik distance [49], where the information content of a word is denoted as the logarithm of the probability of finding the word in a given corpus. This metric only considers the information content of the lowest common level in the hierarchy, i.e. the concept in the taxonomy which has the shortest distance from the concepts compared.
Given that each topic-i is represented by a set of words θ i , in order to compute the topic similarity between two topics-i, j, at first we obtained the pairwise similarities between all the words contained in θ i , θ j using the Resnik distance. To compute the overall matching score between the two topics, i.e. the pairwise Topic Similarity (TS i,j ), we used the matching average method (5) [50], which calculates the similarity between two topics θ i and θ j by dividing the sum of similarity values of all match candidates of both sets by the total number of set tokens. More specifically, the Match(θ i , θ j ) function of the equations counts  the number of highly similar words of the two topics, i.e. the number of words that have Resnik similarity higher than the threshold 1. By employing (5), a N θ × N θ similarity matrix with the pairwise TS between all the N θ topics was created.

Results and discussion
We have performed a number of experiments to evaluate the effectiveness and the performance of our framework. Therefore, this subsection is divided into five parts: a) the comparison against baseline hierarchical clustering algorithms in terms of effectiveness is further discussed in "Effectiveness evaluation" section, b) the comparison against a baseline division hierarchical clustering algorithm in terms of memory usage and computational time is further discussed in "Performance statistical evaluation" section, c) the performance experiments of the proposed method running in the cloud is further discussed in "Performance testing in the cloud" section, d) the complexity analysis is presented in "Complexity analysis" section, and e) the overall proposed framework presented in "A new document clustering framework" section applied on the NYTimes dataset is further discussed in "Experimental results on the NYTimes dataset" section.

Effectiveness evaluation
The first set of experiments was focused on evaluating the quality of the proposed Frequency based hierarchical clustering (FBHC) method, by experimenting on the 10 first datasets described in Table 1. The effectiveness of the FBHC was examined using the external metrics TS and FScore, and comparing the results with baseline hierarchical clustering algorithms implemented in R language. Both division (Diana) 3 and agglomerative (Average, Single, Complete and Ward) 4 hierarchical clustering algorithms were used as baselines.
In Table 2, average FScore and TS values on the proposed algorithm and the baseline algorithms are presented. The best results achieved by an algorithm for each one of the datasets are highlighted as boldface, whereas the second highest results are presented in italics. The FScore was calculated taking into account the whole 3 https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/diana.html 4 https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html hierarchy structures that were created by the compared algorithms. For most of the datasets used in the experimental analysis, the highest FScore values are obtained by the proposed FBHC algorithm. For the LATimes, Oh10, Dmoz-Computers, Oh10 datasets, ward has a higher FScore and the FBHC algorithm comes second with a small difference, whereas for the Reviews dataset Diana comes first and FBHC comes second.
The average TS values were calculated on the final clusters that were set equal to the actual classes for each dataset. To obtain the final clusters of the dendrogram trees that were constructed using the baseline algorithms, the cutree 5 R function was used, whereas for the FBHC method, the branch breaking algorithm followed by the meta-clustering module were applied experimenting on different thresholds until the desired number of clusters were obtained.
To compute the average TS value for each dataset presented in Table 2, we extracted the major topics i, j of each cluster, we computed the TS i,j values using WordNet and (5) for all the clusters and we computed the average value. Instead of WordNet, we could also use Word2vec to calculate TS i,j . However, the results in Table 2 would remain the same, as we present the difference TS − TSActual. Topic Similarity was calculated only for those clusters that contain more than 5 elements and include at least one major topic. Furthermore, TS was calculated for those datasets with actual classes characterized by major topics. The Tr31, Las2s, Tr12, Tr11, Tr45, Tr41, Oh10, Re0 and Re1 datasets do not follow the rule described in "Topic similarity" section, hence most of their clusters have NA values for the TS metric. The maximum value that TS may assume is 1, which indicates that each one of the clusters is characterized by a unique major topic. The Single method failed to create clusters with major topics, because it assigned most of the elements in one cluster with the rest of the clusters containing only one element each. Table 2 shows that the FBHC method usually produces TS values closer to the actual ones, compared to the other methods.

Performance statistical evaluation
The second set of experiments focused on evaluating the performance of the proposed clustering method, in terms of memory usage and computational time. The experiments were run with R on a computer with Intel Core i7 CPU 3.40 GHz with 8 cores and 24 GB RAM, using one core. The Frequency based Hierarchical Clustering (FBHC) algorithm was compared to the Baseline division Hierarchical Clustering algorithm (BHC) Diana. Figure 2 makes clear that using subsets of the NYTimes dataset of different sizes, the BHC algorithm has much higher memory demands. For the experiment with N equal to 50,000 documents, the BHC algorithm was running for 11 days before it aborted with an "out of memory" error. Additional results can be found in Tables 3 and 4, where the average memory usage and computational time for both FBHC and BHC algorithms, and the corresponding results of the statistical evaluation of the aforementioned values, for each subset size are analyzed. Statistical evaluation is performed to ensure the significant difference of the performance of our proposed algorithm and the baseline one. This was necessary because the memory usage and computational time of the baseline algorithms varied in each execution. By the use of the statistical tests the results can be generalized.
In the statistical test, we hypothesize that using the BHC algorithm instead of the FBHC one we can achieve better performance in terms of memory usage and computational time. To determine whether this hypothesis must be rejected, a statistical hypothesis test in name t-test is used (more details about the statistical method can be found in [51]). Tables 3 and 4 report the Degree of Freedom (DF) i.e. the amount of information in the data, the 95% Confidence interval of the differences, the average values of the differences, the t-test value and finally the probability value (p-value) which is used to make a decision about the statistical significance of the terms and model. According to the reported results, the p-values for all subsets never exceed α = 0.05, which means the null hypothesis must be rejected and that the second hypothesis is supported.  Tables 3 and 4 make clear that as the number of documents increases, the absolute value of the t-value of the statistical t-test for both memory usage and computational time increases, except for the first run (the subset with the smallest size) where the t-value of the memory usage was extremely high. This means that the difference between the performance of the two methods becomes more and more statistically important with the increment of the number of documents. For 25,000 documents, our method achieved over 99% reduction in both memory usage and computational time.

Performance testing in the cloud
The third set of experiments focused on evaluating the performance of the proposed clustering method in the cloud. In this round of experiments, we used the biggest dataset of our collection, the PubMed dataset, which contains 8 million documents. In order to test different cloud resource configurations, we built a docker image of the proposed clustering algorithm. The image is publicly available in the Docker hub (mariakotouza/fbc:pubmed) and includes all the subset datasets that were used in these experiments.
The docker image was run as a container on three different configurations: a) the local computer used in the second round of experiences, b) a server that had the following specifications: Ubuntu 18.04.3 LTS (kernel 4.15.0.58-generic), 2 x Intel Xeon X5650 @ 2.67 GHz with 16 cores and 118 GB RAM, and c) a configuration provided by the Okeanos national cloud infrastructure, with the following specifications: VMs with Ubuntu server 18.04, Intel(R) Xeon(R) CPU E5-2650 v3 2.30GHz with 4 cores and 16 GB RAM.
The scalability of our algorithm can be observed in Fig. 3, where different numbers of CPUs of the local computer, the cloud resources and the server were used for each subset. The X axis represents the number of CPUs, and the Y axis represents the execution time in seconds. The different lines in the figures correspond to a different subset size N. Comparing the three plots in the figure we observe that the computational time is highly affected by the available hardware. As for the memory usage, the demands for each core that is used are the same as those presented in Fig. 4 in the following sub-section. Figure 4 shows the results for memory usage and computational time for different subsets of the dataset running on the local computer using one core. The figure makes clear that both metrics follow a linear model with the complexity being equal to O(N), which means that the running time and the memory usage increase at most linearly with the size of the input N.

Complexity analysis
The same result can be obtained using a theoretical analysis to estimate the computational cost of analyzing datasets of different sizes. Table 5 shows the expected computational cost of various clustering algorithms applied to the corresponding datasets and executed on the local computer, using 1 core (the results of FBHC running on 8 cores are also shown). The hypothetical values were predicted after training a regression model using as X and Y variables the Number of documents (N) and the corresponding computational time (T) that were calculated using the PubMed dataset and are depicted on Fig. 4.
The computational complexity of our proposed algorithm was compared to the following state-of-the-art hierarchical clustering procedures: clustering method where points are inserted greedily using the node statistics, which is ideal for large datasets. The values presented on Table 5 are hypothetical, as the ones for the FBHC algorithm. The results prove once again that the FBHC algorithm outperforms the rest of the methods in terms of computational time. The second-best algorithm that scales for large number of documents is the Birch algorithm. However, the algorithm is not scalable in terms of memory usage, as we were not able to run it on the local computer for datasets consisting of more than 80,000 documents due to memory limitation problems.

Experimental results on the NYTimes dataset
The last round of experiments include the application of our proposed hierarchical clustering framework on the NYTimes dataset. Using the binary tree construction algorithm of type similarity algo , a binary tree with 23 levels and 1965 leaf clusters was constructed. Table 6 shows that the Identity and Similarity metrics began with 0 values at the root of the tree, whereas the Entropy metric began with 0.32. These values improved when descending down the different tree levels, until at the leaf level the Similarity value was equal to 100% while the Entropy was equal to 0. During the second phase, the tree was pruned by applying the branch breaking algorithm using the percentage of 0.5% as threshold for all the comparisons of the metrics. The final tree consists of 20 levels and 58 leaf clusters. The average values of each level's metrics using the FM and the FSM matrices are summarized in Table 7. The table shows that the identity value increased towards the leaves of the tree. Notably, when groups of similar bins are used instead of the bins themselves, the similarity value (IS) was a little higher as expected. The values of the Topic Similarity (TS) metric, which is discussed in the following sub-section, are also included in the table.
During the meta-clustering phase of the procedure, a graph is constructed using all the leaf clusters that have been formed after the branch breaking algorithm. The hierarchy structure of the clusters are presented in Fig. 5, where similar clusters are depicted using characteristic colors. The graph is clustered using a threshold equal to 10%, removing all the edges that were connecting the most dissimilar clusters. Figure 6 presents the fully connected The binary tree of the NYTimes dataset. The most similar leaf clusters that was discovered during the graph construction module are presented using the same colors and the clustered graphs. Evidently, most of the big clusters do not have similarities with other clusters, but some smaller clusters like 12, 35, 36, 57, 58, 89, 90, 131, 132 could be merged with the cluster 87 due to the high connectivity that is observed. The aforementioned clusters that were given as an example for merging are presented on Fig. 5 using red color.

Conclusion
In this paper, we presented a new scalable multi-metric hierarchical clustering framework for document cluster-ing. The input documents are preprocessed and transformed into feature vectors using topic modeling, and afterword they are discretized forming sequences of characters. The clustering method is composed of three distinct phases: the binary tree construction algorithm, the branch breaking algorithm, and a meta-clustering module for generating graphical representations of the output. The metrics that are used to form the clusters include Identity, Similarity, Entropy and Bin Similarity. The clustering method exhibits a high degree of parallelism and several sub-processes can be distributed in multiple CPUs to speedup the whole process. It is also dockerized, to enable execution in almost any configuration in the cloud. Using this frequency-based approach to perform hierarchical document clustering, many limitations on computational time and memory usage, as the number of documents increases, can be overcome. Our algorithm has increased scalability compared to existing hierarchical clustering algorithms, because it uses frequency tables to form the clusters instead of making pairwise comparisons between all the elements of the dataset. A series of efficiency and performance evaluation experiments have shown considerable reduction in both execution times and memory requirements over a wide variety of publicly available document sets and of cloud infrastructure.
A limitation of our proposed method may be the information loss that comes from the data discretization module, but it is up to the users to select the number of bins B in such a way that the amount of information that is considered by the model is sufficient, depending on the problem. Considering the effectiveness of the proposed method in the cloud, Future work involves further parallelization of the clustering algorithm in order to optimize the use of allocated resources in the cloud, including GPU usage. Moreover, the proposed framework could be extended to handle real time applications running in the cloud that demand new document categorization. This could be done by implementing a decision-making algorithm that exploits the hierarchy of the clusters to perform new document categorization into the existing clusters.