Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment

Jin, Ran; Kou, Chunhai; Liu, Ruijuan; Li, Yefeng

doi:10.1186/2192-113X-2-18

Research
Open access
Published: 07 November 2013

Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment

Ran Jin^1,2,
Chunhai Kou¹,
Ruijuan Liu¹ &
…
Yefeng Li¹

Journal of Cloud Computing: Advances, Systems and Applications volume 2, Article number: 18 (2013) Cite this article

4082 Accesses
13 Citations
Metrics details

Abstract

Spectral clustering algorithm has proved be more effective than most traditional algorithms in finding clusters. However, its high computational complexity limits its effect in actual application. This paper combines the spectral clustering with MapReduce, through evaluation of sparse matrix eigenvalue and computation of distributed cluster, puts forward the improvement ideas and concrete realization, and thus improves the clustering speed of the distinctive clustering algorithm. According to the experiment, with the processing data scale being enlarged, the clustering rate is in nearly linear growth, and the proposed parallel spectral clustering algorithm is suitable for large data mining. The research results provide research basis to better design a clustering partition algorithm in large data and high efficiency.

Introduction

The clustering analysis is an important and active research field in data mining, and the research is about the classification of data objects. In order to conveniently expound and understand the data objects and extract inherent information or knowledge hidden in the data, it is necessary to use cluster analysis technology. Its main idea is to divide the data into several classes or clusters, so as to make the objects in same cluster become the most similar while objects in different clusters vary greatly. On the whole, the algorithm can be divided into partition method, hierarchical method, density method, and model method and so on [1]. Generally, the traditional clustering algorithm has following drawbacks: low efficiency in clustering, long processing time in large data and difficulty in meeting the expected effect. For these problems, a popular research idea is correspondingly formed: combining clustering analysis, parallel computing and cloud computing, and designing an efficient parallel clustering algorithm [2, 3]. This paper adopts the classical spectral clustering algorithm as the research foundation of clustering partition algorithm as for large-scale data, analyzes how to dig valuable, understandable data information out of large data in a rapid and efficient way and at low costs. Parallel computing is a process that simultaneously uses various computing resources to solve calculation problem, which has the advantages of speeding up program execution and saving investments. Owing to the clustering, many cheap computers can be used to replace the expensive servers, and the data mining services under the parallel computing environment greatly reduces data processing costs. Besides, the cloud computing can provide scalability, reliability and stability when operating large-scale application in virtual computing environment. Based on the characteristics of cloud computing in large application, namely - distributivity, isomerism and mass data, it is suitable for data intensive application and processing [3, 4].

Clustering analysis has following common problems: difficulty in handling mass data and distribution data, difficulty in determining parameters, low efficiency and poor clustering quality. In recent years, some researchers have been focusing on how to accelerate spectral clustering algorithm [5–12]. Fowlkes et al. propose to use the Nyström approximation to avoid calculating the whole similarity matrix. That is to say, they trade accurate similarity values for shortened computational time. Dhillon et al. presupposed the availability of the similarity matrix and proposes a method which does not use eigenvectors. Although these methods can reduce computational time, they trade clustering accuracy for computational speed gain, and they do not address the bottleneck of memory use. To get rid of the memory capacity limit and computational bottleneck, many people like Yang utilized MPI (Message passing Interface) to build a distributed environment. Nevertheless, MPI mechanism increased the consumption of communication between machines and the network. More importantly, it is more complex if realization program uses MPI to deserialize. After all, it requires the whole cluster communication to be controlled, which is not so convenient and easy comparing with Hadoop. The Hadoop is better in fault tolerance. To make the algorithm work normally in mass data, researchers like Meng raised the method of using matrix sparsification - closest method, and finally used the matrix spared through the nearest neighbor method to the parallel implementation of spectral clustering. Finally, by proving the algorithm through learning experience of documents data, they proved that the algorithm can effectively cope with the problem of mass data. In this paper, we first calculate the similar matrix and sparsification according to the data point identification segmentation, then use Lanczos distributed computing and parallel computing to get the feature vector when we store the Laplace matrix in the distributed file system HDFS for calculating the characteristic vector by way of using, finally get clustering results by efficient parallel K-means clustering in terms of the transposed matrix of the feature vector. At each step, different parallel strategies are used in algorithm, and the whole algorithm grows fast.

Paper structure is organized as follows: in Section Relevant concepts and description, the MapReduce paradigm is briefly introduced and traditional spectral clustering algorithm is inspected. In Section Parallel spectral clustering algorithm design based on Hadoop, our design and implementation of PSCA(Parallel Spectral Clustering Algorithm) are presented. Performance evaluation is presented in Section The analysis of experiment and result. In Section Conclusion, conclusion is drawn and future works are discussed.

Relevant concepts and description

From above analysis, we can know that the parallel algorithm design is based on Hadoop, so the users’ main job is to design and realize the Map and Reduce functions, including input and output the type of < key, value > key value and specific logic of Map and Reduce functions, etc.

Hadoop platform

With the appearance of Google’s MapReduce distributed platform, some calculation of high computational complexity can be completed in acceptable time. Based on MapReduce’s thought, Apache foundation developed Hadoop Open Source Project. As an open source project, Hadoop’s distributed computing framework can be used to construct cloud computing environment (distributed computing). With the help of the computing power, it can be even distributed to many computing nodes in the cluster, thus realizing the huge computation ability about large data. Hadoop has high data throughput, and realizes the high fault tolerance, high reliability and scalability. It is composed of two main parts: HDFS (distributed file system) and MapReduce programming model. At the same time, by combining spectral clustering, serial traditional algorithm and MapReduce programming model, it is transplanted into Hadoop platform to conduct distributed data mining calculation by adopting corresponding parallel strategy. However, if the Hadoop platform technology is applied to the data mining algorithms, key problem is how to achieve the parallelization implementation of traditional data mining algorithm [13]. Among these modes, MapReduce (mapping and specification) programming model can make the user conveniently develop distributed computing program without caring about details. In the whole operation process, MapReduce model is always using key value of < key, value > to input and output about the form. It simplifies the programming model of parallel computing, and only provides available interface to upper users. Working processes at each stage of MapReduce calculation model is as follows:

(1)
Input: An application based on the Hadoop platform and MapReduce framework that often need a pair of Map and Reduce functions by realizing appropriate interface or providing abstract class. It should also specify the locations of both input and output, and other operating parameters. This stage will divide big data under the input directory into several independent data blocks.
(2)
Map: MapReduce framework treats the application input as a group of < key, value > key value pairs. At this stage, the framework will call the Map function that user defines to process each < key, value > key value pairs. At the same time, it will create some new intermediate < key, value > key value pairs. The types of the two groups of key value pairs may be different.
(3)
Shuffle: In Shuffle stage, in order to ensure that Reduce input is output in sequence that Map has already sequenced, the frame gets all related < key, value > key value pairs in Map output for each Reduce through HTTP; according to the key value, MapReduce framework groups are the input in Reduce stage (There are maybe same key for different Map’s outputs).
(4)
Reduce: This stage will be full of intermediate data, and for each unique key, implement the user-defined Reduce function. The input parameter is“<key, {list of values} >, and the output is a new < key, value > key value pairs.
(5)
Output: This stage will write the result output from Reduce in the designated location of output directory. In this way, a typical MapReduce process is completed.

Traditional spectral clustering algorithm

Spectral clustering algorithm is a dot pair cluster algorithm, and it is first used in computer vision, VLSI design and other fields, and then it is used in machine learning, and rapidly becomes research focus in the field of international machine learning. It has very promising application prospects for data clustering. The idea of this algorithm is derived from the spectrogram partition theory. If each data sample is considered as the vertex V in the chart, give weight value W to edge E between vertex in accordance with similarity degree between samples, then the undirected weighted graph G=(V, E) based on similarity degree can be obtained. So in the graph G, the clustering problem can be transformed into partition problem on graph G. The optimal classification criterion based on graph theory is to make the internal similarity degree of the two subgraphs the largest, and similarity degree between subgraphs the smallest.

The standard serial spectral clustering algorithm steps are as follows:

(1)
By computation, obtain the similar matrix S ∈ R ^n × n and then sparse it;
(2)
Construct diagonal matrix D;
(3)
Compute the standard Laplace matrix L;
(4)
Compute k minimum eigenvectors of matrix L, and compose matrix Z ∈ R ^n × k which contains k minimum eigenvectors and are regarded as the columns of the matrix Z;
(5)
Standardize it as Y ∈ R ^n × k
(6)
Use K-Means algorithm to cluster the data point y _i ∈ R ^k(i = 1, …, n) into k clusters.

Parallel spectral clustering algorithm design based on Hadoop

In the standard serial spectral clustering algorithms, we know that algorithm computational complexity is mainly presented in the construction of similar matrix, calculation of k minimum feature vector(s) in Laplace matrix and k-means the clustering. The parallel design of spectral clustering algorithm is processed from the above three aspects.

Calculate similar matrixes in parallelized ways

Because the Hadoop MapReduce can provide outstanding distributed computing framework, we realize our parallel spectral clustering algorithm in the Hadoop MapReduce. Firstly, we put the data point x₁, …, x_n in HBase chart, which can be accessed by each machine, and the line key (row key) of each data point x_i is set as the subscript i ∈ {1, …, n} of the data point. Then we use a Reduce function to automatically distribute the similar values between the calculated data points. For each data point x_i with identification i, Reduce function will only clear those whose subscripts are equal to or bigger than i with the data point of x_j(j = i, …, n) and the similar value of x_i. We can call it “the similar value calculation of subscript i”. In this way, the similar value between each pair of data points can be calculated only once. The apparent “similar value calculation of subscript i” and “similar value calculation of other subscripts” are independent from each other. Therefore, if we distribute different subscripts to different machines, then “similar value calculation of subscript i” can be operated in distributed environment.

Especially, “similar value calculation of subscript i” needs to calculate the similar value {< x_i, x_i >, < x_i, x_i + 1 >, …, < x_i, x_n >} of n − i + 1 data point pairs. That is to say, the first subscript 1 needs to compute similar value of n data point pairs, and the last subscript n only needs to compute the similar value of a data point, that itself is < x_n, x_n >. In order to balance the calculation of similar value, we put the “similar value calculation of subscript 1” and “similar value calculation of subscript n” together, and “similar value calculation of subscript 2” and “similar value calculation of subscript n − 1” together, and so on (see Figure 1). When the calculation of similar values is completed, put them back on HBase table and they will be used to calculate the Laplace matrix in later steps. The process of parallel construction of similar matrix can be shown in Algorithm 1.

Algorithm 1 parallelized constructing the reduce function in similarity matrix

Input: <key, value>, key is the subscript index of data point, and value is supposed as null.

Output:<key’ , value’ > = < key,value>

1.
index = key, another Index = n-key + 1
2.
For i in{index,another Index}

i_content = get Content From HBase(i):

For j = i to n do

j_content = get Content From HBase(j);

sim = compute Similarity(i_content,j_content);

store Similarity(i, j, sim) into HBase table;

End For

3.
Output < key,null>
4.
End.

Parallel computing k minimum eigenvectors

Lanczos algorithm is an iterative algorithm invented by Cornelius Lanczos. The algorithm was invented and used to compute the eigenvalue and feature vector of square matrix, or the singular value decomposition of rectangular matrix [14]. Especially for the very large and sparse matrix, Lanczos’ algorithm is very effective [15–17]. When calculating the maximum (or minimum) k feature vector of the matrix, the Lanczos is more suitable, for it can find out the k feature vectors by only iterating k times [15, 16].

Lanczos transforms the original Laplace matrix L into a real and symmetric tri-diagonal matrix: $T_{mm} = V_{m}^{*} L V_{m}$ with the diagonal elements marked as α_j = t_jj and the off-diagonal elements as β_j = t_j − 1j. Notice that T_mm is a symmetric matrix, so t_j − 1j = t_jj − 1. Lanczos algorithm is shown in Algorithm 2:

Notice that (x, y) is the dot product of two vectors, and after the iteration, we get a tridiagonal matrix composed of α_j and β_j:

T_{mm} = (\begin{array}{c} α_{1} & β_{2} & 0 \\ β_{2} & α_{2} & β_{3} \\ β_{3} & α_{3} & ⋱ \\ ⋱ & ⋱ & β_{m - 1} \\ β_{m - 1} & α_{m - 1} & β_{m} \\ 0 & β_{m} & α_{m} \end{array})

After we get the matrix T_mm, because T_mm is a tridiagonal matrix, it is easy to obtain its eigenvalues and feature vector through other ways (such as QR algorithm). It can be proved that the eigenvalue (feature vector) is the similar value to original Laplacian matrix L’s eigenvalue (feature vector).

Algorithm 2 Lanczos algorithm

1.
v ₁ ← norm is the random vector of 1
$v_{0} \leftarrow 0$

$β_{1} \leftarrow 0$
2.
Iteration: for j = 1, 2, …, m
$w_{j} \leftarrow L v_{j} - β_{j} v_{j - 1}$

$α_{j} \leftarrow (w_{j}, v_{j})$

$w_{j} \leftarrow w_{j} - α_{j} v_{j}$

$β_{j + 1} \leftarrow ∥w_{j}∥$

$v_{j + 1} \leftarrow w_{j} / β_{j + 1}$
3.
Return

From Lanczos’ algorithm, we can see that the multiplication Lv_j of matrix and vector is a time-consuming process. If the matrix is put into memory, then L must be removed every time when it is multiplied by a vector, thus consuming a lot of time consumes. The distributed function provided by Hadoop MapReduce and HDF adopts an excellent idea: mobile computing to near the data that is to be operated saves time than to calculation program. We adopt a similar Distributed Matrix to store the matrix L that is to be decomposed on HDFS, and the storage of matrix L on HDFS is according to segmentation. Then Lanczos’ each iteration doesn’t remove the distributed matrix L on HDFS. On the contrary, what should be moved is a vector (i.e. mobile computing). Every time, the vector vj, which is going to multiply matrix, should be sent to the location that matrix L stores in HDFS, and then the product of vector vj and matrix L on each line (see Figure 2) should be calculated in a parallelized way. The product Lv_j between matrix L and vector v_j is the primary time-consuming operation in Lanczos algorithm. But now with matrix L’s distributed storage in HDFS, this operation can be completed by Map/Reduce. If the former k feature vector (s) is needed, just send the vector for k times to the data storage of matrix for calculation.

Parallelization of K-means clustering

In parallelization of K-means clustering algorithm, a file including initialization k cluster (s) center is created, and can be accessed by each machine in the cluster when it is placed on the HDFS. Obviously, the distance calculation between a data point and the k center and other data points and k center is independent of each other. Therefore, the distance calculation between different data points and k center can be performed in parallel in the MapReduce framework. In terms of research on parallel K-means clustering algorithm, there are many achievements, taking literatures [18, 19] for instance. In the paper, our designed parallelized K-means clustering algorithm mainly consists of Map function and Reduce function, with Combine operation being added after Map function.

Map function design

The Map function task is to calculate the distance between each record and the center point and remark the focus clustering category. The input is all recorded data for clustering and iterated clustering center from the previous round, with the record data form of < key, value > pairs as < line number, recording line>; each Map function will read the described file of clustering center, and the Map function will calculate the nearest class center to the input recording point and make a new category marking; the form of output intermediate result < key, value > is < cluster category ID, record attribute vector >. The pseudo code of Map function is as follows:

void Map(Writable key, Text point){

The initialization of variable mindis is the possible maximum value;

for(i = 0;i < k;i++){

if (dis(point, cluster [i]) < mind is){

mindis = dis(point, cluster[i]);

current cluster ID = i;}}

output (current clusterID, point);}

When data is large and those objects of each data subset after partition are rather approximate, the middle k value produced in the process of map will be more likely to be repeated. For example, thousands of such records < key j, value j > produced in Map process will be sent through the network to the designated reduce function. It certainly wastes valuable network resources, makes the delay increase, and reduces the I/O performance. Therefore, after the map process is executed, an optional Combiner function is added. Combiner function will firstly merge the output of map function at locality and output < key j, list (value j) > list, and then make use of the partition function hash (key) mod R, halve the intermediate key/value produced by Combiner function into R different partitions, and distribute each partition to the designated reduce function. Figure 3 is the K-means parallel process.

Reduce function design

The task of Reduce function is to calculate the new clustering center in accordance with the intermediate results of Map function, and is for next round of MapReduce Job. The form of input data < key, value > pair is < cluster category ID, {record attribute vector set} >; all the records with same key (i.e., records of same category ID) will receive a Reduce task-- accumulate the number of points with same key and the sum of the records and get the average value and then a new clustering center description file; form of the output result < key, value > pair is < cluster category ID, average vector >. The pseudo code of Reduce function is as follows:

void Reduce(Writable key, Iterator < Point Writable > points){

Initialize the variable num, record the total number of samples distributed to the same cluster, the initial value is of 0;

While (points. Has Next()){

Point Writable current point = points. next();

Num + =current point. get num();

for(i = 0;i < dimension;i++){

sum[i] + =current point. point [i];}

for(i = 0;i < dimension;i++)

mean[i] = sum[i]/num;

out(key, mean);}

This iteration continues until each class cluster center is not changed any more, or the iterated number reaches a preset value.

Analysis of complexity of algorithm

Parallel computing of similar matrix

Before giving detailed analysis, assume that the time complexity of computing data points on similar value S(x_i, x_j) is O(l), and assume that m is the number of machines in cluster. It is mentioned that “similar value calculation of subscript i” needs to compute the similar value of n − i + 1 data points. We can obtain that the time complexity of “similar value calculation of subscript 1” is O(n), the time complexity of “similar value calculation of subscript 2” is O(n-1), and the like, the time complexity of “similar value calculation of subscript n” is O(l). So the time complexity of computing similar matrix is O(n + (n ‒ 1) … + 1) = O((n² + n)/2). Because the calculation of similar matrix is evenly executed on m machines, the time complexity of parallel similar matrix calculation is O((n + (n ‒ 1) … + 1)/m) = O((n² + n)/(2m)).

Parallel computing of k minimum feature vector(s)

Under the non-parallel condition, the time complexity of using Lanczos to compute Laplacian L’s k vector(s) of different characteristics is O(kL^op + k²n) [20], in which L^op is the time that matrix L multiplies vector vj. Because the matrix L has already been segmented into lines and stored on HDFS, multiplication of matrix L and vector vj is distributed and executed on the machines. And under ideal conditions, the time complexity of each multiplication is L^op/m, so the time complexity of parallel computation front k feature vector(s) is O(kL^op + k²n).

Parallelization of K-means clustering

New expression form y_i of each data point is k-dimensional, hence in each iteration, the distance calculation will be executed between itself and k centers. In this way, the distance computing time complexity of each data point is O(k²). Therefore, the time complexity of iterating the distance calculation of all points each time is O(nk²). If the condition is ideal, then all the distance calculation of data points is evenly distributed to each machine and in parallel execution, so the time complexity is reduced to O(nk²/m) * (numofiterations).

The analysis of experiment and result

Experimental environment

In this experiment, we use 10 computers to set up the Hadoop cluster. Among them, 8 computers are in dual-core 2.6 GHZCPU, 4 GB memory and operating system of Ubuntu10.04; two in quad core 2.8 GHZCPU, 8 GB memory together with the operating system of Ubuntu10.04. The Hadoop version is 0.20.2, and each machine uses gigabit Ethernet card and is connected through switch machine.

The experiments adopts the classic data set DataSet1 provided by KDD Cup’ 99 to test the correctness of the proposed parallel spectral clustering algorithm; we use respectively 10000(Data Set DS1), 50000(Data Set DS2), 100000(Data Set DS3), 1000000(Data Set DS4), 5000000(Data Set DS5) to verify the superiority of the proposed parallel algorithm, and data samples is the multidimensional data listed in literature [20, 21].

In the experiment, both the speedup ration and scaleup ration are deemed as evaluation indicators.

Experimental results

Correctness validation

Table 1 shows the clustering results of data set DataSet1 in stand-alone and the proposed parallel spectral clustering algorithm mode. It can be seen from Table 1, both the proposed parallel spectral clustering algorithm and the serial algorithm have clustering results of higher consistency. The error rate of them is less than 2%, they both achieve a better clustering results and effectiveness, the spectral clustering algorithm proposed in the paper is correct.

Table 1 Comparison of clustering accuracy of stand-alone mode and parallel algorithm mode proposed in the paper

Full size table

Test of speedup ratio

Speedup ratio is defined by parallel computing to reduce the running time and improve the performance. It is an important indicator to verify the performance of parallel computing. The greater speedup ratio is, the less time parallel computing consume relatively, and the higher parallel efficiency and performance improve. Under changing the number of Hadoop cluster nodes, respectively use the results of speedup ratio performance tests according to 10000, 50000, 100000, 1000000, 5000000 pieces of data. Table 2 is the running time of datasets under different nodes. Figure 4 shows the results.

Table 2 Comparison of running time

Full size table

It can be seen from Table 2 and Figure 4, with the increase scale of data set, the algorithm speed-up ratio performance is getting better and better. The reasons are mainly as following: 1) in this paper, the set of < key, value > pair in the stage of Map and Reduce of the proposed parallel spectral clustering algorithm is rather reasonable; 2) we add Combine operation after the stage Map, which greatly reduces the communication costs between the master node and slave nodes. Therefore, as the data quantity becomes large, the speed-up ratio performance will be substantially enhanced.

When the data volume is less than 50000, because in the parallel process, the data volume of each node is not big enough, the speed is smaller than the serial spectral clustering algorithm. However, with the increase of data volume, the speed of parallel algorithm is gradually increased, especially when the data volume is over 1000000, the speedup ratio grows significantly. The running time of stand-alone mode is 3.667 times as long as that of ten computers when dataset volume is 10000. However, it is 1014.39 times when dataset is 5000000. But, it can be seen from Figure 4, when the number of nodes increases to 8 or more, the increasing range of speed-up will narrow. It can be illustrated that the execution efficiency of the parallel spectral clustering algorithm based on Hadoop platform is higher than that of conventional spectral clustering algorithm.

Analysis of scalability

This paper introduces the concept of the efficiency of parallel algorithms. Efficiency of parallel algorithms represents the utilization of a cluster during the execution of parallel algorithms. The formula is $n = \frac{S_{p}}{N}$ , wherein, S_p represents the speedup ratio, N means the number of cluster nodes. Figure 5 shows the efficiency of parallel algorithms proposed in the paper. For a more general, this paper test the scalability of dataset 100000, 1000000 and 5000000.

It can be seen from Figure 5, their efficiency curve goes down overall. This is mainly because as the growth of computing nodes in the cluster, the communication overhead increased gradually between nodes. As the data size increases, the efficiency value of parallel algorithm proposed in this paper is larger, namely the better scalability is, the more stable efficiency curve is. Experimental results show that the parallel algorithm proposed in this paper has better scalability in large data sets.

Conclusion

Those data on the Internet exist in vast scale and grow rapidly, so it is urgently required in technology to mine high-value information from the mass data. As a kind of unsupervised learning method, clustering algorithm is a technique commonly used in data statistics and analysis which contains data mining, machine learning, pattern recognition, image analysis, and many other areas. The traditional serial clustering algorithm has two problems and it is difficult to meet the needs of practical applications: the first one is that the speed of clustering is not fast enough and the efficiency is not high; the other one is that in the face of mass data, subject to the limits of memory capacity, it often cannot run effectively. This paper studied the traditional spectral clustering algorithm and designed efficient parallel spectral clustering algorithm. The strategy of parallel spectral clustering algorithm is to compute similar matrix and sparse according to data points segmentation; when computing eigenvectors, store the Laplacian matrix on the distributed file system HDFS, use distributed Lanczos to compute and get the eigenvectors by parallel computation; at last, in terms of the transposed matrix of eigenvectors, adopt the improved parallel K-Means cluster to obtain the clustering results. Through adopting different parallel strategies about each step of the algorithm, the whole algorithm gets linear growth in speed. The experimental results show that the proposed parallel spectral clustering algorithm is suitable for applying in mass data mining. We hope that the research achievements of this paper can provide inspiration and application value for subsequent research developers.

References

Hartigan JA: Clustering Algorithms. 1975. Wiley, USA Wiley, USA
MATH Google Scholar
Cui J, Li Q, Yang LP: Fast algorithm for mining association rules based on vertically distributed data in large dense databases. Comput Sci 2011, 38: 216–220.
Google Scholar
Zheng P, Cui LZ, Wang HY, Xu M: A data placement strategy for data-intensive applications in cloud. Comput Sci 2010, 33: 1472–1480.
Google Scholar
Wang P, Meng D, Zhan JF, Tu BB: Review of programming models for data-intensive computing. J Comput Res Dev 2010, 47: 1993–2002.
Google Scholar
Fowlkes C, Belongie S, Chung F, Malik : Spectral grouping using the nyström method. IEEE Trans Pattern Anal Mach Intell 2004, 26: 214–225. 10.1109/TPAMI.2004.1262185
Article Google Scholar
Dhillon IS, Guan Y, Kulis B: Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Trans Pattern Anal Mach Intell 2007, 29: 1944–1957.
Article Google Scholar
Kumar S, Mohri M, and Talwalkar A: Sampling techniques for the nyström method [C]. Paper presented at the 12th conference on artificial intelligence and statistics, University of California; 2009. 16–18 April 2009 16–18 April 2009
Google Scholar
Zhang K, Tsang I, Kwok J: Improve nyström low-rank approximation and error analysis. Helsinki: Paper presented at the 25th International Conference on Machine Learning; 2008. 5–9 July 2008 5–9 July 2008
Google Scholar
Yan D, Huang L, Jordan MI: Fast approximate spectral clustering. Paris: Paper presented at the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2009. 28 June-1 July 2009 28 June-1 July 2009
Book Google Scholar
Gropp E, Skjellum A: Using MPI-2: advanced features of the message-passing interface. USA: MIT Press; 1999.
Google Scholar
Song Y, Chen W, Bai H, Lin C, Chang E: Parallel spectral clustering. In European Conference, ECML PKDD. The joint conference on Machine Learning and Knowledge Discovery in Databases, Belgium, September 2008. Lecture notes in computer science (Lecture notes in artificial intelligence), vol 5212. Heidelberg: Springer; 2008:374.
Google Scholar
Maschhoff K, Sorensen D: A portable implementation of ARPACK for distributed memory parallel architectures. Colorado: Paper presented at the 4th Copper Mountain Conference on Iterative Methods; 1996. 9–13 April 1996 9–13 April 1996
MATH Google Scholar
Yang C: The research of data mining based on HADOOP. Chongqing University: Dissertation; 2010.
Google Scholar
Cullum J, Willboughby RA: Lanczos Algorithms for Large Symmetric Eigenvalue Computations volume l. USA: Birkhauser Boston Inc; 1985.
Google Scholar
Golub GH, Loan CFV: Matrix Computations. Maryland: The Johns Hopkins University Press; 1996.
MATH Google Scholar
Cullum J, Willboughby RA: Computing eigenvalues of very large symmetric matrices: an implementation of a lanczos algorithm with no reorthogonalization. J Comput Phys 1981, 44: 329–358. 10.1016/0021-9991(81)90056-5
Article MathSciNet MATH Google Scholar
Mahadevan S: Fast Spectral Learning Using Lanczos Eigenspace Projections. Chicago: The 23th national conference on artificial intelligence; 2008. 13–17 July 13–17 July
Google Scholar
Zhao WZ, Ma HF, Fu YX, Shi ZZ: Research on parallel K-means algorithm design based on hadoop platform. Comput Sci 2011, 38: 166–176.
Google Scholar
Niu XZ, She K: Study of fast parallel clustering partition algorithm for large data set. Comput Sci 2012, 39: 134–151.
Google Scholar
Feng LN Dissertation. In Research on parallel K-Means clustering method in resume data. Dissertation. Yunnan University; 2010.
Google Scholar
Jin R, Kou CH, Liu RJ, Li YF: A Co-optimization routing algorithm in wireless sensor network. Wireless Pers Comm 2013, 70: 1977–1991.
Google Scholar

Download references

Acknowledgement

This work was supported by the Science and Technology Research Program of Zhejiang Province, under grant No.2011C21036, and by the Shanghai Natural Science Foundation under grant No.10ZR1400100, and by Projects in Science and Technique of Ningbo Municipal under grant No. 2012B82003.

Author information

Authors and Affiliations

College of Information Science and Technology, Donghua University, Shanghai, P.R.C
Ran Jin, Chunhai Kou, Ruijuan Liu & Yefeng Li
School of Computer Science and Information Technology, Zhejiang Wanli University, Ningbo, P.R.C
Ran Jin

Authors

Ran Jin
View author publications
You can also search for this author in PubMed Google Scholar
Chunhai Kou
View author publications
You can also search for this author in PubMed Google Scholar
Ruijuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yefeng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran Jin.

Additional information

Competing interest

The authors of this paper have no competing interest.

Authors’ contributions

The contributions of the paper are twofold: The use of Hadoop to design an improved parallel spectral clustering algorithm for large data sets. The use of speedup ratio and scalability to verify the superiority of the parallel algorithm. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jin, R., Kou, C., Liu, R. et al. Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. J Cloud Comp 2, 18 (2013). https://doi.org/10.1186/2192-113X-2-18

Download citation

Received: 09 April 2013
Accepted: 28 October 2013
Published: 07 November 2013
DOI: https://doi.org/10.1186/2192-113X-2-18

Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment

Abstract

Introduction

Relevant concepts and description

Hadoop platform

Traditional spectral clustering algorithm

Parallel spectral clustering algorithm design based on Hadoop

Calculate similar matrixes in parallelized ways

Parallel computing k minimum eigenvectors

Parallelization of K-means clustering

Map function design

Reduce function design

Analysis of complexity of algorithm

Parallel computing of similar matrix

Parallel computing of k minimum feature vector(s)

Parallelization of K-means clustering

The analysis of experiment and result

Experimental environment

Experimental results

Correctness validation

Test of speedup ratio

Analysis of scalability

Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interest

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords