Skip to main content

Advances, Systems and Applications

Table 1 Summary of datasets used to evaluate the hierarchical clustering algorithms

From: A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

Dataset

Domain

# of docs

# of terms

# of terms after prep.

# of classes

Source

Classic4

Abstracts

7095

7749

6576

4

[43]

Reviews

News articles

4069

22,927

12,431

5

[43]

Tr23

TREC documents

204

5833

4384

6

[43]

LATimes

News articles

6279

10,020

6389

6

[43]

Tr31

TREC documents

927

10,129

6946

7

[43]

La2s

News articles

3075

12,433

8517

7

[43]

WebKb

Web pages

8282

22,892

11,009

7

[43]

Tr12

TREC documents

313

5805

4283

8

[43]

Re8

News articles

7674

8901

5379

8

[43]

Tr11

TREC documents

414

6430

4632

9

[43]

Tr45

TREC documents

690

8262

6016

10

[43]

Tr41

TREC documents

878

7455

5406

10

[43]

Oh10

Medical documents

1050

3239

2425

10

[43]

Dmoz-Science

Web pages

6000

5011

3719

12

[43]

Dmoz-Health

Web pages

3500

4217

3172

13

[43]

Re0

Articles

1504

2886

2209

13

[43]

Dmoz-Computers

Web pages

9500

5011

3527

19

[43]

Wap

Web pages

1560

8460

5988

20

[43]

20 Newsgroups

E-mails

18,808

45,434

16,499

20

[43]

Re1

Articles

1657

3758

2863

25

[43]

ACM

Digital library

3493

60,768

16,315

40

[43]

New3

News articles

9558

26,833

14,483

44

[43]

Opinosis

Reviews

6457

2693

2201

51

[43]

NYTimes

News articles

300,000

102,660

18,001

-

[44]

PubMed

Abstracts

8,200,000

141,043

21,451

-

[44]