Nature-inspired algorithms are based on the concepts of self-organization and complex biological systems. They have been designed by researchers and scientists to solve complex problems in various environmental situations by observing how naturally occurring phenomena behave. The introduction of nature-inspired algorithms has led to new branches of study such as neural networks, swarm intelligence, evolutionary computation, and artificial immune systems. Particle swarm optimization (PSO), social spider optimization (SSO), and other nature-inspired algorithms have found some success in solving clustering problems but they may converge to local optima due to the lack of balance between exploration and exploitation. In this paper, we propose a novel implementation of SSO, namely social spider optimization for data clustering using single centroid representation and enhanced mating operation (SSODCSC) in order to improve the balance between exploration and exploitation. In SSODCSC, we implemented each spider as a collection of a centroid and the data instances close to it. We allowed non-dominant male spiders to mate with female spiders by converting them into dominant males. We found that SSODCSC produces better values for the sum of intra-cluster distances, the average CPU time per iteration (in seconds), accuracy, the

Data clustering is one of the most popular unsupervised classification techniques in data mining. It rearranges the given data instances into groups such that the similar data instances are placed in the same group while the dissimilar data instances are placed in separate groups (

It specifies the resultant topological structures of the network clustering when applied on a computer network.

Data clustering is an NP-hard problem (

In _{i, j} is the _{i}_{i}

The classical clustering algorithms are categorized into hierarchical and partitional algorithms. The main drawback of hierarchical clustering is that the clusters formed in an iteration cannot be undone in the next iterations (

In this study, we investigated the performance of social spider optimization (SSO) for data clustering using a single centroid representation and enhanced mating operation. The algorithm was experimented on using the Patent corpus5000 datasets and UCI datasets. Each data instance in the UCI dataset is a data vector but the data instances in the Patent corpus5000 datasets are text files. Before we apply the proposed algorithm on these datasets, the text files were represented as data vectors using TF-IDF mechanism. The vector representation of _{i}

In _{i}_{, j} is the term weight of _{i}_{, j} can be computed using

In _{i, j} (term frequency of the _{j}_{j}

In

In the last decade, nature-inspired algorithms have been successfully applied for solving NP-hard clustering problems. In the state-of-the-art nature inspired algorithms for solving clustering problems, each agent in the population is taken as a collection of

Suppose DS = {dv1, dv2, dv3, dv4, dv5, dv6, dv7, dv8, dv9, dv10},

It specifies data instances present in Agent1.

It specifies data instances present in Agent2.

It specifies data instances present in Agent3.

It specifies data instances present in Agent4.

Agent | Data vectors | Intra-cluster distance | Part of best solution? |
---|---|---|---|

Spider 1 | dv1, dv2, dv3 | 25 | No |

Spider 2 | dv4, dv5, dv6 | 50 | No |

Spider 3 | dv7, dv8, dv9, dv10 | 75 | No |

Spider 4 | dv1, dv2, dv3, dv4, dv7, dv8 | 40 | Yes |

Spider 5 | dv5, dv6 | 45 | No |

Spider 6 | dv9, dv10 | 70 | No |

Spider 7 | dv1, dv2, dv3, dv4, dv8 | 38 | No |

Spider 8 | dv5, dv10 | 30 | Yes |

Spider 9 | dv6, dv7, dv9 | 90 | No |

Spider 10 | dv1, dv2, dv3, dv4, dv8 | 38 | No |

Spider 11 | dv5, dv7, dv10 | 60 | No |

Spider 12 | dv6, dv9 | 65 | Yes |

In our proposed algorithm, social spider optimization for data clustering using single centroid (SSODCSC), each spider is represented by a single centroid and the list of data instances close to it. This representation requires

For SSODCSC:

Number of spiders used = 50

Number of iterations for best clustering results = 300

Total number of spiders to be computed = 300 * 50 = 15,000

Memory required for storing a double value = 8 bytes

Memory required for storing a spider’s centroid (that consists of m dimension values) = 8 * m bytes, where m is the number of dimensions present in the dataset.

Memory required for storing an integer value representing identification number of a data instance = 4 bytes.

Maximum memory required for storing the list of identification numbers of data instances closer to the centroid = 4 *

Maximum memory required for a spider = 8 *

Therefore, total computational memory of SSODCSC = 15,000 * (8 *

For SSO:

Number of spiders used = 50

Number of iterations for best clustering results = 300

Total number of spiders to be computed = 300 * 50 = 15,000

Memory required for storing a double value = 8 bytes

Memory required for storing

Memory required for storing an integer value representing identification number of a data instance = 4 bytes

Maximum memory required for storing

Maximum memory required for a spider =

Therefore, total computational memory of SSO = 15,000 *

The time required for initiating the spiders will be less in this representation. The average CPU time per iteration depends on the time required for computing fitness values and the time required for computing the next positions of the spiders in the solution space. The fitness values and next positions of spiders can be computed in less time with single centroid representation, so that the average CPU time per iteration reduces gradually. The proposed algorithm returns best

In the basic SSO algorithm, non-dominant males are not allowed in the mating operation because of their low weight values. They do not receive any vibrations from other spiders and have no communication in the web, as the communication is established through vibration only (

This paper is organized as follows: “Related Work” describes the recent related work on solving clustering problems using nature-inspired algorithms, “Proposed Algorithm: SSODCSC” describes SSODCSC, “Results” includes experimental results, and we conclude the paper with future work in the section “Discussion.”

The nodes of the graph are samples and each sample is connected to its

Social spider optimization is based on the cooperative behavior of social spiders for obtaining a common food. They are classified into two types, namely male spiders and female spiders (^{1}_{gbs}^{1}_{gbs}_{gbs}^{1}_{gbs}_{ws},_{ws}_{gbs}_{nbs}_{nfs}

The figure specifies the components of a spider in

The SSODCSC algorithm returns spiders ^{1}_{gbs}^{2}_{gbs}^{K}_{gbs}

Social spider optimization for data clustering using single centroid starts with the initialization of spiders in the solution space. Initially all spiders are empty. The fitness of each spider is set to 0, and the weight is set to 1. Each spider s is initialized with a random centroid using

where spider [

The distances of each data instance from the centroids of all spiders are calculated using the Euclidean distance function. A data instance is assigned to the spider that contains its nearest centroid.

The spiders are moved across the solution space in each iteration of SSODCLC based on their gender. The movement of a spider in the solution space depends on the vibrations received from other spiders. The intensity of the vibrations originated from spider _{j}_{i}_{j}

The movement of a female spider _{f}_{gbs}_{nbs}_{f}

The figure specifies how the next position of a female spider is calculated in SSODCSC.

The solution space consists of female spiders and male spiders. When data instances are added or removed from them, their fitness values and weight values will change. If the current weight of a male spider is greater than or equal to the median weight of dominant male spiders, it will be considered to be a dominant male spider. The male spiders that are not dominant male spiders are called non-dominant male spiders. The next position of a dominant male _{dm}

The position of the spider depended only on the vibrations received from its nearest female spider _{nfs}_{f}_{m}

The figure specifies how the next position of a dominant male spider is calculated in SSODCSC.

The next position of the non-dominant male spider _{ndm}

Each dominant male spider mates with a set of female spiders within the specified range of mating to produce a new spider, as shown in

where diff is the sum of differences of the upper bound and lower bound of each dimension, and

The figure specifies how a dominant male spider mates with a set of female spiders to produce a new spider in SSODCSC.

In SSO, the non-dominant male spiders are not allowed to mate with female spiders, as they would produce new spiders having low weights. In SSODCSC, a non-dominant male spider is converted into dominant male spider by making sure that its weight becomes greater than or equal to the average weight of dominant male spiders so that it participates in the mating process and produces a new spider whose weight is better than that of at least one other spider. The theoretical proof for the possibility of converting a non-dominant male spider into a dominant male spider is provided in Theorem 1. Thus, non-dominant male spiders become more powerful than dominant male spiders as they are made to produce new spiders that surely replace worst spiders in the population. The theoretical proof for the possibility of obtaining a new spider that is better than the worst spider, after a non-dominant male spider mates with the female spiders is provided in Theorem 2. The following steps are used to convert a non-dominant male spider into a dominant male spider:

Step 1: Create a list consisting of data instances of the non-dominant male spider _{ndm}

Step 2: Delete the top-most data instance (i.e., the data instance which is the greatest distance from the centroid) from the list.

Step 3: Find the weight of the non-dominant male spider _{ndm}

Step 4: If the weight of non-dominant male is less than the average weight of dominant male spiders, go to Step 2.

The flowchart for SSODCSC is specified in

_{ndm}_{ndm}

Let

But according to definition of the non-dominant male spider,

Assume that the theorem is false.

⇒ _{ndm}

⇒ During the movement of _{ndm}

Let Sum be the sum of distances of data instances from the centroid of _{ndm}

If the data instance that is the furthest distance from the centroid of _{ndm}_{ndm}

⇒ sum of distances of data instances from the centroid of _{ndm}

Sum = Sum-distance of removed data instance from the centroid of _{ndm}

⇒ fitness of _{ndm}

fitness of _{ndm}

⇒ the weight of _{ndm}

the weight of _{ndm}_{ndm}

Similarly,

If a data instance is added to _{ndm}

⇒ sum of distances of data instances from centroid of _{ndm}

⇒ fitness of _{ndm}

⇒ the weight of _{ndm}

Therefore,_{ndm}_{ndm}_{ndm}_{ndm}

When all the data instances are removed from _{ndm}

But according to _{ndm}_{ndm}

Hence, our assumption is wrong.

So, we can conclude that a non-dominant male spider can be converted into dominant male spider in single centroid representation of SSO.

_{ndm}

Let

Let _{new} be the resulting new spider of the mating operation.

Let

Assume that the theorem is false.

It implies:

In other words, the total number of spiders whose weight is less than or equal to that of _{new} is zero.

But according to the Roulette wheel method:

⇒

= _{ndm}_{gbs}_{gbs}

And

So, when weight _{ndm}_{new} becomes _{gbs}.

When weight _{ndm}_{ndm}_{new} becomes _{gbs}

Similarly,

When weight _{ndm}_{new} becomes _{gbs}

Substituting _{gbs}_{new} in

The flowchart specifies the various steps in SSODCSC.

According to _{gbs}_{gbs}_{gbs}

Hence, our assumption is wrong. Therefore, we can conclude that the weight of _{new} produced by _{ndm}

The proposed algorithm and the algorithms used in the comparison were implemented in the Java Run Time Environment, version 1.7.0.51, and the experiments were run on Intel Xeon CPU E3 1270 v3 with a 3.50-GHz processor with a 160 GB RAM. The Windows 7 Professional Operating System was used.

At first, we applied SSODCSC on six Patent corpus datasets. The description of the data sets is given in

Patent corpus1 | Patent corpus2 | Patent corpus3 | Patent corpus4 | Patent corpus5 | Patent corpus6 | |
---|---|---|---|---|---|---|

Number of text documents | 100 | 150 | 200 | 250 | 300 | 350 |

Number of clusters | 6 | 7 | 9 | 9 | 8 | 7 |

As SSODCSC returned

Dataset | SICD | Cosine similarity | Accuracy | |
---|---|---|---|---|

Patent corpus1 | 10,263.55 | 0.8643 | 0.8666 | 87.53 |

Patent corpus2 | 12,813.98 | 0.7517 | 0.7611 | 79.24 |

Patent corpus3 | 16,600.41 | 0.7123 | 0.7316 | 74.29 |

Patent corpus4 | 20,580.11 | 0.9126 | 0.9315 | 94.05 |

Patent corpus5 | 23,163.24 | 0.8143 | 0.8255 | 83.17 |

Patent corpus6 | 28,426.86 | 0.8551 | 0.8703 | 86.25 |

Dataset | 100 iterations | 150 iterations | 200 iterations | 250 iterations | 300 iterations |
---|---|---|---|---|---|

Patent corpus1 | 27,500.23 | 21,256.45 | 16,329.59 | 13,260.72 | |

Patent corpus2 | 23,464.44 | 21,501.16 | 17,467.15 | 15,254.33 | |

Patent corpus3 | 25,731.05 | 22,150.15 | 19,456.25 | 18,204.42 | |

Patent corpus4 | 31,189.46 | 28,506.72 | 27,155.68 | 24,638.83 | |

Patent corpus5 | 36,124.30 | 33,854.35 | 30,109.52 | 26,138.59 | |

Patent corpus6 | 41,201.22 | 37,367.33 | 33,632.63 | 31,007.25 |

The best values are specified in bold.

To find the distance between data instances, we used the Euclidean distance function and Manhattan distance function. Data instances having small differences were placed in same cluster by the Euclidean distance function, as it ignores the small differences. It was found that SSODCSC produced a slightly better clustering result with the Euclidean distance function as shown in

Dataset | Euclidean distance function | Manhattan distance function | ||
---|---|---|---|---|

Accuracy | Avg. cosine similarity | Accuracy | Avg. cosine similarity | |

Patent corpus1 | 82.05 | 0.8198 | ||

Patent corpus2 | 73.33 | 0.7344 | ||

Patent corpus3 | 68.03 | 0.69.95 | ||

Patent corpus4 | 85.27 | 0.8637 | ||

Patent corpus5 | 76.49 | 0.7743 | ||

Patent corpus6 | 80.46 | 0.8142 |

The best values are specified in bold.

Dataset | PSO | GA | ABC | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | |
---|---|---|---|---|---|---|---|---|---|---|---|

Patent corpus1 | 13,004.21 | 13,256.55 | 13,480.76 | 13,705.09 | 14,501.76 | 14,794.09 | 12,884.53 | 13,250.71 | 13,024.83 | 12,159.98 | |

Patent corpus2 | 15,598.25 | 15,997.44 | 16,044.05 | 15,800.55 | 16,895.58 | 17,034.29 | 14,057.22 | 16,842.83 | 15,803.19 | 14,809.66 | |

Patent corpus3 | 20,007.12 | 21,255.77 | 23,903.11 | 24,589.19 | 19,956.44 | 19,543.05 | 18,183.14 | 21,259.03 | 19,045.42 | 18,656.93 | |

Patent corpus4 | 24,175.19 | 25,023.52 | 27,936.76 | 28,409.58 | 24,498.32 | 25,759.48 | 23,637.83 | 25,109.06 | 24,264.31 | 23,447.12 | |

Patent corpus5 | 31,064.62 | 29,879.76 | 31,007.15 | 31,588.66 | 27,442.28 | 30,015.64 | 28,268.55 | 30,129.24 | 29,176.48 | 26,289.88 | |

Patent corpus6 | 29,846.53 | 32,226.51 | 33,509.84 | 34,185.35 | 31,993.79 | 32,753.55 | 30,005.81 | 32,208.31 | 31,804.89 | 31,615.35 |

The best values are specified in bold.

Dataset | PSO | GA | ABC | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | |
---|---|---|---|---|---|---|---|---|---|---|---|

Patent corpus1 | 68.29 | 57.97 | 57.06 | 58.26 | 56.08 | 54.15 | 68.28 | 76.03 | 49.14 | 70.22 | |

Patent corpus2 | 70.21 | 69.38 | 67.15 | 68.57 | 67.88 | 67.05 | 62.25 | 69.92 | 60.05 | 69.45 | |

Patent corpus3 | 65.15 | 64.95 | 62.99 | 63.25 | 67.09 | 66.98 | 51.19 | 68.28 | 64.03 | 67.69 | |

Patent corpus4 | 64.93 | 61.03 | 58.78 | 59.11 | 58.12 | 69.49 | 55.28 | 68.87 | 62.49 | 71.10 | |

Patent corpus5 | 69.72 | 57.38 | 55.80 | 56.07 | 44.67 | 68.05 | 61.51 | 64.62 | 68.55 | 71.16 | |

Patent corpus6 | 58.35 | 62.59 | 60.65 | 61.47 | 54.95 | 69.51 | 64.63 | 69.55 | 72.01 | 70.29 |

The best values are specified in bold.

The silhouette coefficient SC of a data instance _{i}

where _{i}_{i}

Dataset | PSO | GA | ABC | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | |
---|---|---|---|---|---|---|---|---|---|---|---|

Patent corpus1 | 0.5043 | 0.5990 | 0.4001 | 0.4109 | 0.4844 | 0.3184 | 0.4908 | 0.7015 | 0.5804 | 0.7159 | |

Patent corpus2 | 0.5107 | 0.6220 | 0.5922 | 0.3906 | 0.4335 | 0.4577 | 0.5388 | 0.6799 | 0.6496 | 0.6884 | |

Patent corpus3 | 0.4498 | 0.4411 | 0.4804 | 0.4188 | 0.5913 | 0.4990 | 0.6588 | 0.6731 | 0.6005 | 0.6691 | |

Patent corpus4 | 0.3466 | 0.6618 | 0.5269 | 0.4401 | 0.4548 | 0.4018 | 0.6106 | 0.7177 | 0.5985 | 0.6994 | |

Patent corpus5 | 0.4082 | 0.3933 | 0.4005 | 0.4905 | 0.3997 | 0.4833 | 0.6933 | 0.7269 | 0.6208 | 0.7280 | |

Patent corpus6 | 0.3225 | 0.4119 | 0.5507 | 0.5055 | 0.4883 | 0.4397 | 0.7045 | 0.6894 | 0.7328 | 0.7448 |

The best values are specified in bold.

From

The figure specifies inter-cluster distances returned by clustering algorithms when applied on Patent corpus5000 datasets.

The figure specifies intra-cluster distances returned by clustering algorithms when applied on Patent corpus5000 datasets.

We applied SSODCSC on UCI data sets as well. The description of the data sets is given in

Dataset | Number of classes | Number of attributes | Number of instances |
---|---|---|---|

Iris | 3 | 4 | 150 |

Wine | 3 | 13 | 178 |

Glass | 6 | 9 | 214 |

Vowel | 6 | 3 | 871 |

Cancer | 2 | 9 | 683 |

CMC | 3 | 9 | 1,473 |

Haberman | 2 | 3 | 306 |

Bupa | 2 | 6 | 345 |

Dataset | 100 iterations | 150 iterations | 200 iterations | 250 iterations | 300 iterations |
---|---|---|---|---|---|

Iris | 125.7045 | 118.9034 | 107.0844 | 100.3683 | |

Vowel | 147,257.5582 | 147,001.1863 | 146,948.7469 | 146,893.7569 | |

CMC | 6,206.8186 | 6,127.4439 | 5,986.2964 | 5,574.6241 | |

Glass | 387.5241 | 340.3885 | 301.0084 | 258.3053 | |

Wine | 17,358.0946 | 17,150.6084 | 16,998.4387 | 16,408.5572 |

The best values are specified in bold.

Best spiders | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 |
---|---|---|---|---|

Spider 21 | 6.7026 | 3.0001 | 5.482 | 2.018 |

Spider 35 | 5.193 | 3.5821 | 1.4802 | 0.2402 |

Spider 16 | 5.8849 | 2.8009 | 4.4045 | 1.4152 |

Best spiders | Dimension 1 | Dimension 2 | Dimension 3 |
---|---|---|---|

Spider 10 | 508.4185 | 1,838.7035 | 2,558.1605 |

Spider 25 | 408.0024 | 1,013.0002 | 2,310.9836 |

Spider 42 | 624.0367 | 1,308.0523 | 2,333.8023 |

Spider 22 | 357.1078 | 2,292.1580 | 2,976.9458 |

Spider 48 | 377.2070 | 2,150.0418 | 2,678.0003 |

Spider 5 | 436.8024 | 993.0034 | 2,659.0012 |

Best spiders | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | Dimension 7 | Dimension 8 | Dimension 9 |
---|---|---|---|---|---|---|---|---|---|

Spider 23 | 24.4001 | 3.0699 | 3.4986 | 1.8021 | 0.9303 | 0.8206 | 2.2985 | 2.9584 | 0.0271 |

Spider 38 | 43.7015 | 2.9929 | 3.4602 | 3.4568 | 0.8209 | 0.8330 | 1.8215 | 3.4719 | 3.306 |

Spider 16 | 33.4894 | 3.0934 | 3.5599 | 3.5844 | 0.8015 | 0.6629 | 2.169 | 3.2901 | 0.0704 |

Best spiders | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | Dimension 7 | Dimension 8 | Dimension 9 |
---|---|---|---|---|---|---|---|---|---|

Spider 13 | 1.5201 | 14.6023 | 0.06803 | 2.2617 | 73.3078 | 0.0094 | 8.7136 | 1.01392 | 0.0125 |

Spider 29 | 1.5306 | 13.8005 | 3.5613 | 0.9603 | 71.8448 | 0.1918 | 9.5572 | 0.0827 | 0.0071 |

Spider 35 | 1.5169 | 13.3158 | 3.6034 | 1.4236 | 72.7014 | 0.5771 | 8.2178 | 0.0076 | 0.0321 |

Spider 42 | 1.4138 | 13.0092 | 0.0036 | 3.0253 | 70.6672 | 6.2470 | 6.9489 | 0.0078 | 0.0004 |

Spider 48 | 1.5205 | 12.8409 | 3.4601 | 1.3091 | 73.0315 | 0.6178 | 8.5902 | 0.0289 | 0.0579 |

Spider 7 | 1.5214 | 13.0315 | 0.2703 | 1.5193 | 72.7601 | 0.3615 | 11.995 | 0.0472 | 0.0309 |

Best spiders | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | Dimension 7 | Dimension 8 | Dimension 9 | Dimension 10 | Dimension 11 | Dimension 12 | Dimension 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Spider 33 | 12.89 | 2.12 | 2.41 | 19.51 | 98.89 | 2.06 | 1.46 | 0.47 | 1.52 | 5.41 | 0.89 | 2.15 | 686.95 |

Spider 4 | 12.68 | 2.45 | 2.41 | 21.31 | 92.41 | 2.13 | 1.62 | 0.45 | 1.14 | 4.92 | 0.82 | 2.71 | 463.71 |

Spider 3 | 13.37 | 2.31 | 2.62 | 17.38 | 105.08 | 2.85 | 3.28 | 0.29 | 2.67 | 5.29 | 1.04 | 3.39 | 1,137.5 |

PSO | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | ||
---|---|---|---|---|---|---|---|---|---|

Best | 0.0041 | 0.0151 | 0.0168 | 0.0186 | 0.0097 | 0.0148 | 0.0116 | 0.0065 | 0.0048 |

Average | 0.0068 | 0.0192 | 0.0205 | 0.0215 | 0.0126 | 0.0172 | 0.0129 | 0.0082 | 0.0055 |

Worst | 0.0072 | 0.0235 | 0.0245 | 0.0278 | 0.0138 | 0.0194 | 0.0144 | 0.0097 | 0.0069 |

PSO | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | ||
---|---|---|---|---|---|---|---|---|---|

Best | 0.0125 | 0.1263 | 0.1455 | 0.1602 | 0.0188 | 0.0206 | 0.0215 | 0.0178 | 0.0145 |

Average | 0.0136 | 0.1923 | 0.2034 | 0.1698 | 0.0204 | 0.0218 | 0.0228 | 0.0195 | 0.0172 |

Worst | 0.0155 | 0.2056 | 0.2245 | 0.1893 | 0.0219 | 0.0231 | 0.0239 | 0.0227 | 0.0198 |

Dataset | PSO | GA | ABC | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | |
---|---|---|---|---|---|---|---|---|---|---|---|

Wine | 82.25 | 78.79 | 70.25 | 72.48 | 63.34 | 64.88 | 60.10 | 67.88 | 63.78 | 78.42 | |

Cancer | 76.95 | 83.42 | 71.38 | 70.55 | 62.98 | 60.34 | 61.95 | 62.03 | 64.80 | 74.34 | |

CMC | 50.25 | 51.49 | 55.15 | 57.79 | 51.92 | 50.49 | 51.98 | 52.92 | 52.00 | 51.45 | |

Vowel | 66.10 | 68.11 | 60.69 | 64.74 | 62.12 | 68.13 | 54.00 | 68.68 | 65.56 | 70.85 | |

Iris | 94.43 | 90.95 | 62.41 | 62.58 | 60.43 | 71.95 | 64.43 | 62.47 | 62.43 | 85.81 | |

Glass | 52.88 | 44.94 | 45.01 | 43.72 | 54.66 | 43.36 | 55.48 | 42.21 | 44.46 | 58.54 |

The best values are specified in bold.

We computed the ICD and inter-cluster distances of the resultant clusters of the clustering algorithms when applied on UCI datasets. From

The figure compares the clustering algorithms based on intra-cluster distances when applied on Iris and Glass datasets.

The figure compares intra-cluster distances of clustering algorithms when applied on Wine and Bupa datasets.

The figure compares intra-cluster distances of clustering algorithms when applied on Haberman, Cancer, and CMC datasets.

The figure compares inter-cluster distances of clustering algorithms when applied on Iris, Haberman, Cancer, and CMC datasets.

The figure compares inter-cluster distances of clustering algorithms when applied on Glass, Wine, and Bupa datasets.

Dataset | PSO | GA | ABC | IBCO | ACO | SMSSO | BFGSA | SOS | SSO | SSODCSC | |
---|---|---|---|---|---|---|---|---|---|---|---|

Wine | 0.6490 | 0.5629 | 0.5008 | 0.5226 | 0.4151 | 0.4488 | 0.6003 | 0.6109 | 0.6417 | 0.6885 | |

Cancer | 0.5894 | 0.6228 | 0.5277 | 0.5848 | 0.4492 | 0.4852 | 0.5995 | 0.5651 | 0.5999 | 0.6107 | |

CMC | 0.3733 | 0.3281 | 0.3162 | 0.3726 | 0.3447 | 0.4984 | 0.4805 | 0.4900 | 0.4788 | 0.5111 | |

Vowel | 0.4588 | 0.4079 | 0.4277 | 0.4011 | 0.6212 | 0.4105 | 0.5822 | 0.6255 | 0.6020 | 0.6492 | |

Iris | 0.7099 | 0.7165 | 0.4388 | 0.4736 | 0.6043 | 0.5059 | 0.6253 | 0.5796 | 0.6511 | 0.6333 | |

Glass | 0.3661 | 0.2805 | 0.2996 | 0.2070 | 0.5466 | 0.2900 | 0.4896 | 0.4155 | 0.4011 | 0.4419 |

The best values are specified in bold.

To show the significance of the proposed algorithm, we applied a one-way ANOVA test on the accuracy values shown in

Dataset | Sum | Sum squared | Mean | Variance |
---|---|---|---|---|

396.650 | 26,318.996 | 66.108 | 19.425 | |

PSO | 373.300 | 23,327.241 | 62.217 | 20.352 |

GA | 362.430 | 21,979.857 | 60.405 | 17.455 |

ABC | 366.730 | 22,513.033 | 61.122 | 19.577 |

IBCO | 348.790 | 20,646.575 | 58.132 | 74.166 |

ACO | 395.230 | 26,205.548 | 65.872 | 34.218 |

SMSSO | 363.140 | 22,174.032 | 60.523 | 39.118 |

BFGSA | 417.270 | 29,087.550 | 69.545 | 13.701 |

SOS | 376.270 | 23,910.126 | 62.712 | 62.721 |

SSO | 419.910 | 29,395.727 | 69.985 | 1.665 |

SSODCSC | 493.530 | 40,708.696 | 69.985 | 22.677 |

Degrees of freedom: df1 = 10, df2 = 55

Sum of squares for treatment (SSTR) = 2,761.313

Sum of squares for error (SSE) = 1,625.378

Total sum of squares (SST = SSE + SSTR) = 4,386.691

Mean square treatment (MSTR = SSTR/df1) = 276.131

Mean square error (MSE = SSE/df2) = 29.552

Probability of calculated

We can reject the null hypothesis as calculated

Assuming significance level of 5%.

Studentized range for df1 = 10 and df2 = 55 is 4.663.

Tukey honestly significant difference = 10.349.

Mean of

Mean of PSO and SSODCSC differs as 20.03833 is greater than 10.349.

Mean of genetic algorithms (GA) and SSODCSC differs as 21.85000 is greater than 10.349.

Mean of artificial bee colony (ABC) and SSODCSC differs as 21.13333 is greater than 10.349.

Mean of improved bee colony optimization (IBCO) and SSODCSC differs as 24.12333 is greater than 10.349.

Mean of ACO and SSODCSC differs as 16.38333 is greater than 10.349.

Mean of SMSSO and SSODCSC differs as 21.73167 is greater than 10.349.

Mean of BFGSA and SSODCSC differs as 12.71000 is greater than 10.349.

Mean of SOS and SSODCSC differs as 19.54333 is greater than 10.349.

Mean of SSO and SSODCSC differs as 12.27000 is greater than 10.349.

Therefore, it may be concluded that SSODCSC significantly differs from other clustering algorithms.

To show the significance of the proposed algorithm, we applied a one-way ANOVA test on the

Dataset | Sum | Sum squared | Mean | Variance |
---|---|---|---|---|

422.860 | 31,293.957 | 70.477 | 298.439 | |

PSO | 417.700 | 30,748.459 | 69.617 | 333.915 |

GA | 364.890 | 22,675.874 | 60.815 | 97.018 |

ABC | 371.860 | 23,589.299 | 61.977 | 108.531 |

IBCO | 355.450 | 21,172.517 | 59.242 | 23.013 |

ACO | 359.150 | 22,098.159 | 59.858 | 120.008 |

SMSSO | 347.940 | 20,296.988 | 57.990 | 23.990 |

BFGSA | 356.190 | 21,657.069 | 59.365 | 102.370 |

SOS | 353.030 | 21,143.238 | 58.939 | 74.308 |

SSO | 419.410 | 30,133.245 | 69.902 | 163.157 |

SSODCSC | 510.810 | 44,665.701 | 85.135 | 235.578 |

Degrees of freedom: df1 = 10, df2 = 55.

Sum of squares for treatment (SSTR) = 4,113.431.

Sum of squares for error (SSE) = 7,901.638.

Total sum of squares (SST = SSE + SSTR) = 12,015.069.

Mean square treatment (MSTR = SSTR/df1) = 411.343.

Mean square error (MSE = SSE/df2) = 143.666.

Probability of calculated

So, we can reject the null hypothesis as calculated

Assuming significance level of 5%.

Studentized range for df1 = 10 and df2 = 55 is 4.663.

Tukey honestly significant difference = 22.819.

Means of GA and SSODCSC differ as 24.32000 is greater than 22.819.

Means of ABC and SSODCSC differ as 23.15833 is greater than 22.819.

Means of IBCO and SSODCSC differ as 25.89333 is greater than 22.819.

Means of ACO and SSODCSC differ as 25.27667 is greater than 22.819.

Means of SMSSO and SSODCSC differ as 27.14500 is greater than 22.819.

Means of BFGSA and SSODCSC differ as 25.77000 is greater than 22.819.

Means of SOS and SSODCSC differ as 26.29667 is greater than 22.819.

Therefore, it is obvious that SSODCSC significantly differs from most of the other clustering algorithms when applied on UCI datasets.

We applied our proposed algorithm on Patent corpus datasets (

We compared the clustering results of SSODCSC with other clustering algorithms such as

In order to conduct experiments, we formed the Patent corpus1 dataset by taking 100 text documents that belong to six different classes, Patent corpus2 dataset by taking 150 text documents that belong to seven different classes, Patent corpus3 dataset by taking 200 text documents that belong to nine different classes, Patent corpus4 dataset by taking 250 text documents that belong to nine different classes, Patent corpus5 dataset by taking 300 text documents that belong to eight different classes, and Patent corpus6 dataset by taking 350 text documents that belong to seven different classes of Patent corpus5000 data repository.

The clustering quality can be validated using ICD and inter-cluster distances. The smaller value for intra-cluster distance and a larger value for inter-cluster distance are the requirements for any clustering algorithm. We computed the ICD and inter-cluster distances of the resultant clusters of the clustering algorithms, when applied on Patent corpus datasets and UCI datasets, and found that SSODCSC produces better results than the other clustering algorithms.

We compared the clustering algorithms on the basis of average CPU time per iteration (in seconds). We found that SSODCSC has the shortest average CPU time per iteration with respect to most of the datasets. The reasons for this are its ability to produce a better solution space after every iteration, to initialize the solution space in less time, to compute fitness values of the spiders in less time, and to find the next positions of the spiders in less time.

We compared the clustering algorithms on the basis of the average silhouette coefficient value. We found that SSODCSC produces better average silhouette coefficient values for both Patent corpus datasets and UCI datasets.

We conducted a one-way ANOVA test separately on the clustering results of Patent corpus datasets and UCI datasets to show the superiority and applicability of the proposed method with respect to text datasets and feature based datasets.

In this paper, we proposed a novel implementation of SSO for data clustering using a single centroid representation and enhanced mating. Additionally, we allowed non-dominant male spiders to mate with female spiders by converting them into dominant males. As a result, the explorative power of the algorithm has been increased and thereby the chance of getting a global optimum has been improved. We compared SSODCSC with other state-of-the-art algorithms and found that it produces better clustering results. We applied SSODCSC on Patent corpus text datasets and UCI datasets and got better clustering results than other algorithms. We conducted a one-way ANOVA test to show its superiority and applicability with respect to text datasets and feature-based datasets. Future work will include the study of applicability of SSODCSC in data classification of brain computer interfaces.

The raw data used in the experiments.

SSODCSC was written in Java. To run it, only JDK is required.The data files should be stored with .txt extension in current working directory. The program works for both text datasets and attribute based datasets.

The authors declare that they have no competing interests.

The following information was supplied regarding data availability:

The two datasets are available in the