Investigative the meaningful clusters for data sets
InvestigativeMethods for Computer ScienceAssignment-3ProjectIdeaDensity-basedspatial clustering of applications with noise (DBSCAN)is a data clustering algorithm as name implies, it is a density-based clusteringalgorithm.
DBSCAN algorithm can find manyclusters which could not be found using some other clustering algorithms, likek-means, because DBSCAN uses a density-based definition of a cluster, whichresult in relatively less resistant to noise and can handle clusters ofdifferent shapes and sizes. However, the main weakness of DBSCAN is that it hastrouble when the clusters have greatly varied densities. But the existing density-basedalgorithms have trouble in finding out all the meaningful clusters for datasets with varied densities. In my project, I propose a Dynamic method forfrequently changing dataset to find clusters of arbitrary shapes and to findsuitable value of Eps for each density level of data set. In the existingDBSCAN algorithm, one global value is used for all dataset in varied densityi.e. the Eps(radius) value is determined globally, so due to a single globalparameter Eps, it is impossible to detect some clusters using oneglobal-MinPts.
ExistingResearchData clustering hasbeen received considerable attention in many applications, such as data mining,document retrieval, image segmentation and pattern classification. Theenlarging volumes of information emerging by the progress of technology, makesclustering of very large scale of data a challenging task. ProjectHypothesisThe basic idea of thisproposal is that we need some methods to find the suitable values of parametersEps for different densities according to k-dist plot, then we can usetraditional DBSCAN algorithm to find clusters. For each value of Eps, DBSCANalgorithm is adopted to find all the clusters with respect to correspondinglevel of density. The final expected result will avoids marking both denserareas and sparser ones as one cluster. The advantage of using this method is:Its clusters are easy to understand and it does not limit itself to shapes ofclusters. For more description of new method, 2-dimension data can be chosen.
Howto Test HypothesisSuppose that the noisearound the denser cluster C1 has the same density as the other cluster C2. Ifthe Eps threshold is low enough that DBSCAN finds C2 as cluster, then C1 andthe points surrounding it will become a single cluster. If the Eps threshold ishigh enough that DBSCAN finds C1 as a separate cluster, and the pointssurrounding are marked as noise, then C2 and the points surrounding it willalso be marked as noise. DBSCAN also has trouble with high-dimensional databecause density is more difficult to define for such data.To determine theparameters Eps and MinPts we need to look at the behaviour of the distance frompoint to its kth nearest neighbour, which is called k-dist. Thisk-dists are computed for all data points for some (k), then the plot sortedvalues in ascending order, after that, we expect to see the sharp change in theplotted graph.
This sharp change at the value of k-dist corresponds with asuitable value of Eps for each density level of data set. In the K-dist plotsome little changes show up for the changing density level of the examiningdataset. But finally after a certain time a sharp change shows up. The value ofEps determined in this way depends on (k), but doesn’t change dramatically as(k) changes. These are some of the hypothetical test case which can beconsidered for working on the proposed method.References· M.Ester,A.
Frommelt, H.-P.Kriegel, and J.Sander,”Spatial data mining: databaseprimitives, algorithms and efficient DBMS support,” Data Mining and KnowledgeDiscovery, vol. 4, no. 2, pp.
193–216,2000.· M.Hemalatha, M. Naga, and N. Saranya, “A recent survey of knowledge discovery inspatial data mining,” International JournalofComputerScience,vol.8,no.
3,article2,2011.· He,Y.; Tan, H.; Luo, W.; Mao, H.; Ma, D.
; Feng, S.; Fan, J. MR-DBSCAN: An Ef?cientParallel Density-Based Clustering Algorithm Using MapReduce. In Proceedings ofthe 2011 IEEE 17th International Conference on Parallel and DistributedSystems, Tainan, Taiwan, 7–9 December 2011; pp. 473–480.