Friday, December 28, 2012
3
IEEE Java Project - Clustering with Multi-Viewpoint based Similarity Measure
Clustering
with Multi-Viewpoint based
Similarity Measure
ABSTRACT:
All clustering methods have to assume some cluster relationship among the
data objects that they are applied on. Similarity between a pair of objects can
be defined either explicitly or implicitly. In this paper, we introduce a novel
multi-viewpoint based similarity measure and two related clustering methods.
The major difference between a traditional dissimilarity/similarity measure and
ours is that the former uses only a single viewpoint, which is the origin,
while the latter utilizes many different viewpoints, which are objects assumed
to not be in the same cluster with the two objects being measured. Using
multiple viewpoints, more informative assessment of similarity could be achieved.
Theoretical analysis and empirical study are conducted to support this claim.
Two criterion functions for document clustering are proposed based on this new
measure. We compare them with several well-known clustering algorithms that use
other popular similarity measures on various document collections to verify the
advantages of our proposal.
EXISTING SYSTEMS
·
Clustering is one of the most interesting and important
topics in data mining. The aim of clustering is to find intrinsic structures in
data, and organize them into meaningful subgroups for further study and
analysis. There have been many clustering algorithms published every year.
·
Existing Systems greedily picks the next frequent item set
which represent the next cluster to minimize the overlapping between the
documents that contain both the item set and some remaining item sets.
·
In other words, the clustering result depends on the order of
picking up the item sets, which in turns depends on the greedy heuristic. This
method does not follow a sequential order of selecting clusters. Instead, we
assign documents to the best cluster.
PROPOSED SYSTEM
·
The main work is to develop a
novel hierarchal algorithm for document clustering which provides maximum
efficiency and performance.
·
It is particularly focused in
studying and making use of cluster overlapping phenomenon to design cluster
merging criteria. Proposing a new way to compute the overlap rate in order to
improve time efficiency and “the veracity” is mainly concentrated. Based on the
Hierarchical Clustering Method, the usage of Expectation-Maximization (EM)
algorithm in the Gaussian Mixture Model to count the parameters and make the
two sub-clusters combined when their overlap is the largest is narrated.
·
Experiments in both public data
and document clustering data show that this approach can improve the efficiency
of clustering and save computing time.
Given a data set satisfying the
distribution of a mixture of Gaussians, the degree of overlap between
components affects the number of clusters “perceived” by a human operator or
detected by a clustering algorithm. In other words, there may be a significant
difference between intuitively defined clusters and the true clusters
corresponding to the components in the mixture.
MODULES
·
HTML PARSER
·
CUMMULATIVE DOCUMENT
·
DOCUMENT SIMILARITY
·
CLUSTERING
MODULE DESCRIPTION:
HTML Parser
·
Parsing is the first step done when the document enters the
process state.
·
Parsing is defined as the separation or identification of
meta tags in a HTML document.
·
Here, the raw HTML file is read and it is parsed through all
the nodes in the tree structure.
Cumulative Document
·
The cumulative document is the sum of all the documents,
containing meta-tags from all the documents.
·
We find the references (to other pages) in the input base
document and read other documents and then find references in them and so on.
·
Thus in all the documents their meta-tags are identified,
starting from the base document.
Document Similarity
·
The similarity between two documents is found by the
cosine-similarity measure technique.
·
The weights in the cosine-similarity are found from the
TF-IDF measure between the phrases (meta-tags) of the two documents.
·
This is done by computing the term weights involved.
·
TF = C / T
·
IDF = D / DF.
D à quotient of the total number of
documents
DF à number of times each word is found
in the entire corpus
C à quotient of no of times a word
appears in each document
T à total number of words in the document
· TFIDF = TF *
IDF
Clustering
·
Clustering is a division of data into groups of similar
objects.
·
Representing the data by fewer clusters necessarily loses
certain fine details, but achieves simplification.
The similar
documents are grouped together in a cluster, if their cosine similarity measure
is less than a specified threshold
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
•
System : Pentium IV 2.4 GHz.
•
Hard
Disk : 40 GB.
•
Floppy
Drive : 1.44 Mb.
•
Monitor : 15 VGA Colour.
•
Mouse : Logitech.
•
Ram : 512 Mb.
SOFTWARE REQUIREMENTS:
•
Operating system : - Windows XP.
•
Coding Language : - JAVA
REFERENCE:
Duc Thang Nguyen, Lihui Chen and Chee Keong Chan, “Clustering with
Multi-Viewpoint based Similarity Measure”, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012.
Other Recommended Posts on Computer Science Projects, CSE Major Projects, IEEE 2012 Projects, IEEE CSE Projects, Java Major Projects, Project Ideas, Projects
Subscribe to:
Post Comments (Atom)
3 Responses to “IEEE Java Project - Clustering with Multi-Viewpoint based Similarity Measure”
January 7, 2013 at 10:39 PM
hi... i am doing project on this topic. are u having implemented project.
March 11, 2020 at 8:54 AM
Event information for UFC 249, fight card, tickets, Date, location,press conference time, TV broadcast, live stream online, schedule. How to Watch UFC 249 Live Stream From Anywhere in the World?
May 28, 2021 at 3:55 AM
Given a data set satisfying the distribution of a mixture of Gaussians, the degree of overlap between components affects the number of clusters “perceived” by a human operator or detected by a clustering algorithm. In other words, there may be a significant difference between intuitively defined clusters and the true clusters corresponding to the components in the mixture. silk dupatta lawn suits 2021 , two piece suit for ladies in pakistan
Post a Comment