A Near-Duplicate Detection Algorithm to Facilitate Document Clustering
KeywordsWeb Content Mining
Instruments and machines
Electronic computers. Computer science
Full recordShow full item record
AbstractWeb Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting these pages has many potential applications for example may indicate plagiarism or copyright infringement. This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are used to perform clustering of documents .We demonstrated our approach in web news articles domain. The experimental results show that our algorithm outperforms in terms of similarity measures. The near duplicate and duplicate document identification has resulted reduced memory in repositories.