EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

Kumar, Mr. Niraj and Vemula, Mr. Venkata Vinay Babu and Srinathan, Dr. Kannan and Varma, Dr. Vasudeva (2010) EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING. [Conference Paper]

Full text available as:

Preview

PDF (EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING) - Published Version
Available under License Creative Commons Public Domain Dedication.
73Kb

Abstract

This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on several features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.

Item Type:	Conference Paper
Keywords:	Document clustering, Group-average agglomerative clustering, Community detection, Similarity measure, N-gram, Wikipedia based additional knowledge.
Subjects:	Computer Science > Statistical Models
ID Code:	7148
Deposited By:	Kumar, Mr Niraj
Deposited On:	22 Nov 2010 14:10
Last Modified:	11 Mar 2011 08:57

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

Metadata

Repository Staff Only: item control page