Cogprints

EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

Kumar, Mr. Niraj and Vemula, Mr. Venkata Vinay Babu and Srinathan, Dr. Kannan and Varma, Dr. Vasudeva (2010) EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING. [Conference Paper]

Full text available as:

[img]
Preview
PDF (EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING) - Published Version
Available under License Creative Commons Public Domain Dedication.

73Kb

Abstract

This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on several features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.

Item Type:Conference Paper
Keywords:Document clustering, Group-average agglomerative clustering, Community detection, Similarity measure, N-gram, Wikipedia based additional knowledge.
Subjects:Computer Science > Statistical Models
ID Code:7148
Deposited By: Kumar, Mr Niraj
Deposited On:22 Nov 2010 14:10
Last Modified:11 Mar 2011 08:57

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

1. Banerjee, S., Ramanathan, K., Gupta, A., 2007.

Clustering Short Texts using Wikipedia; SIGIR’07,

July 23–27, Amsterdam, The Netherlands.

2. Clauset, A., Newman, M., Moore, C., 2004. Finding

community structure in verylarge networks. Physical

Review E, 70:066111, 2004.

3. Hammouda, K., Matute, D., Kamel, M., 2005.

CorePhrase: Keyphrase Extraction for Document

Clustering; In IAPR: 4th International Conference on

Machine Learning and Data Mining.

4. Han, J., Kim, T., Choi, J., 2007. Web Document

Clustering by Using Automatic Keyphrase Extraction;

Proceedings of the 2007 IEEE/WIC/ACM

International Conferences on Web Intelligence and

Intelligent Agent Technology - Workshops.

5. Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009.

Exploiting Wikipedia as External Knowledge for

Document Clustering; KDD’09.

6. Huang, A., Milne, D., Frank, E., Witten, I. 2008.

Clustering Documents with Active Learning Using

Wikipedia. ICDM 2008.

7. Huang, A., Milne, D., Frank, E., Witten, I., 2009.

Clustering documents using a wikipedia-based concept

representation. In Proc 13th Pacific-Asia Conference

on Knowledge Discovery and Data Mining.

8. Kaufman, L., and Rousseeuw, P., 1999. Finding

Groups in data: An introduction to cluster analysis,

1999, John Wiley & Sons.

9. Kumar, N., Srinathan, K., 2008. Automatic Keyphrase

Extraction from Scientific Documents Using N-gram

Filtration Technique. In the Proceedings of ACM

DocEng.

10. Newman,M., Girvan,M., 2004. Finding and evaluating

community structure in networks. Physical review E,

69:026113, 2004.

11. Steinbach, M., Karypis, G., and Kumar, V., 2000. A

Comparison of document clustering techniques.

Technical Report. Department of Computer Science

and Engineering,University of Minnesota.

12. Tan,P., Steinbach,M.,Kumar,V., 2006. Introduction to

Data Mining; Addison-Wesley; ISBN-10:

0321321367.

13. Zhao, Y., Karypis, G., 2001. Criterion functions for

document clustering: experiments and analysis,

Technical Report. Department of Computer Science,

University of Minnesota.

Metadata

Repository Staff Only: item control page