---
abstract: "This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document\r\nclustering with lesser human involvement, accompanied by effective improvements in result?” In the\r\ndevised system, we propose a method to exploit the importance of N-grams in a document and use\r\nWikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams\r\nin a document depends on several features including, but not limited to: frequency, position of their\r\noccurrence in a sentence and the position of the sentence in which they occur, in the document. First, we\r\nintroduce a new similarity measure, which takes the weighted N-gram importance into account, in the\r\ncalculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area."
altloc: []
chapter: ~
commentary: ~
commref: ~
confdates: '25-28 October, 2010'
conference: KDIR 2010
confloc: 'Valencia, Spain'
contact_email: ~
creators_id:
  - niraj_kumar@research.iiit.ac.in
  - vinaybabu.vv@gmail.com
  - srinathan@iiit.ac.in
  - vv@iiit.ac.in
creators_name:
  - family: Kumar
    given: 'Niraj '
    honourific: Mr.
    lineage: ''
  - family: Vemula
    given: Venkata Vinay Babu
    honourific: Mr.
    lineage: ''
  - family: Srinathan
    given: Kannan
    honourific: Dr.
    lineage: ''
  - family: Varma
    given: Vasudeva
    honourific: Dr.
    lineage: ''
date: 2010-10-25
date_type: published
datestamp: 2010-11-22 14:10:12
department: ~
dir: disk0/00/00/71/48
edit_lock_since: ~
edit_lock_until: 0
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 7148
fileinfo: /style/images/fileicons/application_pdf.png;/7148/1/KDIR_Niraj.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: pub
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: 'Document clustering, Group-average agglomerative clustering, Community detection, Similarity measure, N-gram, Wikipedia based additional knowledge.'
lastmod: 2011-03-11 08:57:49
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: ~
pagerange: ~
pubdom: TRUE
publication: ~
publisher: ~
refereed: TRUE
referencetext: "1. Banerjee, S., Ramanathan, K., Gupta, A., 2007.\r\nClustering Short Texts using Wikipedia; SIGIR’07,\r\nJuly 23–27, Amsterdam, The Netherlands.\r\n\r\n2. Clauset, A., Newman, M., Moore, C., 2004. Finding\r\ncommunity structure in verylarge networks. Physical\r\nReview E, 70:066111, 2004.\r\n\r\n3. Hammouda, K., Matute, D., Kamel, M., 2005.\r\nCorePhrase: Keyphrase Extraction for Document\r\nClustering; In IAPR: 4th International Conference on\r\nMachine Learning and Data Mining.\r\n\r\n4. Han, J., Kim, T., Choi, J., 2007. Web Document\r\nClustering by Using Automatic Keyphrase Extraction;\r\nProceedings of the 2007 IEEE/WIC/ACM\r\nInternational Conferences on Web Intelligence and\r\nIntelligent Agent Technology - Workshops.\r\n\r\n5. Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009.\r\nExploiting Wikipedia as External Knowledge for\r\nDocument Clustering; KDD’09.\r\n\r\n6. Huang, A., Milne, D., Frank, E., Witten, I. 2008.\r\nClustering Documents with Active Learning Using\r\nWikipedia. ICDM 2008.\r\n\r\n7. Huang, A., Milne, D., Frank, E., Witten, I., 2009.\r\nClustering documents using a wikipedia-based concept\r\nrepresentation. In Proc 13th Pacific-Asia Conference\r\non Knowledge Discovery and Data Mining.\r\n\r\n8. Kaufman, L., and Rousseeuw, P., 1999. Finding\r\nGroups in data: An introduction to cluster analysis,\r\n1999, John Wiley & Sons.\r\n\r\n9. Kumar, N., Srinathan, K., 2008. Automatic Keyphrase\r\nExtraction from Scientific Documents Using N-gram\r\nFiltration Technique. In the Proceedings of ACM\r\nDocEng.\r\n\r\n10. Newman,M., Girvan,M., 2004. Finding and evaluating\r\ncommunity structure in networks. Physical review E,\r\n69:026113, 2004.\r\n\r\n11. Steinbach, M., Karypis, G., and Kumar, V., 2000. A\r\nComparison of document clustering techniques.\r\nTechnical Report. Department of Computer Science\r\nand Engineering,University of Minnesota.\r\n\r\n12. Tan,P., Steinbach,M.,Kumar,V., 2006. Introduction to\r\nData Mining; Addison-Wesley; ISBN-10:\r\n0321321367.\r\n\r\n13. Zhao, Y., Karypis, G., 2001. Criterion functions for\r\ndocument clustering: experiments and analysis,\r\nTechnical Report. Department of Computer Science,\r\nUniversity of Minnesota."
relation_type: []
relation_uri: []
reportno: ~
rev_number: 23
series: ~
source: ~
status_changed: 2010-11-22 14:10:12
subjects:
  - comp-sci-stat-model
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
type: confpaper
userid: 8811
volume: ~
<script src='https://archive-bar.soton.ac.uk/archive-bar.js'></script>
<script src='https://archive-bar.soton.ac.uk/google-analytics.js'></script>