--- abstract: "This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document\r\nclustering with lesser human involvement, accompanied by effective improvements in result?” In the\r\ndevised system, we propose a method to exploit the importance of N-grams in a document and use\r\nWikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams\r\nin a document depends on several features including, but not limited to: frequency, position of their\r\noccurrence in a sentence and the position of the sentence in which they occur, in the document. First, we\r\nintroduce a new similarity measure, which takes the weighted N-gram importance into account, in the\r\ncalculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area." altloc: [] chapter: ~ commentary: ~ commref: ~ confdates: '25-28 October, 2010' conference: KDIR 2010 confloc: 'Valencia, Spain' contact_email: ~ creators_id: - niraj_kumar@research.iiit.ac.in - vinaybabu.vv@gmail.com - srinathan@iiit.ac.in - vv@iiit.ac.in creators_name: - family: Kumar given: 'Niraj ' honourific: Mr. lineage: '' - family: Vemula given: Venkata Vinay Babu honourific: Mr. lineage: '' - family: Srinathan given: Kannan honourific: Dr. lineage: '' - family: Varma given: Vasudeva honourific: Dr. lineage: '' date: 2010-10-25 date_type: published datestamp: 2010-11-22 14:10:12 department: ~ dir: disk0/00/00/71/48 edit_lock_since: ~ edit_lock_until: 0 edit_lock_user: ~ editors_id: [] editors_name: [] eprint_status: archive eprintid: 7148 fileinfo: /style/images/fileicons/application_pdf.png;/7148/1/KDIR_Niraj.pdf full_text_status: public importid: ~ institution: ~ isbn: ~ ispublished: pub issn: ~ item_issues_comment: [] item_issues_count: 0 item_issues_description: [] item_issues_id: [] item_issues_reported_by: [] item_issues_resolved_by: [] item_issues_status: [] item_issues_timestamp: [] item_issues_type: [] keywords: 'Document clustering, Group-average agglomerative clustering, Community detection, Similarity measure, N-gram, Wikipedia based additional knowledge.' lastmod: 2011-03-11 08:57:49 latitude: ~ longitude: ~ metadata_visibility: show note: ~ number: ~ pagerange: ~ pubdom: TRUE publication: ~ publisher: ~ refereed: TRUE referencetext: "1. Banerjee, S., Ramanathan, K., Gupta, A., 2007.\r\nClustering Short Texts using Wikipedia; SIGIR’07,\r\nJuly 23–27, Amsterdam, The Netherlands.\r\n\r\n2. Clauset, A., Newman, M., Moore, C., 2004. Finding\r\ncommunity structure in verylarge networks. Physical\r\nReview E, 70:066111, 2004.\r\n\r\n3. Hammouda, K., Matute, D., Kamel, M., 2005.\r\nCorePhrase: Keyphrase Extraction for Document\r\nClustering; In IAPR: 4th International Conference on\r\nMachine Learning and Data Mining.\r\n\r\n4. Han, J., Kim, T., Choi, J., 2007. Web Document\r\nClustering by Using Automatic Keyphrase Extraction;\r\nProceedings of the 2007 IEEE/WIC/ACM\r\nInternational Conferences on Web Intelligence and\r\nIntelligent Agent Technology - Workshops.\r\n\r\n5. Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009.\r\nExploiting Wikipedia as External Knowledge for\r\nDocument Clustering; KDD’09.\r\n\r\n6. Huang, A., Milne, D., Frank, E., Witten, I. 2008.\r\nClustering Documents with Active Learning Using\r\nWikipedia. ICDM 2008.\r\n\r\n7. Huang, A., Milne, D., Frank, E., Witten, I., 2009.\r\nClustering documents using a wikipedia-based concept\r\nrepresentation. In Proc 13th Pacific-Asia Conference\r\non Knowledge Discovery and Data Mining.\r\n\r\n8. Kaufman, L., and Rousseeuw, P., 1999. Finding\r\nGroups in data: An introduction to cluster analysis,\r\n1999, John Wiley & Sons.\r\n\r\n9. Kumar, N., Srinathan, K., 2008. Automatic Keyphrase\r\nExtraction from Scientific Documents Using N-gram\r\nFiltration Technique. In the Proceedings of ACM\r\nDocEng.\r\n\r\n10. Newman,M., Girvan,M., 2004. Finding and evaluating\r\ncommunity structure in networks. Physical review E,\r\n69:026113, 2004.\r\n\r\n11. Steinbach, M., Karypis, G., and Kumar, V., 2000. A\r\nComparison of document clustering techniques.\r\nTechnical Report. Department of Computer Science\r\nand Engineering,University of Minnesota.\r\n\r\n12. Tan,P., Steinbach,M.,Kumar,V., 2006. Introduction to\r\nData Mining; Addison-Wesley; ISBN-10:\r\n0321321367.\r\n\r\n13. Zhao, Y., Karypis, G., 2001. Criterion functions for\r\ndocument clustering: experiments and analysis,\r\nTechnical Report. Department of Computer Science,\r\nUniversity of Minnesota." relation_type: [] relation_uri: [] reportno: ~ rev_number: 23 series: ~ source: ~ status_changed: 2010-11-22 14:10:12 subjects: - comp-sci-stat-model succeeds: ~ suggestions: ~ sword_depositor: ~ sword_slug: ~ thesistype: ~ title: EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING type: confpaper userid: 8811 volume: ~