--- abstract: "We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the probabilistic categoriser described in (Gaussier et al., ECIR'02). This categoriser is adapted to handle multiple labelling and a piecewise-linear confidence estimation layer is added to provide an estimate of the labelling confidence. This technique achieves a score of 1.689 on the test data.\n" altloc: [] chapter: ~ commentary: ~ commref: ~ confdates: 28 April 2007 conference: Text Mining 2007 confloc: 'Minneapolis, USA' contact_email: ~ creators_id: [] creators_name: - family: Goutte given: Cyril honourific: '' lineage: '' date: 2007 date_type: published datestamp: 2007-07-28 department: ~ dir: disk0/00/00/56/26 edit_lock_since: ~ edit_lock_until: ~ edit_lock_user: ~ editors_id: [] editors_name: [] eprint_status: archive eprintid: 5626 fileinfo: /style/images/fileicons/application_pdf.png;/5626/1/goutte07tmw.pdf full_text_status: public importid: ~ institution: ~ isbn: ~ ispublished: pub issn: ~ item_issues_comment: [] item_issues_count: 0 item_issues_description: [] item_issues_id: [] item_issues_reported_by: [] item_issues_resolved_by: [] item_issues_status: [] item_issues_timestamp: [] item_issues_type: [] keywords: 'Text categorization, probabilistic model, confidence estimation, multi-label categorization, category description' lastmod: 2011-03-11 08:56:55 latitude: ~ longitude: ~ metadata_visibility: show note: ~ number: ~ pagerange: ~ pubdom: FALSE publication: ~ publisher: ~ refereed: FALSE referencetext: | Cancedda, N., Goutte, C., Renders, J.-M., Cesa-Bianchi, N., Conconi, A., Li, Y., Shawe-Taylor, J., Vinokourov, A., Graepel, T. and Gentile, C. (2002). Kernel Methods for Document Filtering. The Eleventh Text REtrieval Conference (TREC 2002), National Institute of Standards and Technology (NIST). Dempster, A.~P., Laird, N.~M., and Rubin, D.~B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38. Gandrabur, S., Foster, G., Lapalme, G. (2006). Confidence Estimation for NLP Applications. ACM Transactions on Speech and Language Processing, 3(3): 1--29. Gaussier, E., Goutte, C., Popat, K., and Chen, F. (2002). A hierarchical model for clustering and categorising documents. Proceedings of the 24th BCS-IRSG Colloquium on IR Research (ECIR'02), pp. 229--247. Springer. Gillick, L., Ito, Y. and Young, J. (1997). A Probabilistic Approach to Confidence Estimation and Evaluation. ICASSP '97: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97)-Volume 2, pp. 879--882. Goutte, C. and Gaussier, E. (2004). Method for multi-class, multi-label categorization using probabilistic hierarchical modeling. US Patent 7,139,754 (granted Nov. 21, 2006). Goutte, C. and Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. Advances in Information Retrieval, 27th European Conference on IR Research (ECIR 2005), pp. 345-359. Springer. Hofmann, T. (1999). Probabilistic latent semantic analysis. Uncertainty in Artificial Intelligence (UAI'99), pp. 289--296. Joachims, T. (1998). Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. ECML '98: Proceedings of the 10th European Conference on Machine Learning}, pp. 137--142. McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41--48. McCallum, A. (1999). Multi-Label Text Classification with a Mixture Model Trained by EM. AAAI'99 Workshop on Text Learning. Renders, J.-M., Gaussier, E., Goutte, C., Pacull, F. and Csurka, G. (2006). Categorization in multiple category systems. Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), pp. 745--752. Rose, K., Gurewwitz, E., and Fox, G. (1990). A deterministic annealing approach to clustering. Pattern Recogn. Letters, 11(9):589--594. Vapnik, V.~N. (1998). Statistical Learning Theory. Wiley. Zadrozny, B. and Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. Machine Learning, Proceedings of the Eighteenth International Conference (ICML'01). relation_type: [] relation_uri: [] reportno: ~ rev_number: 12 series: ~ source: ~ status_changed: 2007-09-12 17:11:07 subjects: - comp-sci-stat-model - ling-comput - comp-sci-mach-learn succeeds: ~ suggestions: ~ sword_depositor: ~ sword_slug: ~ thesistype: ~ title: Fast & Confident Probabilistic Categorization type: confpaper userid: 7131 volume: ~