creators_name: Goutte, Cyril type: confpaper datestamp: 2007-07-28 lastmod: 2011-03-11 08:56:55 metadata_visibility: show title: Fast & Confident Probabilistic Categorization ispublished: pub subjects: comp-sci-stat-model subjects: ling-comput subjects: comp-sci-mach-learn full_text_status: public keywords: Text categorization, probabilistic model, confidence estimation, multi-label categorization, category description abstract: We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the probabilistic categoriser described in (Gaussier et al., ECIR'02). This categoriser is adapted to handle multiple labelling and a piecewise-linear confidence estimation layer is added to provide an estimate of the labelling confidence. This technique achieves a score of 1.689 on the test data. date: 2007 date_type: published refereed: FALSE referencetext: Cancedda, N., Goutte, C., Renders, J.-M., Cesa-Bianchi, N., Conconi, A., Li, Y., Shawe-Taylor, J., Vinokourov, A., Graepel, T. and Gentile, C. (2002). Kernel Methods for Document Filtering. The Eleventh Text REtrieval Conference (TREC 2002), National Institute of Standards and Technology (NIST). Dempster, A.~P., Laird, N.~M., and Rubin, D.~B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38. Gandrabur, S., Foster, G., Lapalme, G. (2006). Confidence Estimation for NLP Applications. ACM Transactions on Speech and Language Processing, 3(3): 1--29. Gaussier, E., Goutte, C., Popat, K., and Chen, F. (2002). A hierarchical model for clustering and categorising documents. Proceedings of the 24th BCS-IRSG Colloquium on IR Research (ECIR'02), pp. 229--247. Springer. Gillick, L., Ito, Y. and Young, J. (1997). A Probabilistic Approach to Confidence Estimation and Evaluation. ICASSP '97: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97)-Volume 2, pp. 879--882. Goutte, C. and Gaussier, E. (2004). Method for multi-class, multi-label categorization using probabilistic hierarchical modeling. US Patent 7,139,754 (granted Nov. 21, 2006). Goutte, C. and Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. Advances in Information Retrieval, 27th European Conference on IR Research (ECIR 2005), pp. 345-359. Springer. Hofmann, T. (1999). Probabilistic latent semantic analysis. Uncertainty in Artificial Intelligence (UAI'99), pp. 289--296. Joachims, T. (1998). Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. ECML '98: Proceedings of the 10th European Conference on Machine Learning}, pp. 137--142. McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41--48. McCallum, A. (1999). Multi-Label Text Classification with a Mixture Model Trained by EM. AAAI'99 Workshop on Text Learning. Renders, J.-M., Gaussier, E., Goutte, C., Pacull, F. and Csurka, G. (2006). Categorization in multiple category systems. Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), pp. 745--752. Rose, K., Gurewwitz, E., and Fox, G. (1990). A deterministic annealing approach to clustering. Pattern Recogn. Letters, 11(9):589--594. Vapnik, V.~N. (1998). Statistical Learning Theory. Wiley. Zadrozny, B. and Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. Machine Learning, Proceedings of the Eighteenth International Conference (ICML'01). citation: Goutte, Cyril (2007) Fast & Confident Probabilistic Categorization. [Conference Paper] document_url: http://cogprints.org/5626/1/goutte07tmw.pdf