Fast & Confident Probabilistic Categorization

Goutte, Cyril (2007) Fast & Confident Probabilistic Categorization. [Conference Paper]

Full text available as:



We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the probabilistic categoriser described in (Gaussier et al., ECIR'02). This categoriser is adapted to handle multiple labelling and a piecewise-linear confidence estimation layer is added to provide an estimate of the labelling confidence. This technique achieves a score of 1.689 on the test data.

Item Type:Conference Paper
Keywords:Text categorization, probabilistic model, confidence estimation, multi-label categorization, category description
Subjects:Computer Science > Statistical Models
Linguistics > Computational Linguistics
Computer Science > Machine Learning
ID Code:5626
Deposited By: Goutte, Dr. Cyril
Deposited On:28 Jul 2007
Last Modified:11 Mar 2011 08:56

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

Cancedda, N., Goutte, C., Renders, J.-M., Cesa-Bianchi, N., Conconi, A., Li, Y., Shawe-Taylor, J., Vinokourov, A., Graepel, T. and Gentile, C. (2002). Kernel Methods for Document Filtering. The Eleventh Text REtrieval Conference (TREC 2002), National Institute of Standards and Technology (NIST).

Dempster, A.~P., Laird, N.~M., and Rubin, D.~B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38.

Gandrabur, S., Foster, G., Lapalme, G. (2006). Confidence Estimation for NLP Applications. ACM Transactions on Speech and Language Processing, 3(3): 1--29.

Gaussier, E., Goutte, C., Popat, K., and Chen, F. (2002). A hierarchical model for clustering and categorising documents. Proceedings of the 24th BCS-IRSG Colloquium on IR Research (ECIR'02), pp. 229--247. Springer.

Gillick, L., Ito, Y. and Young, J. (1997). A Probabilistic Approach to Confidence Estimation and Evaluation. ICASSP '97: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97)-Volume 2, pp. 879--882.

Goutte, C. and Gaussier, E. (2004). Method for multi-class, multi-label categorization using probabilistic hierarchical modeling. US Patent 7,139,754 (granted Nov. 21, 2006).

Goutte, C. and Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. Advances in Information Retrieval, 27th European Conference on IR Research (ECIR 2005), pp. 345-359. Springer.

Hofmann, T. (1999). Probabilistic latent semantic analysis. Uncertainty in Artificial Intelligence (UAI'99), pp. 289--296.

Joachims, T. (1998). Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. ECML '98: Proceedings of the 10th European Conference on Machine Learning}, pp. 137--142.

McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41--48.

McCallum, A. (1999). Multi-Label Text Classification with a Mixture Model Trained by EM. AAAI'99 Workshop on Text Learning.

Renders, J.-M., Gaussier, E., Goutte, C., Pacull, F. and Csurka, G. (2006). Categorization in multiple category systems. Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), pp. 745--752.

Rose, K., Gurewwitz, E., and Fox, G. (1990). A deterministic annealing approach to clustering. Pattern Recogn. Letters, 11(9):589--594.

Vapnik, V.~N. (1998). Statistical Learning Theory. Wiley.

Zadrozny, B. and Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. Machine Learning, Proceedings of the Eighteenth International Conference (ICML'01).


Repository Staff Only: item control page