Data Cube Approximation and Mining using Probabilistic Modeling

Goutte, Cyril and Missaoui, Rokia and Boujenoui, Ameur (2007) Data Cube Approximation and Mining using Probabilistic Modeling. [Departmental Technical Report]

Full text available as:



On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches.

Item Type:Departmental Technical Report
Keywords:data cubes, OLAP, data warehouses, multidimensional data, non-negative multi-way array factorization, log-linear modeling
Subjects:Computer Science > Machine Learning
Computer Science > Artificial Intelligence
ID Code:5622
Deposited By: Goutte, Dr. Cyril
Deposited On:28 Jul 2007
Last Modified:11 Mar 2011 08:56

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716--723, 1974.

Brian Babcock, Surajit Chaudhuri, and Gautam Das. Dynamic sample selection for approximate query processing. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 539--550, New York, NY, USA, 2003. ACM Press.

Daniel Barbara and Xintao Wu. Using loglinear models to compress datacube. In WAIM '00: Proceedings of the First International Conference on Web-Age Information Management, pages 311--322, London, UK, 2000. Springer-Verlag.

Daniel Barbara and Xintao Wu. Loglinear-based quasi cubes. J. Intell. Inf. Syst., 16(3):255--276, 2001.

Christophe Biernacki, Gilles Celeux, and Gérard Govaert. An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recognition Letter, 20:267--272, 1999.

Christophe Biernacki, Gilles Celeux, and Gérard Govaert. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):719--725, 2000.

Ameur Boujenoui and Daniel Zéghal. Effet de la structure des droits de vote sur la qualité des mécanismes internes de gouvernance: cas des entreprises canadiennes. Canadian Journal of Administrative Sciences, 23(3):183--201, 2006.

I.~Buciu and I.~Pitas. Application of non-negative and local non negative matrix factorization to facial expression recognition. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR'04) - Volume 1, 2004.

Wray Buntine. Variational extensions to EM and multinomial PCA. In ECML'02, pages 23--34, 2002.

Surajit Chaudhuri and Umeshwar Dayal. An overview of data warehousing and olap technology. SIGMOD Rec., 26(1):65--74, 1997.

Ronald Christensen. Log-linear Models. Springer-Verlag, New York, 1997.

W.E. Deming and F.F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal total are known. The Annals of Mathematical Statistics, 11(4):427--444, 1940.

A.~P. Dempster, N.~M. Laird, and D.~B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38, 1977.

Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB '00: Proceedings of the 26th International Conference on Very Large Data Bases, pages 176--187, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

E.~Gaussier, C.~Goutte, K.~Popat, and F.~Chen. A hierarchical model for clustering and categorising documents. In Advances in Information Retrieval, Lecture Notes in Computer Science, pages 229--247. Springer, 2002.

Eric Gaussier and Cyril Goutte. Relation between PLSA and NMF and implications. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 601--602, New York, NY, USA, 2005. ACM Press.

Cyril Goutte, Kenji Yamada, and Eric Gaussier. Aligning words using matrix factorisation. In ACL'04, pages 503--510, 2004.

D.~Guillamet, J.~Vitria, and B.~Schiele. Introducing a weighted non-negative matrix factorization for image classification. Pattern Recognition Letter, 24(14):2447--2454, 2003.

David Guillamet, Marco Bressan, and Jordi Vitria. A weighted non-negative matrix factorization for local representation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 942--947, 2001.

S.J. Haberman. Analysis of qualitative data, Volume 1. Academic Press, New York, 1978.

Venky Harinarayan, Anand Rajaraman, and Jeffrey~D. Ullman. Implementing data cubes efficiently. In SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 205--216, New York, NY, USA, 1996. ACM Press.

D.C. Hoaglin, F.~Mosteller, and J.W. Tukey. Understanding Robust and Exploratory Data Analysis. John Wiley, New York, USA, 1983.

Thomas Hofmann. Probabilistic latent semantic analysis. In UAI'99, pages 289--296. Morgan Kaufmann, 1999.

Xin Jin, Yanzan Zhou, and Bamshad Mobasher. Web usage mining based on probabilistic latent semantic analysis. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 197--205, New York, NY, USA, 2004. ACM Press.

David Knoke and Peter Burke. Log-Linear Models. Sage, Beverly Hills, CA, USA, 1980.

Laks V.~S. Lakshmanan, Jian Pei, and Yan Zhao. Quotient cube: How to summarize the semantics of a data cube. In Proceedings of the 28th International Conference on Very Large Databases, VLDB, pages 778--789, 2002.

Daniel~D. Lee and H.~Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, 1999.

Daniel~D. Lee and H.~Sebastian Seung. Algorithms for non-negative matrix factorization. In NIPS'13. MIT Press, 2001.

J.~S. Lee, D.~D. Lee, S.~Choi, and D.~S. Lee. Application of non-negative matrix factorization to dynamic positron emission tomography. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation (ICA2001), 2001.

Hongjun Lu, Ling Feng, and Jiawei Han. Beyond intratransaction association analysis: mining multidimensional intertransaction association rules. ACM Trans. Inf. Syst., 18(4):423--454, 2000.

Riadh~Ben Messaoud, Omar Boussaid, and Sabine Rabaséda. A new olap aggregation based on the ahc technique. In DOLAP'04: Proceedings of the 7th ACM international workshop on Data warehousing and OLAP, pages 65--72, New York, NY, USA, 2004. ACM Press.

Morten Morup, Lars~Kai Hansen, Josef Parnas, and Sidse~M. Arnfred. Decomposing the time-frequency representation of {EEG} using non-negative matrix and multi-way factorization. Technical report, Informatics and Mathematical Modeling, Technical University of Denmark, 2006.

Themis Palpanas, Nick Koudas, and Alberto Mendelzon. Using datacube aggregates for approximate querying and deviation detection. IEEE Transactions on Knowledge and Data Engineering, 17(11):1465--1477, 2005.

How ROB created the rating system. Globe and Mail, Report on Business, B6, october 7 2006.

K.~Rose, E.~Gurewwitz, and G.~Fox. A deterministic annealing approach to clustering. Pattern Recogn. Letters, 11(9):589--594, 1990.

Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. Discovery-driven exploration of olap data cubes. In EDBT '98: Proceedings of the 6th International Conference on Extending Database Technology, pages 168--182, London, UK, 1998. Springer-Verlag.

Gideon Schwartz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, 1978.

J.~Shanmugasundaram, U.~Fayyad, and P.~S. Bradley. Compressed data cubes for olap aggregate query approximation on continuous dimensions. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 223--232. ACM Press, 1999.

P.~Smaragdis and J.C. Brown. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177--180, 2003.

Suvrit Sra and Inderjit~S. Dhillon. Nonnegative matrix approximation: Algorithms and applications. Technical report, Dept. of Computer Sciences, The Univ. of Texas at Austin, May 2006.

J.K. Vermunt. Log-linear models, volume Encyclopedia of Statistics in Behavioral Science, pages 1082--1093. Wiley, Chichester, UK, 2005.

Max Welling and Markus Weber. Positive tensor factorization. Pattern Recognition Letters, 22(12):1255--1261, 2001.

Dong Xin, Jiawei Han, Xiaolei Li, and Benjamin~W. Wah. Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. In VLDB, 2003.

Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization. In SIGIR'03, pages 267--273, 2003.


Repository Staff Only: item control page