Items from Data Mining track
Number of items: 22.
and Zhou, Jingyu
and Guo, Minyi A Class-Feature-Centroid Classifier for Text Categorization.
Automated text categorization is an important technique for many web applications, such as document indexing, document ﬁltering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the advantages of short training time and testing time due to its computational efficiency. As a result, centroid-based classiﬁers have been widely used in many web applications. However, the accuracy of centroid-based classiﬁers is inferior to SVM, mainly because centroids found during construction are far from perfect locations. We design a fast Class-Feature-Centroid (CFC) classiﬁer for multi-class, single-label text categorization. In CFC, a centroid is built from two important class distributions: inter-class term index and inner-class term index. CFC proposes a novel combination of these indices and employs a denormalized cosine measure to calculate the similarity score between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classiﬁers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when data is sparse.
and Liu, Chao
and Kannan, Anitha
and Minka, Tom
and Taylor, Michael
and Wang, Yi-Min
and Faloutsos, Christos Click Chain Model in Web Search.
Given a terabyte click log, can we build an efficient and effective click model? It is commonly believed that web search click logs are a gold mine for search business, because they reﬂect users’ preference over web documents presented by the search engine. Click models provide a principled approach to inferring user-perceived relevance of web documents, which can be leveraged in numerous applications in search businesses. Due to the huge volume of click data, scalability is a must. We present the click chain model (CCM), which is based on a solid, Bayesian framework. It is both scalable and incremental, perfectly meeting the computational challenges imposed by the voluminous click logs that constantly grow. We conduct an extensive experimental study on a data set containing 8.8 million query sessions obtained in July 2008 from a commercial search engine. CCM consistently outperforms two state-of-the-art competitors in a number of metrics, with over 9.7% better log-likelihood, over 6.2% better click perplexity and much more robust (up to 30%) prediction of the ﬁrst and the last clicked position.
Abdel Hamid, Ossama
and Behzadi, Behshad
and Christoph, Stefan
and Henzinger, Monika Detecting the Origin of Text Segments Efficiently.
In the origin detection problem an algorithm is given a set S of documents, ordered by creation time, and a query document D. It needs to output for every consecutive sequence of k alphanumeric terms in D the earliest document in S in which the sequence appeared (if such a document exists). Algorithms for the origin detection problem can, for example, be used to detect the “origin” of text segments in D and thus to detect novel content in D. They can also ﬁnd the document from which the author of D has copied the most (or show that D is mostly original.) We propose novel algorithms for this problem and evaluate them together with a large number of previously published algorithms. Our results show that (1) detecting the origin of text segments efficiently can be done with very high accuracy even when the space used is less than 1% of the size of the documents in S , (2) the precision degrades smoothly with the amount of available space, (3) various estimation techniques can be used to increase the performance of the algorithms.
and Zhang, Ya A Dynamic Bayesian Network Click Model for Web Search Ranking.
As with any application of machine learning, web search ranking requires labeled data. The labels usually come in the form of relevance assessments made by editors. Click logs can also provide an important source of implicit feedback and can be used as a cheap proxy for editorial labels. The main difficulty however comes from the so called position bias — urls appearing in lower positions are less likely to be clicked even if they are relevant. In this paper, we propose a Dynamic Bayesian Network which aims at providing us with unbiased estimation of the relevance from the click logs. Experiments show that the proposed click model outperforms other existing click models in predicting both click-through rate and relevance.
Wook Kim, Jong
and Selçuk Candan, K.
and Tatemura, Junichi Efficient Overlap and Content Reuse Detection in Blogs and Online News Articles.
The use of blogs to track and comment on real world (political, news, entertainment) events is growing. Similarly, as more individuals start relying on the Web as their primary information source and as more traditional media outlets try reaching consumers through alternative venues, the number of news sites on the Web is also continuously increasing. Content-reuse, whether in the form of extensive quotations or content borrowing across media outlets, is very common in blogs and news entries outlets tracking the same real-world event. Knowledge about which web entries re-use content from which others can be an effective asset when organizing these entries for presentation. On the other hand, this knowledge is not cheap to acquire: considering the size of the related space web entries, it is essential that the techniques developed for identifying re-use are fast and scalable. Furthermore, the dynamic nature of blog and news entries necessitates incremental processing for reuse detection. In this paper, we develop a novel qSign algorithm that efficiently and effectively analyze the blogosphere for quotation and reuse identiﬁcation. Experiment results show that with qSign processing time gains from 10X to 100X are possible while maintaining reuse detection rates of upto 90%. Furthermore, processing time gains can be pushed multiple orders of magnitude (from 100X to 1000X) for 70% recall.
and Zhou, Ke
and Xue, Gui-Rong
and Zha, Hongyuan
and Yu, Yong Enhancing Diversity, Coverage and Balance for Summarization through Structure Learning.
Document summarization plays an increasingly important role with the exponential growth of documents on the Web. Many supervised and unsupervised approaches have been proposed to generate summaries from documents. However, these approaches seldom simultaneously consider summary diversity, coverage, and balance issues which to a large extent determine the quality of summaries. In this paper, we consider extract-based summarization emphasizing the following three requirements: 1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary; 2) sufficient coverage, which focuses on avoiding the loss of the document’s main information when generating the summary; and 3) balance, which demands that different aspects of the document need to have about the same relative importance in the summary. We formulate the extract-based summarization problem as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisﬁes the above three requirements. The mapping is learned by incorporating several constraints in a structure learning framework, and we explore the graph structure of the output variables and employ structural SVM for solving the resulted optimization problem. Experiments on the DUC2001 data sets demonstrate signiﬁcant performance improvements in terms of F1 and ROUGE metrics.
and Gurevich, Maxim Estimating the ImpressionRank of Web Pages.
The ImpressionRank of a web page (or, more generally, of a web site) is the number of times users viewed the page while browsing search results. ImpressionRank captures the visibility of pages and sites in search engines and is thus an important measure, which is of interest to web site owners, competitors, market analysts, and end users. All previous approaches to estimating the ImpressionRank of a page rely on privileged access to private data sources, like the search engine’s query log. In this paper we present the ﬁrst external algorithm for estimating the ImpressionRank of a web page. This algorithm relies on access to three public data sources: the search engine, the query suggestion service of the search engine, and the web. In addition, the algorithm is local and uses modest resources. It can therefore be used by almost any party to estimate the ImpressionRank of any page on any search engine. En route to estimating the ImpressionRank of a page, our algorithm solves a novel variant of the keyword extraction problem: it ﬁnds the most popular search keywords that drive impressions of a page. Empirical analysis of the algorithm on the Google and Yahoo! search engines indicates that it is accurate and provides interesting insights about sites and search queries.
and Ganti, Venkatesh
and Xin, Dong Exploiting Web Search to Generate Synonyms for Entities.
Tasks recognizing named entities such as products, people names, or locations from documents have recently received signiﬁcant attention in the literature. Many solutions to these tasks assume the existence of reference entity tables. An important challenge that needs to be addressed in the entity extraction task is that of ascertaining whether or not a candidate string approximately matches with a named entity in a given reference table. Prior approaches have relied on string-based similarity which only compare a candidate string and an entity it matches with. In this paper, we exploit web search engines in order to deﬁne new similarity functions. We then develop efficient techniques to facilitate approximate matching in the context of our proposed similarity functions. In an extensive experimental evaluation, we demonstrate the accuracy and efficiency of our techniques.
and Moore, Andrew W. Fast Dynamic Reranking in Large Graphs.
In this paper we consider the problem of re-ranking search results by incorporating user feedback. We present a graph theoretic measure for discriminating irrelevant results from relevant results using a few labeled examples provided by the user. The key intuition is that nodes relatively closer (in graph topology) to the relevant nodes than the irrelevant nodes are more likely to be relevant. We present a simple sampling algorithm to evaluate this measure at speciﬁc nodes of interest, and an efficient branch and bound algorithm to compute the top k nodes from the entire graph under this measure. On quantiﬁable prediction tasks the introduced measure outperforms other diffusion-based proximity measures which take only the positive relevance feedback into account. On the Entity-Relation graph built from the authors and papers of the entire DBLP citation corpus (1.4 million nodes and 2.2 million edges) our branch and bound algorithm takes about 1.5 seconds to retrieve the top 10 nodes w.r.t. this measure with 10 labeled nodes.
and Kossinets, Gueorgi
and Kleinberg, Jon
and Lee, Lillian How Opinions are Received by Online Communities: A Case Study on Amazon.com Helpfulness Votes.
There are many on-line settings in which users publicly express opinions. A number of these offer mechanisms for other users to evaluate these opinions; a canonical example is Amazon.com, where reviews come with annotations like “26 of 32 people found the following review helpful.” Opinion evaluation appears in many off-line settings as well, including market research and political campaigns. Reasoning about the evaluation of an opinion is fundamentally different from reasoning about the opinion itself: rather than asking, “What did Y think of X?”, we are asking, “What did Z think of Y’s opinion of X?” Here we develop a framework for analyzing and modeling opinion evaluation, using a large-scale collection of Amazon book reviews as a dataset. We ﬁnd that the perceived helpfulness of a review depends not just on its content but also but also in subtle ways on how the expressed evaluation relates to other evaluations of the same product. As part of our approach, we develop novel methods that take advantage of the phenomenon of review “plagiarism” to control for the effects of text in opinion evaluation, and we provide a simple and natural mathematical model consistent with our ﬁndings. Our analysis also allows us to distinguish among the predictions of competing theories from sociology and social psychology, and to discover unexpected differences in the collective opinion-evaluation behavior of user populations from different countries. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications – Data Mining General Terms: Measurement, Theory Keywords: Review helpfulness, review utility, social inﬂuence, online communities, sentiment analysis, opinion mining, plagiarism.
and Cai, Rui
and Wang, Yida
and Zhu, Jun
and Zhang, Lei
and Ma, Wei-Ying Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums.
Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to ﬁnd a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be signiﬁcantly improved.
and Rajan, Suju
and Narayanan, Vijay K. Large Scale Multi-Label Classification via MetaLabeler.
The explosion of online content has made the management of such content non-trivial. Web-related tasks such as web page categorization, news ﬁltering, query categorization, tag recommendation, etc. often involve the construction of multilabel categorization systems on a large scale. Existing multilabel classiﬁcation methods either do not scale or have unsatisfactory performance. In this work, we propose MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation. Extensive experiments conducted on benchmark data show that the MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to millions of multi-labeled instances and can be deployed easily. This enables us to apply the MetaLabeler to a large scale query categorization problem in Yahoo!, yielding a signiﬁcant improvement in performance.
and Fan, Wei
and Peng, Jing
and Verscheure, Olivier
and Ren, Jiangtao Latent Space Domain Transfer between High Dimensional Overlapping Distributions.
Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of “missing values” when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that ﬁrst brings the marginal distributions of two domains closer by “ﬁlling up” those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the “latent-space” as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then ﬁnd common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as “queries”, the mostly related labeled instances from outdomain are retrieved, and the classiﬁcation is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latentsemantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download.
Bennett, Paul N.
and Maxwell Chickering, David
and Mityagin, Anton Learning Consensus Opinion: Mining Data from a Labeling Game.
We consider the problem of identifying the consensus rank- ing for the results of a query, given preferences among those results from a set of individual users. Once consensus rank- ings are identified for a set of queries, these rankings can serve for both evaluation and training of retrieval and learn- ing systems. We present a novel approach to collecting the individual user preferences over image-search results: we use a collaborative game in which players are rewarded for agree- ing on which image result is best for a query. Our approach is distinct from other labeling games because we are able to elicit directly the preferences of interest with respect to image queries extracted from query logs. As a source of rel- evance judgments, this data provides a useful complement to click data. Furthermore, the data is free of positional biases and is collected by the game without the risk of frus- trating users with non-relevant results; this risk is prevalent in standard mechanisms for debiasing clicks. We describe data collected over 34 days from a deployed version of this game that amounts to about 18 million expressed prefer- ences between pairs. Finally, we present several approaches to modeling this data in order to extract the consensus rank- ings from the preferences and better sort the search results for targeted queries.
and Liu, Yandong
and Zhou, Ding
and Agichtein, Eugene
and Zha, Hongyuan Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement.
Community Question Answering (CQA) has emerged as a popular forum for users to pose questions for other users to answer. Over the last few years, CQA portals such as Naver and Yahoo! Answers have exploded in popularity, and now provide a viable alternative to general purpose Web search. At the same time, the answers to past questions submitted in CQA sites comprise a valuable knowledge repository which could be a gold mine for information retrieval and automatic question answering. Unfortunately, the quality of the submitted questions and answers varies widely - increasingly so that a large fraction of the content is not usable for answering queries. Previous approaches for retrieving relevant and high quality content have been proposed, but they require large amounts of manually labeled data – which limits the applicability of the supervised approaches to new sites and domains. In this paper we address this problem by developing a semi-supervised coupled mutual reinforcement framework for simultaneously calculating content quality and user reputation, that requires relatively few labeled examples to initialize the training process. Results of a large scale evaluation demonstrate that our methods are more effective than previous approaches for ﬁnding high-quality answers, questions, and users. More importantly, our quality estimation signiﬁcantly improves the accuracy of search over CQA archives over the state-of-the-art methods.
and Herbrich, Ralf
and Graepel, Thore Matchbox: Large Scale Online Bayesian Recommendations.
We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with col- laborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a user- specific scale. Efficient inference is achieved by approxi- mate message passing involving a combination of Expecta- tion Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popu- larity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We eval- uate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.
and Zhai, ChengXiang
and Sundaresan, Neel Rated Aspect Summarization of Short Comments.
Web 2.0 technologies have enabled more and more people to freely comment on different kinds of entities (e.g. sellers, products, services). The large scale of information poses the need and challenge of automatic summarization. In many cases, each of the user-generated short comments comes with an overall rating. In this paper, we study the problem of generating a “rated aspect summary” of short comments, which is a decomposed view of the overall ratings for the major aspects so that a user could gain different perspectives towards the target entity. We formally deﬁne the problem and decompose the solution into three steps. We demonstrate the effectiveness of our methods by using eBay sellers’ feedback comments. We also quantitatively evaluate each step of our methods and study how well human agree on such a summarization task. The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.
and Kenthapadi, Krishnaram
and Mishra, Nina
and Ntoulas, Alexandros Releasing Search Queries and Clicks Privately.
The question of how to publish an anonymized search log was brought to the forefront by a well-intentioned, but privacy-unaware AOL search log release. Since then a series of ad-hoc techniques have been proposed in the literature, though none are known to be provably private. In this paper, we take a major step towards a solution: we show how queries, clicks and their associated perturbed counts can be published in a manner that rigorously preserves privacy. Our algorithm is decidedly simple to state, but non-trivial to analyze. On the opposite side of privacy is the question of whether the data we can safely publish is of any use. Our ﬁndings offer a glimmer of hope: we demonstrate that a non-negligible fraction of queries and clicks can indeed be safely published via a collection of experiments on a real search log. In addition, we select an application, keyword generation, and show that the keyword suggestions generated from the perturbed data resemble those generated from the original data.
Ali Bayir, Murat
and Hakki Toroslu, Ismail
and Cosar, Ahmet
and Fidan, Guven Smart Miner: A New Framework for Mining Large Scale Web Usage Data.
In this paper, we propose a novel framework called SmartMiner for web usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns. Unlike the simple session concepts in the time and navigation based approaches, where sessions are sequences of web pages requested from the server or viewed in the browser, Smart Miner sessions are set of paths traversed in the web graph that corresponds to users’ navigations among web pages. We have modeled session construction as a new graph problem and utilized a new algorithm, Smart-SRA, to solve this problem efficiently. For the pattern discovery phase, we have developed an efficient version of the Apriori-All technique which uses the structure of web graph to increase the performance. From the experiments that we have performed on both real and simulated data, we have observed that Smart-Miner produces at least 30% more accurate web usage patterns than other approaches including previous session construction methods. We have also studied the effect of having the referrer information in the web server logs to show that different versions of SmartSRA produce similar results. Our another contribution is that we have implemented distributed version of the Smart Miner framework by employing Map/Reduce Paradigm. We conclude that we can efficiently process terabytes of web server logs belonging to multiple web sites by our scalable framework.
and Chen, Bee-Chung
and Elango, Pradheep Spatio-Temporal Models for Estimating Click-through Rate.
We propose novel spatio-temporal models to estimate clickthrough rates in the context of content recommendation. We track article CTR at a ﬁxed location over time through a dynamic Gamma-Poisson model and combine information from correlated locations through dynamic linear regressions, signiﬁcantly improving on per-location model. Our models adjust for user fatigue through an exponential tilt to the ﬁrstview CTR (probability of click on ﬁrst article exposure) that is based only on user-speciﬁc repeat-exposure features. We illustrate our approach on data obtained from a module (Today Module) published regularly on Yahoo! Front Page and demonstrate signiﬁcant improvement over commonly used baseline methods. Large scale simulation experiments to study the performance of our models under different scenarios provide encouraging results. Throughout, all modeling assumptions are validated via rigorous exploratory data analysis.
and Nie, Zaiqing
and Liu, Xiaojiang
and Zhang, Bo
and Wen, Ji-Rong StatSnowball: a Statistical Approach to Extracting Entity Relationships.
Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Boot- strapping systems significantly reduce the number of train- ing examples, but they usually apply heuristic-based meth- ods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Further- more, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify var- ious types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a boot- strapping system and can perform both traditional relation extraction and Open IE. StatSnowball uses the discriminative Markov logic net- works (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l1 -norm penalized maximum like- lihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during it- erations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.
and Jiang, Daxin
and Pei, Jian
and Chen, Enhong
and Li, Hang Towards Context-Aware Search by Learning a Very Large Variable Length Hidden Markov Model from Search Logs.
Capturing the context of a user’s query from the previous queries and clicks in the same session may help understand the user’s information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users’ search experience substantially. In this paper, we propose a general approach to context-aware search. To capture contexts of queries, we learn a variable length Hidden Markov Model (vlHMM) from search sessions extracted from log data. Although the mathematical model is intuitive, how to learn a large vlHMM with millions of states from hundreds of millions of search sessions poses a grand challenge. We develop a strategy for parameter initialization in vlHMM learning which can greatly reduce the number of parameters to be estimated in practice. We also devise a method for distributed vlHMM learning under the map-reduce model. We test our approach on a real data set consisting of 1.8 billion queries, 2.6 billion clicks, and 840 million search sessions, and evaluate the effectiveness of the vlHMM learned from the real data on three search applications: document re-ranking, query suggestion, and URL recommendation. The experimental results show that our approach is both effective and efficient.
About this site
This website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software.
We (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it... [this has now happened, this site is now static]