Items from Search track
Group by: No Grouping | Creators Jump to: Agrawal, Sanjay | Broder, Andrei | Chakrabarti, Deepayan | Chakrabarti, Kaushik | Chaudhuri, Surajit | Chen, Jay | Chen, Zheng | Chierichetti, Flavio | Christian König, Arnd | Ciccolo, Peter | Diemert, Eustache | Ding, Shuai | Feng, Jianhua | Fontoura, Marcus | Gabrilovich, Evgeniy | Gan, Qingqing | Ganti, Venkatesh | Gollapudi, Sreenivas | He, Jinru | Hu, Jian | Ji, Shengyue | Josifovski, Vanja | Kumar, Ravi | Leggetter, Chris | Li, Chen | Li, Guoliang | Li, Jinyang | Lochovsky, Fred | Metzler, Donald | Pandey, Sandeep | Punera, Kunal | Raghavan, Hema | Raghavan, Prabhakar | Riedel, Lance | Sharma, Aneesh | Subramanian, Lakshminarayanan | Suel, Torsten | Sun, Jian-tao | Vandelle, Gilles | Vassilvitskii, Sergei | Wang, Gang | Wang, Xuerui | Xin, Dong | Yan, Hao | Yi, Xing | Yuan, Jeffrey Number of items: 15.
Agrawal, SanjayAgrawal, Sanjay and Chakrabarti, Kaushik and Chaudhuri, Surajit and Ganti, Venkatesh and Christian König, Arnd and Xin, Dong Exploiting Web Search Engines to Search Structured Databases.
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries. Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Broder, AndreiPandey, Sandeep and Broder, Andrei and Chierichetti, Flavio and Josifovski, Vanja and Kumar, Ravi and Vassilvitskii, Sergei Nearest-Neighbor Caching for Content-Match Applications.
Motivated by contextual advertising systems and other web applications involving efficiency–accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency–accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model. Broder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Wang, Xuerui and Broder, Andrei and Fontoura, Marcus and Josifovski, Vanja A Search-based Method for Forecasting Ad Impression in Contextual Advertising.
Contextual advertising (also called content match) refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This reach depends on the total number of exposures of the ad (impressions) and its click-through-rate (CTR) that can be viewed as the probability of an end-user clicking on the ad when shown. These two orthogonal, critical factors are both difficult to estimate and even individually can still be very informative and useful in planning and budgeting advertising campaigns. In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging: 1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers’ content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation. Chakrabarti, DeepayanChakrabarti, Deepayan and Kumar, Ravi and Punera, Kunal Quicklink Selection for Navigational Query Results.
Quicklinks for a website are navigational shortcuts displayed below the website homepage on a search results page, and that let the users directly jump to selected points inside the website. Since the real-estate on a search results page is constrained and valuable, picking the best set of quicklinks to maximize the benefits for a majority of the users becomes an important problem for search engines. Using user browsing trails obtained from browser toolbars, and a simple probabilistic model, we formulate the quicklink selection problem as a combinatorial optimizaton problem. We first demonstrate the hardness of the objective, and then propose an algorithm that is provably within a factor of (1 − 1/e) of the optimal. We also propose a different algorithm that works on trees and that can find the optimal solution; unlike the previous algorithm, this algorithm can incorporate natural constraints on the set of chosen quicklinks. The efficacy of our methods is demonstrated via empirical results on both a manually labeled set of websites and a set for which quicklink click-through rates for several webpages were obtained from a real-world search engine. Chakrabarti, KaushikAgrawal, Sanjay and Chakrabarti, Kaushik and Chaudhuri, Surajit and Ganti, Venkatesh and Christian König, Arnd and Xin, Dong Exploiting Web Search Engines to Search Structured Databases.
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries. Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Chaudhuri, SurajitAgrawal, Sanjay and Chakrabarti, Kaushik and Chaudhuri, Surajit and Ganti, Venkatesh and Christian König, Arnd and Xin, Dong Exploiting Web Search Engines to Search Structured Databases.
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries. Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Chen, JayChen, Jay and Subramanian, Lakshminarayanan and Li, Jinyang RuralCafe: Web Search in the Rural Developing World.
The majority of people in rural developing regions do not have access to the World Wide Web. Traditional network connectivity technologies have proven to be prohibitively expensive in these areas. The emergence of new long-range wireless technologies provide hope for connecting these rural regions to the Internet. However, the network connectivity provided by these new solutions are by nature intermittent due to high network usage rates, frequent power-cuts and the use of delay tolerant links. Typical applications, especially interactive applications like web search, do not tolerate intermittent connectivity. In this paper, we present the design and implementation of RuralCafe, a system intended to support efficient web search over intermittent networks. RuralCafe enables users to perform web search asynchronously and find what they are looking for in one round of intermittency as opposed to multiple rounds of search/downloads. RuralCafe does this by providing an expanded search query interface which allows a user to specify additional query terms to maximize the utility of the results returned by a search query. Given knowledge of the limited available network resources, RuralCafe performs optimizations to prefetch pages to best satisfy a search query based on a user’s search preferences. In addition, RuralCafe does not require modifications to the web browser, and can provide single round search results tailored to various types of networks and economic constraints. We have implemented and evaluated the effectiveness of RuralCafe using queries from logs made to a large search engine, queries made by users in an intermittent setting, and live queries from a small testbed deployment. We have also deployed a prototype of RuralCafe in Kerala, India. Chen, ZhengHu, Jian and Wang, Gang and Lochovsky, Fred and Sun, Jian-tao and Chen, Zheng Understanding User's Query Intent with Wikipedia.
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Chierichetti, FlavioChierichetti, Flavio and Kumar, Ravi and Raghavan, Prabhakar Compressed Web Indexes.
Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf’s law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf’s law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions. Pandey, Sandeep and Broder, Andrei and Chierichetti, Flavio and Josifovski, Vanja and Kumar, Ravi and Vassilvitskii, Sergei Nearest-Neighbor Caching for Content-Match Applications.
Motivated by contextual advertising systems and other web applications involving efficiency–accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency–accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model. Christian König, ArndAgrawal, Sanjay and Chakrabarti, Kaushik and Chaudhuri, Surajit and Ganti, Venkatesh and Christian König, Arnd and Xin, Dong Exploiting Web Search Engines to Search Structured Databases.
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries. Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Ciccolo, PeterBroder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Diemert, EustacheDiemert, Eustache and Vandelle, Gilles Unsupervised Query Categorization using Automatically-Built Concept Graphs.
Automatic categorization of user queries is an important component of general purpose (Web) search engines, particularly for triggering rich, query-specific content and sponsored links. We propose an unsupervised learning scheme that reduces dramatically the cost of setting up and maintaining such a categorizer, while retaining good categorization power. The model is stored as a graph of concepts where graph edges represent the cross-reference between the concepts. Concepts and relations are extracted from query logs by an offline Web mining process, which uses a search engine as a powerful summarizer for building a concept graph. Empirical evaluation indicates that the system compares favorably on publicly available data sets (such as KDD Cup 2005) as well as on portions of the current query stream of Yahoo! Search, where it is already changing the experience of millions of Web search users. Ding, ShuaiYan, Hao and Ding, Shuai and Suel, Torsten Inverted Index Compression and Query Processing with Optimized Document Ordering.
Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through- put, and many index compression techniques have been stud- ied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index com- pression compared to a random document ordering. We study index compression and query processing tech- niques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve signif- icant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in in- dex size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Ding, Shuai and He, Jinru and Yan, Hao and Suel, Torsten Using Graphics Processors for High Performance IR Query Processing.
Web search engines are facing formidable performance challenges due to data sizes and query loads. The major engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. The significant cost of operating these systems has motivated a lot of recent research into more efficient query processing mechanisms. We investigate a new way to build such high performance IR systems using graphical processing units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains such as databases and scientific computing [9, 8, 12]. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, to develop suitable algorithms for subtasks such as inverted list compression, list intersection, and top-k scoring, and to show how to achieve highly efficient query processing on GPUbased systems. Our experimental results for a prototype GPU-based system on 25.2 million web pages shows promising gains in query throughput. Feng, JianhuaJi, Shengyue and Li, Guoliang and Li, Chen and Feng, Jianhua Efficient Interactive Fuzzy Keyword Search.
Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called “interactive, fuzzy search,” in which the system searches the underlying data “on the fly” as the user types in query keywords. It extends autocomplete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords approximately. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incrementalsearch algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency. answers. This information-access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information, as illustrated by the following example. At a conference venue, an attendee named John met a person from a university. After the conference he wanted to get more information about this person, such as his research projects. All John knows about the person is that he is a professor from that university, and he only remembers the name roughly. In order to search for this person, John goes to the directory page of the university. Figure 1 shows such an interface. John needs to fill in the form by providing information for multiple attributes, such as name, phone, department, and title. Given his limited information about the person, especially since he does not know the exact spelling of the person’s name, John needs to try a few possible keywords, go through the returned results, modify the keywords, and reissue a new query. He needs to repeat this step multiple times to find the person, if lucky enough. This search interface is neither efficient nor user friendly. Fontoura, MarcusWang, Xuerui and Broder, Andrei and Fontoura, Marcus and Josifovski, Vanja A Search-based Method for Forecasting Ad Impression in Contextual Advertising.
Contextual advertising (also called content match) refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This reach depends on the total number of exposures of the ad (impressions) and its click-through-rate (CTR) that can be viewed as the probability of an end-user clicking on the ad when shown. These two orthogonal, critical factors are both difficult to estimate and even individually can still be very informative and useful in planning and budgeting advertising campaigns. In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging: 1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers’ content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation. Gabrilovich, EvgeniyBroder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Gan, QingqingGan, Qingqing and Suel, Torsten Improved Techniques for Result Caching in Web Search Engines.
Query processing is a major cost factor in operating large web search engines. In this paper, we study query result caching, one of the main techniques used to optimize query processing performance. Our first contribution is a study of result caching as a weighted caching problem. Most previous work has focused on optimizing cache hit ratios, but given that processing costs of queries can vary very significantly we argue that total cost savings also need to be considered. We describe and evaluate several algorithms for weighted result caching, and study the impact of Zipf-based query distributions on result caching. Our second and main contribution is a new set of feature-based cache eviction policies that achieve significant improvements over all previous methods, substantially narrowing the existing performance gap to the theoretically optimal (clairvoyant) method. Finally, using the same approach, we also obtain performance gains for the related problem of inverted list caching. Ganti, VenkateshAgrawal, Sanjay and Chakrabarti, Kaushik and Chaudhuri, Surajit and Ganti, Venkatesh and Christian König, Arnd and Xin, Dong Exploiting Web Search Engines to Search Structured Databases.
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries. Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Gollapudi, SreenivasGollapudi, Sreenivas and Sharma, Aneesh An Axiomatic Approach for Result Diversification.
Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principle-based approach of presenting the most relevant results on top can be sub-optimal, and hence the search engine would like to trade-off relevance for diversity in the results. In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database. He, JinruDing, Shuai and He, Jinru and Yan, Hao and Suel, Torsten Using Graphics Processors for High Performance IR Query Processing.
Web search engines are facing formidable performance challenges due to data sizes and query loads. The major engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. The significant cost of operating these systems has motivated a lot of recent research into more efficient query processing mechanisms. We investigate a new way to build such high performance IR systems using graphical processing units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains such as databases and scientific computing [9, 8, 12]. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, to develop suitable algorithms for subtasks such as inverted list compression, list intersection, and top-k scoring, and to show how to achieve highly efficient query processing on GPUbased systems. Our experimental results for a prototype GPU-based system on 25.2 million web pages shows promising gains in query throughput. Hu, JianHu, Jian and Wang, Gang and Lochovsky, Fred and Sun, Jian-tao and Chen, Zheng Understanding User's Query Intent with Wikipedia.
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Ji, ShengyueJi, Shengyue and Li, Guoliang and Li, Chen and Feng, Jianhua Efficient Interactive Fuzzy Keyword Search.
Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called “interactive, fuzzy search,” in which the system searches the underlying data “on the fly” as the user types in query keywords. It extends autocomplete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords approximately. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incrementalsearch algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency. answers. This information-access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information, as illustrated by the following example. At a conference venue, an attendee named John met a person from a university. After the conference he wanted to get more information about this person, such as his research projects. All John knows about the person is that he is a professor from that university, and he only remembers the name roughly. In order to search for this person, John goes to the directory page of the university. Figure 1 shows such an interface. John needs to fill in the form by providing information for multiple attributes, such as name, phone, department, and title. Given his limited information about the person, especially since he does not know the exact spelling of the person’s name, John needs to try a few possible keywords, go through the returned results, modify the keywords, and reissue a new query. He needs to repeat this step multiple times to find the person, if lucky enough. This search interface is neither efficient nor user friendly. Josifovski, VanjaPandey, Sandeep and Broder, Andrei and Chierichetti, Flavio and Josifovski, Vanja and Kumar, Ravi and Vassilvitskii, Sergei Nearest-Neighbor Caching for Content-Match Applications.
Motivated by contextual advertising systems and other web applications involving efficiency–accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency–accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model. Broder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Wang, Xuerui and Broder, Andrei and Fontoura, Marcus and Josifovski, Vanja A Search-based Method for Forecasting Ad Impression in Contextual Advertising.
Contextual advertising (also called content match) refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This reach depends on the total number of exposures of the ad (impressions) and its click-through-rate (CTR) that can be viewed as the probability of an end-user clicking on the ad when shown. These two orthogonal, critical factors are both difficult to estimate and even individually can still be very informative and useful in planning and budgeting advertising campaigns. In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging: 1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers’ content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation. Kumar, RaviChierichetti, Flavio and Kumar, Ravi and Raghavan, Prabhakar Compressed Web Indexes.
Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf’s law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf’s law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions. Pandey, Sandeep and Broder, Andrei and Chierichetti, Flavio and Josifovski, Vanja and Kumar, Ravi and Vassilvitskii, Sergei Nearest-Neighbor Caching for Content-Match Applications.
Motivated by contextual advertising systems and other web applications involving efficiency–accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency–accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model. Chakrabarti, Deepayan and Kumar, Ravi and Punera, Kunal Quicklink Selection for Navigational Query Results.
Quicklinks for a website are navigational shortcuts displayed below the website homepage on a search results page, and that let the users directly jump to selected points inside the website. Since the real-estate on a search results page is constrained and valuable, picking the best set of quicklinks to maximize the benefits for a majority of the users becomes an important problem for search engines. Using user browsing trails obtained from browser toolbars, and a simple probabilistic model, we formulate the quicklink selection problem as a combinatorial optimizaton problem. We first demonstrate the hardness of the objective, and then propose an algorithm that is provably within a factor of (1 − 1/e) of the optimal. We also propose a different algorithm that works on trees and that can find the optimal solution; unlike the previous algorithm, this algorithm can incorporate natural constraints on the set of chosen quicklinks. The efficacy of our methods is demonstrated via empirical results on both a manually labeled set of websites and a set for which quicklink click-through rates for several webpages were obtained from a real-world search engine. Leggetter, ChrisYi, Xing and Raghavan, Hema and Leggetter, Chris Discovering Users' Specific Geo Intention in Web Search.
Discovering users’ specific and implicit geographic intention in web search can greatly help satisfy users’ information needs. We build a geo intent analysis system that uses minimal supervision to learn a model from large amounts of web-search logs for this discovery. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users’ implicit geo intent and pinpoint the city corresponding to this intent, (2) determine whether the geo-intent is localized around the users’ current geographic location, (3) predict cities for queries that have a mention of an entity that is located in a specific place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We find that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users’ implicit city level geo intent (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of these (3) the city language model can effectively retrieve cities in locationspecific queries with high precision (88%) and recall (74%); human evaluation shows that the language model predicts city labels for location-specific queries with high accuracy (84.5%). Li, ChenJi, Shengyue and Li, Guoliang and Li, Chen and Feng, Jianhua Efficient Interactive Fuzzy Keyword Search.
Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called “interactive, fuzzy search,” in which the system searches the underlying data “on the fly” as the user types in query keywords. It extends autocomplete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords approximately. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incrementalsearch algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency. answers. This information-access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information, as illustrated by the following example. At a conference venue, an attendee named John met a person from a university. After the conference he wanted to get more information about this person, such as his research projects. All John knows about the person is that he is a professor from that university, and he only remembers the name roughly. In order to search for this person, John goes to the directory page of the university. Figure 1 shows such an interface. John needs to fill in the form by providing information for multiple attributes, such as name, phone, department, and title. Given his limited information about the person, especially since he does not know the exact spelling of the person’s name, John needs to try a few possible keywords, go through the returned results, modify the keywords, and reissue a new query. He needs to repeat this step multiple times to find the person, if lucky enough. This search interface is neither efficient nor user friendly. Li, GuoliangJi, Shengyue and Li, Guoliang and Li, Chen and Feng, Jianhua Efficient Interactive Fuzzy Keyword Search.
Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called “interactive, fuzzy search,” in which the system searches the underlying data “on the fly” as the user types in query keywords. It extends autocomplete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords approximately. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incrementalsearch algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency. answers. This information-access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information, as illustrated by the following example. At a conference venue, an attendee named John met a person from a university. After the conference he wanted to get more information about this person, such as his research projects. All John knows about the person is that he is a professor from that university, and he only remembers the name roughly. In order to search for this person, John goes to the directory page of the university. Figure 1 shows such an interface. John needs to fill in the form by providing information for multiple attributes, such as name, phone, department, and title. Given his limited information about the person, especially since he does not know the exact spelling of the person’s name, John needs to try a few possible keywords, go through the returned results, modify the keywords, and reissue a new query. He needs to repeat this step multiple times to find the person, if lucky enough. This search interface is neither efficient nor user friendly. Li, JinyangChen, Jay and Subramanian, Lakshminarayanan and Li, Jinyang RuralCafe: Web Search in the Rural Developing World.
The majority of people in rural developing regions do not have access to the World Wide Web. Traditional network connectivity technologies have proven to be prohibitively expensive in these areas. The emergence of new long-range wireless technologies provide hope for connecting these rural regions to the Internet. However, the network connectivity provided by these new solutions are by nature intermittent due to high network usage rates, frequent power-cuts and the use of delay tolerant links. Typical applications, especially interactive applications like web search, do not tolerate intermittent connectivity. In this paper, we present the design and implementation of RuralCafe, a system intended to support efficient web search over intermittent networks. RuralCafe enables users to perform web search asynchronously and find what they are looking for in one round of intermittency as opposed to multiple rounds of search/downloads. RuralCafe does this by providing an expanded search query interface which allows a user to specify additional query terms to maximize the utility of the results returned by a search query. Given knowledge of the limited available network resources, RuralCafe performs optimizations to prefetch pages to best satisfy a search query based on a user’s search preferences. In addition, RuralCafe does not require modifications to the web browser, and can provide single round search results tailored to various types of networks and economic constraints. We have implemented and evaluated the effectiveness of RuralCafe using queries from logs made to a large search engine, queries made by users in an intermittent setting, and live queries from a small testbed deployment. We have also deployed a prototype of RuralCafe in Kerala, India. Lochovsky, FredHu, Jian and Wang, Gang and Lochovsky, Fred and Sun, Jian-tao and Chen, Zheng Understanding User's Query Intent with Wikipedia.
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Metzler, DonaldBroder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Pandey, SandeepPandey, Sandeep and Broder, Andrei and Chierichetti, Flavio and Josifovski, Vanja and Kumar, Ravi and Vassilvitskii, Sergei Nearest-Neighbor Caching for Content-Match Applications.
Motivated by contextual advertising systems and other web applications involving efficiency–accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency–accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model. Punera, KunalChakrabarti, Deepayan and Kumar, Ravi and Punera, Kunal Quicklink Selection for Navigational Query Results.
Quicklinks for a website are navigational shortcuts displayed below the website homepage on a search results page, and that let the users directly jump to selected points inside the website. Since the real-estate on a search results page is constrained and valuable, picking the best set of quicklinks to maximize the benefits for a majority of the users becomes an important problem for search engines. Using user browsing trails obtained from browser toolbars, and a simple probabilistic model, we formulate the quicklink selection problem as a combinatorial optimizaton problem. We first demonstrate the hardness of the objective, and then propose an algorithm that is provably within a factor of (1 − 1/e) of the optimal. We also propose a different algorithm that works on trees and that can find the optimal solution; unlike the previous algorithm, this algorithm can incorporate natural constraints on the set of chosen quicklinks. The efficacy of our methods is demonstrated via empirical results on both a manually labeled set of websites and a set for which quicklink click-through rates for several webpages were obtained from a real-world search engine. Raghavan, HemaYi, Xing and Raghavan, Hema and Leggetter, Chris Discovering Users' Specific Geo Intention in Web Search.
Discovering users’ specific and implicit geographic intention in web search can greatly help satisfy users’ information needs. We build a geo intent analysis system that uses minimal supervision to learn a model from large amounts of web-search logs for this discovery. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users’ implicit geo intent and pinpoint the city corresponding to this intent, (2) determine whether the geo-intent is localized around the users’ current geographic location, (3) predict cities for queries that have a mention of an entity that is located in a specific place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We find that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users’ implicit city level geo intent (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of these (3) the city language model can effectively retrieve cities in locationspecific queries with high precision (88%) and recall (74%); human evaluation shows that the language model predicts city labels for location-specific queries with high accuracy (84.5%). Raghavan, PrabhakarChierichetti, Flavio and Kumar, Ravi and Raghavan, Prabhakar Compressed Web Indexes.
Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf’s law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf’s law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions. Riedel, LanceBroder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Sharma, AneeshGollapudi, Sreenivas and Sharma, Aneesh An Axiomatic Approach for Result Diversification.
Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principle-based approach of presenting the most relevant results on top can be sub-optimal, and hence the search engine would like to trade-off relevance for diversity in the results. In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database. Subramanian, LakshminarayananChen, Jay and Subramanian, Lakshminarayanan and Li, Jinyang RuralCafe: Web Search in the Rural Developing World.
The majority of people in rural developing regions do not have access to the World Wide Web. Traditional network connectivity technologies have proven to be prohibitively expensive in these areas. The emergence of new long-range wireless technologies provide hope for connecting these rural regions to the Internet. However, the network connectivity provided by these new solutions are by nature intermittent due to high network usage rates, frequent power-cuts and the use of delay tolerant links. Typical applications, especially interactive applications like web search, do not tolerate intermittent connectivity. In this paper, we present the design and implementation of RuralCafe, a system intended to support efficient web search over intermittent networks. RuralCafe enables users to perform web search asynchronously and find what they are looking for in one round of intermittency as opposed to multiple rounds of search/downloads. RuralCafe does this by providing an expanded search query interface which allows a user to specify additional query terms to maximize the utility of the results returned by a search query. Given knowledge of the limited available network resources, RuralCafe performs optimizations to prefetch pages to best satisfy a search query based on a user’s search preferences. In addition, RuralCafe does not require modifications to the web browser, and can provide single round search results tailored to various types of networks and economic constraints. We have implemented and evaluated the effectiveness of RuralCafe using queries from logs made to a large search engine, queries made by users in an intermittent setting, and live queries from a small testbed deployment. We have also deployed a prototype of RuralCafe in Kerala, India. Suel, TorstenGan, Qingqing and Suel, Torsten Improved Techniques for Result Caching in Web Search Engines.
Query processing is a major cost factor in operating large web search engines. In this paper, we study query result caching, one of the main techniques used to optimize query processing performance. Our first contribution is a study of result caching as a weighted caching problem. Most previous work has focused on optimizing cache hit ratios, but given that processing costs of queries can vary very significantly we argue that total cost savings also need to be considered. We describe and evaluate several algorithms for weighted result caching, and study the impact of Zipf-based query distributions on result caching. Our second and main contribution is a new set of feature-based cache eviction policies that achieve significant improvements over all previous methods, substantially narrowing the existing performance gap to the theoretically optimal (clairvoyant) method. Finally, using the same approach, we also obtain performance gains for the related problem of inverted list caching. Yan, Hao and Ding, Shuai and Suel, Torsten Inverted Index Compression and Query Processing with Optimized Document Ordering.
Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through- put, and many index compression techniques have been stud- ied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index com- pression compared to a random document ordering. We study index compression and query processing tech- niques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve signif- icant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in in- dex size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Ding, Shuai and He, Jinru and Yan, Hao and Suel, Torsten Using Graphics Processors for High Performance IR Query Processing.
Web search engines are facing formidable performance challenges due to data sizes and query loads. The major engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. The significant cost of operating these systems has motivated a lot of recent research into more efficient query processing mechanisms. We investigate a new way to build such high performance IR systems using graphical processing units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains such as databases and scientific computing [9, 8, 12]. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, to develop suitable algorithms for subtasks such as inverted list compression, list intersection, and top-k scoring, and to show how to achieve highly efficient query processing on GPUbased systems. Our experimental results for a prototype GPU-based system on 25.2 million web pages shows promising gains in query throughput. Sun, Jian-taoHu, Jian and Wang, Gang and Lochovsky, Fred and Sun, Jian-tao and Chen, Zheng Understanding User's Query Intent with Wikipedia.
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Vandelle, GillesDiemert, Eustache and Vandelle, Gilles Unsupervised Query Categorization using Automatically-Built Concept Graphs.
Automatic categorization of user queries is an important component of general purpose (Web) search engines, particularly for triggering rich, query-specific content and sponsored links. We propose an unsupervised learning scheme that reduces dramatically the cost of setting up and maintaining such a categorizer, while retaining good categorization power. The model is stored as a graph of concepts where graph edges represent the cross-reference between the concepts. Concepts and relations are extracted from query logs by an offline Web mining process, which uses a search engine as a powerful summarizer for building a concept graph. Empirical evaluation indicates that the system compares favorably on publicly available data sets (such as KDD Cup 2005) as well as on portions of the current query stream of Yahoo! Search, where it is already changing the experience of millions of Web search users. Vassilvitskii, SergeiPandey, Sandeep and Broder, Andrei and Chierichetti, Flavio and Josifovski, Vanja and Kumar, Ravi and Vassilvitskii, Sergei Nearest-Neighbor Caching for Content-Match Applications.
Motivated by contextual advertising systems and other web applications involving efficiency–accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency–accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model. Wang, GangHu, Jian and Wang, Gang and Lochovsky, Fred and Sun, Jian-tao and Chen, Zheng Understanding User's Query Intent with Wikipedia.
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Wang, XueruiWang, Xuerui and Broder, Andrei and Fontoura, Marcus and Josifovski, Vanja A Search-based Method for Forecasting Ad Impression in Contextual Advertising.
Contextual advertising (also called content match) refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This reach depends on the total number of exposures of the ad (impressions) and its click-through-rate (CTR) that can be viewed as the probability of an end-user clicking on the ad when shown. These two orthogonal, critical factors are both difficult to estimate and even individually can still be very informative and useful in planning and budgeting advertising campaigns. In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging: 1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers’ content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation. Xin, DongAgrawal, Sanjay and Chakrabarti, Kaushik and Chaudhuri, Surajit and Ganti, Venkatesh and Christian König, Arnd and Xin, Dong Exploiting Web Search Engines to Search Structured Databases.
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries. Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Yan, HaoYan, Hao and Ding, Shuai and Suel, Torsten Inverted Index Compression and Query Processing with Optimized Document Ordering.
Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through- put, and many index compression techniques have been stud- ied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index com- pression compared to a random document ordering. We study index compression and query processing tech- niques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve signif- icant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in in- dex size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Ding, Shuai and He, Jinru and Yan, Hao and Suel, Torsten Using Graphics Processors for High Performance IR Query Processing.
Web search engines are facing formidable performance challenges due to data sizes and query loads. The major engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. The significant cost of operating these systems has motivated a lot of recent research into more efficient query processing mechanisms. We investigate a new way to build such high performance IR systems using graphical processing units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains such as databases and scientific computing [9, 8, 12]. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, to develop suitable algorithms for subtasks such as inverted list compression, list intersection, and top-k scoring, and to show how to achieve highly efficient query processing on GPUbased systems. Our experimental results for a prototype GPU-based system on 25.2 million web pages shows promising gains in query throughput. Yi, XingYi, Xing and Raghavan, Hema and Leggetter, Chris Discovering Users' Specific Geo Intention in Web Search.
Discovering users’ specific and implicit geographic intention in web search can greatly help satisfy users’ information needs. We build a geo intent analysis system that uses minimal supervision to learn a model from large amounts of web-search logs for this discovery. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users’ implicit geo intent and pinpoint the city corresponding to this intent, (2) determine whether the geo-intent is localized around the users’ current geographic location, (3) predict cities for queries that have a mention of an entity that is located in a specific place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We find that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users’ implicit city level geo intent (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of these (3) the city language model can effectively retrieve cities in locationspecific queries with high precision (88%) and recall (74%); human evaluation shows that the language model predicts city labels for location-specific queries with high accuracy (84.5%). Yuan, JeffreyBroder, Andrei and Ciccolo, Peter and Gabrilovich, Evgeniy and Josifovski, Vanja and Metzler, Donald and Riedel, Lance and Yuan, Jeffrey Online Expansion of Rare Queries for Sponsored Search.
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. This list was generated on Fri Feb 15 08:40:29 2019 GMT. About this siteThis website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software. PreservationWe (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it... [this has now happened, this site is now static] |