Items from Poster Session track
Jump to: Abekawa, Takeshi | Agrawal, Mayank | Amitay, Einat | An, Ning | Angelova, Ralitsa | Apte, Varsha | Aramaki, Eiji | Aras, Hidir | Atkinson, Martin | Balakrishnan, Rajesh | Baraglia, Ranieri | Baumgartner, Robert | Baykan, Eda | Bhat, Satish | Bischoff, Kerstin | Bogaerts, Jérôme | Boldú, Marc | Bolivar, Alvaro | Braga, Daniele | Brunner, Jean-Sebastien | Bu, Jiajun | Böhnstedt, Doreen | Cacheda, Fidel | Cai, Rui | Carboni, Davide | Carlsson, Niklas | Carmel, David | Carneiro, Victor | Castro, Paul | Ceri, Stefano | Chan, Su | Chan, W. K. | Chandra, Praphul | Chang, William | Chatterjee, Raja | Chen, Chun | Chen, Dewei | Chen, Huajun | Chen, Yunfei | Chen, Zheng | Chi, Mingmin | Chung, Chin-Wan | Chung, Sukwon | Chung Su, Tse | Cordier, Marie-Odile | Cortez, Eli | Darmaputra, Yansen | Dasdan, Ali | Dasgupta, Sourish | Dekhil, Mohamed | Della Valle, Emanuele | Deng, Ting | Dhopeshwarkar, Sanket | Dmitriev, Pavel | Dom, Byron | Domínguez García, Renato | Dong, Zheng-Bin | Drome, Chris | Du, Zongxia | Duan, Lei | Eager, Derek | Eda, Takeharu | Fan, Weiguo | Farahat, Ayman | Feng, Dan | Feng, Jianhua | Feng, Rui | Firan, Claudiu S. | Formoso, Vreixo | Francesco Barbieri, Davide | Gabrilovich, Evgeniy | Gang, Lu | Gatellier, Patrick | Gatius, Marta | Geng, Guang-Gang | Ghosh, Riddhiman | Golbandi, Nadav | Gomes, Paulo | González, Meritxell | Gorshkov, Andrey | Gottron, Thomas | Grefenstette, Gregory | Grineva, Maria | Grossniklaus, Michael | Guan, Ziyu | Gunopulos, Dimitrios | Guo, Huipeng | Guo, Li | Guo, Xiaolin | Gupta, Ajay | Gupta, Anubha | Gupta, Manish | Guyet, Thomas | Halpin, Harry | Har'El, Nadav | Hassanzadeh, Oktie | He, Xiaofei | He, Xiaofeng | Heatherly, Raymond | Henzinger, Monika | Holzinger, Wolfgang | Horhammer, Mike | Hsieh, Yung-Huan | Hsu, Chao-Jung | Hu, Jian | Hu, Weiming | Hua, Xian-Sheng | Huai, Jinpeng | Huang, Chin-Yu | Hussain, Toufeeq | Huynh, Xinh | Iida, Toshinari | Ji, Lei | Jiang, Bo | Jiang, Lili | Juffinger, Andreas | Kantarcioglu, Murat | Kasneci, Gjergji | Keller, Matthias | Kelliher, Aisling | Kementsietsidis, Anastasios | Khan, Latifur | Kil, Hyunyoung | Kim, Jinil | Kim, Sung-Ryul | Kitagawa, Hiroyuki | Kohlschütter, Christian | Kolay, Santanu | Konuru, Ravi | Kotsakos, Dimitrios | Krause, Markus | Krüepl, Bernhard | Lee, Dongwon | Lee, Jeehoon | Lee, Jihyun | Lee, Taehyung | Lee, Yugyung | Lee Giles, C. | Lex, Elisabeth | Li, Guoliang | Li, Hua | Li, Juanzi | Li, Lian | Li, Lusong | Li, Qiudan | Li, Wei | Li, Xianxian | Li, Yang | Liao, Hong-Luan | Lim, Lipyeow | Lin, Chen | Lin, Ching-Yung | Lin, Yu-Ru | Lindamood, Jack | Liu, Chris | Liu, Ning | Liu, Qiaoling | Lizorkin, Dmitry | Lu, Bin | Lu, Gang | Lu, Tian-Bo | Luo, Guan | Maghoul, Farzin | Mahanti, Anirban | Manca, Roberto | Mao, Robert | Marian, Ludmila | Martin, Ludger | Masseglia, Florent | Massidda, Francesco | Mathur, Vipul | Medelyan, Olena | Mei, Tao | Mesquita, Filipe | Michel, Sebastian | Min, Jun-Ki | Mitra, Siddharth | Morishima, Atsuyuki | Moura, Edleno | Murakami, Yohei | Nadamoto, Akiyo | Nakamizo, Akiyoshi | Nam, Wonhong | Neubert, Marden | Ni, Xiaochuan | Ni, Yuan | Nie, Qin-Wu | Nikolaev, Kirill | Nussbaumer, Martin | Ofek-Koifman, Shila | Oiwa, Yutaka | Olivas, Jose A. | Oliveira, Pedro | Paiu, Raluca | Pan, Aimin | Pan, Yue | Pantel, Patrick | Paquier, Micaël | Parikh, Nish | Paris, Pilar | Park, Kunsoo | Pasca, Marius | Pei, Jian | Peng, Jin | Perego, Raffaele | Platakis, Manolis | Popescu, Adrian | Popescu, Ana-Maria | Porras, Mercè | Qing Chang, Elaine | Qiu, Bingyu | Qiu, Guang | Qu, Mingcheng | Quiniou, Rene | Ramanujam, Sunitha | Ravada, Siva | Reisinger, Joseph | Rensing, Christoph | Ribera, Mireia | Romero, Francisco P. | Sato, Satoshi | Scholl, Philipp | Seida, Steven | Serrano-Guerrero, Jesus | Shang, Shujie | Shen, Dan | Shen, Jing | Shi, Yuan | Shieh, Jyh-Ren | Shin, Hyoseop | Shiowattana, Dungjit | Silvestri, Fabrizio | Sire, Stéphane | Soffer, Aya | Song, Guo-Jie | Steinmetz, Ralf | Suchanek, Fabian M. | Sugimoto, Shigeo | Sulé, Andreu | Sun, Jian-Tao | Sun, Jimeng | Sundaram, Hari | Sundaresan, Neel | Suzuki, Hirofumi | Takagi, Hiromitsu | Tang, Jie | Termens, Miquel | Thuraisingham, Bhavani | Tian, Zhi-Hong | Toda, Guilherme A. | Tse, T. H. | Tu, Xudong | Uchiyama, Tadasu | Uchiyama, Toshio | Vagner, Alain | Van der Goot, Erik | Viswanathan, Amar | Wang, Can | Wang, Gang | Wang, Haofen | Wang, Jianyong | Wang, Jing-Yao | Wang, Junfeng | Wang, Lu | Wang, Min | Wang, Shengyuan | Wang, Wei | Wang, Xin-Jing | Watanabe, Hajime | Weber, Ingmar | Weikum, Gerhard | Wook Kim, Jin | Wu, Hao | Wu, Ja-Ling | Wu, Ou | Wu, Xiaoyuan | Wu, Yi-Chuan | Wu, Zhaohui | Xie, Guotong | Xie, Kun-Qing | Xue, Gui-Rong | Xue, Xiangyang | Yadav, Amit | Yan, Jun | Yanai, Keiji | Yang, Jiang-Ming | Yang, Qiang | Yeh, Yang-Ting | Yi, Jeonghee | Yogev, Sivan | Yoshikawa, Masatoshi | Yu, Yong | Zha, Hongyuan | Zhan, Jian | Zhang, Cheng | Zhang, Congle | Zhang, Dell | Zhang, Jian-Ying | Zhang, Kaimin | Zhang, Lei | Zhang, Peiwu | Zhang, Xinchang | Zhang, Yun-Fei | Zhang, Zhenyu | Zhao, Yingbin | Zheng, Shuyi | Zhou, Chunying | Zhou, Lizhu | Zhou, Yiping | Zhu, Bin B. | Zhu, Junyan | Zhu, Yunzhang | Zudina, Ekaterina | Zuo, Haiqiang | da Silva, Altigran S.Number of items: 93.
Abekawa, TakeshiNadamoto, Akiyo and Aramaki, Eiji and Abekawa, Takeshi and Murakami, Yohei Content Hole Search in Community-type Content. In community-type content such as blogs and SNSs, we call the user’s unawareness of information as a ”content hole” and the search for this information as a ”content hole search.” A content hole search differs from similarity searching and has a variety of types. In this paper, we propose different types of content holes and define each type. We also propose an analysis of dialogue related to community-type content and introduce content hole search by using Wikipedia as an example.
Agrawal, Mayank
Amitay, EinatAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
An, NingAn, Ning and Chatterjee, Raja and Horhammer, Mike and Ravada, Siva Securely Implementing Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial. In this paper, we briefly describe the implementation of various Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial 11g. We highlight how we utilize Oracle’s implementation of OASIS Web Services Security (WSS) to provide a robust security framework for these OGC Web Services. We also discuss our future direction in supporting OGC Web Service Interface Standards. In addition to the mandated XML interfaces, Oracle Spatial WFS, Oracle Spatial CSW and Oracle Spatial OpenLS also support SOAP interfaces. To improve performance, Oracle Spatial WFS and Oracle Spatial CSW also implement caching mechanism to support retrieving records from a single query across different web requests. Below, we describe each of supported OGC Web Services in Oracle Spatial. Due to space limits, we will emphasize Oracle Spatial WFS support to illustrate our unique implementation.
Angelova, Ralitsa
Apte, VarshaMathur, Vipul and Dhopeshwarkar, Sanket and Apte, Varsha MASTH Proxy: An Extensible Platform for Web Overload Control. Many overload control mechanisms for Web based applications aim to prevent overload by setting limits on factors such as admitted load, number of server threads, buffer size. For this they need online measurements of metrics such as response time, throughput, and resource utilization. This requires instrumentation of the server by modifying server code, which may not be feasible or desirable. An alternate approach is to use a proxy between the clients and servers. We have developed a proxy-based overload control platform called MASTH Proxy –Multi-class Admissioncontrolled Self-Tuning Http Proxy. It records detailed measurements, supports multiple request classes, manages queues of HTTP requests, provides tunable parameters and enables easy implementation of dynamic overload control. This gives designers of overload control schemes a platform where they can concentrate on developing the core control logic, without the need to modify upstream server code.
Aramaki, EijiNadamoto, Akiyo and Aramaki, Eiji and Abekawa, Takeshi and Murakami, Yohei Content Hole Search in Community-type Content. In community-type content such as blogs and SNSs, we call the user’s unawareness of information as a ”content hole” and the search for this information as a ”content hole search.” A content hole search differs from similarity searching and has a variety of types. In this paper, we propose different types of content holes and define each type. We also propose an analysis of dialogue related to community-type content and introduce content hole search by using Wikipedia as an example.
Aras, HidirKrause, Markus and Aras, Hidir Playful Tagging - Folksonomy Generation Using Online Games. Collaborative Tagging is a powerful method to create folksonomies that can be used to grasp/filter user preferences or enhance web search. Recent research has shown that depending on the number of users and the quality of user-provided tags powerful community-driven semantics or “ontologies” can emerge – as it was evident analyzing user data from social web applications such as del.icio.us or Flickr. Unfortunately, most web pages do not contain tags and, thus, no vocabulary that describes the information provided. A common problem in web page annotation is to motivate users for constant participation, i.e. tagging. In this paper we describe our approach of a binary verification game that embeds collaborative tagging into on-line games in order to produce domain specific folksonomies.
Atkinson, MartinAtkinson, Martin and Van der Goot, Erik Near Real Time Information Mining in Multilingual News. This paper presents a near real-time multilingual news monitoring and analysis system that forms the backbone of our research work. The system integrates technologies to address the problems related to information extraction and analysis of open source intelligence on the World Wide Web. By chaining together different techniques in text mining, automated machine learning and statistical analysis, we can automatically determine who, where and, to a certain extent, what is being reported in news articles.
Balakrishnan, Rajesh
Baraglia, RanieriBaraglia, Ranieri and Cacheda, Fidel and Carneiro, Victor and Formoso, Vreixo and Perego, Raffaele and Silvestri, Fabrizio Search Shortcuts: Driving Users Towards Their Goals. Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide Suggestions under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce Search Shortcuts as “Successful ” queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Baumgartner, RobertKrüepl, Bernhard and Holzinger, Wolfgang and Darmaputra, Yansen and Baumgartner, Robert A Flight Meta-Search Engine with Metamorph. We demonstrate a flight meta-search engine that is based on the Metamorph framework. Metamorph provides mechanisms to model web forms together with the interactions which are needed to fulfil a request, and can generate interaction sequences that pose queries using these web forms and collect the results. In this paper, we discuss an interesting new feature that makes use of the forms themselves as an information source. We show how data can be extracted from web forms (rather than the data behind web forms) to generate a graph of flight connections between cities. The flight connection graph allows us to vastly reduce the number of queries that the engine sends to airline websites in the most interesting search scenarios; those that involve the controversial practice of creative ticketing, in which agencies attempt to find lower price fares by using more than one airline for a journey. We describe a system which attains data from a number of websites to identify promising routes and prune the search tree. Heuristics that make use of geographical information and an estimation of cost based on historical data are employed. The results are then made available to improve the quality of future search requests. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software General Terms: Algorithms, Design, Experimentation. Keywords: Hidden Web, Web Data Extraction, Web Form Mapping, Web Form Extraction.
Baykan, EdaBaykan, Eda and Henzinger, Monika and Marian, Ludmila and Weber, Ingmar Purely URL-based Topic Classification. Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page’s content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
Bhat, SatishDasgupta, Sourish and Bhat, Satish and Lee, Yugyung SGPS: A Semantic Scheme for Web Service Similarity. Today’s Web becomes a platform for services to be dynamically interconnected to produce a desired outcome. It is important to formalize the semantics of the contextual elements of web services. In this paper, we propose a novel technique called Semantic Genome Propagation Scheme (SGPS) for measuring similarity between semantic concepts. We show how SGPS is used to compute a multi-dimensional similarity between two services. We evaluate the SGPS similarity measurement in terms of the similarity performance and scalability.
Bischoff, KerstinBischoff, Kerstin and Firan, Claudiu S. and Paiu, Raluca Deriving Music Theme Annotations from User Tags. Music theme annotations would be really beneficial for supporting retrieval, but are often neglected by users while annotating. Thus, in order to support users in tagging and to fill the gaps in the tag space, in this paper we develop algorithms for recommending theme annotations. Our methods exploit already existing user tags, the lyrics of music tracks, as well as combinations of both. We compare the results for our recommended theme annotations against genre and style recommendations – a much easier and already studied task. We evaluate the quality of our recommended tags against an expert ground truth data set. Our results are promising and provide interesting insights into possible extensions for music tagging systems to support music search.
Bogaerts, JérômeSire, Stéphane and Paquier, Micaël and Vagner, Alain and Bogaerts, Jérôme A Messaging API for Inter-Widgets Communication. Widget containers are used everywhere on the Web, for instance as customizable start pages to Web desktops. In this poster, we describe the extension of a widget container with an inter-widgets communication layer, as well as the subsequent application programming interfaces (APIs) added to the Widget object to support this feature. We present the benefits of a drag and drop facility within widgets and conclude by a call for standardization of inter-widgets communication on the Web.
Boldú, Marc
Bolivar, AlvaroShen, Dan and Wu, Xiaoyuan and Bolivar, Alvaro Rare Item Detection in e-Commerce Site. As the largest online marketplace in the world, eBay has a huge inventory where there are plenty of great rare items with potentially large, even rapturous buyers. These items are obscured in long tail of eBay item listing and hard to find through existing searching or browsing methods. It is observed that there are great rarity demands from users according to eBay query log. To keep up with the demands, the paper proposes a method to automatically detect rare items in eBay online listing. A large set of features relevant to the task are investigated to filter items and further measure item rareness. The experiments on the most rarity-demandintensitive domains show that the method may effectively detect rare items (> 90% precision).
Braga, Daniele
Brunner, Jean-SebastienBrunner, Jean-Sebastien and Gatellier, Patrick Raise Semantics at the User Level for Dynamic and Interactive SOA-based Portals. In this paper, we describe the fully dynamic semantic portal we implemented, integrating Semantic Web technologies and Service Oriented Architecture (SOA). The goals of the portal are twofold: first it helps administrators to easily propose new features in the portal using semantics to ease the orchestration process; secondly it automatically generates a customized user interface for these scenarios. This user interface takes into account different devices and assists end-users in the use of the portal taking benefit of context awareness. All the added-value of this portal is based on a core semantics defined by an ontology. We present here the main features of this portal and how it was implemented using state-of- the-art technologies and frameworks.
Bu, JiajunWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Wang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Qu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Zhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Böhnstedt, DoreenScholl, Philipp and Domínguez García, Renato and Böhnstedt, Doreen and Rensing, Christoph and Steinmetz, Ralf Towards LanguageIndependent Web Genre Detection. The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres. features (e.g. part-of-speech tagging and document terms), structural features (e.g. HTML tag frequencies, use of facets used to enable functionalities like form input elements) and simple text statistics (e.g. frequencies of punctuation). However, a fact often neglected by related work is that the absolute dominance of the English language on the web is decreasing. Thus, it is important to develop a way of recognizing web genres independently of the language used on the respective web page. As many genres exhibit a certain structural and visual layout, this property enables to ignore linguistic features altogether.
Cacheda, FidelBaraglia, Ranieri and Cacheda, Fidel and Carneiro, Victor and Formoso, Vreixo and Perego, Raffaele and Silvestri, Fabrizio Search Shortcuts: Driving Users Towards Their Goals. Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide Suggestions under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce Search Shortcuts as “Successful ” queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Cai, RuiLin, Chen and Yang, Jiang-Ming and Cai, Rui and Wang, Xin-Jing and Wang, Wei and Zhang, Lei Modeling Semantics and Structure of Discussion Threads. The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Carboni, DavideManca, Roberto and Massidda, Francesco and Carboni, Davide Visualization of Geo-annotated Pictures in Mobile Phones. In this work, a novel mobile browser for geo-referenced pictures is introduced and described. We use the term browser to denote a system aimed at browsing pictures selected from a large set like Internet photo sharing services. The criteria to filter a subset of pictures to browse are three: the user's actual position, the user's actual heading, and the user's preferences. In this work we only focus on the first two criteria leaving the integration of user's preferences for future developments.
Carlsson, Niklas
Carmel, DavidAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Carneiro, VictorBaraglia, Ranieri and Cacheda, Fidel and Carneiro, Victor and Formoso, Vreixo and Perego, Raffaele and Silvestri, Fabrizio Search Shortcuts: Driving Users Towards Their Goals. Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide Suggestions under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce Search Shortcuts as “Successful ” queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Castro, PaulLin, Yu-Ru and Sun, Jimeng and Castro, Paul and Konuru, Ravi and Sundaram, Hari and Kelliher, Aisling Extracting Community Structure through Relational Hypergraphs. Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an online method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users’ future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Ceri, Stefano
Chan, SuChung, Sukwon and Shiowattana, Dungjit and Dmitriev, Pavel and Chan, Su The Web of Nations. In this paper, we report on a large-scale study of structural differences among the national webs. The study is based on a webscale crawl conducted in the summer 2008. More specifically, we study two graphs derived from this crawl, the nation graph, with nodes corresponding to nations and edges – to links among nations, and the host graph, with nodes corresponding to hosts and edges – to hyperlinks among pages on the hosts. Contrary to some of the previous work [2], our results show that webs of different nations are often very different from each other, both in terms of their internal structure, and in terms of their connectivity with other nations.
Chan, W. K.Jiang, Bo and Chan, W. K. and Zhang, Zhenyu and Tse, T. H. Where to Adapt Dynamic Service Compositions. Peer services depend on one another to accomplish their tasks, and their structures may evolve. A service composition may be designed to replace its member services whenever the quality of the composite service fails to meet certain quality-of-service (QoS) requirements. Finding services and service invocation endpoints having the greatest impact on the quality are important to guide subsequent service adaptations. This paper proposes a technique that samples the QoS of composite services and continually analyzes them to identify artifacts for service adaptation. The preliminary results show that our technique has the potential to effectively find such artifacts in services.
Chandra, PraphulChandra, Praphul and Gupta, Ajay Retaining Personal Expression for Social Search. Web is being extensively used for personal expression, which includes ratings, reviews, recommendations, blogs. This user created content, e.g. book review on Amazon.com, becomes the property of the website, and the user often does not have easy access to it. In some cases, user’s feedback may get averaged with feedback from other users e.g. ratings of a video. We argue that the creator of such content needs to be able to retain (a link to) her created content. We introduce the concept of MEB which is a user controlled store of such retained links. A MEB allows a user to access/share all the reviews she has given on different websites. With this capability users can allow their friends to search through their feedback. Searching through one’s social network allows harnessing the power of social networks where known relationships provide the context & trust necessary to interpret feedback.
Chang, WilliamChang, William and Pantel, Patrick and Popescu, Ana-Maria and Gabrilovich, Evgeniy Towards Intent-Driven Bidterm Suggestion. In online advertising, pervasive in commercial search engines, advertisers typically bid on few terms, and the scarcity of data makes ad matching difficult. Suggesting additional bidterms can significantly improve ad clickability and conversion rates. In this paper, we present a large-scale bidterm suggestion system that models an advertiser’s intent and finds new bidterms consistent with that intent. Preliminary experiments show that our system significantly increases the coverage of a state of the art production system used at Yahoo while maintaining comparable precision.
Chatterjee, RajaAn, Ning and Chatterjee, Raja and Horhammer, Mike and Ravada, Siva Securely Implementing Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial. In this paper, we briefly describe the implementation of various Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial 11g. We highlight how we utilize Oracle’s implementation of OASIS Web Services Security (WSS) to provide a robust security framework for these OGC Web Services. We also discuss our future direction in supporting OGC Web Service Interface Standards. In addition to the mandated XML interfaces, Oracle Spatial WFS, Oracle Spatial CSW and Oracle Spatial OpenLS also support SOAP interfaces. To improve performance, Oracle Spatial WFS and Oracle Spatial CSW also implement caching mechanism to support retrieving records from a single query across different web requests. Below, we describe each of supported OGC Web Services in Oracle Spatial. Due to space limits, we will emphasize Oracle Spatial WFS support to illustrate our unique implementation.
Chen, ChunWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Wang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Qu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Zhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Chen, DeweiChen, Dewei and Tang, Jie and Li, Juanzi and Zhou, Lizhu Discovering the Staring People From Social Networks. In this paper, we study a novel problem of staring people dis- covery from social networks, which is concerned with finding people who are not only authoritative but also sociable in the social network. We formalize this problem as an optimiza- tion programming problem. Taking the co-author network as a case study, we define three objective functions and pro- pose two methods to combine these objective functions. A genetic algorithm based method is further presented to solve this problem. Experimental results show that the proposed solution can effectively find the staring people from social networks.
Chen, HuajunLu, Bin and Wu, Zhaohui and Ni, Yuan and Xie, Guotong and Zhou, Chunying and Chen, Huajun sMash: Semantic-based Mashup Navigation for Data API Network. With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Chen, Yunfei
Chen, ZhengWang, Gang and Hu, Jian and Zhu, Yunzhang and Li, Hua and Chen, Zheng Competitive Analysis from Click-Through Log. Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers’ products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors’ keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
Liu, Ning and Yan, Jun and Fan, Weiguo and Yang, Qiang and Chen, Zheng Identifying Vertical Search Intention of Query through Social Tagging Propagation. A pressing task during the unification process is to identify a user’s vertical search intention based on the user’s query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
Ni, Xiaochuan and Sun, Jian-Tao and Hu, Jian and Chen, Zheng Mining Multilingual Topics from Wikipedia. In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted “universal” topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.
Liu, Ning and Yan, Jun and Chen, Zheng A Probabilistic Model Based Approach for Blended Search. In this paper, we propose to model the blended search problem by assuming conditional dependencies among queries, VSEs and search results. The probability distributions of this model are learned from search engine query log through unigram language model. Our experimental exploration shows that, (1) a large number of queries in generic Web search have vertical search intentions; and (2) our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the Mean Average Precision (MAP) by as much as 16% compared to traditional Web search without blending. these components into a single list. However, from the classical meta-search problem’s configuration, the query log of component search engines is not available for study. In this extended abstract, we model the blended search problem based on the conditional dependencies among queries, VSEs and all the search results. We utilize the usage information, i.e. query log, of all the VSEs, which are not available for traditional metasearch engines, to learn the model parameters by the smoothed unigram language model. Finally, given a user query, the search results from both generic Web search and different VSEs are ranked together by inferring their probabilities of relevance to the given query. The main contributions of this work are, (1) through studying the belonging vertical search engines’ query log of a commercial search engine, we show the importance of blended search problem; (2) we propose a novel probabilistic model based approach to explore the blended search problem; and (3) we experimentally verify that our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the MAP as much as 16% in contrast to traditional Web search without vertical search blending and 10% to some other some ranking baseline.
Yan, Jun and Liu, Ning and Qing Chang, Elaine and Ji, Lei and Chen, Zheng Search Result Re-ranking Based on Gap between Search Queries and Social Tags. Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result re-ranking. In this paper, we show that the gap between search queries and social tags of the same web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to rerank search results for better user satisfaction. Intuitively, on one hand, the search users’ intentions are generally described by their queries before they read the search results. On the other hand, the web annotators semantically tag web pages after they read the content of the pages. The difference between users’ recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users’ pre- and postknowledge respectively. We empirically show the strong correlation between user satisfaction and user’s knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
Chi, MingminChi, Mingmin and Zhang, Peiwu and Zhao, Yingbin and Feng, Rui and Xue, Xiangyang Web Image Retrieval ReRanking with Multi-view Clustering. General image retrieval is often carried out by a text-based search engine, such as Google Image Search. In this case, natural language queries are used as input to the search engine. Usually, the user queries are quite ambiguous and the returned results are not well-organized as the ranking often done by the popularity of an image. In order to address these problems, we propose to use both textual and visual contents of retrieved images to reRank web retrieved results. In particular, a machine learning technique, a multi-view clustering algorithm is proposed to reorganize the original results provided by the text-based search engine. Preliminary results validate the effectiveness of the proposed framework.
Chung, Chin-WanLee, Jihyun and Min, Jun-Ki and Chung, Chin-Wan An Effective Semantic Search Technique using Ontology. In this paper, we present a semantic search technique considering the type of desired Web resources and the semantic relationships between the resources and the query keywords in the ontology. In order to effectively retrieve the most relevant top-k resources, we propose a novel ranking model. To do this, we devise a measure to determine the weight of the semantic relationship. In addition, we consider the number of meaningful semantic relationships between a resource and keywords, the coverage of keywords, and the distin- guishability of keywords. Through experiments using real datasets, we observe that our ranking model provides more accurate seman- tic search results compared to existing ranking models.
Chung, SukwonChung, Sukwon and Shiowattana, Dungjit and Dmitriev, Pavel and Chan, Su The Web of Nations. In this paper, we report on a large-scale study of structural differences among the national webs. The study is based on a webscale crawl conducted in the summer 2008. More specifically, we study two graphs derived from this crawl, the nation graph, with nodes corresponding to nations and edges – to links among nations, and the host graph, with nodes corresponding to hosts and edges – to hyperlinks among pages on the hosts. Contrary to some of the previous work [2], our results show that webs of different nations are often very different from each other, both in terms of their internal structure, and in terms of their connectivity with other nations.
Chung Su, TseShieh, Jyh-Ren and Hsieh, Yung-Huan and Yeh, Yang-Ting and Chung Su, Tse and Lin, Ching-Yung and Wu, Ja-Ling Building Term Suggestion Relational Graphs from Collective Intelligence. This paper proposes an effective approach to provide relevant search terms for conceptual Web search. ‘Semantic Term Suggestion’ function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Cordier, Marie-OdileWang, Wei and Masseglia, Florent and Guyet, Thomas and Quiniou, Rene and Cordier, Marie-Odile A General Framework for Adaptive and Online Detection of Web Attacks. Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we pro- pose a novel general framework for adaptive and online de- tection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with “con- cept drift” in web audit data streams. Str-DBSCAN that we extended DBSCAN [1] to streaming data as well as StrAP [3] are both used to validate the framework. The detec- tion model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data col- lected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Cortez, EliToda, Guilherme A. and Cortez, Eli and Mesquita, Filipe and da Silva, Altigran S. and Moura, Edleno and Neubert, Marden Automatically Filling Form-Based Web Interfaces with Free Text Inputs. On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of formbased interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over formbased ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
Darmaputra, YansenKrüepl, Bernhard and Holzinger, Wolfgang and Darmaputra, Yansen and Baumgartner, Robert A Flight Meta-Search Engine with Metamorph. We demonstrate a flight meta-search engine that is based on the Metamorph framework. Metamorph provides mechanisms to model web forms together with the interactions which are needed to fulfil a request, and can generate interaction sequences that pose queries using these web forms and collect the results. In this paper, we discuss an interesting new feature that makes use of the forms themselves as an information source. We show how data can be extracted from web forms (rather than the data behind web forms) to generate a graph of flight connections between cities. The flight connection graph allows us to vastly reduce the number of queries that the engine sends to airline websites in the most interesting search scenarios; those that involve the controversial practice of creative ticketing, in which agencies attempt to find lower price fares by using more than one airline for a journey. We describe a system which attains data from a number of websites to identify promising routes and prune the search tree. Heuristics that make use of geographical information and an estimation of cost based on historical data are employed. The results are then made available to improve the quality of future search requests. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software General Terms: Algorithms, Design, Experimentation. Keywords: Hidden Web, Web Data Extraction, Web Form Mapping, Web Form Extraction.
Dasdan, AliDasdan, Ali and Drome, Chris and Kolay, Santanu Thumbs-Up: A Game for Playing to Rank Search Results. Human computation is an effective way to channel human effort spent playing games to solving computational problems that are easy for humans but difficult for computers to automate. We propose Thumbs-Up, a new game for human computation with the purpose of playing to rank search result. Our experience from users shows that Thumbs-Up is not only fun to play, but produces more relevant rankings than both a major search engine and optimal rank aggregation using the Kemeny rule.
Dasdan, Ali and Huynh, Xinh User-Centric Content Freshness Metrics for Search Engines. In order to return relevant search results, a search engine must keep its local repository synchronized to the Web, but it is usually impossible to attain perfect freshness. Hence, it is vital for a production search engine continually to monitor and improve repository freshness. Most previous freshness metrics, formulated in the context of developing better synchronization policies, focused on the web crawler while ignoring other parts of a search engine. But, the freshness of documents in a web crawler does not necessarily translate directly into the freshness of search results as seen by users. We propose metrics for measuring freshness from a user’s perspective, which take into account the latency between when documents are crawled and when they are viewed by users, as well as the variation in user click and view frequency among different documents. We also describe a practical implementation of these metrics that were used in a production search engine.
Kolay, Santanu and Dasdan, Ali The Value of Socially Tagged URLs for a Search Engine. Social bookmarking has emerged as a growing source of human generated content on the web. In essence, bookmarking involves URLs and tags on them. In this paper, we perform a large scale study of the usefulness of bookmarked URLs from the top social bookmarking site Delicious. Instead of focusing on the dimension of tags, which has been covered in the previous work, we explore social bookmarking from the dimension of URLs. More specifically, we investigate the Delicious URLs and their content to quantify their value to a search engine. For their value in leading to good content, we show that the Delicious URLs have higher quality content and more external outlinks. For their value in satisfying users, we show that the Delicious URLs have more clicked URLs as well as get more clicks. We suggest that based on their value, the Delicious URLs should be used as another source of seed URLs for crawlers.
Dasgupta, SourishDasgupta, Sourish and Bhat, Satish and Lee, Yugyung SGPS: A Semantic Scheme for Web Service Similarity. Today’s Web becomes a platform for services to be dynamically interconnected to produce a desired outcome. It is important to formalize the semantics of the contextual elements of web services. In this paper, we propose a novel technique called Semantic Genome Propagation Scheme (SGPS) for measuring similarity between semantic concepts. We show how SGPS is used to compute a multi-dimensional similarity between two services. We evaluate the SGPS similarity measurement in terms of the similarity performance and scalability.
Dekhil, MohamedGhosh, Riddhiman and Dekhil, Mohamed Discovering User Profiles. In this paper we describe techniques for the discovery and construction of user profiles. Leveraging from the emergent data web, our system addresses the problem of sparseness of user profile information currently faced by both asserted and inferred profile systems. A profile mediator, that dynamically builds the most suitable user profile for a particular service or interaction in real-time, is employed in our prototype implementation.
Della Valle, Emanuele
Deng, TingDeng, Ting and Huai, Jinpeng and Li, Xianxian and Du, Zongxia and Guo, Huipeng Automated Synthesis of Composite Services with Correctness Guarantee. In this paper, we propose a novel approach for composing existing web services to satisfy the correctness constraints to the design, including freeness of deadlock and unspecified reception, and temporal constraints in Computation Tree Logic formula. An automated synthesis algorithm based on learning algorithm is introduced, which guarantees that the composite service is the most general way of coordinating services so that the correctness is ensured. We have implemented a prototype system evaluating the effectiveness and efficiency of our synthesis approach through an experimental study. In this paper we propose a novel approach to synthesize the composite service from a given set of services, where the designer only needs to set the correctness constraints on the desired behaviors of the targeted service and the synthesis will be automatically performed with the correctness guaranteed. We implemented a prototype system and the preliminary experimental results on a practical travel agent example show that our synthesis approach is effective and efficient.
Dhopeshwarkar, SanketMathur, Vipul and Dhopeshwarkar, Sanket and Apte, Varsha MASTH Proxy: An Extensible Platform for Web Overload Control. Many overload control mechanisms for Web based applications aim to prevent overload by setting limits on factors such as admitted load, number of server threads, buffer size. For this they need online measurements of metrics such as response time, throughput, and resource utilization. This requires instrumentation of the server by modifying server code, which may not be feasible or desirable. An alternate approach is to use a proxy between the clients and servers. We have developed a proxy-based overload control platform called MASTH Proxy –Multi-class Admissioncontrolled Self-Tuning Http Proxy. It records detailed measurements, supports multiple request classes, manages queues of HTTP requests, provides tunable parameters and enables easy implementation of dynamic overload control. This gives designers of overload control schemes a platform where they can concentrate on developing the core control logic, without the need to modify upstream server code.
Dmitriev, PavelZheng, Shuyi and Dmitriev, Pavel and Lee Giles, C. Graph Based Crawler Seed Selection. This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more “good” and less “bad” pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.
Chung, Sukwon and Shiowattana, Dungjit and Dmitriev, Pavel and Chan, Su The Web of Nations. In this paper, we report on a large-scale study of structural differences among the national webs. The study is based on a webscale crawl conducted in the summer 2008. More specifically, we study two graphs derived from this crawl, the nation graph, with nodes corresponding to nations and edges – to links among nations, and the host graph, with nodes corresponding to hosts and edges – to hyperlinks among pages on the hosts. Contrary to some of the previous work [2], our results show that webs of different nations are often very different from each other, both in terms of their internal structure, and in terms of their connectivity with other nations.
Dom, ByronHe, Xiaofeng and Duan, Lei and Zhou, Yiping and Dom, Byron Threshold Selection for Web-Page Classification with Highly Skewed Class Distribution. We propose a novel cost-efficient approach to threshold selection for binary web-page classification problems with imbalanced class distributions. In many binary-classification tasks the distribution of classes is highly skewed. In such problems, using uniform random sampling in constructing sample sets for threshold setting requires large sample sizes in order to include a statistically sufficient number of examples of the minority class. On the other hand, manually labeling examples is expensive and budgetary considerations require that the size of sample sets be limited. These conflicting requirements make threshold selection a challenging problem. Our method of sample-set construction is a novel approach based on stratified sampling, in which manually labeled examples are expanded to reflect the true class distribution of the web-page population. Our experimental results show that using false positive rate as the criterion for threshold setting results in lower-variance threshold estimates than using other widely used accuracy measures such as F1 and precision.
Domínguez García, RenatoScholl, Philipp and Domínguez García, Renato and Böhnstedt, Doreen and Rensing, Christoph and Steinmetz, Ralf Towards LanguageIndependent Web Genre Detection. The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres. features (e.g. part-of-speech tagging and document terms), structural features (e.g. HTML tag frequencies, use of facets used to enable functionalities like form input elements) and simple text statistics (e.g. frequencies of punctuation). However, a fact often neglected by related work is that the absolute dominance of the English language on the web is decreasing. Thus, it is important to develop a way of recognizing web genres independently of the language used on the respective web page. As many genres exhibit a certain structural and visual layout, this property enables to ignore linguistic features altogether.
Dong, Zheng-BinDong, Zheng-Bin and Song, Guo-Jie and Xie, Kun-Qing and Wang, Jing-Yao An Experimental Study of Large-Scale Mobile Social Network. Mobile social network is a typical social network where one or more individuals of similar interests or commonalities, conversing and connecting with one another using the mobile phone. Our works in this paper focus on the experimental study for this kind of social network with the support of large-scale real mobile call data. The main contributions can be summarized as three-fold: firstly, a large-scale real mobile phone call log of one city has been extracted from a mobile phone carrier in China to construct mobile social network; secondly, common features of traditional social networks, such as power law distribution and small diameter etc, have been experimented, with which we confirm that the mobile social network is a typical scale-free network and has small-world phenomenon; lastly, different from traditional analytical methods, important properties of the actors, such as gender and age, have been introduced into our experiments with some interesting findings about human behavior, for example, the middle-age people are more active than the young and old people, and the female is unusual more active than the male while in the old age.
Drome, ChrisDasdan, Ali and Drome, Chris and Kolay, Santanu Thumbs-Up: A Game for Playing to Rank Search Results. Human computation is an effective way to channel human effort spent playing games to solving computational problems that are easy for humans but difficult for computers to automate. We propose Thumbs-Up, a new game for human computation with the purpose of playing to rank search result. Our experience from users shows that Thumbs-Up is not only fun to play, but produces more relevant rankings than both a major search engine and optimal rank aggregation using the Kemeny rule.
Du, ZongxiaDeng, Ting and Huai, Jinpeng and Li, Xianxian and Du, Zongxia and Guo, Huipeng Automated Synthesis of Composite Services with Correctness Guarantee. In this paper, we propose a novel approach for composing existing web services to satisfy the correctness constraints to the design, including freeness of deadlock and unspecified reception, and temporal constraints in Computation Tree Logic formula. An automated synthesis algorithm based on learning algorithm is introduced, which guarantees that the composite service is the most general way of coordinating services so that the correctness is ensured. We have implemented a prototype system evaluating the effectiveness and efficiency of our synthesis approach through an experimental study. In this paper we propose a novel approach to synthesize the composite service from a given set of services, where the designer only needs to set the correctness constraints on the desired behaviors of the targeted service and the synthesis will be automatically performed with the correctness guaranteed. We implemented a prototype system and the preliminary experimental results on a practical travel agent example show that our synthesis approach is effective and efficient.
Duan, LeiHe, Xiaofeng and Duan, Lei and Zhou, Yiping and Dom, Byron Threshold Selection for Web-Page Classification with Highly Skewed Class Distribution. We propose a novel cost-efficient approach to threshold selection for binary web-page classification problems with imbalanced class distributions. In many binary-classification tasks the distribution of classes is highly skewed. In such problems, using uniform random sampling in constructing sample sets for threshold setting requires large sample sizes in order to include a statistically sufficient number of examples of the minority class. On the other hand, manually labeling examples is expensive and budgetary considerations require that the size of sample sets be limited. These conflicting requirements make threshold selection a challenging problem. Our method of sample-set construction is a novel approach based on stratified sampling, in which manually labeled examples are expanded to reflect the true class distribution of the web-page population. Our experimental results show that using false positive rate as the criterion for threshold setting results in lower-variance threshold estimates than using other widely used accuracy measures such as F1 and precision.
Eager, Derek
Eda, TakeharuEda, Takeharu and Uchiyama, Toshio and Uchiyama, Tadasu and Yoshikawa, Masatoshi Signaling Emotion in Tagclouds. In order to create more attractive tagclouds that get people interested in tagged content, we propose a simple but novel tagcloud where font size is determined by tag’s entropy value, not the popularity to its content. Our method raises users’ emotional interest in the content by emphasizing more emotional tags. Our initial experiments show that emotional tagclouds attract more attention than normal tagclouds at first look; thus they will enhance the role of tagcloud as a social signaller.
Fan, WeiguoLiu, Ning and Yan, Jun and Fan, Weiguo and Yang, Qiang and Chen, Zheng Identifying Vertical Search Intention of Query through Social Tagging Propagation. A pressing task during the unification process is to identify a user’s vertical search intention based on the user’s query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
Farahat, AymanFarahat, Ayman Privacy Preserving Frequency Capping in Internet Banner Advertising. We describe an optimize-and-dispatch approach for delivering pay-per-impression advertisements in online advertising. The platform provider for an advertising network commits to showing advertisers’ banner ads while capping the number of advertising message shown to a unique user as the user transitions through the network. The traditional approach for enforcing frequency caps has been to use crosssite cookies to track users. However,cross-site cookies and other tracking mechanisms can infringe on the user privacy. In this paper, we propose a novel linear programming approach that decides when to show an ad to the user based solely on the page currently viewed by the users. We show that the frequency caps are fulfilled in expectation. We show the efficacy of that approach using simulation results. Categories and Subject Descriptors: G.3 Mathematics of Computing: Probability and Statistics General Terms: Algorithms. Keywords: User Model, Markov Chain to transition from one section of the advertising network to another based on a random yet know probability transition matrix. The traditional approach to frequency capping is to use cross-site cookies to track users through the web properties where the advertising network is serving ads. The cookies are used to keep a count of the number of ads the user has seen. When the user has reached the maximum daily caps, no further ads are shown. However, there has been growing concern over the privacy issues associated with tracking the user across multiple sites. Furthermore,up to 33 % of the users delete their cookies making cookie based approach unreliable [3]. We propose a novel algorithm that can be used to insure that the frequency caps are fulfilled in expectation. The approach is based on formulating a linear optimization program that maximizes the expected number of ads seen by the user subject to the frequency caps constraints. The solution to the linear program gives a set of probabilistic weights used by the ad server to decide whether to serve the ad when a user arrives at a specific web page.
Feng, DanTu, Xudong and Wang, Xin-Jing and Feng, Dan and Zhang, Lei Ranking Community Answers via Analogical Reasoning. Due to the lexical gap between questions and answers, automatically detecting right answers becomes very challenging for community question-answering sites. In this paper, we propose an analogical reasoning-based method. It treats questions and answers as relational data and ranks an answer by measuring the analogy of its link to a query with the links embedded in previous relevant knowledge; the answer that links in the most analogous way to the new question is assumed to be the best answer. We based our experiments on 29.8 million Yahoo!Answer questionanswer threads and showed the effectiveness of the approach.
Feng, JianhuaLi, Guoliang and Feng, Jianhua and Zhou, Lizhu Interactive Search in XML Data. In a traditional keyword-search system in XML data, a user composes a keyword query, submits it to the system, and retrieves relevant subtrees. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information. In this paper, we study a new information-access paradigm for XML data, called “Inks,” in which the system searches on the underlying data “on the fly” as the user types in query keywords. Inks extends existing XML keyword search methods by interactively answering keyword queries. We propose effective indices, early-termination techniques, and efficient search algorithms to achieve a high interactive speed. We have implemented our algorithm. The experimental results show that Inks achieves high search efficiency and result quality.
Feng, RuiChi, Mingmin and Zhang, Peiwu and Zhao, Yingbin and Feng, Rui and Xue, Xiangyang Web Image Retrieval ReRanking with Multi-view Clustering. General image retrieval is often carried out by a text-based search engine, such as Google Image Search. In this case, natural language queries are used as input to the search engine. Usually, the user queries are quite ambiguous and the returned results are not well-organized as the ranking often done by the popularity of an image. In order to address these problems, we propose to use both textual and visual contents of retrieved images to reRank web retrieved results. In particular, a machine learning technique, a multi-view clustering algorithm is proposed to reorganize the original results provided by the text-based search engine. Preliminary results validate the effectiveness of the proposed framework.
Firan, Claudiu S.Bischoff, Kerstin and Firan, Claudiu S. and Paiu, Raluca Deriving Music Theme Annotations from User Tags. Music theme annotations would be really beneficial for supporting retrieval, but are often neglected by users while annotating. Thus, in order to support users in tagging and to fill the gaps in the tag space, in this paper we develop algorithms for recommending theme annotations. Our methods exploit already existing user tags, the lyrics of music tracks, as well as combinations of both. We compare the results for our recommended theme annotations against genre and style recommendations – a much easier and already studied task. We evaluate the quality of our recommended tags against an expert ground truth data set. Our results are promising and provide interesting insights into possible extensions for music tagging systems to support music search.
Formoso, VreixoBaraglia, Ranieri and Cacheda, Fidel and Carneiro, Victor and Formoso, Vreixo and Perego, Raffaele and Silvestri, Fabrizio Search Shortcuts: Driving Users Towards Their Goals. Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide Suggestions under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce Search Shortcuts as “Successful ” queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Francesco Barbieri, Davide
Gabrilovich, EvgeniyChang, William and Pantel, Patrick and Popescu, Ana-Maria and Gabrilovich, Evgeniy Towards Intent-Driven Bidterm Suggestion. In online advertising, pervasive in commercial search engines, advertisers typically bid on few terms, and the scarcity of data makes ad matching difficult. Suggesting additional bidterms can significantly improve ad clickability and conversion rates. In this paper, we present a large-scale bidterm suggestion system that models an advertiser’s intent and finds new bidterms consistent with that intent. Preliminary experiments show that our system significantly increases the coverage of a state of the art production system used at Yahoo while maintaining comparable precision.
Gang, LuWang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Gatellier, PatrickBrunner, Jean-Sebastien and Gatellier, Patrick Raise Semantics at the User Level for Dynamic and Interactive SOA-based Portals. In this paper, we describe the fully dynamic semantic portal we implemented, integrating Semantic Web technologies and Service Oriented Architecture (SOA). The goals of the portal are twofold: first it helps administrators to easily propose new features in the portal using semantics to ease the orchestration process; secondly it automatically generates a customized user interface for these scenarios. This user interface takes into account different devices and assists end-users in the use of the portal taking benefit of context awareness. All the added-value of this portal is based on a core semantics defined by an ontology. We present here the main features of this portal and how it was implemented using state-of- the-art technologies and frameworks.
Gatius, MartaGatius, Marta and González, Meritxell A Flexible Dialogue System for Enhancing Web Usability. In this paper, we study how the performance and usability of web dialogue systems could be enhanced by using an appropriate representation of the different types of knowledge involved in communication: general dialogue mechanisms, specific domainrestricted linguistic and conceptual knowledge and information on how well the communication process is doing. We describe the experiments carried out to analyze how to improve this knowledge representation in the web dialogue system we developed. the system to the user’s expertise and application complexity we have distinguished two types of messages: directed and open. Directed system's messages are explicit about the information the system needs from the user at each state of the communication. Open system's messages suggest the user to introduce the information needed, but not as strongly. Although different types of messages could be used, the initiative (or control) of the dialogue is always mixed because the user can decide to select a new task at any state of the communication and the system will always guide the user to introduce the needed information. In order to improve the adaptability of the DS several systems ([1],[3]) dynamically adapt the dialogue strategy. In our DS we have incorporated an independent module that uses data on how well the communication is doing to determine the type of message and the confirmation policy. Following the methodology proposed in [1] we analyzed a corpus of dialogues to obtain the data that gives information about the most appropiate system stategy and the amount of evidence each data gives. The dialogue cues used by the adaptive module to determine the system’s respond are related to the system’s errors as well as to the content of the user’s intervention (asking for help, giving new relevant data and giving not expected data).
Geng, Guang-GangGeng, Guang-Gang and Li, Qiudan and Zhang, Xinchang Link Based Small Sample Learning for Web Spam Detection. Robust statistical learning based web spam detection sys- tem often requires large amounts of labeled training data. However, labeled samples are more difficult, expensive and time consuming to obtain than unlabeled ones. This pa- per proposed link based semi-supervised learning algorithms to boost the performance of a classifier, which integrates the traditional Self-training with the topological dependency based link learning. The experiments with a few labeled samples on standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.
Ghosh, RiddhimanGhosh, Riddhiman and Dekhil, Mohamed Discovering User Profiles. In this paper we describe techniques for the discovery and construction of user profiles. Leveraging from the emergent data web, our system addresses the problem of sparseness of user profile information currently faced by both asserted and inferred profile systems. A profile mediator, that dynamically builds the most suitable user profile for a particular service or interaction in real-time, is employed in our prototype implementation.
Golbandi, NadavAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Gomes, PauloOliveira, Pedro and Gomes, Paulo Instance-based Probabilistic Reasoning in the Semantic Web. Most of the approaches for dealing with uncertainty in the Semantic Web rely on the principle that this uncertainty is already asserted. In this paper, we propose a new approach to learn and reason about uncertainty in the Semantic Web. Using instance data, we learn the uncertainty of an OWL ontology, and use that information to perform probabilistic reasoning on it. For this purpose, we use Markov logic, a new representation formalism that combines logic with probabilistic graphical models. cumbersome and difficult task, invalidating all the gains that could arise from the annotation. In fact, uncertainty is a common characteristic of the current Web. When we create a webpage, for example, search engines are responsible to assert what is the probabilistic relevance of it, compared to other pages, to certain topics. We don’t have to explicitly refer that information: we just create its content, and search engines do the rest. So, we must develop similar automatic mechanisms to perform reasoning in the Semantic Web. In this work, we study how we can make probabilistic reasoning on OWL ontologies without any kind of uncertainty annotation. To assert the uncertainty of its axioms, we use solely the information of its instances. For this purpose, we use Markov logic [4], a novel approach that combines logic and probability in the same representation.
González, MeritxellGatius, Marta and González, Meritxell A Flexible Dialogue System for Enhancing Web Usability. In this paper, we study how the performance and usability of web dialogue systems could be enhanced by using an appropriate representation of the different types of knowledge involved in communication: general dialogue mechanisms, specific domainrestricted linguistic and conceptual knowledge and information on how well the communication process is doing. We describe the experiments carried out to analyze how to improve this knowledge representation in the web dialogue system we developed. the system to the user’s expertise and application complexity we have distinguished two types of messages: directed and open. Directed system's messages are explicit about the information the system needs from the user at each state of the communication. Open system's messages suggest the user to introduce the information needed, but not as strongly. Although different types of messages could be used, the initiative (or control) of the dialogue is always mixed because the user can decide to select a new task at any state of the communication and the system will always guide the user to introduce the needed information. In order to improve the adaptability of the DS several systems ([1],[3]) dynamically adapt the dialogue strategy. In our DS we have incorporated an independent module that uses data on how well the communication is doing to determine the type of message and the confirmation policy. Following the methodology proposed in [1] we analyzed a corpus of dialogues to obtain the data that gives information about the most appropiate system stategy and the amount of evidence each data gives. The dialogue cues used by the adaptive module to determine the system’s respond are related to the system’s errors as well as to the content of the user’s intervention (asking for help, giving new relevant data and giving not expected data).
Gorshkov, AndreyNikolaev, Kirill and Zudina, Ekaterina and Gorshkov, Andrey Combining Anchor Text Categorization and Graph Analysis for Paid Link Detection. In order to artificially boost the rank of commercial pages in search engine results, search engine optimizers pay for links to these pages on other websites. Identifying paid links is important for a web search engine to produce highly relevant results. In this paper we introduce a novel method of identifying such links. We start with training a classifier of anchor text topics and analyzing web pages for diversity of their outgoing commercial links. Then we use this information and analyze link graph of the Russian Web to find pages that sell links and sites that buy links and to identify the paid links. Testing on manually marked samples showed high efficiency of the algorithm.
Gottron, Thomas
Grefenstette, GregoryPopescu, Adrian and Grefenstette, Gregory Deducing Trip Related Information from Flickr. Uploading tourist photos is a popular activity on photo sharing platforms. These photographs and their associated metadata (tags, geo-tags, and temporal information) should be useful for mining information about the sites visited. However, user-supplied metadata are often noisy and efficient filtering methods are needed before extracting useful knowledge. We focus here on exploiting temporal information, associated with tourist sites that appear in Flickr. From automatically filtered sets of geo-tagged photos, we deduce answers to questions like “how long does it take to visit a tourist attraction?” or “what can I visit in one day in this city?” Our method is evaluated and validated by comparing the automatically obtained visit duration times to manual estimations.
Grineva, Maria
Grossniklaus, Michael
Guan, ZiyuWang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Gunopulos, DimitriosPlatakis, Manolis and Kotsakos, Dimitrios and Gunopulos, Dimitrios Searching for Events in the Blogosphere. Over the last few years, blogs (web logs) have gained massive popularity and have become one of the most influential web social media in our times. Every blog post in the Blogosphere has a well defined timestamp, which is not taken into account by search engines. By conducting research regarding this feature of the Blogosphere, we can attempt to discover bursty terms and correlations between them during a time interval. We apply Kleinberg’s automaton on extracted titles of blog posts to discover bursty terms, we introduce a novel representation of a term’s burstiness evolution called State Series and we employ a Euclidean-based distance metric to discover potential correlations between terms without taking into account their context. We evaluate the results trying to match them with real life events. Finally, we propose some ideas for further evaluation techniques and future research in the field.
Guo, HuipengDeng, Ting and Huai, Jinpeng and Li, Xianxian and Du, Zongxia and Guo, Huipeng Automated Synthesis of Composite Services with Correctness Guarantee. In this paper, we propose a novel approach for composing existing web services to satisfy the correctness constraints to the design, including freeness of deadlock and unspecified reception, and temporal constraints in Computation Tree Logic formula. An automated synthesis algorithm based on learning algorithm is introduced, which guarantees that the composite service is the most general way of coordinating services so that the correctness is ensured. We have implemented a prototype system evaluating the effectiveness and efficiency of our synthesis approach through an experimental study. In this paper we propose a novel approach to synthesize the composite service from a given set of services, where the designer only needs to set the correctness constraints on the desired behaviors of the targeted service and the synthesis will be automatically performed with the correctness guaranteed. We implemented a prototype system and the preliminary experimental results on a practical travel agent example show that our synthesis approach is effective and efficient.
Guo, LiLi, Yang and Lu, Tian-Bo and Guo, Li and Tian, Zhi-Hong and Nie, Qin-Wu Towards Lightweight and Efficient DDoS Attacks Detection for Web Server. In this poster, based on our previous work in building a lightweight DDoS (Distributed Denial-of-Services) attacks detection mechanism for web server using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and genetic algorithm based instance selection methods, we further propose a more efficient and effective instance selection method, named E-FCM (Extend Fuzzy C-Means). By using this method, we can obtain much cheaper training time for TCM-KNN while ensuring high detection performance. Therefore, the optimized mechanism is more suitable for lightweight DDoS attacks detection in real network environment. In our previous work, we proposed an effective anomaly detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) algorithm to fulfill DDoS attacks detection task towards ensuring the QoS of web server. The method is good at detecting network anomalies with high detection rate, high confidence and low false positives than traditional methods, because it combines “strangeness” with “p-values” measures to evaluate the network traffic compared to the conventional ad-hoc thresholds based detection and particular definition based detection. Secondly, we utilize the new objective measurement as the input feature spaces of TCM-KNN, to effectively detect DDoS attack against web server. Finally, we introduce Genetic Algorithm (GA) based instance selection method to boost the real-time detection performance of TCM-KNN and thus make it be an effective and lightweight mechanism for DDoS detection for web servers [4, 5]. However, we found the computational cost for GA is expensive, which results in high training time for TCM-KNN.
Guo, XiaolinZhang, Kaimin and Wang, Lu and Guo, Xiaolin and Pan, Aimin and Zhu, Bin B. WPBench: A Benchmark for Evaluating the Client-side Performance of Web 2.0 Applications. In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users’ interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
Gupta, AjayChandra, Praphul and Gupta, Ajay Retaining Personal Expression for Social Search. Web is being extensively used for personal expression, which includes ratings, reviews, recommendations, blogs. This user created content, e.g. book review on Amazon.com, becomes the property of the website, and the user often does not have easy access to it. In some cases, user’s feedback may get averaged with feedback from other users e.g. ratings of a video. We argue that the creator of such content needs to be able to retain (a link to) her created content. We introduce the concept of MEB which is a user controlled store of such retained links. A MEB allows a user to access/share all the reviews she has given on different websites. With this capability users can allow their friends to search through their feedback. Searching through one’s social network allows harnessing the power of social networks where known relationships provide the context & trust necessary to interpret feedback.
Gupta, AnubhaRamanujam, Sunitha and Gupta, Anubha and Khan, Latifur and Seida, Steven and Thuraisingham, Bhavani Relationalizing RDF Stores for Tools Reusability. The emergence of Semantic Web technologies and standards such as Resource Description Framework (RDF) has introduced novel data storage models such as the RDF Graph Model. In this paper, we present a research effort called R2D, which attempts to bridge the gap between RDF and RDBMS concepts by presenting a relational view of RDF data stores. Thus, R2D is essentially a relational wrapper around RDF stores that aims to make the variety of stable relational tools that are currently in the market available to RDF stores without data duplication and synchronization issues.
Gupta, ManishGupta, Manish Predicting Click Through Rate for Job Listings. Click Through Rate (CTR) is an important metric for ad systems, job portals, recommendation systems. CTR impacts publisher’s revenue, advertiser’s bid amounts in “pay for performance” business models. We learn regression models using features of the job, optional click history of job, features of “related” jobs. We show that our models predict CTR much better than predicting avg. CTR for all job listings, even in absence of the click history for the job listing.
Guyet, ThomasWang, Wei and Masseglia, Florent and Guyet, Thomas and Quiniou, Rene and Cordier, Marie-Odile A General Framework for Adaptive and Online Detection of Web Attacks. Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we pro- pose a novel general framework for adaptive and online de- tection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with “con- cept drift” in web audit data streams. Str-DBSCAN that we extended DBSCAN [1] to streaming data as well as StrAP [3] are both used to validate the framework. The detec- tion model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data col- lected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Halpin, HarryHalpin, Harry Is There Anything Worth Finding on the Semantic Web? There has recently been an upsurge of interest in the possibilities of combining structured data and ad-hoc information retrieval from traditional hypertext. In this experiment, we run queries extracted from a query log of a major search engine against the Semantic Web to discover if the Semantic Web has anything of interest to the average user. We show that there is indeed much information on the Semantic Web that could be relevant for many queries for people, places and even abstract concepts, although they are overwhelmingly clustered around a Semantic Web-enabled export of Wikipedia known as DBPedia. to a more specialized search engine. We use a search query log of approximately 15 million distinct queries from Microsoft Live Search. This query log contains 14,921,285 queries. Of these queries, 7,095,302 (47.55%) were unique, and corrected for capitalization, 6,623,635 (44.39%) were unique.
Har'El, NadavAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Hassanzadeh, OktieHassanzadeh, Oktie and Lim, Lipyeow and Kementsietsidis, Anastasios and Wang, Min A Declarative Framework for Semantic Link Discovery over Relational Data. In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage require- ments by the user, that allows matching data items in many real-world scenarios. These requirements are translated to queries that can run over the relational data source, potentially using the semantic knowledge to enhance the accuracy of link discovery. Our framework lets data publishers to easily find and publish high-quality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.
He, XiaofeiWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Wang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Qu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Zhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
He, XiaofengHe, Xiaofeng and Duan, Lei and Zhou, Yiping and Dom, Byron Threshold Selection for Web-Page Classification with Highly Skewed Class Distribution. We propose a novel cost-efficient approach to threshold selection for binary web-page classification problems with imbalanced class distributions. In many binary-classification tasks the distribution of classes is highly skewed. In such problems, using uniform random sampling in constructing sample sets for threshold setting requires large sample sizes in order to include a statistically sufficient number of examples of the minority class. On the other hand, manually labeling examples is expensive and budgetary considerations require that the size of sample sets be limited. These conflicting requirements make threshold selection a challenging problem. Our method of sample-set construction is a novel approach based on stratified sampling, in which manually labeled examples are expanded to reflect the true class distribution of the web-page population. Our experimental results show that using false positive rate as the criterion for threshold setting results in lower-variance threshold estimates than using other widely used accuracy measures such as F1 and precision.
Heatherly, RaymondLindamood, Jack and Heatherly, Raymond and Kantarcioglu, Murat and Thuraisingham, Bhavani Inferring Private Information Using Social Network Data. On-line social networks, such as Facebook, are increasingly utilized by many users. These networks allow people to publish details about themselves and connect to their friends. Some of the information revealed inside these networks is private and it is possible that corporations could use learning algorithms on the released data to predict undisclosed private information. In this paper, we explore how to launch inference attacks using released social networking data to predict undisclosed private information about individuals. We then explore the effectiveness of possible sanitization techniques that can be used to combat such inference attacks under different scenarios. social network data could be used to predict some individual private trait that a user is not willing to disclose (e.g., political or religious affiliation) and explore the effect of possible data sanitization alternatives on preventing such private information leakage. To our knowledge this is the first comprehensive paper that discusses the problem of inferring private traits using real-life social network data and possible sanitization approaches to prevent such inference. First, we present a ıve modification of Na¨ Bayes classification that is suitable for classifying large amount of social network data. Our modified Na¨ Bayes algorithm predicts privacy sensitive trait ıve information using both node traits and link structure. We compare the accuracy of our learning method based on link structure against the accuracy of our learning method based on node traits. Please see extended version of this paper [3] for further details of our modified Naive Bayes classifier. In order to protect privacy, we sanitize both trait (e.g., deleting some information from a user’s on-line profile) and link details (e.g., deleting links between friends) and explore the effect they have on combating possible inference attacks. Our initial results indicate that just sanitizing trait information or link information may not be enough to prevent inference attacks and comprehensive sanitization techniques that involve both aspects are needed in practice. Similar to our paper, in [2], authors consider ways to infer private information via friendship links by creating a Bayesian Network from the links inside a social network. A similar privacy problem for online social networks is discussed in [4]. Compared to [2] and [4], we provide techniques that help in choosing the most effective traits or links that need to be removed for protecting privacy.
Henzinger, MonikaBaykan, Eda and Henzinger, Monika and Marian, Ludmila and Weber, Ingmar Purely URL-based Topic Classification. Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page’s content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
Holzinger, WolfgangKrüepl, Bernhard and Holzinger, Wolfgang and Darmaputra, Yansen and Baumgartner, Robert A Flight Meta-Search Engine with Metamorph. We demonstrate a flight meta-search engine that is based on the Metamorph framework. Metamorph provides mechanisms to model web forms together with the interactions which are needed to fulfil a request, and can generate interaction sequences that pose queries using these web forms and collect the results. In this paper, we discuss an interesting new feature that makes use of the forms themselves as an information source. We show how data can be extracted from web forms (rather than the data behind web forms) to generate a graph of flight connections between cities. The flight connection graph allows us to vastly reduce the number of queries that the engine sends to airline websites in the most interesting search scenarios; those that involve the controversial practice of creative ticketing, in which agencies attempt to find lower price fares by using more than one airline for a journey. We describe a system which attains data from a number of websites to identify promising routes and prune the search tree. Heuristics that make use of geographical information and an estimation of cost based on historical data are employed. The results are then made available to improve the quality of future search requests. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software General Terms: Algorithms, Design, Experimentation. Keywords: Hidden Web, Web Data Extraction, Web Form Mapping, Web Form Extraction.
Horhammer, MikeAn, Ning and Chatterjee, Raja and Horhammer, Mike and Ravada, Siva Securely Implementing Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial. In this paper, we briefly describe the implementation of various Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial 11g. We highlight how we utilize Oracle’s implementation of OASIS Web Services Security (WSS) to provide a robust security framework for these OGC Web Services. We also discuss our future direction in supporting OGC Web Service Interface Standards. In addition to the mandated XML interfaces, Oracle Spatial WFS, Oracle Spatial CSW and Oracle Spatial OpenLS also support SOAP interfaces. To improve performance, Oracle Spatial WFS and Oracle Spatial CSW also implement caching mechanism to support retrieving records from a single query across different web requests. Below, we describe each of supported OGC Web Services in Oracle Spatial. Due to space limits, we will emphasize Oracle Spatial WFS support to illustrate our unique implementation.
Hsieh, Yung-HuanShieh, Jyh-Ren and Hsieh, Yung-Huan and Yeh, Yang-Ting and Chung Su, Tse and Lin, Ching-Yung and Wu, Ja-Ling Building Term Suggestion Relational Graphs from Collective Intelligence. This paper proposes an effective approach to provide relevant search terms for conceptual Web search. ‘Semantic Term Suggestion’ function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Hsu, Chao-JungHsu, Chao-Jung and Huang, Chin-Yu Reliability Analysis Using Weighted Combinational Models for Web-based Software. In the past, some researches suggested that engineers can use combined software reliability growth models (SRGMs) to obtain more accurate reliability prediction during testing. In this paper, three weighted combinational models, namely, equal, linear, and nonlinear weight, are proposed for reliability estimation of web-based software. We further investigate the estimation accuracy of using genetic algorithm to determine the weight assignment for the proposed models. Preliminary result shows that the linearly and nonlinearly weighted combinational models have better prediction capability than single SRGM and equally weighted combinational model for web-based software.
Hu, JianWang, Gang and Hu, Jian and Zhu, Yunzhang and Li, Hua and Chen, Zheng Competitive Analysis from Click-Through Log. Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers’ products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors’ keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
Ni, Xiaochuan and Sun, Jian-Tao and Hu, Jian and Chen, Zheng Mining Multilingual Topics from Wikipedia. In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted “universal” topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.
Hu, Weiming
Hua, Xian-ShengLi, Lusong and Mei, Tao and Liu, Chris and Hua, Xian-Sheng GameSense. This paper presents a novel game-like advertising system called GameSense, which is driven by the compelling contents of online images. Given a Web page which typically contains images, GameSense is able to select suitable images to create online in-image games for advertising. The contextually relevant ads (i.e., product logos) are embedded at appropriate positions within the online games. The ads are selected based on not only textual relevance but also visual content similarity. The game is able to provide viewers rich experience and thus promote the embedded ads to provide more effective advertising.
Huai, JinpengDeng, Ting and Huai, Jinpeng and Li, Xianxian and Du, Zongxia and Guo, Huipeng Automated Synthesis of Composite Services with Correctness Guarantee. In this paper, we propose a novel approach for composing existing web services to satisfy the correctness constraints to the design, including freeness of deadlock and unspecified reception, and temporal constraints in Computation Tree Logic formula. An automated synthesis algorithm based on learning algorithm is introduced, which guarantees that the composite service is the most general way of coordinating services so that the correctness is ensured. We have implemented a prototype system evaluating the effectiveness and efficiency of our synthesis approach through an experimental study. In this paper we propose a novel approach to synthesize the composite service from a given set of services, where the designer only needs to set the correctness constraints on the desired behaviors of the targeted service and the synthesis will be automatically performed with the correctness guaranteed. We implemented a prototype system and the preliminary experimental results on a practical travel agent example show that our synthesis approach is effective and efficient.
Huang, Chin-YuHsu, Chao-Jung and Huang, Chin-Yu Reliability Analysis Using Weighted Combinational Models for Web-based Software. In the past, some researches suggested that engineers can use combined software reliability growth models (SRGMs) to obtain more accurate reliability prediction during testing. In this paper, three weighted combinational models, namely, equal, linear, and nonlinear weight, are proposed for reliability estimation of web-based software. We further investigate the estimation accuracy of using genetic algorithm to determine the weight assignment for the proposed models. Preliminary result shows that the linearly and nonlinearly weighted combinational models have better prediction capability than single SRGM and equally weighted combinational model for web-based software.
Hussain, Toufeeq
Huynh, XinhDasdan, Ali and Huynh, Xinh User-Centric Content Freshness Metrics for Search Engines. In order to return relevant search results, a search engine must keep its local repository synchronized to the Web, but it is usually impossible to attain perfect freshness. Hence, it is vital for a production search engine continually to monitor and improve repository freshness. Most previous freshness metrics, formulated in the context of developing better synchronization policies, focused on the web crawler while ignoring other parts of a search engine. But, the freshness of documents in a web crawler does not necessarily translate directly into the freshness of search results as seen by users. We propose metrics for measuring freshness from a user’s perspective, which take into account the latency between when documents are crawled and when they are viewed by users, as well as the variation in user click and view frequency among different documents. We also describe a practical implementation of these metrics that were used in a production search engine.
Iida, Toshinari
Ji, LeiYan, Jun and Liu, Ning and Qing Chang, Elaine and Ji, Lei and Chen, Zheng Search Result Re-ranking Based on Gap between Search Queries and Social Tags. Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result re-ranking. In this paper, we show that the gap between search queries and social tags of the same web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to rerank search results for better user satisfaction. Intuitively, on one hand, the search users’ intentions are generally described by their queries before they read the search results. On the other hand, the web annotators semantically tag web pages after they read the content of the pages. The difference between users’ recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users’ pre- and postknowledge respectively. We empirically show the strong correlation between user satisfaction and user’s knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
Jiang, BoJiang, Bo and Chan, W. K. and Zhang, Zhenyu and Tse, T. H. Where to Adapt Dynamic Service Compositions. Peer services depend on one another to accomplish their tasks, and their structures may evolve. A service composition may be designed to replace its member services whenever the quality of the composite service fails to meet certain quality-of-service (QoS) requirements. Finding services and service invocation endpoints having the greatest impact on the quality are important to guide subsequent service adaptations. This paper proposes a technique that samples the QoS of composite services and continually analyzes them to identify artifacts for service adaptation. The preliminary results show that our technique has the potential to effectively find such artifacts in services.
Jiang, Lili
Juffinger, AndreasJuffinger, Andreas and Lex, Elisabeth Crosslanguage Blog Mining and Trend Visualisation. People use weblogs to express thoughts, present ideas and share knowledge, therefore weblogs are extraordinarily valuable resources, amongs others, for trend analysis. Trends are derived from the chronological sequence of blog post count per topic. The comparison with a reference corpus allows qualitative statements over identified trends. We propose a crosslanguage blog mining and trend visualisation system to analyse blogs across languages and topics. The trend visualisation facilitates the identification of trends and the comparison with the reference news article corpus. To prove the correctness of our system we computed the correlation between trends in blogs and news articles for a subset of blogs and topics. The evaluation corroborated our hypothesis of a high correlation coefficient for these subsets and therefore the correctness of our system for different languages and topics is proven.
Kantarcioglu, MuratLindamood, Jack and Heatherly, Raymond and Kantarcioglu, Murat and Thuraisingham, Bhavani Inferring Private Information Using Social Network Data. On-line social networks, such as Facebook, are increasingly utilized by many users. These networks allow people to publish details about themselves and connect to their friends. Some of the information revealed inside these networks is private and it is possible that corporations could use learning algorithms on the released data to predict undisclosed private information. In this paper, we explore how to launch inference attacks using released social networking data to predict undisclosed private information about individuals. We then explore the effectiveness of possible sanitization techniques that can be used to combat such inference attacks under different scenarios. social network data could be used to predict some individual private trait that a user is not willing to disclose (e.g., political or religious affiliation) and explore the effect of possible data sanitization alternatives on preventing such private information leakage. To our knowledge this is the first comprehensive paper that discusses the problem of inferring private traits using real-life social network data and possible sanitization approaches to prevent such inference. First, we present a ıve modification of Na¨ Bayes classification that is suitable for classifying large amount of social network data. Our modified Na¨ Bayes algorithm predicts privacy sensitive trait ıve information using both node traits and link structure. We compare the accuracy of our learning method based on link structure against the accuracy of our learning method based on node traits. Please see extended version of this paper [3] for further details of our modified Naive Bayes classifier. In order to protect privacy, we sanitize both trait (e.g., deleting some information from a user’s on-line profile) and link details (e.g., deleting links between friends) and explore the effect they have on combating possible inference attacks. Our initial results indicate that just sanitizing trait information or link information may not be enough to prevent inference attacks and comprehensive sanitization techniques that involve both aspects are needed in practice. Similar to our paper, in [2], authors consider ways to infer private information via friendship links by creating a Bayesian Network from the links inside a social network. A similar privacy problem for online social networks is discussed in [4]. Compared to [2] and [4], we provide techniques that help in choosing the most effective traits or links that need to be removed for protecting privacy.
Kasneci, Gjergji
Keller, MatthiasKeller, Matthias and Nussbaumer, Martin Cascading Style Sheets: A Novel Approach Towards Productive Styling with Today's Standards. In this paper we present an approach of generating Cascading Style Sheet documents automatically if the desired effect on the content elements is specified. While a Web user agent resolves the CSS rules and computes their effect, our approach handles the way back. We argue, that this can remarkably improve CSS productivity, since the process of CSS authoring always involves this direction implicitly. Our approach claims a new and innovative way to reuse chunks of markup together with its presentation. It furthermore bears potential for the optimization and reorganization of CSS documents. We describe criteria for CSS code quality we oriented on, including a quantitative indicator for the abstractness of a CSS presentation specification. An evaluation and recomputation of the CSS for 25.000 HTML documents shows that concerning these criteria the automatically generated code comes close to manually authored code.
Kelliher, AislingLin, Yu-Ru and Sun, Jimeng and Castro, Paul and Konuru, Ravi and Sundaram, Hari and Kelliher, Aisling Extracting Community Structure through Relational Hypergraphs. Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an online method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users’ future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Kementsietsidis, AnastasiosHassanzadeh, Oktie and Lim, Lipyeow and Kementsietsidis, Anastasios and Wang, Min A Declarative Framework for Semantic Link Discovery over Relational Data. In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage require- ments by the user, that allows matching data items in many real-world scenarios. These requirements are translated to queries that can run over the relational data source, potentially using the semantic knowledge to enhance the accuracy of link discovery. Our framework lets data publishers to easily find and publish high-quality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.
Khan, LatifurRamanujam, Sunitha and Gupta, Anubha and Khan, Latifur and Seida, Steven and Thuraisingham, Bhavani Relationalizing RDF Stores for Tools Reusability. The emergence of Semantic Web technologies and standards such as Resource Description Framework (RDF) has introduced novel data storage models such as the RDF Graph Model. In this paper, we present a research effort called R2D, which attempts to bridge the gap between RDF and RDBMS concepts by presenting a relational view of RDF data stores. Thus, R2D is essentially a relational wrapper around RDF stores that aims to make the variety of stable relational tools that are currently in the market available to RDF stores without data duplication and synchronization issues.
Kil, HyunyoungKil, Hyunyoung and Nam, Wonhong and Lee, Dongwon Automatic Web Service Composition with Abstraction and Refinement. The behavioral description based Web Service Composition (WSC) problem aims at the automatic construction of a coordinator web service that controls a set of web services to reach a goal state. However, solving the WSC problem exactly with a realistic model is doubly-exponential in the number of variables in web service descriptions. In this paper, we propose a novel efficient approximation-based algorithm using automatic abstraction and refinement to dramatically reduce the number of variables needed to solve the problem.
Kim, JinilLee, Taehyung and Kim, Jinil and Wook Kim, Jin and Kim, Sung-Ryul and Park, Kunsoo Detecting Soft Errors by Redirection Classification. A soft error redirection is a URL redirection to a page that returns the HTTP status code 200 (OK) but has actually no relevant content to the client request. Since such redirections degrade the performance of web search engines in many ways, it is highly desirable to remove as many of them as possible. We propose a novel approach to detect soft error redirections by analyzing redirection logs collected during crawling operation. Experimental results on huge crawl data show that our measure can classify soft error redirections effectively.
Kim, Sung-RyulLee, Taehyung and Kim, Jinil and Wook Kim, Jin and Kim, Sung-Ryul and Park, Kunsoo Detecting Soft Errors by Redirection Classification. A soft error redirection is a URL redirection to a page that returns the HTTP status code 200 (OK) but has actually no relevant content to the client request. Since such redirections degrade the performance of web search engines in many ways, it is highly desirable to remove as many of them as possible. We propose a novel approach to detect soft error redirections by analyzing redirection logs collected during crawling operation. Experimental results on huge crawl data show that our measure can classify soft error redirections effectively.
Kitagawa, Hiroyuki
Kohlschütter, ChristianKohlschütter, Christian A Densitometric Analysis of Web Template Content. What makes template content in the Web so special that we need to remove it? In this paper I present a large-scale aggregate analysis of textual Web content, corroborating statistical laws from the field of Quantitative Linguistics. I analyze the idiosyncrasy of template content compared to regular “full text” content and derive a simple yet suitable quantitative model.
Kolay, SantanuDasdan, Ali and Drome, Chris and Kolay, Santanu Thumbs-Up: A Game for Playing to Rank Search Results. Human computation is an effective way to channel human effort spent playing games to solving computational problems that are easy for humans but difficult for computers to automate. We propose Thumbs-Up, a new game for human computation with the purpose of playing to rank search result. Our experience from users shows that Thumbs-Up is not only fun to play, but produces more relevant rankings than both a major search engine and optimal rank aggregation using the Kemeny rule.
Kolay, Santanu and Dasdan, Ali The Value of Socially Tagged URLs for a Search Engine. Social bookmarking has emerged as a growing source of human generated content on the web. In essence, bookmarking involves URLs and tags on them. In this paper, we perform a large scale study of the usefulness of bookmarked URLs from the top social bookmarking site Delicious. Instead of focusing on the dimension of tags, which has been covered in the previous work, we explore social bookmarking from the dimension of URLs. More specifically, we investigate the Delicious URLs and their content to quantify their value to a search engine. For their value in leading to good content, we show that the Delicious URLs have higher quality content and more external outlinks. For their value in satisfying users, we show that the Delicious URLs have more clicked URLs as well as get more clicks. We suggest that based on their value, the Delicious URLs should be used as another source of seed URLs for crawlers.
Konuru, RaviLin, Yu-Ru and Sun, Jimeng and Castro, Paul and Konuru, Ravi and Sundaram, Hari and Kelliher, Aisling Extracting Community Structure through Relational Hypergraphs. Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an online method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users’ future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Kotsakos, DimitriosPlatakis, Manolis and Kotsakos, Dimitrios and Gunopulos, Dimitrios Searching for Events in the Blogosphere. Over the last few years, blogs (web logs) have gained massive popularity and have become one of the most influential web social media in our times. Every blog post in the Blogosphere has a well defined timestamp, which is not taken into account by search engines. By conducting research regarding this feature of the Blogosphere, we can attempt to discover bursty terms and correlations between them during a time interval. We apply Kleinberg’s automaton on extracted titles of blog posts to discover bursty terms, we introduce a novel representation of a term’s burstiness evolution called State Series and we employ a Euclidean-based distance metric to discover potential correlations between terms without taking into account their context. We evaluate the results trying to match them with real life events. Finally, we propose some ideas for further evaluation techniques and future research in the field.
Krause, MarkusKrause, Markus and Aras, Hidir Playful Tagging - Folksonomy Generation Using Online Games. Collaborative Tagging is a powerful method to create folksonomies that can be used to grasp/filter user preferences or enhance web search. Recent research has shown that depending on the number of users and the quality of user-provided tags powerful community-driven semantics or “ontologies” can emerge – as it was evident analyzing user data from social web applications such as del.icio.us or Flickr. Unfortunately, most web pages do not contain tags and, thus, no vocabulary that describes the information provided. A common problem in web page annotation is to motivate users for constant participation, i.e. tagging. In this paper we describe our approach of a binary verification game that embeds collaborative tagging into on-line games in order to produce domain specific folksonomies.
Krüepl, BernhardKrüepl, Bernhard and Holzinger, Wolfgang and Darmaputra, Yansen and Baumgartner, Robert A Flight Meta-Search Engine with Metamorph. We demonstrate a flight meta-search engine that is based on the Metamorph framework. Metamorph provides mechanisms to model web forms together with the interactions which are needed to fulfil a request, and can generate interaction sequences that pose queries using these web forms and collect the results. In this paper, we discuss an interesting new feature that makes use of the forms themselves as an information source. We show how data can be extracted from web forms (rather than the data behind web forms) to generate a graph of flight connections between cities. The flight connection graph allows us to vastly reduce the number of queries that the engine sends to airline websites in the most interesting search scenarios; those that involve the controversial practice of creative ticketing, in which agencies attempt to find lower price fares by using more than one airline for a journey. We describe a system which attains data from a number of websites to identify promising routes and prune the search tree. Heuristics that make use of geographical information and an estimation of cost based on historical data are employed. The results are then made available to improve the quality of future search requests. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software General Terms: Algorithms, Design, Experimentation. Keywords: Hidden Web, Web Data Extraction, Web Form Mapping, Web Form Extraction.
Lee, DongwonKil, Hyunyoung and Nam, Wonhong and Lee, Dongwon Automatic Web Service Composition with Abstraction and Refinement. The behavioral description based Web Service Composition (WSC) problem aims at the automatic construction of a coordinator web service that controls a set of web services to reach a goal state. However, solving the WSC problem exactly with a realistic model is doubly-exponential in the number of variables in web service descriptions. In this paper, we propose a novel efficient approximation-based algorithm using automatic abstraction and refinement to dramatically reduce the number of variables needed to solve the problem.
Lee, JeehoonShin, Hyoseop and Lee, Jeehoon Ranking User-Created Contents by Search User's Inclination in Online Communities. Searching posts effectively has become an important issue in large-scale online communities. Especially, if search users have different inclinations when they search posts, they have different kinds of posts in their minds. To address this problem, in this paper, we propose a scheme of ranking posts based on search users’ inclination. User ranking score is employed to capture posts that are relevant to a specific user inclination. Specifically, we present a scheme to rank posts in terms of user expertise and popularity. Experimental results show that different user inclinations can produce quite different search results and the proposed scheme achieves about 70% accuracy.
Lee, JihyunLee, Jihyun and Min, Jun-Ki and Chung, Chin-Wan An Effective Semantic Search Technique using Ontology. In this paper, we present a semantic search technique considering the type of desired Web resources and the semantic relationships between the resources and the query keywords in the ontology. In order to effectively retrieve the most relevant top-k resources, we propose a novel ranking model. To do this, we devise a measure to determine the weight of the semantic relationship. In addition, we consider the number of meaningful semantic relationships between a resource and keywords, the coverage of keywords, and the distin- guishability of keywords. Through experiments using real datasets, we observe that our ranking model provides more accurate seman- tic search results compared to existing ranking models.
Lee, TaehyungLee, Taehyung and Kim, Jinil and Wook Kim, Jin and Kim, Sung-Ryul and Park, Kunsoo Detecting Soft Errors by Redirection Classification. A soft error redirection is a URL redirection to a page that returns the HTTP status code 200 (OK) but has actually no relevant content to the client request. Since such redirections degrade the performance of web search engines in many ways, it is highly desirable to remove as many of them as possible. We propose a novel approach to detect soft error redirections by analyzing redirection logs collected during crawling operation. Experimental results on huge crawl data show that our measure can classify soft error redirections effectively.
Lee, YugyungDasgupta, Sourish and Bhat, Satish and Lee, Yugyung SGPS: A Semantic Scheme for Web Service Similarity. Today’s Web becomes a platform for services to be dynamically interconnected to produce a desired outcome. It is important to formalize the semantics of the contextual elements of web services. In this paper, we propose a novel technique called Semantic Genome Propagation Scheme (SGPS) for measuring similarity between semantic concepts. We show how SGPS is used to compute a multi-dimensional similarity between two services. We evaluate the SGPS similarity measurement in terms of the similarity performance and scalability.
Lee Giles, C.Zheng, Shuyi and Dmitriev, Pavel and Lee Giles, C. Graph Based Crawler Seed Selection. This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more “good” and less “bad” pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.
Lex, ElisabethJuffinger, Andreas and Lex, Elisabeth Crosslanguage Blog Mining and Trend Visualisation. People use weblogs to express thoughts, present ideas and share knowledge, therefore weblogs are extraordinarily valuable resources, amongs others, for trend analysis. Trends are derived from the chronological sequence of blog post count per topic. The comparison with a reference corpus allows qualitative statements over identified trends. We propose a crosslanguage blog mining and trend visualisation system to analyse blogs across languages and topics. The trend visualisation facilitates the identification of trends and the comparison with the reference news article corpus. To prove the correctness of our system we computed the correlation between trends in blogs and news articles for a subset of blogs and topics. The evaluation corroborated our hypothesis of a high correlation coefficient for these subsets and therefore the correctness of our system for different languages and topics is proven.
Li, GuoliangLi, Guoliang and Feng, Jianhua and Zhou, Lizhu Interactive Search in XML Data. In a traditional keyword-search system in XML data, a user composes a keyword query, submits it to the system, and retrieves relevant subtrees. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information. In this paper, we study a new information-access paradigm for XML data, called “Inks,” in which the system searches on the underlying data “on the fly” as the user types in query keywords. Inks extends existing XML keyword search methods by interactively answering keyword queries. We propose effective indices, early-termination techniques, and efficient search algorithms to achieve a high interactive speed. We have implemented our algorithm. The experimental results show that Inks achieves high search efficiency and result quality.
Li, HuaWang, Gang and Hu, Jian and Zhu, Yunzhang and Li, Hua and Chen, Zheng Competitive Analysis from Click-Through Log. Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers’ products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors’ keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
Li, JuanziChen, Dewei and Tang, Jie and Li, Juanzi and Zhou, Lizhu Discovering the Staring People From Social Networks. In this paper, we study a novel problem of staring people dis- covery from social networks, which is concerned with finding people who are not only authoritative but also sociable in the social network. We formalize this problem as an optimiza- tion programming problem. Taking the co-author network as a case study, we define three objective functions and pro- pose two methods to combine these objective functions. A genetic algorithm based method is further presented to solve this problem. Experimental results show that the proposed solution can effectively find the staring people from social networks.
Li, Lian
Li, LusongLi, Lusong and Mei, Tao and Liu, Chris and Hua, Xian-Sheng GameSense. This paper presents a novel game-like advertising system called GameSense, which is driven by the compelling contents of online images. Given a Web page which typically contains images, GameSense is able to select suitable images to create online in-image games for advertising. The contextually relevant ads (i.e., product logos) are embedded at appropriate positions within the online games. The ads are selected based on not only textual relevance but also visual content similarity. The game is able to provide viewers rich experience and thus promote the embedded ads to provide more effective advertising.
Li, QiudanGeng, Guang-Gang and Li, Qiudan and Zhang, Xinchang Link Based Small Sample Learning for Web Spam Detection. Robust statistical learning based web spam detection sys- tem often requires large amounts of labeled training data. However, labeled samples are more difficult, expensive and time consuming to obtain than unlabeled ones. This pa- per proposed link based semi-supervised learning algorithms to boost the performance of a classifier, which integrates the traditional Self-training with the topological dependency based link learning. The experiments with a few labeled samples on standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.
Li, WeiZhang, Dell and Mao, Robert and Li, Wei The Recurrence Dynamics of Social Tagging. How often do tags recur? How hard is predicting tag recurrence? What tags are likely to recur? We try to answer these questions by analysing the RSDC08 dataset, in both individual and collective settings. Our findings provide useful insights for the development of tag suggestion techniques etc.
Li, XianxianDeng, Ting and Huai, Jinpeng and Li, Xianxian and Du, Zongxia and Guo, Huipeng Automated Synthesis of Composite Services with Correctness Guarantee. In this paper, we propose a novel approach for composing existing web services to satisfy the correctness constraints to the design, including freeness of deadlock and unspecified reception, and temporal constraints in Computation Tree Logic formula. An automated synthesis algorithm based on learning algorithm is introduced, which guarantees that the composite service is the most general way of coordinating services so that the correctness is ensured. We have implemented a prototype system evaluating the effectiveness and efficiency of our synthesis approach through an experimental study. In this paper we propose a novel approach to synthesize the composite service from a given set of services, where the designer only needs to set the correctness constraints on the desired behaviors of the targeted service and the synthesis will be automatically performed with the correctness guaranteed. We implemented a prototype system and the preliminary experimental results on a practical travel agent example show that our synthesis approach is effective and efficient.
Li, Yang
Li, Yang and Lu, Tian-Bo and Guo, Li and Tian, Zhi-Hong and Nie, Qin-Wu Towards Lightweight and Efficient DDoS Attacks Detection for Web Server. In this poster, based on our previous work in building a lightweight DDoS (Distributed Denial-of-Services) attacks detection mechanism for web server using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and genetic algorithm based instance selection methods, we further propose a more efficient and effective instance selection method, named E-FCM (Extend Fuzzy C-Means). By using this method, we can obtain much cheaper training time for TCM-KNN while ensuring high detection performance. Therefore, the optimized mechanism is more suitable for lightweight DDoS attacks detection in real network environment. In our previous work, we proposed an effective anomaly detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) algorithm to fulfill DDoS attacks detection task towards ensuring the QoS of web server. The method is good at detecting network anomalies with high detection rate, high confidence and low false positives than traditional methods, because it combines “strangeness” with “p-values” measures to evaluate the network traffic compared to the conventional ad-hoc thresholds based detection and particular definition based detection. Secondly, we utilize the new objective measurement as the input feature spaces of TCM-KNN, to effectively detect DDoS attack against web server. Finally, we introduce Genetic Algorithm (GA) based instance selection method to boost the real-time detection performance of TCM-KNN and thus make it be an effective and lightweight mechanism for DDoS detection for web servers [4, 5]. However, we found the computational cost for GA is expensive, which results in high training time for TCM-KNN.
Liao, Hong-Luan
Lim, LipyeowHassanzadeh, Oktie and Lim, Lipyeow and Kementsietsidis, Anastasios and Wang, Min A Declarative Framework for Semantic Link Discovery over Relational Data. In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage require- ments by the user, that allows matching data items in many real-world scenarios. These requirements are translated to queries that can run over the relational data source, potentially using the semantic knowledge to enhance the accuracy of link discovery. Our framework lets data publishers to easily find and publish high-quality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.
Lin, ChenLin, Chen and Yang, Jiang-Ming and Cai, Rui and Wang, Xin-Jing and Wang, Wei and Zhang, Lei Modeling Semantics and Structure of Discussion Threads. The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Lin, Ching-YungShieh, Jyh-Ren and Hsieh, Yung-Huan and Yeh, Yang-Ting and Chung Su, Tse and Lin, Ching-Yung and Wu, Ja-Ling Building Term Suggestion Relational Graphs from Collective Intelligence. This paper proposes an effective approach to provide relevant search terms for conceptual Web search. ‘Semantic Term Suggestion’ function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Lin, Yu-RuLin, Yu-Ru and Sun, Jimeng and Castro, Paul and Konuru, Ravi and Sundaram, Hari and Kelliher, Aisling Extracting Community Structure through Relational Hypergraphs. Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an online method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users’ future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Lindamood, JackLindamood, Jack and Heatherly, Raymond and Kantarcioglu, Murat and Thuraisingham, Bhavani Inferring Private Information Using Social Network Data. On-line social networks, such as Facebook, are increasingly utilized by many users. These networks allow people to publish details about themselves and connect to their friends. Some of the information revealed inside these networks is private and it is possible that corporations could use learning algorithms on the released data to predict undisclosed private information. In this paper, we explore how to launch inference attacks using released social networking data to predict undisclosed private information about individuals. We then explore the effectiveness of possible sanitization techniques that can be used to combat such inference attacks under different scenarios. social network data could be used to predict some individual private trait that a user is not willing to disclose (e.g., political or religious affiliation) and explore the effect of possible data sanitization alternatives on preventing such private information leakage. To our knowledge this is the first comprehensive paper that discusses the problem of inferring private traits using real-life social network data and possible sanitization approaches to prevent such inference. First, we present a ıve modification of Na¨ Bayes classification that is suitable for classifying large amount of social network data. Our modified Na¨ Bayes algorithm predicts privacy sensitive trait ıve information using both node traits and link structure. We compare the accuracy of our learning method based on link structure against the accuracy of our learning method based on node traits. Please see extended version of this paper [3] for further details of our modified Naive Bayes classifier. In order to protect privacy, we sanitize both trait (e.g., deleting some information from a user’s on-line profile) and link details (e.g., deleting links between friends) and explore the effect they have on combating possible inference attacks. Our initial results indicate that just sanitizing trait information or link information may not be enough to prevent inference attacks and comprehensive sanitization techniques that involve both aspects are needed in practice. Similar to our paper, in [2], authors consider ways to infer private information via friendship links by creating a Bayesian Network from the links inside a social network. A similar privacy problem for online social networks is discussed in [4]. Compared to [2] and [4], we provide techniques that help in choosing the most effective traits or links that need to be removed for protecting privacy.
Liu, ChrisLi, Lusong and Mei, Tao and Liu, Chris and Hua, Xian-Sheng GameSense. This paper presents a novel game-like advertising system called GameSense, which is driven by the compelling contents of online images. Given a Web page which typically contains images, GameSense is able to select suitable images to create online in-image games for advertising. The contextually relevant ads (i.e., product logos) are embedded at appropriate positions within the online games. The ads are selected based on not only textual relevance but also visual content similarity. The game is able to provide viewers rich experience and thus promote the embedded ads to provide more effective advertising.
Liu, NingLiu, Ning and Yan, Jun and Fan, Weiguo and Yang, Qiang and Chen, Zheng Identifying Vertical Search Intention of Query through Social Tagging Propagation. A pressing task during the unification process is to identify a user’s vertical search intention based on the user’s query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
Liu, Ning and Yan, Jun and Chen, Zheng A Probabilistic Model Based Approach for Blended Search. In this paper, we propose to model the blended search problem by assuming conditional dependencies among queries, VSEs and search results. The probability distributions of this model are learned from search engine query log through unigram language model. Our experimental exploration shows that, (1) a large number of queries in generic Web search have vertical search intentions; and (2) our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the Mean Average Precision (MAP) by as much as 16% compared to traditional Web search without blending. these components into a single list. However, from the classical meta-search problem’s configuration, the query log of component search engines is not available for study. In this extended abstract, we model the blended search problem based on the conditional dependencies among queries, VSEs and all the search results. We utilize the usage information, i.e. query log, of all the VSEs, which are not available for traditional metasearch engines, to learn the model parameters by the smoothed unigram language model. Finally, given a user query, the search results from both generic Web search and different VSEs are ranked together by inferring their probabilities of relevance to the given query. The main contributions of this work are, (1) through studying the belonging vertical search engines’ query log of a commercial search engine, we show the importance of blended search problem; (2) we propose a novel probabilistic model based approach to explore the blended search problem; and (3) we experimentally verify that our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the MAP as much as 16% in contrast to traditional Web search without vertical search blending and 10% to some other some ranking baseline.
Yan, Jun and Liu, Ning and Qing Chang, Elaine and Ji, Lei and Chen, Zheng Search Result Re-ranking Based on Gap between Search Queries and Social Tags. Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result re-ranking. In this paper, we show that the gap between search queries and social tags of the same web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to rerank search results for better user satisfaction. Intuitively, on one hand, the search users’ intentions are generally described by their queries before they read the search results. On the other hand, the web annotators semantically tag web pages after they read the content of the pages. The difference between users’ recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users’ pre- and postknowledge respectively. We empirically show the strong correlation between user satisfaction and user’s knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
Liu, QiaolingWang, Haofen and Liu, Qiaoling and Xue, Gui-Rong and Yu, Yong and Zhang, Lei and Pan, Yue Dataplorer: A Scalable Search Engine for the Data Web. More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Experimentation Keywords: hybrid query, inverted index, ranking, faceted search sake of the others. The usability challenge is addressed by providing the user with hybrid query capabilities, leveraging the power of structured queries and the ease of use of keyword search. We also propose a faceted search functionality that allows users to progressively compose the structured part of their information need after having started with imprecise keywords. Scalability is one of the main challenges that hybrid queries are facing, due to the large amount of data. Inspired from the cross field of DB and IR integration, we make IR compatible with hybrid search through an extension of the inverted index, and thus able to scale as well as to handle structured information. To ensure that uncertainty does not remain as a problem to return relevant results, we provide a powerful ranking scheme that considers structures of both data and hybrid queries for score propagation and aggregation during results ranking. As an improvement of our previous work [3], we support faceted search with integrated ranking to tackle both usability and uncertainty issues while preserving efficiency.
Lizorkin, Dmitry
Lu, BinLu, Bin and Wu, Zhaohui and Ni, Yuan and Xie, Guotong and Zhou, Chunying and Chen, Huajun sMash: Semantic-based Mashup Navigation for Data API Network. With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Lu, GangZhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Lu, Tian-BoLi, Yang and Lu, Tian-Bo and Guo, Li and Tian, Zhi-Hong and Nie, Qin-Wu Towards Lightweight and Efficient DDoS Attacks Detection for Web Server. In this poster, based on our previous work in building a lightweight DDoS (Distributed Denial-of-Services) attacks detection mechanism for web server using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and genetic algorithm based instance selection methods, we further propose a more efficient and effective instance selection method, named E-FCM (Extend Fuzzy C-Means). By using this method, we can obtain much cheaper training time for TCM-KNN while ensuring high detection performance. Therefore, the optimized mechanism is more suitable for lightweight DDoS attacks detection in real network environment. In our previous work, we proposed an effective anomaly detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) algorithm to fulfill DDoS attacks detection task towards ensuring the QoS of web server. The method is good at detecting network anomalies with high detection rate, high confidence and low false positives than traditional methods, because it combines “strangeness” with “p-values” measures to evaluate the network traffic compared to the conventional ad-hoc thresholds based detection and particular definition based detection. Secondly, we utilize the new objective measurement as the input feature spaces of TCM-KNN, to effectively detect DDoS attack against web server. Finally, we introduce Genetic Algorithm (GA) based instance selection method to boost the real-time detection performance of TCM-KNN and thus make it be an effective and lightweight mechanism for DDoS detection for web servers [4, 5]. However, we found the computational cost for GA is expensive, which results in high training time for TCM-KNN.
Luo, Guan
Maghoul, FarzinYi, Jeonghee and Maghoul, Farzin Query Clustering using Click-Through Graph. In this p aper w e describe a problem of d iscovering query clusters from a click -through graph of w eb search logs. The graph consists of a set of web search queries, a set of pag es selected for the queries, and a set of d irected edges that connects a query node and a page node click ed by a user for the query. The proposed method extracts all m axim al b ipartite cliques (b icliques) from a click-through graph and compute an equiv alence set of queries (i.e., a query cluster) from the m axim al bicliques. A cluster of queries is form ed from th e queries in a biclique. We present a scalable algorithm that enumerates all maximal bicliques from the click-through graph. We h ave conducted experim ents on Yahoo web search queries and the result is p romising.
Mahanti, Anirban
Manca, RobertoManca, Roberto and Massidda, Francesco and Carboni, Davide Visualization of Geo-annotated Pictures in Mobile Phones. In this work, a novel mobile browser for geo-referenced pictures is introduced and described. We use the term browser to denote a system aimed at browsing pictures selected from a large set like Internet photo sharing services. The criteria to filter a subset of pictures to browse are three: the user's actual position, the user's actual heading, and the user's preferences. In this work we only focus on the first two criteria leaving the integration of user's preferences for future developments.
Mao, RobertZhang, Dell and Mao, Robert and Li, Wei The Recurrence Dynamics of Social Tagging. How often do tags recur? How hard is predicting tag recurrence? What tags are likely to recur? We try to answer these questions by analysing the RSDC08 dataset, in both individual and collective settings. Our findings provide useful insights for the development of tag suggestion techniques etc.
Marian, LudmilaBaykan, Eda and Henzinger, Monika and Marian, Ludmila and Weber, Ingmar Purely URL-based Topic Classification. Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page’s content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
Martin, Ludger
Masseglia, FlorentWang, Wei and Masseglia, Florent and Guyet, Thomas and Quiniou, Rene and Cordier, Marie-Odile A General Framework for Adaptive and Online Detection of Web Attacks. Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we pro- pose a novel general framework for adaptive and online de- tection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with “con- cept drift” in web audit data streams. Str-DBSCAN that we extended DBSCAN [1] to streaming data as well as StrAP [3] are both used to validate the framework. The detec- tion model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data col- lected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Massidda, FrancescoManca, Roberto and Massidda, Francesco and Carboni, Davide Visualization of Geo-annotated Pictures in Mobile Phones. In this work, a novel mobile browser for geo-referenced pictures is introduced and described. We use the term browser to denote a system aimed at browsing pictures selected from a large set like Internet photo sharing services. The criteria to filter a subset of pictures to browse are three: the user's actual position, the user's actual heading, and the user's preferences. In this work we only focus on the first two criteria leaving the integration of user's preferences for future developments.
Mathur, VipulMathur, Vipul and Dhopeshwarkar, Sanket and Apte, Varsha MASTH Proxy: An Extensible Platform for Web Overload Control. Many overload control mechanisms for Web based applications aim to prevent overload by setting limits on factors such as admitted load, number of server threads, buffer size. For this they need online measurements of metrics such as response time, throughput, and resource utilization. This requires instrumentation of the server by modifying server code, which may not be feasible or desirable. An alternate approach is to use a proxy between the clients and servers. We have developed a proxy-based overload control platform called MASTH Proxy –Multi-class Admissioncontrolled Self-Tuning Http Proxy. It records detailed measurements, supports multiple request classes, manages queues of HTTP requests, provides tunable parameters and enables easy implementation of dynamic overload control. This gives designers of overload control schemes a platform where they can concentrate on developing the core control logic, without the need to modify upstream server code.
Medelyan, Olena
Mei, TaoLi, Lusong and Mei, Tao and Liu, Chris and Hua, Xian-Sheng GameSense. This paper presents a novel game-like advertising system called GameSense, which is driven by the compelling contents of online images. Given a Web page which typically contains images, GameSense is able to select suitable images to create online in-image games for advertising. The contextually relevant ads (i.e., product logos) are embedded at appropriate positions within the online games. The ads are selected based on not only textual relevance but also visual content similarity. The game is able to provide viewers rich experience and thus promote the embedded ads to provide more effective advertising.
Mesquita, FilipeToda, Guilherme A. and Cortez, Eli and Mesquita, Filipe and da Silva, Altigran S. and Moura, Edleno and Neubert, Marden Automatically Filling Form-Based Web Interfaces with Free Text Inputs. On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of formbased interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over formbased ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
Michel, SebastianMichel, Sebastian and Weber, Ingmar Rethinking Email Message and People Search. We show how a number of novel email search features can be implemented without any kind of natural language processing (NLP) or advanced data mining. Our approach inspects the email headers of all messages a user has ever sent or received and it creates simple per-contact summaries, including simple information about the message exchange history, the domain of the sender or even the sender’s gender. With these summaries advanced questions/tasks such as “Who do I still need to reply to?” or “Find ‘fun’ messages sent by friends.” become possible. As a proof of concept, we implemented a Mozilla-Thunderbird extension, adding powerful people search to the popular email client.
Min, Jun-KiLee, Jihyun and Min, Jun-Ki and Chung, Chin-Wan An Effective Semantic Search Technique using Ontology. In this paper, we present a semantic search technique considering the type of desired Web resources and the semantic relationships between the resources and the query keywords in the ontology. In order to effectively retrieve the most relevant top-k resources, we propose a novel ranking model. To do this, we devise a measure to determine the weight of the semantic relationship. In addition, we consider the number of meaningful semantic relationships between a resource and keywords, the coverage of keywords, and the distin- guishability of keywords. Through experiments using real datasets, we observe that our ranking model provides more accurate seman- tic search results compared to existing ranking models.
Mitra, Siddharth
Morishima, Atsuyuki
Moura, EdlenoToda, Guilherme A. and Cortez, Eli and Mesquita, Filipe and da Silva, Altigran S. and Moura, Edleno and Neubert, Marden Automatically Filling Form-Based Web Interfaces with Free Text Inputs. On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of formbased interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over formbased ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
Murakami, YoheiNadamoto, Akiyo and Aramaki, Eiji and Abekawa, Takeshi and Murakami, Yohei Content Hole Search in Community-type Content. In community-type content such as blogs and SNSs, we call the user’s unawareness of information as a ”content hole” and the search for this information as a ”content hole search.” A content hole search differs from similarity searching and has a variety of types. In this paper, we propose different types of content holes and define each type. We also propose an analysis of dialogue related to community-type content and introduce content hole search by using Wikipedia as an example.
Nadamoto, AkiyoNadamoto, Akiyo and Aramaki, Eiji and Abekawa, Takeshi and Murakami, Yohei Content Hole Search in Community-type Content. In community-type content such as blogs and SNSs, we call the user’s unawareness of information as a ”content hole” and the search for this information as a ”content hole search.” A content hole search differs from similarity searching and has a variety of types. In this paper, we propose different types of content holes and define each type. We also propose an analysis of dialogue related to community-type content and introduce content hole search by using Wikipedia as an example.
Nakamizo, Akiyoshi
Nam, WonhongKil, Hyunyoung and Nam, Wonhong and Lee, Dongwon Automatic Web Service Composition with Abstraction and Refinement. The behavioral description based Web Service Composition (WSC) problem aims at the automatic construction of a coordinator web service that controls a set of web services to reach a goal state. However, solving the WSC problem exactly with a realistic model is doubly-exponential in the number of variables in web service descriptions. In this paper, we propose a novel efficient approximation-based algorithm using automatic abstraction and refinement to dramatically reduce the number of variables needed to solve the problem.
Neubert, MardenToda, Guilherme A. and Cortez, Eli and Mesquita, Filipe and da Silva, Altigran S. and Moura, Edleno and Neubert, Marden Automatically Filling Form-Based Web Interfaces with Free Text Inputs. On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of formbased interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over formbased ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
Ni, XiaochuanNi, Xiaochuan and Sun, Jian-Tao and Hu, Jian and Chen, Zheng Mining Multilingual Topics from Wikipedia. In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted “universal” topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.
Ni, YuanLu, Bin and Wu, Zhaohui and Ni, Yuan and Xie, Guotong and Zhou, Chunying and Chen, Huajun sMash: Semantic-based Mashup Navigation for Data API Network. With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Nie, Qin-WuLi, Yang and Lu, Tian-Bo and Guo, Li and Tian, Zhi-Hong and Nie, Qin-Wu Towards Lightweight and Efficient DDoS Attacks Detection for Web Server. In this poster, based on our previous work in building a lightweight DDoS (Distributed Denial-of-Services) attacks detection mechanism for web server using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and genetic algorithm based instance selection methods, we further propose a more efficient and effective instance selection method, named E-FCM (Extend Fuzzy C-Means). By using this method, we can obtain much cheaper training time for TCM-KNN while ensuring high detection performance. Therefore, the optimized mechanism is more suitable for lightweight DDoS attacks detection in real network environment. In our previous work, we proposed an effective anomaly detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) algorithm to fulfill DDoS attacks detection task towards ensuring the QoS of web server. The method is good at detecting network anomalies with high detection rate, high confidence and low false positives than traditional methods, because it combines “strangeness” with “p-values” measures to evaluate the network traffic compared to the conventional ad-hoc thresholds based detection and particular definition based detection. Secondly, we utilize the new objective measurement as the input feature spaces of TCM-KNN, to effectively detect DDoS attack against web server. Finally, we introduce Genetic Algorithm (GA) based instance selection method to boost the real-time detection performance of TCM-KNN and thus make it be an effective and lightweight mechanism for DDoS detection for web servers [4, 5]. However, we found the computational cost for GA is expensive, which results in high training time for TCM-KNN.
Nikolaev, KirillNikolaev, Kirill and Zudina, Ekaterina and Gorshkov, Andrey Combining Anchor Text Categorization and Graph Analysis for Paid Link Detection. In order to artificially boost the rank of commercial pages in search engine results, search engine optimizers pay for links to these pages on other websites. Identifying paid links is important for a web search engine to produce highly relevant results. In this paper we introduce a novel method of identifying such links. We start with training a classifier of anchor text topics and analyzing web pages for diversity of their outgoing commercial links. Then we use this information and analyze link graph of the Russian Web to find pages that sell links and sites that buy links and to identify the paid links. Testing on manually marked samples showed high efficiency of the algorithm.
Nussbaumer, MartinKeller, Matthias and Nussbaumer, Martin Cascading Style Sheets: A Novel Approach Towards Productive Styling with Today's Standards. In this paper we present an approach of generating Cascading Style Sheet documents automatically if the desired effect on the content elements is specified. While a Web user agent resolves the CSS rules and computes their effect, our approach handles the way back. We argue, that this can remarkably improve CSS productivity, since the process of CSS authoring always involves this direction implicitly. Our approach claims a new and innovative way to reuse chunks of markup together with its presentation. It furthermore bears potential for the optimization and reorganization of CSS documents. We describe criteria for CSS code quality we oriented on, including a quantitative indicator for the abstractness of a CSS presentation specification. An evaluation and recomputation of the CSS for 25.000 HTML documents shows that concerning these criteria the automatically generated code comes close to manually authored code.
Ofek-Koifman, ShilaAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Oiwa, YutakaOiwa, Yutaka and Takagi, Hiromitsu and Watanabe, Hajime and Suzuki, Hirofumi PAKE-based Mutual HTTP Authentication for Preventing Phishing Attacks. We developed a new Web authentication protocol with passwordbased mutual authentication which prevents various kinds of phishing attacks. This protocol provides a protection of user’s passwords against any phishers even if a dictionary attack is employed, and prevents phishers from imitating a false sense of successful authentication to users. The protocol is designed considering interoperability with many recent Web applications which requires many features which current HTTP authentication does not provide. The protocol is proposed as an Internet Draft submitted to IETF, and implemented in both server side (as an Apache extension) and client side (as a Mozilla-based browser and an IE-based one). Categories and Subject Descriptors: K.6.5 [Management of Computing and Information Systems]: Security and Protection— Authentication General Terms: Security, Standardization. Keywords: Network protocol, Mutual authentication, HTTP.
Olivas, Jose A.Romero, Francisco P. and Serrano-Guerrero, Jesus and Olivas, Jose A. Bucefalo: A Tool for Intelligent Search and Filtering for Web-based Personal Health Records. In this poster, a tool named BUCEFALO is presented. This tool is specially designed to improve the information retrieval tasks in web-based Personal Health Records (PHR). This tool implements semantic and multilingual query expansion techniques and information filtering algorithms in order to help users find the most valuable information about a specific clinical case. The filtering model is based on fuzzy prototypes based filtering, data quality measures, user profiles and healthcare ontologies. The first experimental results illustrate the feasibility of this tool. standards the relevant health information is reliably and unambiguously tagged using XML within a single file. The use of XML allows that this information can be read, understood and processed for any application which uses the standard. Google Health and Microsoft Health use a subset of the CCR (Continuity of Care Record) standard. The CCR standard is the most used patient health summary. A document in CCR format is a XML document that consists of a header, a footer, and a body of health data organized into as many as 17 sections, e.g. problems and conditions, medications list, allergies list, family history, procedures, encounters, etc.. These web-based PHRs are examples of multi-user document repositories. The clinical reports can be read for different users (nurses, physicians, students) and for different purposes (diagnosis, learning, research). When a document repository has many users and many purposes, there are different points of view of the same repository structure. Therefore, it is necessary a technique able to manage these different points of view in knowledge retrieval tasks. In this case, fuzzy logic is especially recommendable due to its special features to model information retrieval applications.
Oliveira, PedroOliveira, Pedro and Gomes, Paulo Instance-based Probabilistic Reasoning in the Semantic Web. Most of the approaches for dealing with uncertainty in the Semantic Web rely on the principle that this uncertainty is already asserted. In this paper, we propose a new approach to learn and reason about uncertainty in the Semantic Web. Using instance data, we learn the uncertainty of an OWL ontology, and use that information to perform probabilistic reasoning on it. For this purpose, we use Markov logic, a new representation formalism that combines logic with probabilistic graphical models. cumbersome and difficult task, invalidating all the gains that could arise from the annotation. In fact, uncertainty is a common characteristic of the current Web. When we create a webpage, for example, search engines are responsible to assert what is the probabilistic relevance of it, compared to other pages, to certain topics. We don’t have to explicitly refer that information: we just create its content, and search engines do the rest. So, we must develop similar automatic mechanisms to perform reasoning in the Semantic Web. In this work, we study how we can make probabilistic reasoning on OWL ontologies without any kind of uncertainty annotation. To assert the uncertainty of its axioms, we use solely the information of its instances. For this purpose, we use Markov logic [4], a novel approach that combines logic and probability in the same representation.
Paiu, RalucaBischoff, Kerstin and Firan, Claudiu S. and Paiu, Raluca Deriving Music Theme Annotations from User Tags. Music theme annotations would be really beneficial for supporting retrieval, but are often neglected by users while annotating. Thus, in order to support users in tagging and to fill the gaps in the tag space, in this paper we develop algorithms for recommending theme annotations. Our methods exploit already existing user tags, the lyrics of music tracks, as well as combinations of both. We compare the results for our recommended theme annotations against genre and style recommendations – a much easier and already studied task. We evaluate the quality of our recommended tags against an expert ground truth data set. Our results are promising and provide interesting insights into possible extensions for music tagging systems to support music search.
Pan, AiminZhang, Kaimin and Wang, Lu and Guo, Xiaolin and Pan, Aimin and Zhu, Bin B. WPBench: A Benchmark for Evaluating the Client-side Performance of Web 2.0 Applications. In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users’ interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
Pan, YueWang, Haofen and Liu, Qiaoling and Xue, Gui-Rong and Yu, Yong and Zhang, Lei and Pan, Yue Dataplorer: A Scalable Search Engine for the Data Web. More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Experimentation Keywords: hybrid query, inverted index, ranking, faceted search sake of the others. The usability challenge is addressed by providing the user with hybrid query capabilities, leveraging the power of structured queries and the ease of use of keyword search. We also propose a faceted search functionality that allows users to progressively compose the structured part of their information need after having started with imprecise keywords. Scalability is one of the main challenges that hybrid queries are facing, due to the large amount of data. Inspired from the cross field of DB and IR integration, we make IR compatible with hybrid search through an extension of the inverted index, and thus able to scale as well as to handle structured information. To ensure that uncertainty does not remain as a problem to return relevant results, we provide a powerful ranking scheme that considers structures of both data and hybrid queries for score propagation and aggregation during results ranking. As an improvement of our previous work [3], we support faceted search with integrated ranking to tackle both usability and uncertainty issues while preserving efficiency.
Pantel, PatrickChang, William and Pantel, Patrick and Popescu, Ana-Maria and Gabrilovich, Evgeniy Towards Intent-Driven Bidterm Suggestion. In online advertising, pervasive in commercial search engines, advertisers typically bid on few terms, and the scarcity of data makes ad matching difficult. Suggesting additional bidterms can significantly improve ad clickability and conversion rates. In this paper, we present a large-scale bidterm suggestion system that models an advertiser’s intent and finds new bidterms consistent with that intent. Preliminary experiments show that our system significantly increases the coverage of a state of the art production system used at Yahoo while maintaining comparable precision.
Paquier, MicaëlSire, Stéphane and Paquier, Micaël and Vagner, Alain and Bogaerts, Jérôme A Messaging API for Inter-Widgets Communication. Widget containers are used everywhere on the Web, for instance as customizable start pages to Web desktops. In this poster, we describe the extension of a widget container with an inter-widgets communication layer, as well as the subsequent application programming interfaces (APIs) added to the Widget object to support this feature. We present the benefits of a drag and drop facility within widgets and conclude by a call for standardization of inter-widgets communication on the Web.
Parikh, NishParikh, Nish and Sundaresan, Neel Buzz-Based Recommender System. In this paper, we describe a buzz-based recommender system based on a large source of queries in an eCommerce application. The system detects bursts in query trends. These bursts are linked to external entities like news and inventory information to find the queries currently in-demand which we refer to as buzz queries. The system follows the paradigm of limited quantity merchandising, in the sense that on a per-day basis the system shows recommendations around a single buzz query with the intent of increasing user curiosity, and improving activity and stickiness on the site. A semantic neighborhood of the chosen buzz query is selected and appropriate recommendations are made on products that relate to this neighborhood.
Paris, Pilar
Park, KunsooLee, Taehyung and Kim, Jinil and Wook Kim, Jin and Kim, Sung-Ryul and Park, Kunsoo Detecting Soft Errors by Redirection Classification. A soft error redirection is a URL redirection to a page that returns the HTTP status code 200 (OK) but has actually no relevant content to the client request. Since such redirections degrade the performance of web search engines in many ways, it is highly desirable to remove as many of them as possible. We propose a novel approach to detect soft error redirections by analyzing redirection logs collected during crawling operation. Experimental results on huge crawl data show that our measure can classify soft error redirections effectively.
Pasca, MariusReisinger, Joseph and Pasca, Marius Bootstrapped Extraction of Class Attributes. As an alternative to previous studies on extracting class attributes from unstructured text, which consider either Web documents or query logs as the source of textual data, A bootstrapped method extracts class attributes simultaneously from both sources, using a small set of seed attributes. The method improves extraction precision and also improves attribute relevance across 40 test classes.
Pei, JianWang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Peng, Jin
Perego, RaffaeleBaraglia, Ranieri and Cacheda, Fidel and Carneiro, Victor and Formoso, Vreixo and Perego, Raffaele and Silvestri, Fabrizio Search Shortcuts: Driving Users Towards Their Goals. Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide Suggestions under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce Search Shortcuts as “Successful ” queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Platakis, ManolisPlatakis, Manolis and Kotsakos, Dimitrios and Gunopulos, Dimitrios Searching for Events in the Blogosphere. Over the last few years, blogs (web logs) have gained massive popularity and have become one of the most influential web social media in our times. Every blog post in the Blogosphere has a well defined timestamp, which is not taken into account by search engines. By conducting research regarding this feature of the Blogosphere, we can attempt to discover bursty terms and correlations between them during a time interval. We apply Kleinberg’s automaton on extracted titles of blog posts to discover bursty terms, we introduce a novel representation of a term’s burstiness evolution called State Series and we employ a Euclidean-based distance metric to discover potential correlations between terms without taking into account their context. We evaluate the results trying to match them with real life events. Finally, we propose some ideas for further evaluation techniques and future research in the field.
Popescu, AdrianPopescu, Adrian and Grefenstette, Gregory Deducing Trip Related Information from Flickr. Uploading tourist photos is a popular activity on photo sharing platforms. These photographs and their associated metadata (tags, geo-tags, and temporal information) should be useful for mining information about the sites visited. However, user-supplied metadata are often noisy and efficient filtering methods are needed before extracting useful knowledge. We focus here on exploiting temporal information, associated with tourist sites that appear in Flickr. From automatically filtered sets of geo-tagged photos, we deduce answers to questions like “how long does it take to visit a tourist attraction?” or “what can I visit in one day in this city?” Our method is evaluated and validated by comparing the automatically obtained visit duration times to manual estimations.
Popescu, Ana-MariaChang, William and Pantel, Patrick and Popescu, Ana-Maria and Gabrilovich, Evgeniy Towards Intent-Driven Bidterm Suggestion. In online advertising, pervasive in commercial search engines, advertisers typically bid on few terms, and the scarcity of data makes ad matching difficult. Suggesting additional bidterms can significantly improve ad clickability and conversion rates. In this paper, we present a large-scale bidterm suggestion system that models an advertiser’s intent and finds new bidterms consistent with that intent. Preliminary experiments show that our system significantly increases the coverage of a state of the art production system used at Yahoo while maintaining comparable precision.
Porras, Mercè
Qing Chang, ElaineYan, Jun and Liu, Ning and Qing Chang, Elaine and Ji, Lei and Chen, Zheng Search Result Re-ranking Based on Gap between Search Queries and Social Tags. Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result re-ranking. In this paper, we show that the gap between search queries and social tags of the same web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to rerank search results for better user satisfaction. Intuitively, on one hand, the search users’ intentions are generally described by their queries before they read the search results. On the other hand, the web annotators semantically tag web pages after they read the content of the pages. The difference between users’ recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users’ pre- and postknowledge respectively. We empirically show the strong correlation between user satisfaction and user’s knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
Qiu, BingyuYanai, Keiji and Qiu, Bingyu Mining Cultural Differences from a Large Number of Geotagged Photos. We propose a novel method to detect cultural differences over the world automatically by using a large amount of geotagged images on the photo sharing Web sites such as Flickr. We employ the state-of-the-art object recognition technique developed in the research community of computer vision to mine representative photos of the given concept for representative local regions from a large-scale unorganized collection of consumer-generated geotagged photos. The results help us understand how objects, scenes or events corresponding to the same given concept are visually different depending on local regions over the world.
Qiu, GuangWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Qu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Qu, MingchengWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Qu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Zhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Quiniou, ReneWang, Wei and Masseglia, Florent and Guyet, Thomas and Quiniou, Rene and Cordier, Marie-Odile A General Framework for Adaptive and Online Detection of Web Attacks. Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we pro- pose a novel general framework for adaptive and online de- tection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with “con- cept drift” in web audit data streams. Str-DBSCAN that we extended DBSCAN [1] to streaming data as well as StrAP [3] are both used to validate the framework. The detec- tion model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data col- lected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Ramanujam, SunithaRamanujam, Sunitha and Gupta, Anubha and Khan, Latifur and Seida, Steven and Thuraisingham, Bhavani Relationalizing RDF Stores for Tools Reusability. The emergence of Semantic Web technologies and standards such as Resource Description Framework (RDF) has introduced novel data storage models such as the RDF Graph Model. In this paper, we present a research effort called R2D, which attempts to bridge the gap between RDF and RDBMS concepts by presenting a relational view of RDF data stores. Thus, R2D is essentially a relational wrapper around RDF stores that aims to make the variety of stable relational tools that are currently in the market available to RDF stores without data duplication and synchronization issues.
Ravada, SivaAn, Ning and Chatterjee, Raja and Horhammer, Mike and Ravada, Siva Securely Implementing Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial. In this paper, we briefly describe the implementation of various Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial 11g. We highlight how we utilize Oracle’s implementation of OASIS Web Services Security (WSS) to provide a robust security framework for these OGC Web Services. We also discuss our future direction in supporting OGC Web Service Interface Standards. In addition to the mandated XML interfaces, Oracle Spatial WFS, Oracle Spatial CSW and Oracle Spatial OpenLS also support SOAP interfaces. To improve performance, Oracle Spatial WFS and Oracle Spatial CSW also implement caching mechanism to support retrieving records from a single query across different web requests. Below, we describe each of supported OGC Web Services in Oracle Spatial. Due to space limits, we will emphasize Oracle Spatial WFS support to illustrate our unique implementation.
Reisinger, JosephReisinger, Joseph and Pasca, Marius Bootstrapped Extraction of Class Attributes. As an alternative to previous studies on extracting class attributes from unstructured text, which consider either Web documents or query logs as the source of textual data, A bootstrapped method extracts class attributes simultaneously from both sources, using a small set of seed attributes. The method improves extraction precision and also improves attribute relevance across 40 test classes.
Rensing, ChristophScholl, Philipp and Domínguez García, Renato and Böhnstedt, Doreen and Rensing, Christoph and Steinmetz, Ralf Towards LanguageIndependent Web Genre Detection. The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres. features (e.g. part-of-speech tagging and document terms), structural features (e.g. HTML tag frequencies, use of facets used to enable functionalities like form input elements) and simple text statistics (e.g. frequencies of punctuation). However, a fact often neglected by related work is that the absolute dominance of the English language on the web is decreasing. Thus, it is important to develop a way of recognizing web genres independently of the language used on the respective web page. As many genres exhibit a certain structural and visual layout, this property enables to ignore linguistic features altogether.
Ribera, Mireia
Romero, Francisco P.Romero, Francisco P. and Serrano-Guerrero, Jesus and Olivas, Jose A. Bucefalo: A Tool for Intelligent Search and Filtering for Web-based Personal Health Records. In this poster, a tool named BUCEFALO is presented. This tool is specially designed to improve the information retrieval tasks in web-based Personal Health Records (PHR). This tool implements semantic and multilingual query expansion techniques and information filtering algorithms in order to help users find the most valuable information about a specific clinical case. The filtering model is based on fuzzy prototypes based filtering, data quality measures, user profiles and healthcare ontologies. The first experimental results illustrate the feasibility of this tool. standards the relevant health information is reliably and unambiguously tagged using XML within a single file. The use of XML allows that this information can be read, understood and processed for any application which uses the standard. Google Health and Microsoft Health use a subset of the CCR (Continuity of Care Record) standard. The CCR standard is the most used patient health summary. A document in CCR format is a XML document that consists of a header, a footer, and a body of health data organized into as many as 17 sections, e.g. problems and conditions, medications list, allergies list, family history, procedures, encounters, etc.. These web-based PHRs are examples of multi-user document repositories. The clinical reports can be read for different users (nurses, physicians, students) and for different purposes (diagnosis, learning, research). When a document repository has many users and many purposes, there are different points of view of the same repository structure. Therefore, it is necessary a technique able to manage these different points of view in knowledge retrieval tasks. In this case, fuzzy logic is especially recommendable due to its special features to model information retrieval applications.
Sato, SatoshiSato, Satoshi Crawling English-Japanese Person-Name Transliterations from the Web. Automatic compilation of lexicon is a dream of lexicon compilers as well as lexicon users. This paper proposes a system that crawls English-Japanese person-name transliterations from the Web, which works a back-end collector for automatic compilation of bilingual person-name lexicon. Our crawler collected 561K transliterations in five months. From them, an English-Japanese person-name lexicon with 406K entries has been compiled by an automatic post processing. This lexicon is much larger than other similar resources including English-Japanese lexicon of HeiNER obtained from Wikipedia. names written in Latin script are transliterated into one in Katakana script according to their pronunciations. English-Japanese transliteration of person name is difficult because of several reasons, such as limited coverage of existing bilingual lexicons, non-English (e.g., French and German) person names appeared in English texts, and spelling variants in Katakana script. 2. There is a possibility that we can compile a large EnglishJapanese person-name lexicon from the Web, because a lot of transliteration instances of person names exist on the Web. Actually, human translators use the Web as a virtual low-quality bilingual lexicon. 3. New person names are produced; new person-name transliterations are produced in every day. Human translators hope frequent update of bilingual personname lexicon. This paper proposes a system that crawls English-Japanese person-name transliterations from the Web, which works as a back-end collector for automatic lexicon compilation. From collected transliterations, a bilingual person-name lexicon is produced by an automatic post processing. This attempt of automatic lexicon compilation can be viewed as a conversion from a virtual low-quality bilingual lexicon (i.e., the Web) to a real high-quality bilingual lexicon.
Scholl, PhilippScholl, Philipp and Domínguez García, Renato and Böhnstedt, Doreen and Rensing, Christoph and Steinmetz, Ralf Towards LanguageIndependent Web Genre Detection. The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres. features (e.g. part-of-speech tagging and document terms), structural features (e.g. HTML tag frequencies, use of facets used to enable functionalities like form input elements) and simple text statistics (e.g. frequencies of punctuation). However, a fact often neglected by related work is that the absolute dominance of the English language on the web is decreasing. Thus, it is important to develop a way of recognizing web genres independently of the language used on the respective web page. As many genres exhibit a certain structural and visual layout, this property enables to ignore linguistic features altogether.
Seida, StevenRamanujam, Sunitha and Gupta, Anubha and Khan, Latifur and Seida, Steven and Thuraisingham, Bhavani Relationalizing RDF Stores for Tools Reusability. The emergence of Semantic Web technologies and standards such as Resource Description Framework (RDF) has introduced novel data storage models such as the RDF Graph Model. In this paper, we present a research effort called R2D, which attempts to bridge the gap between RDF and RDBMS concepts by presenting a relational view of RDF data stores. Thus, R2D is essentially a relational wrapper around RDF stores that aims to make the variety of stable relational tools that are currently in the market available to RDF stores without data duplication and synchronization issues.
Serrano-Guerrero, JesusRomero, Francisco P. and Serrano-Guerrero, Jesus and Olivas, Jose A. Bucefalo: A Tool for Intelligent Search and Filtering for Web-based Personal Health Records. In this poster, a tool named BUCEFALO is presented. This tool is specially designed to improve the information retrieval tasks in web-based Personal Health Records (PHR). This tool implements semantic and multilingual query expansion techniques and information filtering algorithms in order to help users find the most valuable information about a specific clinical case. The filtering model is based on fuzzy prototypes based filtering, data quality measures, user profiles and healthcare ontologies. The first experimental results illustrate the feasibility of this tool. standards the relevant health information is reliably and unambiguously tagged using XML within a single file. The use of XML allows that this information can be read, understood and processed for any application which uses the standard. Google Health and Microsoft Health use a subset of the CCR (Continuity of Care Record) standard. The CCR standard is the most used patient health summary. A document in CCR format is a XML document that consists of a header, a footer, and a body of health data organized into as many as 17 sections, e.g. problems and conditions, medications list, allergies list, family history, procedures, encounters, etc.. These web-based PHRs are examples of multi-user document repositories. The clinical reports can be read for different users (nurses, physicians, students) and for different purposes (diagnosis, learning, research). When a document repository has many users and many purposes, there are different points of view of the same repository structure. Therefore, it is necessary a technique able to manage these different points of view in knowledge retrieval tasks. In this case, fuzzy logic is especially recommendable due to its special features to model information retrieval applications.
Shang, ShujieZhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Shen, DanShen, Dan and Wu, Xiaoyuan and Bolivar, Alvaro Rare Item Detection in e-Commerce Site. As the largest online marketplace in the world, eBay has a huge inventory where there are plenty of great rare items with potentially large, even rapturous buyers. These items are obscured in long tail of eBay item listing and hard to find through existing searching or browsing methods. It is observed that there are great rarity demands from users according to eBay query log. To keep up with the demands, the paper proposes a method to automatically detect rare items in eBay online listing. A large set of features relevant to the task are investigated to filter items and further measure item rareness. The experiments on the most rarity-demandintensitive domains show that the method may effectively detect rare items (> 90% precision).
Shen, JingWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Shi, YuanWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Shieh, Jyh-RenShieh, Jyh-Ren and Hsieh, Yung-Huan and Yeh, Yang-Ting and Chung Su, Tse and Lin, Ching-Yung and Wu, Ja-Ling Building Term Suggestion Relational Graphs from Collective Intelligence. This paper proposes an effective approach to provide relevant search terms for conceptual Web search. ‘Semantic Term Suggestion’ function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Shin, HyoseopShin, Hyoseop and Lee, Jeehoon Ranking User-Created Contents by Search User's Inclination in Online Communities. Searching posts effectively has become an important issue in large-scale online communities. Especially, if search users have different inclinations when they search posts, they have different kinds of posts in their minds. To address this problem, in this paper, we propose a scheme of ranking posts based on search users’ inclination. User ranking score is employed to capture posts that are relevant to a specific user inclination. Specifically, we present a scheme to rank posts in terms of user expertise and popularity. Experimental results show that different user inclinations can produce quite different search results and the proposed scheme achieves about 70% accuracy.
Shiowattana, DungjitChung, Sukwon and Shiowattana, Dungjit and Dmitriev, Pavel and Chan, Su The Web of Nations. In this paper, we report on a large-scale study of structural differences among the national webs. The study is based on a webscale crawl conducted in the summer 2008. More specifically, we study two graphs derived from this crawl, the nation graph, with nodes corresponding to nations and edges – to links among nations, and the host graph, with nodes corresponding to hosts and edges – to hyperlinks among pages on the hosts. Contrary to some of the previous work [2], our results show that webs of different nations are often very different from each other, both in terms of their internal structure, and in terms of their connectivity with other nations.
Silvestri, FabrizioBaraglia, Ranieri and Cacheda, Fidel and Carneiro, Victor and Formoso, Vreixo and Perego, Raffaele and Silvestri, Fabrizio Search Shortcuts: Driving Users Towards Their Goals. Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide Suggestions under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce Search Shortcuts as “Successful ” queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Sire, StéphaneSire, Stéphane and Paquier, Micaël and Vagner, Alain and Bogaerts, Jérôme A Messaging API for Inter-Widgets Communication. Widget containers are used everywhere on the Web, for instance as customizable start pages to Web desktops. In this poster, we describe the extension of a widget container with an inter-widgets communication layer, as well as the subsequent application programming interfaces (APIs) added to the Widget object to support this feature. We present the benefits of a drag and drop facility within widgets and conclude by a call for standardization of inter-widgets communication on the Web.
Soffer, AyaAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Song, Guo-JieDong, Zheng-Bin and Song, Guo-Jie and Xie, Kun-Qing and Wang, Jing-Yao An Experimental Study of Large-Scale Mobile Social Network. Mobile social network is a typical social network where one or more individuals of similar interests or commonalities, conversing and connecting with one another using the mobile phone. Our works in this paper focus on the experimental study for this kind of social network with the support of large-scale real mobile call data. The main contributions can be summarized as three-fold: firstly, a large-scale real mobile phone call log of one city has been extracted from a mobile phone carrier in China to construct mobile social network; secondly, common features of traditional social networks, such as power law distribution and small diameter etc, have been experimented, with which we confirm that the mobile social network is a typical scale-free network and has small-world phenomenon; lastly, different from traditional analytical methods, important properties of the actors, such as gender and age, have been introduced into our experiments with some interesting findings about human behavior, for example, the middle-age people are more active than the young and old people, and the female is unusual more active than the male while in the old age.
Steinmetz, RalfScholl, Philipp and Domínguez García, Renato and Böhnstedt, Doreen and Rensing, Christoph and Steinmetz, Ralf Towards LanguageIndependent Web Genre Detection. The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres. features (e.g. part-of-speech tagging and document terms), structural features (e.g. HTML tag frequencies, use of facets used to enable functionalities like form input elements) and simple text statistics (e.g. frequencies of punctuation). However, a fact often neglected by related work is that the absolute dominance of the English language on the web is decreasing. Thus, it is important to develop a way of recognizing web genres independently of the language used on the respective web page. As many genres exhibit a certain structural and visual layout, this property enables to ignore linguistic features altogether.
Suchanek, Fabian M.
Sugimoto, Shigeo
Sulé, Andreu
Sun, Jian-TaoNi, Xiaochuan and Sun, Jian-Tao and Hu, Jian and Chen, Zheng Mining Multilingual Topics from Wikipedia. In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted “universal” topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.
Sun, JimengLin, Yu-Ru and Sun, Jimeng and Castro, Paul and Konuru, Ravi and Sundaram, Hari and Kelliher, Aisling Extracting Community Structure through Relational Hypergraphs. Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an online method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users’ future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Sundaram, HariLin, Yu-Ru and Sun, Jimeng and Castro, Paul and Konuru, Ravi and Sundaram, Hari and Kelliher, Aisling Extracting Community Structure through Relational Hypergraphs. Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an online method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users’ future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Sundaresan, NeelParikh, Nish and Sundaresan, Neel Buzz-Based Recommender System. In this paper, we describe a buzz-based recommender system based on a large source of queries in an eCommerce application. The system detects bursts in query trends. These bursts are linked to external entities like news and inventory information to find the queries currently in-demand which we refer to as buzz queries. The system follows the paradigm of limited quantity merchandising, in the sense that on a per-day basis the system shows recommendations around a single buzz query with the intent of increasing user curiosity, and improving activity and stickiness on the site. A semantic neighborhood of the chosen buzz query is selected and appropriate recommendations are made on products that relate to this neighborhood.
Suzuki, HirofumiOiwa, Yutaka and Takagi, Hiromitsu and Watanabe, Hajime and Suzuki, Hirofumi PAKE-based Mutual HTTP Authentication for Preventing Phishing Attacks. We developed a new Web authentication protocol with passwordbased mutual authentication which prevents various kinds of phishing attacks. This protocol provides a protection of user’s passwords against any phishers even if a dictionary attack is employed, and prevents phishers from imitating a false sense of successful authentication to users. The protocol is designed considering interoperability with many recent Web applications which requires many features which current HTTP authentication does not provide. The protocol is proposed as an Internet Draft submitted to IETF, and implemented in both server side (as an Apache extension) and client side (as a Mozilla-based browser and an IE-based one). Categories and Subject Descriptors: K.6.5 [Management of Computing and Information Systems]: Security and Protection— Authentication General Terms: Security, Standardization. Keywords: Network protocol, Mutual authentication, HTTP.
Takagi, HiromitsuOiwa, Yutaka and Takagi, Hiromitsu and Watanabe, Hajime and Suzuki, Hirofumi PAKE-based Mutual HTTP Authentication for Preventing Phishing Attacks. We developed a new Web authentication protocol with passwordbased mutual authentication which prevents various kinds of phishing attacks. This protocol provides a protection of user’s passwords against any phishers even if a dictionary attack is employed, and prevents phishers from imitating a false sense of successful authentication to users. The protocol is designed considering interoperability with many recent Web applications which requires many features which current HTTP authentication does not provide. The protocol is proposed as an Internet Draft submitted to IETF, and implemented in both server side (as an Apache extension) and client side (as a Mozilla-based browser and an IE-based one). Categories and Subject Descriptors: K.6.5 [Management of Computing and Information Systems]: Security and Protection— Authentication General Terms: Security, Standardization. Keywords: Network protocol, Mutual authentication, HTTP.
Tang, JieChen, Dewei and Tang, Jie and Li, Juanzi and Zhou, Lizhu Discovering the Staring People From Social Networks. In this paper, we study a novel problem of staring people dis- covery from social networks, which is concerned with finding people who are not only authoritative but also sociable in the social network. We formalize this problem as an optimiza- tion programming problem. Taking the co-author network as a case study, we define three objective functions and pro- pose two methods to combine these objective functions. A genetic algorithm based method is further presented to solve this problem. Experimental results show that the proposed solution can effectively find the staring people from social networks.
Termens, Miquel
Thuraisingham, BhavaniLindamood, Jack and Heatherly, Raymond and Kantarcioglu, Murat and Thuraisingham, Bhavani Inferring Private Information Using Social Network Data. On-line social networks, such as Facebook, are increasingly utilized by many users. These networks allow people to publish details about themselves and connect to their friends. Some of the information revealed inside these networks is private and it is possible that corporations could use learning algorithms on the released data to predict undisclosed private information. In this paper, we explore how to launch inference attacks using released social networking data to predict undisclosed private information about individuals. We then explore the effectiveness of possible sanitization techniques that can be used to combat such inference attacks under different scenarios. social network data could be used to predict some individual private trait that a user is not willing to disclose (e.g., political or religious affiliation) and explore the effect of possible data sanitization alternatives on preventing such private information leakage. To our knowledge this is the first comprehensive paper that discusses the problem of inferring private traits using real-life social network data and possible sanitization approaches to prevent such inference. First, we present a ıve modification of Na¨ Bayes classification that is suitable for classifying large amount of social network data. Our modified Na¨ Bayes algorithm predicts privacy sensitive trait ıve information using both node traits and link structure. We compare the accuracy of our learning method based on link structure against the accuracy of our learning method based on node traits. Please see extended version of this paper [3] for further details of our modified Naive Bayes classifier. In order to protect privacy, we sanitize both trait (e.g., deleting some information from a user’s on-line profile) and link details (e.g., deleting links between friends) and explore the effect they have on combating possible inference attacks. Our initial results indicate that just sanitizing trait information or link information may not be enough to prevent inference attacks and comprehensive sanitization techniques that involve both aspects are needed in practice. Similar to our paper, in [2], authors consider ways to infer private information via friendship links by creating a Bayesian Network from the links inside a social network. A similar privacy problem for online social networks is discussed in [4]. Compared to [2] and [4], we provide techniques that help in choosing the most effective traits or links that need to be removed for protecting privacy.
Ramanujam, Sunitha and Gupta, Anubha and Khan, Latifur and Seida, Steven and Thuraisingham, Bhavani Relationalizing RDF Stores for Tools Reusability. The emergence of Semantic Web technologies and standards such as Resource Description Framework (RDF) has introduced novel data storage models such as the RDF Graph Model. In this paper, we present a research effort called R2D, which attempts to bridge the gap between RDF and RDBMS concepts by presenting a relational view of RDF data stores. Thus, R2D is essentially a relational wrapper around RDF stores that aims to make the variety of stable relational tools that are currently in the market available to RDF stores without data duplication and synchronization issues.
Tian, Zhi-HongLi, Yang and Lu, Tian-Bo and Guo, Li and Tian, Zhi-Hong and Nie, Qin-Wu Towards Lightweight and Efficient DDoS Attacks Detection for Web Server. In this poster, based on our previous work in building a lightweight DDoS (Distributed Denial-of-Services) attacks detection mechanism for web server using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and genetic algorithm based instance selection methods, we further propose a more efficient and effective instance selection method, named E-FCM (Extend Fuzzy C-Means). By using this method, we can obtain much cheaper training time for TCM-KNN while ensuring high detection performance. Therefore, the optimized mechanism is more suitable for lightweight DDoS attacks detection in real network environment. In our previous work, we proposed an effective anomaly detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) algorithm to fulfill DDoS attacks detection task towards ensuring the QoS of web server. The method is good at detecting network anomalies with high detection rate, high confidence and low false positives than traditional methods, because it combines “strangeness” with “p-values” measures to evaluate the network traffic compared to the conventional ad-hoc thresholds based detection and particular definition based detection. Secondly, we utilize the new objective measurement as the input feature spaces of TCM-KNN, to effectively detect DDoS attack against web server. Finally, we introduce Genetic Algorithm (GA) based instance selection method to boost the real-time detection performance of TCM-KNN and thus make it be an effective and lightweight mechanism for DDoS detection for web servers [4, 5]. However, we found the computational cost for GA is expensive, which results in high training time for TCM-KNN.
Toda, Guilherme A.Toda, Guilherme A. and Cortez, Eli and Mesquita, Filipe and da Silva, Altigran S. and Moura, Edleno and Neubert, Marden Automatically Filling Form-Based Web Interfaces with Free Text Inputs. On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of formbased interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over formbased ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
Tse, T. H.Jiang, Bo and Chan, W. K. and Zhang, Zhenyu and Tse, T. H. Where to Adapt Dynamic Service Compositions. Peer services depend on one another to accomplish their tasks, and their structures may evolve. A service composition may be designed to replace its member services whenever the quality of the composite service fails to meet certain quality-of-service (QoS) requirements. Finding services and service invocation endpoints having the greatest impact on the quality are important to guide subsequent service adaptations. This paper proposes a technique that samples the QoS of composite services and continually analyzes them to identify artifacts for service adaptation. The preliminary results show that our technique has the potential to effectively find such artifacts in services.
Tu, XudongTu, Xudong and Wang, Xin-Jing and Feng, Dan and Zhang, Lei Ranking Community Answers via Analogical Reasoning. Due to the lexical gap between questions and answers, automatically detecting right answers becomes very challenging for community question-answering sites. In this paper, we propose an analogical reasoning-based method. It treats questions and answers as relational data and ranks an answer by measuring the analogy of its link to a query with the links embedded in previous relevant knowledge; the answer that links in the most analogous way to the new question is assumed to be the best answer. We based our experiments on 29.8 million Yahoo!Answer questionanswer threads and showed the effectiveness of the approach.
Uchiyama, TadasuEda, Takeharu and Uchiyama, Toshio and Uchiyama, Tadasu and Yoshikawa, Masatoshi Signaling Emotion in Tagclouds. In order to create more attractive tagclouds that get people interested in tagged content, we propose a simple but novel tagcloud where font size is determined by tag’s entropy value, not the popularity to its content. Our method raises users’ emotional interest in the content by emphasizing more emotional tags. Our initial experiments show that emotional tagclouds attract more attention than normal tagclouds at first look; thus they will enhance the role of tagcloud as a social signaller.
Uchiyama, ToshioEda, Takeharu and Uchiyama, Toshio and Uchiyama, Tadasu and Yoshikawa, Masatoshi Signaling Emotion in Tagclouds. In order to create more attractive tagclouds that get people interested in tagged content, we propose a simple but novel tagcloud where font size is determined by tag’s entropy value, not the popularity to its content. Our method raises users’ emotional interest in the content by emphasizing more emotional tags. Our initial experiments show that emotional tagclouds attract more attention than normal tagclouds at first look; thus they will enhance the role of tagcloud as a social signaller.
Vagner, AlainSire, Stéphane and Paquier, Micaël and Vagner, Alain and Bogaerts, Jérôme A Messaging API for Inter-Widgets Communication. Widget containers are used everywhere on the Web, for instance as customizable start pages to Web desktops. In this poster, we describe the extension of a widget container with an inter-widgets communication layer, as well as the subsequent application programming interfaces (APIs) added to the Widget object to support this feature. We present the benefits of a drag and drop facility within widgets and conclude by a call for standardization of inter-widgets communication on the Web.
Van der Goot, ErikAtkinson, Martin and Van der Goot, Erik Near Real Time Information Mining in Multilingual News. This paper presents a near real-time multilingual news monitoring and analysis system that forms the backbone of our research work. The system integrates technologies to address the problems related to information extraction and analysis of open source intelligence on the World Wide Web. By chaining together different techniques in text mining, automated machine learning and statistical analysis, we can automatically determine who, where and, to a certain extent, what is being reported in news articles.
Viswanathan, Amar
Wang, CanWang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Zhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Wang, GangWang, Gang and Hu, Jian and Zhu, Yunzhang and Li, Hua and Chen, Zheng Competitive Analysis from Click-Through Log. Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers’ products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors’ keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
Wang, HaofenWang, Haofen and Liu, Qiaoling and Xue, Gui-Rong and Yu, Yong and Zhang, Lei and Pan, Yue Dataplorer: A Scalable Search Engine for the Data Web. More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Experimentation Keywords: hybrid query, inverted index, ranking, faceted search sake of the others. The usability challenge is addressed by providing the user with hybrid query capabilities, leveraging the power of structured queries and the ease of use of keyword search. We also propose a faceted search functionality that allows users to progressively compose the structured part of their information need after having started with imprecise keywords. Scalability is one of the main challenges that hybrid queries are facing, due to the large amount of data. Inspired from the cross field of DB and IR integration, we make IR compatible with hybrid search through an extension of the inverted index, and thus able to scale as well as to handle structured information. To ensure that uncertainty does not remain as a problem to return relevant results, we provide a powerful ranking scheme that considers structures of both data and hybrid queries for score propagation and aggregation during results ranking. As an improvement of our previous work [3], we support faceted search with integrated ranking to tackle both usability and uncertainty issues while preserving efficiency.
Wang, Jianyong
Wang, Jing-YaoDong, Zheng-Bin and Song, Guo-Jie and Xie, Kun-Qing and Wang, Jing-Yao An Experimental Study of Large-Scale Mobile Social Network. Mobile social network is a typical social network where one or more individuals of similar interests or commonalities, conversing and connecting with one another using the mobile phone. Our works in this paper focus on the experimental study for this kind of social network with the support of large-scale real mobile call data. The main contributions can be summarized as three-fold: firstly, a large-scale real mobile phone call log of one city has been extracted from a mobile phone carrier in China to construct mobile social network; secondly, common features of traditional social networks, such as power law distribution and small diameter etc, have been experimented, with which we confirm that the mobile social network is a typical scale-free network and has small-world phenomenon; lastly, different from traditional analytical methods, important properties of the actors, such as gender and age, have been introduced into our experiments with some interesting findings about human behavior, for example, the middle-age people are more active than the young and old people, and the female is unusual more active than the male while in the old age.
Wang, JunfengWang, Junfeng and He, Xiaofei and Wang, Can and Pei, Jian and Bu, Jiajun and Chen, Chun and Guan, Ziyu and Gang, Lu News Article Extraction with Template-Independent Wrapper. We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Wang, LuZhang, Kaimin and Wang, Lu and Guo, Xiaolin and Pan, Aimin and Zhu, Bin B. WPBench: A Benchmark for Evaluating the Client-side Performance of Web 2.0 Applications. In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users’ interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
Wang, MinHassanzadeh, Oktie and Lim, Lipyeow and Kementsietsidis, Anastasios and Wang, Min A Declarative Framework for Semantic Link Discovery over Relational Data. In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage require- ments by the user, that allows matching data items in many real-world scenarios. These requirements are translated to queries that can run over the relational data source, potentially using the semantic knowledge to enhance the accuracy of link discovery. Our framework lets data publishers to easily find and publish high-quality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.
Wang, Shengyuan
Wang, WeiWang, Wei and Masseglia, Florent and Guyet, Thomas and Quiniou, Rene and Cordier, Marie-Odile A General Framework for Adaptive and Online Detection of Web Attacks. Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we pro- pose a novel general framework for adaptive and online de- tection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with “con- cept drift” in web audit data streams. Str-DBSCAN that we extended DBSCAN [1] to streaming data as well as StrAP [3] are both used to validate the framework. The detec- tion model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data col- lected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Lin, Chen and Yang, Jiang-Ming and Cai, Rui and Wang, Xin-Jing and Wang, Wei and Zhang, Lei Modeling Semantics and Structure of Discussion Threads. The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Wang, Xin-JingLin, Chen and Yang, Jiang-Ming and Cai, Rui and Wang, Xin-Jing and Wang, Wei and Zhang, Lei Modeling Semantics and Structure of Discussion Threads. The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Tu, Xudong and Wang, Xin-Jing and Feng, Dan and Zhang, Lei Ranking Community Answers via Analogical Reasoning. Due to the lexical gap between questions and answers, automatically detecting right answers becomes very challenging for community question-answering sites. In this paper, we propose an analogical reasoning-based method. It treats questions and answers as relational data and ranks an answer by measuring the analogy of its link to a query with the links embedded in previous relevant knowledge; the answer that links in the most analogous way to the new question is assumed to be the best answer. We based our experiments on 29.8 million Yahoo!Answer questionanswer threads and showed the effectiveness of the approach.
Watanabe, HajimeOiwa, Yutaka and Takagi, Hiromitsu and Watanabe, Hajime and Suzuki, Hirofumi PAKE-based Mutual HTTP Authentication for Preventing Phishing Attacks. We developed a new Web authentication protocol with passwordbased mutual authentication which prevents various kinds of phishing attacks. This protocol provides a protection of user’s passwords against any phishers even if a dictionary attack is employed, and prevents phishers from imitating a false sense of successful authentication to users. The protocol is designed considering interoperability with many recent Web applications which requires many features which current HTTP authentication does not provide. The protocol is proposed as an Internet Draft submitted to IETF, and implemented in both server side (as an Apache extension) and client side (as a Mozilla-based browser and an IE-based one). Categories and Subject Descriptors: K.6.5 [Management of Computing and Information Systems]: Security and Protection— Authentication General Terms: Security, Standardization. Keywords: Network protocol, Mutual authentication, HTTP.
Weber, IngmarBaykan, Eda and Henzinger, Monika and Marian, Ludmila and Weber, Ingmar Purely URL-based Topic Classification. Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page’s content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
Michel, Sebastian and Weber, Ingmar Rethinking Email Message and People Search. We show how a number of novel email search features can be implemented without any kind of natural language processing (NLP) or advanced data mining. Our approach inspects the email headers of all messages a user has ever sent or received and it creates simple per-contact summaries, including simple information about the message exchange history, the domain of the sender or even the sender’s gender. With these summaries advanced questions/tasks such as “Who do I still need to reply to?” or “Find ‘fun’ messages sent by friends.” become possible. As a proof of concept, we implemented a Mozilla-Thunderbird extension, adding powerful people search to the popular email client.
Weikum, Gerhard
Wook Kim, JinLee, Taehyung and Kim, Jinil and Wook Kim, Jin and Kim, Sung-Ryul and Park, Kunsoo Detecting Soft Errors by Redirection Classification. A soft error redirection is a URL redirection to a page that returns the HTTP status code 200 (OK) but has actually no relevant content to the client request. Since such redirections degrade the performance of web search engines in many ways, it is highly desirable to remove as many of them as possible. We propose a novel approach to detect soft error redirections by analyzing redirection logs collected during crawling operation. Experimental results on huge crawl data show that our measure can classify soft error redirections effectively.
Wu, HaoWu, Hao and Qiu, Guang and He, Xiaofei and Shi, Yuan and Qu, Mingcheng and Shen, Jing and Bu, Jiajun and Chen, Chun Advertising Keyword Generation Using Active Learning. This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Qu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Wu, Ja-LingShieh, Jyh-Ren and Hsieh, Yung-Huan and Yeh, Yang-Ting and Chung Su, Tse and Lin, Ching-Yung and Wu, Ja-Ling Building Term Suggestion Relational Graphs from Collective Intelligence. This paper proposes an effective approach to provide relevant search terms for conceptual Web search. ‘Semantic Term Suggestion’ function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Wu, Ou
Wu, XiaoyuanShen, Dan and Wu, Xiaoyuan and Bolivar, Alvaro Rare Item Detection in e-Commerce Site. As the largest online marketplace in the world, eBay has a huge inventory where there are plenty of great rare items with potentially large, even rapturous buyers. These items are obscured in long tail of eBay item listing and hard to find through existing searching or browsing methods. It is observed that there are great rarity demands from users according to eBay query log. To keep up with the demands, the paper proposes a method to automatically detect rare items in eBay online listing. A large set of features relevant to the task are investigated to filter items and further measure item rareness. The experiments on the most rarity-demandintensitive domains show that the method may effectively detect rare items (> 90% precision).
Wu, Yi-Chuan
Wu, ZhaohuiLu, Bin and Wu, Zhaohui and Ni, Yuan and Xie, Guotong and Zhou, Chunying and Chen, Huajun sMash: Semantic-based Mashup Navigation for Data API Network. With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Xie, GuotongLu, Bin and Wu, Zhaohui and Ni, Yuan and Xie, Guotong and Zhou, Chunying and Chen, Huajun sMash: Semantic-based Mashup Navigation for Data API Network. With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Xie, Kun-QingDong, Zheng-Bin and Song, Guo-Jie and Xie, Kun-Qing and Wang, Jing-Yao An Experimental Study of Large-Scale Mobile Social Network. Mobile social network is a typical social network where one or more individuals of similar interests or commonalities, conversing and connecting with one another using the mobile phone. Our works in this paper focus on the experimental study for this kind of social network with the support of large-scale real mobile call data. The main contributions can be summarized as three-fold: firstly, a large-scale real mobile phone call log of one city has been extracted from a mobile phone carrier in China to construct mobile social network; secondly, common features of traditional social networks, such as power law distribution and small diameter etc, have been experimented, with which we confirm that the mobile social network is a typical scale-free network and has small-world phenomenon; lastly, different from traditional analytical methods, important properties of the actors, such as gender and age, have been introduced into our experiments with some interesting findings about human behavior, for example, the middle-age people are more active than the young and old people, and the female is unusual more active than the male while in the old age.
Xue, Gui-RongWang, Haofen and Liu, Qiaoling and Xue, Gui-Rong and Yu, Yong and Zhang, Lei and Pan, Yue Dataplorer: A Scalable Search Engine for the Data Web. More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Experimentation Keywords: hybrid query, inverted index, ranking, faceted search sake of the others. The usability challenge is addressed by providing the user with hybrid query capabilities, leveraging the power of structured queries and the ease of use of keyword search. We also propose a faceted search functionality that allows users to progressively compose the structured part of their information need after having started with imprecise keywords. Scalability is one of the main challenges that hybrid queries are facing, due to the large amount of data. Inspired from the cross field of DB and IR integration, we make IR compatible with hybrid search through an extension of the inverted index, and thus able to scale as well as to handle structured information. To ensure that uncertainty does not remain as a problem to return relevant results, we provide a powerful ranking scheme that considers structures of both data and hybrid queries for score propagation and aggregation during results ranking. As an improvement of our previous work [3], we support faceted search with integrated ranking to tackle both usability and uncertainty issues while preserving efficiency.
Zhang, Congle and Xue, Gui-Rong and Yu, Yong and Zha, Hongyuan Web-Scale Classification with Naive Bayes. Traditional Naive Bayes Classifier performs miserably on web-scale taxonomies. In this paper, we investigate the reasons behind such bad performance. We discover that the low performance are not completely caused by the intrinsic limitations of Naive Bayes, but mainly comes from two largely ignored problems: contradiction pair problem and discriminative evidence cancelation problem. We propose modifications that can alleviate the two problems while preserving the advantages of Naive Bayes. The experimental results show our modified Naive Bayes can significantly improve the performance on real web-scale taxonomies.
Xue, XiangyangChi, Mingmin and Zhang, Peiwu and Zhao, Yingbin and Feng, Rui and Xue, Xiangyang Web Image Retrieval ReRanking with Multi-view Clustering. General image retrieval is often carried out by a text-based search engine, such as Google Image Search. In this case, natural language queries are used as input to the search engine. Usually, the user queries are quite ambiguous and the returned results are not well-organized as the ranking often done by the popularity of an image. In order to address these problems, we propose to use both textual and visual contents of retrieved images to reRank web retrieved results. In particular, a machine learning technique, a multi-view clustering algorithm is proposed to reorganize the original results provided by the text-based search engine. Preliminary results validate the effectiveness of the proposed framework.
Yadav, Amit
Yan, JunLiu, Ning and Yan, Jun and Fan, Weiguo and Yang, Qiang and Chen, Zheng Identifying Vertical Search Intention of Query through Social Tagging Propagation. A pressing task during the unification process is to identify a user’s vertical search intention based on the user’s query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
Liu, Ning and Yan, Jun and Chen, Zheng A Probabilistic Model Based Approach for Blended Search. In this paper, we propose to model the blended search problem by assuming conditional dependencies among queries, VSEs and search results. The probability distributions of this model are learned from search engine query log through unigram language model. Our experimental exploration shows that, (1) a large number of queries in generic Web search have vertical search intentions; and (2) our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the Mean Average Precision (MAP) by as much as 16% compared to traditional Web search without blending. these components into a single list. However, from the classical meta-search problem’s configuration, the query log of component search engines is not available for study. In this extended abstract, we model the blended search problem based on the conditional dependencies among queries, VSEs and all the search results. We utilize the usage information, i.e. query log, of all the VSEs, which are not available for traditional metasearch engines, to learn the model parameters by the smoothed unigram language model. Finally, given a user query, the search results from both generic Web search and different VSEs are ranked together by inferring their probabilities of relevance to the given query. The main contributions of this work are, (1) through studying the belonging vertical search engines’ query log of a commercial search engine, we show the importance of blended search problem; (2) we propose a novel probabilistic model based approach to explore the blended search problem; and (3) we experimentally verify that our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the MAP as much as 16% in contrast to traditional Web search without vertical search blending and 10% to some other some ranking baseline.
Yan, Jun and Liu, Ning and Qing Chang, Elaine and Ji, Lei and Chen, Zheng Search Result Re-ranking Based on Gap between Search Queries and Social Tags. Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result re-ranking. In this paper, we show that the gap between search queries and social tags of the same web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to rerank search results for better user satisfaction. Intuitively, on one hand, the search users’ intentions are generally described by their queries before they read the search results. On the other hand, the web annotators semantically tag web pages after they read the content of the pages. The difference between users’ recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users’ pre- and postknowledge respectively. We empirically show the strong correlation between user satisfaction and user’s knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
Yanai, KeijiYanai, Keiji and Qiu, Bingyu Mining Cultural Differences from a Large Number of Geotagged Photos. We propose a novel method to detect cultural differences over the world automatically by using a large amount of geotagged images on the photo sharing Web sites such as Flickr. We employ the state-of-the-art object recognition technique developed in the research community of computer vision to mine representative photos of the given concept for representative local regions from a large-scale unorganized collection of consumer-generated geotagged photos. The results help us understand how objects, scenes or events corresponding to the same given concept are visually different depending on local regions over the world.
Yang, Jiang-MingLin, Chen and Yang, Jiang-Ming and Cai, Rui and Wang, Xin-Jing and Wang, Wei and Zhang, Lei Modeling Semantics and Structure of Discussion Threads. The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Yang, QiangLiu, Ning and Yan, Jun and Fan, Weiguo and Yang, Qiang and Chen, Zheng Identifying Vertical Search Intention of Query through Social Tagging Propagation. A pressing task during the unification process is to identify a user’s vertical search intention based on the user’s query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
Yeh, Yang-TingShieh, Jyh-Ren and Hsieh, Yung-Huan and Yeh, Yang-Ting and Chung Su, Tse and Lin, Ching-Yung and Wu, Ja-Ling Building Term Suggestion Relational Graphs from Collective Intelligence. This paper proposes an effective approach to provide relevant search terms for conceptual Web search. ‘Semantic Term Suggestion’ function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Yi, JeongheeYi, Jeonghee and Maghoul, Farzin Query Clustering using Click-Through Graph. In this p aper w e describe a problem of d iscovering query clusters from a click -through graph of w eb search logs. The graph consists of a set of web search queries, a set of pag es selected for the queries, and a set of d irected edges that connects a query node and a page node click ed by a user for the query. The proposed method extracts all m axim al b ipartite cliques (b icliques) from a click-through graph and compute an equiv alence set of queries (i.e., a query cluster) from the m axim al bicliques. A cluster of queries is form ed from th e queries in a biclique. We present a scalable algorithm that enumerates all maximal bicliques from the click-through graph. We h ave conducted experim ents on Yahoo web search queries and the result is p romising.
Yogev, SivanAmitay, Einat and Carmel, David and Har'El, Nadav and Ofek-Koifman, Shila and Soffer, Aya and Yogev, Sivan and Golbandi, Nadav Social Search and Discovery Using a Unified Approach. We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Yoshikawa, MasatoshiEda, Takeharu and Uchiyama, Toshio and Uchiyama, Tadasu and Yoshikawa, Masatoshi Signaling Emotion in Tagclouds. In order to create more attractive tagclouds that get people interested in tagged content, we propose a simple but novel tagcloud where font size is determined by tag’s entropy value, not the popularity to its content. Our method raises users’ emotional interest in the content by emphasizing more emotional tags. Our initial experiments show that emotional tagclouds attract more attention than normal tagclouds at first look; thus they will enhance the role of tagcloud as a social signaller.
Yu, YongWang, Haofen and Liu, Qiaoling and Xue, Gui-Rong and Yu, Yong and Zhang, Lei and Pan, Yue Dataplorer: A Scalable Search Engine for the Data Web. More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Experimentation Keywords: hybrid query, inverted index, ranking, faceted search sake of the others. The usability challenge is addressed by providing the user with hybrid query capabilities, leveraging the power of structured queries and the ease of use of keyword search. We also propose a faceted search functionality that allows users to progressively compose the structured part of their information need after having started with imprecise keywords. Scalability is one of the main challenges that hybrid queries are facing, due to the large amount of data. Inspired from the cross field of DB and IR integration, we make IR compatible with hybrid search through an extension of the inverted index, and thus able to scale as well as to handle structured information. To ensure that uncertainty does not remain as a problem to return relevant results, we provide a powerful ranking scheme that considers structures of both data and hybrid queries for score propagation and aggregation during results ranking. As an improvement of our previous work [3], we support faceted search with integrated ranking to tackle both usability and uncertainty issues while preserving efficiency.
Zhang, Congle and Xue, Gui-Rong and Yu, Yong and Zha, Hongyuan Web-Scale Classification with Naive Bayes. Traditional Naive Bayes Classifier performs miserably on web-scale taxonomies. In this paper, we investigate the reasons behind such bad performance. We discover that the low performance are not completely caused by the intrinsic limitations of Naive Bayes, but mainly comes from two largely ignored problems: contradiction pair problem and discriminative evidence cancelation problem. We propose modifications that can alleviate the two problems while preserving the advantages of Naive Bayes. The experimental results show our modified Naive Bayes can significantly improve the performance on real web-scale taxonomies.
Zha, HongyuanZhang, Congle and Xue, Gui-Rong and Yu, Yong and Zha, Hongyuan Web-Scale Classification with Naive Bayes. Traditional Naive Bayes Classifier performs miserably on web-scale taxonomies. In this paper, we investigate the reasons behind such bad performance. We discover that the low performance are not completely caused by the intrinsic limitations of Naive Bayes, but mainly comes from two largely ignored problems: contradiction pair problem and discriminative evidence cancelation problem. We propose modifications that can alleviate the two problems while preserving the advantages of Naive Bayes. The experimental results show our modified Naive Bayes can significantly improve the performance on real web-scale taxonomies.
Zhan, Jian
Zhang, ChengQu, Mingcheng and Qiu, Guang and He, Xiaofei and Zhang, Cheng and Wu, Hao and Bu, Jiajun and Chen, Chun Probabilistic Question Recommendation for Question Answering Communities. User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Zhang, CongleZhang, Congle and Xue, Gui-Rong and Yu, Yong and Zha, Hongyuan Web-Scale Classification with Naive Bayes. Traditional Naive Bayes Classifier performs miserably on web-scale taxonomies. In this paper, we investigate the reasons behind such bad performance. We discover that the low performance are not completely caused by the intrinsic limitations of Naive Bayes, but mainly comes from two largely ignored problems: contradiction pair problem and discriminative evidence cancelation problem. We propose modifications that can alleviate the two problems while preserving the advantages of Naive Bayes. The experimental results show our modified Naive Bayes can significantly improve the performance on real web-scale taxonomies.
Zhang, DellZhang, Dell and Mao, Robert and Li, Wei The Recurrence Dynamics of Social Tagging. How often do tags recur? How hard is predicting tag recurrence? What tags are likely to recur? We try to answer these questions by analysing the RSDC08 dataset, in both individual and collective settings. Our findings provide useful insights for the development of tag suggestion techniques etc.
Zhang, Jian-Ying
Zhang, KaiminZhang, Kaimin and Wang, Lu and Guo, Xiaolin and Pan, Aimin and Zhu, Bin B. WPBench: A Benchmark for Evaluating the Client-side Performance of Web 2.0 Applications. In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users’ interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
Zhang, LeiWang, Haofen and Liu, Qiaoling and Xue, Gui-Rong and Yu, Yong and Zhang, Lei and Pan, Yue Dataplorer: A Scalable Search Engine for the Data Web. More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Experimentation Keywords: hybrid query, inverted index, ranking, faceted search sake of the others. The usability challenge is addressed by providing the user with hybrid query capabilities, leveraging the power of structured queries and the ease of use of keyword search. We also propose a faceted search functionality that allows users to progressively compose the structured part of their information need after having started with imprecise keywords. Scalability is one of the main challenges that hybrid queries are facing, due to the large amount of data. Inspired from the cross field of DB and IR integration, we make IR compatible with hybrid search through an extension of the inverted index, and thus able to scale as well as to handle structured information. To ensure that uncertainty does not remain as a problem to return relevant results, we provide a powerful ranking scheme that considers structures of both data and hybrid queries for score propagation and aggregation during results ranking. As an improvement of our previous work [3], we support faceted search with integrated ranking to tackle both usability and uncertainty issues while preserving efficiency.
Lin, Chen and Yang, Jiang-Ming and Cai, Rui and Wang, Xin-Jing and Wang, Wei and Zhang, Lei Modeling Semantics and Structure of Discussion Threads. The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Tu, Xudong and Wang, Xin-Jing and Feng, Dan and Zhang, Lei Ranking Community Answers via Analogical Reasoning. Due to the lexical gap between questions and answers, automatically detecting right answers becomes very challenging for community question-answering sites. In this paper, we propose an analogical reasoning-based method. It treats questions and answers as relational data and ranks an answer by measuring the analogy of its link to a query with the links embedded in previous relevant knowledge; the answer that links in the most analogous way to the new question is assumed to be the best answer. We based our experiments on 29.8 million Yahoo!Answer questionanswer threads and showed the effectiveness of the approach.
Zhang, PeiwuChi, Mingmin and Zhang, Peiwu and Zhao, Yingbin and Feng, Rui and Xue, Xiangyang Web Image Retrieval ReRanking with Multi-view Clustering. General image retrieval is often carried out by a text-based search engine, such as Google Image Search. In this case, natural language queries are used as input to the search engine. Usually, the user queries are quite ambiguous and the returned results are not well-organized as the ranking often done by the popularity of an image. In order to address these problems, we propose to use both textual and visual contents of retrieved images to reRank web retrieved results. In particular, a machine learning technique, a multi-view clustering algorithm is proposed to reorganize the original results provided by the text-based search engine. Preliminary results validate the effectiveness of the proposed framework.
Zhang, XinchangGeng, Guang-Gang and Li, Qiudan and Zhang, Xinchang Link Based Small Sample Learning for Web Spam Detection. Robust statistical learning based web spam detection sys- tem often requires large amounts of labeled training data. However, labeled samples are more difficult, expensive and time consuming to obtain than unlabeled ones. This pa- per proposed link based semi-supervised learning algorithms to boost the performance of a classifier, which integrates the traditional Self-training with the topological dependency based link learning. The experiments with a few labeled samples on standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.
Zhang, Yun-Fei
Zhang, ZhenyuJiang, Bo and Chan, W. K. and Zhang, Zhenyu and Tse, T. H. Where to Adapt Dynamic Service Compositions. Peer services depend on one another to accomplish their tasks, and their structures may evolve. A service composition may be designed to replace its member services whenever the quality of the composite service fails to meet certain quality-of-service (QoS) requirements. Finding services and service invocation endpoints having the greatest impact on the quality are important to guide subsequent service adaptations. This paper proposes a technique that samples the QoS of composite services and continually analyzes them to identify artifacts for service adaptation. The preliminary results show that our technique has the potential to effectively find such artifacts in services.
Zhao, YingbinChi, Mingmin and Zhang, Peiwu and Zhao, Yingbin and Feng, Rui and Xue, Xiangyang Web Image Retrieval ReRanking with Multi-view Clustering. General image retrieval is often carried out by a text-based search engine, such as Google Image Search. In this case, natural language queries are used as input to the search engine. Usually, the user queries are quite ambiguous and the returned results are not well-organized as the ranking often done by the popularity of an image. In order to address these problems, we propose to use both textual and visual contents of retrieved images to reRank web retrieved results. In particular, a machine learning technique, a multi-view clustering algorithm is proposed to reorganize the original results provided by the text-based search engine. Preliminary results validate the effectiveness of the proposed framework.
Zheng, ShuyiZheng, Shuyi and Dmitriev, Pavel and Lee Giles, C. Graph Based Crawler Seed Selection. This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more “good” and less “bad” pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.
Zhou, ChunyingLu, Bin and Wu, Zhaohui and Ni, Yuan and Xie, Guotong and Zhou, Chunying and Chen, Huajun sMash: Semantic-based Mashup Navigation for Data API Network. With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Zhou, LizhuChen, Dewei and Tang, Jie and Li, Juanzi and Zhou, Lizhu Discovering the Staring People From Social Networks. In this paper, we study a novel problem of staring people dis- covery from social networks, which is concerned with finding people who are not only authoritative but also sociable in the social network. We formalize this problem as an optimiza- tion programming problem. Taking the co-author network as a case study, we define three objective functions and pro- pose two methods to combine these objective functions. A genetic algorithm based method is further presented to solve this problem. Experimental results show that the proposed solution can effectively find the staring people from social networks.
Li, Guoliang and Feng, Jianhua and Zhou, Lizhu Interactive Search in XML Data. In a traditional keyword-search system in XML data, a user composes a keyword query, submits it to the system, and retrieves relevant subtrees. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a tryand-see approach for finding information. In this paper, we study a new information-access paradigm for XML data, called “Inks,” in which the system searches on the underlying data “on the fly” as the user types in query keywords. Inks extends existing XML keyword search methods by interactively answering keyword queries. We propose effective indices, early-termination techniques, and efficient search algorithms to achieve a high interactive speed. We have implemented our algorithm. The experimental results show that Inks achieves high search efficiency and result quality.
Zhou, YipingHe, Xiaofeng and Duan, Lei and Zhou, Yiping and Dom, Byron Threshold Selection for Web-Page Classification with Highly Skewed Class Distribution. We propose a novel cost-efficient approach to threshold selection for binary web-page classification problems with imbalanced class distributions. In many binary-classification tasks the distribution of classes is highly skewed. In such problems, using uniform random sampling in constructing sample sets for threshold setting requires large sample sizes in order to include a statistically sufficient number of examples of the minority class. On the other hand, manually labeling examples is expensive and budgetary considerations require that the size of sample sets be limited. These conflicting requirements make threshold selection a challenging problem. Our method of sample-set construction is a novel approach based on stratified sampling, in which manually labeled examples are expanded to reflect the true class distribution of the web-page population. Our experimental results show that using false positive rate as the criterion for threshold setting results in lower-variance threshold estimates than using other widely used accuracy measures such as F1 and precision.
Zhu, Bin B.Zhang, Kaimin and Wang, Lu and Guo, Xiaolin and Pan, Aimin and Zhu, Bin B. WPBench: A Benchmark for Evaluating the Client-side Performance of Web 2.0 Applications. In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users’ interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
Zhu, JunyanZhu, Junyan and Wang, Can and He, Xiaofei and Bu, Jiajun and Chen, Chun and Shang, Shujie and Qu, Mingcheng and Lu, Gang Tag-Oriented Document Summarization. Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Zhu, YunzhangWang, Gang and Hu, Jian and Zhu, Yunzhang and Li, Hua and Chen, Zheng Competitive Analysis from Click-Through Log. Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers’ products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors’ keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
Zudina, EkaterinaNikolaev, Kirill and Zudina, Ekaterina and Gorshkov, Andrey Combining Anchor Text Categorization and Graph Analysis for Paid Link Detection. In order to artificially boost the rank of commercial pages in search engine results, search engine optimizers pay for links to these pages on other websites. Identifying paid links is important for a web search engine to produce highly relevant results. In this paper we introduce a novel method of identifying such links. We start with training a classifier of anchor text topics and analyzing web pages for diversity of their outgoing commercial links. Then we use this information and analyze link graph of the Russian Web to find pages that sell links and sites that buy links and to identify the paid links. Testing on manually marked samples showed high efficiency of the algorithm.
Zuo, Haiqiang
da Silva, Altigran S.Toda, Guilherme A. and Cortez, Eli and Mesquita, Filipe and da Silva, Altigran S. and Moura, Edleno and Neubert, Marden Automatically Filling Form-Based Web Interfaces with Free Text Inputs. On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of formbased interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over formbased ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
This list was generated on Fri Feb 15 07:21:28 2019 GMT.
About this site
This website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software.
Preservation
We (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it... [this has now happened, this site is now static]
|