Content Analysis in Web 2.0

Published Version

Preview

PDF ( Exploring Collaboratively Annotated Data for Automatic Annotation ) (222Kb)

Preview

PDF (An Integrated Approach for Relation Extraction from Wikipedia Texts ) (943Kb)

Preview

PDF (A Ngram-based Statistical Machine Translation Approach for Text Normalization on Chat-speak Style Communications ) (162Kb)

Preview

PDF (Opinion Analysis on CAW 2.0 Datasets) (73Kb)

Preview

PDF (Using automatic keyword extraction to detect off-topic posts in online discussion boards ) (429Kb)

Preview

PDF (Detection of Harassment on Web 2.0) (210Kb)

Abstract

Web mining deals with understanding, and discovering information in, the World Wide Web. Web mining focuses on analyzing three different sources of information: web structure, user activity and the contents. When referring to the Web 2.0, web structure and user activity related data can be dealt with in a very similar way that in the case of the traditional Web, however, in the case of contents, conventional analysis and mining procedures are not suitable anymore. This is mainly because, in the Web 2.0, contents are generated by users, who make a very free use of language and are constantly incorporating new communication elements which are generally context dependent. This kind of language can also be found on chats, SMS, e-mails and other channels of informal textual communication. This workshop focuses on the problem of making Web 2.0 both searchable and analyzable in terms of its contents. This is an extremely important endeavor for current web mining technologies because of two reasons: first, user generated content (UGC) is growing faster than ever in the cyberspace and, two, automatic analysis of UGC will allow improving the user experience of common citizens about Internet resources and opportunities, while, simultaneously, detecting and tracking criminal and terrorist activity. In this first edition of the workshop we attempt to focus the attention of interested research groups and companies into the new challenges and opportunities related to Web 2.0 content analysis. More specifically, we will focus on specific tasks on the scope of text content mining, with the intention of extending the coverage to multimedia data in future editions of the workshop. According to this, for the first edition of the workshop, we will collect and provide a corpus which should be used as experimental collection to conduct research in three specific shared tasks: text normalization, opinion mining and misbehavior detection. In the text normalization shared task we want to address the problem related to chat-speak style of communication. Recently, some research has been carried out in this area for SMS communications and from the perspective of machine translation approaches. In this shared task we attempt to generalize the problem to Web 2.0 contents and to explore additional alternatives the participants can come out with. In the opinion mining shared task we want to address problems such as determining text subjectivity and polarity, and sentiment analysis. Although these problems have been already approached from different perspectives, most of the research has been carried out on specific domain data and applications where users are requested to rate services or products. Our intention is to focus the attention into the more general domain in which Web 2.0 users express their sentiments and opinions in their daily interaction within a virtual community. Finally, in the misbehavior detection shared task, we want to address the problems of detecting inappropriate activity in which some users in a virtual community can be molesting or offensive to some other members of the community. We consider that this shared task can provide a good starting point for a future shared task with the more ambitious goal of classifying users and detecting identity supplantation for on-line criminal activity.

Export Record As...

About this site

This website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software.

Preservation

We (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it... [this has now happened, this site is now static]