creators_name: Scholl, Philipp creators_name: Domínguez García, Renato creators_name: Böhnstedt, Doreen creators_name: Rensing, Christoph creators_name: Steinmetz, Ralf type: conference_item datestamp: 2009-04-06 19:13:33 lastmod: 2009-04-07 14:03:03 metadata_visibility: show title: Towards Language–Independent Web Genre Detection ispublished: pub full_text_status: public pres_type: poster abstract: The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres. features (e.g. part-of-speech tagging and document terms), structural features (e.g. HTML tag frequencies, use of facets used to enable functionalities like form input elements) and simple text statistics (e.g. frequencies of punctuation). However, a fact often neglected by related work is that the absolute dominance of the English language on the web is decreasing. Thus, it is important to develop a way of recognizing web genres independently of the language used on the respective web page. As many genres exhibit a certain structural and visual layout, this property enables to ignore linguistic features altogether. date: 2009-04 pagerange: 1157-1157 event_title: 18th International World Wide Web Conference event_location: Madrid, Spain event_dates: April 20th-24th, 2009 event_type: conference refereed: TRUE citation: Scholl, Philipp and Domínguez García, Renato and Böhnstedt, Doreen and Rensing, Christoph and Steinmetz, Ralf (2009) Towards Language–Independent Web Genre Detection. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain. document_url: http://www2009.eprints.org/159/1/p1157.pdf