creators_name: Pasternack, Jeff creators_name: Roth, Dan type: conference_item datestamp: 2009-04-06 19:11:34 lastmod: 2009-04-07 14:02:41 metadata_visibility: show title: Extracting Article Text from the Web with Maximum Subsequence Segmentation ispublished: pub full_text_status: public pres_type: paper abstract: Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top “text-only” score (based on Levenshtein distance), 87.8% versus 84.1%. date: 2009-04 pagerange: 971-971 event_title: 18th International World Wide Web Conference event_location: Madrid, Spain event_dates: April 20th-24th, 2009 event_type: conference refereed: TRUE citation: Pasternack, Jeff and Roth, Dan (2009) Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain. document_url: http://www2009.eprints.org/98/1/p971.pdf