<mets:mets OBJID="oai:generic.eprints.org:98" LABEL="Eprints Item" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mets="http://www.loc.gov/METS/"><mets:metsHdr CREATEDATA="2019-02-15T10:26:23Z"><mets:agent TYPE="ORGANIZATION" ROLE="CUSTODIAN"><mets:name>WWW2009 EPrints</mets:name></mets:agent></mets:metsHdr><mets:dmdSec ID="DMD_oai:generic.eprints.org:98_mods"><mets:mdWrap MDTYPE="mods"><mets:xmlData><mods:titleInfo><mods:title>Extracting Article Text from the Web with Maximum Subsequence Segmentation</mods:title></mods:titleInfo><mods:name type="personal"><mods:namePart type="given">Jeff</mods:namePart><mods:namePart type="family">Pasternack</mods:namePart><mods:role><mods:roleTerm type="text">author</mods:roleTerm></mods:role></mods:name><mods:name type="personal"><mods:namePart type="given">Dan</mods:namePart><mods:namePart type="family">Roth</mods:namePart><mods:role><mods:roleTerm type="text">author</mods:roleTerm></mods:role></mods:name><mods:abstract>Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top “text-only” score (based on Levenshtein distance), 87.8% versus 84.1%.</mods:abstract><mods:originInfo><mods:dateIssued encoding="iso8061">2009-04</mods:dateIssued></mods:originInfo><mods:genre>Conference or Workshop Item</mods:genre></mets:xmlData></mets:mdWrap></mets:dmdSec><mets:amdSec ID="TMD_oai:generic.eprints.org:98"><mets:rightsMD ID="rights_oai:generic.eprints.org:98_mods"><mets:mdWrap MDTYPE="mods"><mets:xmlData><mods:useAndReproduction>
<p xmlns="http://www.w3.org/1999/xhtml"><strong>For work being deposited by its own author:</strong> 
In self-archiving this collection of files and associated bibliographic 
metadata, I grant WWW2009 EPrints the right to store 
them and to make them permanently available publicly for free on-line. 
I declare that this material is my own intellectual property and I 
understand that WWW2009 EPrints does not assume any 
responsibility if there is any breach of copyright in distributing these 
files or metadata. (All authors are urged to prominently assert their 
copyright on the title page of their work.)</p>

<p xmlns="http://www.w3.org/1999/xhtml"><strong>For work being deposited by someone other than its 
author:</strong> I hereby declare that the collection of files and 
associated bibliographic metadata that I am archiving at 
WWW2009 EPrints) is in the public domain. If this is 
not the case, I accept full responsibility for any breach of copyright 
that distributing these files or metadata may entail.</p>

<p xmlns="http://www.w3.org/1999/xhtml">Clicking on the deposit button indicates your agreement to these 
terms.</p>
    </mods:useAndReproduction></mets:xmlData></mets:mdWrap></mets:rightsMD></mets:amdSec><mets:fileSec><mets:fileGrp USE="reference"><mets:file SIZE="877068" ID="oai:generic.eprints.org:98_98_1" MIMETYPE="application/octet-stream" OWNERID="http://www2009.eprints.org/98/1/p971.pdf"><mets:FLocat LOCTYPE="URL" xlink:href="http://www2009.eprints.org/98/1/p971.pdf" xlink:type="simple"></mets:FLocat></mets:file></mets:fileGrp></mets:fileSec><mets:structMap><mets:div DMDID="DMD_oai:generic.eprints.org:98_mods" AMDID="TMD_oai:generic.eprints.org:98"><mets:fptr FILEID="oai:generic.eprints.org:98_98_1"></mets:fptr></mets:div></mets:structMap></mets:mets>