Hi Les: You're arguing that Webometrics should count PDFs, and I fully agree.
I was only arguing that Webometrics should not *limit* its count to PDFs.
Sorry if I didn't make that clear.
BTW, I'd make the analogous case to publishers. Publish in PDF if you like, but
never publish in PDF-only. If you offer PDF editions, then also offer XML or
HTML editions.
Best, Peter
Peter Suber
www.bit.ly/suber
----------
On Sun, Jul 11, 2010 at 6:49 AM, Leslie Carr <lac_at_ecs.soton.ac.uk> wrote:
On 10 Jul 2010, at 15:37, Peter Suber wrote:
For more detail on "rich media" or "rich files", see the
Webometrics page on methodology: "Only the number of text
files in Acrobat format (.pdf) ... are considered."...This is
a bug, not a feature. A more useful ranking would try to
count full-text scholarly or peer-reviewed articles regardless
of format. I know that's hard to do. But it's a mistake to
use any format as a surrogate for that status, and especially
a format as flawed as PDF. Even if Webometrics wanted to
reward some formats more than others, it should not reward
PDF.
I think it should. The overwhelming majority of academic papers are
distributed online as PDF; the overwhelming majority of things in
repositories that are not PDF are not academic papers.
The format is optimized for print or reading, not for use or
reuse. PDFs are slow to load and often not even readable in
bandwidth-poor parts of the world. They crash many browsers.
They often lack working links; when they do have links, they
require users to open in the same window rather than in a
separate window, losing the file that took so long to load.
Users can't deep-link to subsections. Publishers can lock
them to prevent cutting and pasting. Publishers can insert
scripts to make them unreadable offline or after a certain
time. PDFs impede text processing by users, text mining by
software, handicapped access ("read-aloud" software), and
mark-up by third parties.
This is an argument about what software/data formats researchers *should*
use; affecting their authoring and editorial processes is probably beyond
the scope of what we can expect from this league table.
PubMed Central scores low in the Webometric rankings because
it has no PDFs.
It does "have" PDFs - it might ingest articles in XML, but it certainly
exports them in PDF. Enquiring of Google (site:www.ncbi.nlm.nih.gov
filetype:pdf) shows that it has about 6,690,000 PDFs.
But PMC is one of the most populated and useful OA
repositories in the world.
This is something that needs investigating. If I had to guess why it ranks
so low, it might be because no-one is linking INTO pubmed; rather they are
linking to the original publishers.
The format it uses instead of PDF, the NLM DTD coded in XML,
is vastly superior to PDF for every scholarly purpose. I
haven't had time to code my articles in XML. But since even
HTML is superior to PDF for purposes of access and reuse, I
self-archive in HTML rather than PDF whenever I can.
For the record, I completely agree with you about PDF / HTML / XHTML. If
only Microsoft Word (and LaTeX) had decent export facilities that produced
good "semantic" HTML.
--
Les Carr
Received on Sun Jul 11 2010 - 19:58:14 BST