Mailing List Discussions

SUBJECT CATEGORIZATION

At the 2nd Workshop on the Open Archives Initiative (OAI): Gaining Independence with ePrints Archives and OAI Oct 17-19 at CERN in Geneva a forum for discussion of 'subject' issues was thought to be an important next step. This is being introduced by Southampton on the new oai-eprints mailing list which was created as a result of the workshop. http://lists.openlib.org/mailman/listinfo/oai-eprints

Previous mailing lists have been reviewed briefly. Relevant items have been linked to the key mailing lists. (Use Netscape to view)

Questions to consider for Institutional e-Print archives:

We need subject categories- for harvesting or archive Search & Retrieve
Do e-Print archives need to use a universal classification scheme
If so what universal classification scheme
Are universal classification schemes too detailed
Can we expect self archiving researchers to select from a long list of terms
Could we rely on automated indexing systems
Would the automated indexing system need to translate to one agreed subject categorization scheme
Will harvesting become a push scenario (data providers depositing their selection of records in the discipline based archive) rather than the present harvesting model which requires some uniformity in subject categorization for selectivity
If web search engines will be able to search and retrieve on e-Print metadata and full text is there any need for subject categorization either at the record level or archive level
If so, will there be a need to deposit records in discipline based archives if the documentis already in an institutional archive

To assist discussion a table (Word , html) has been created to illustrate the variety of approaches to subject search provision in existing archives. The table identifies a selection of e-Print archives, including those using eprint.org software, annotated with the type of classification or terminology they appear to be using and an indication of what type of archive they appear to be.

Eprints.org V.1 software was issued without a default classification or thesaurus. From our review, it appears that most archives see the need for categorization and are using some sort of in-house classification or terminology. One wonders what would have been the situation had eprints org offered a default classification from the start?

Options for 'Browse by subject'

In House classification (ie has some structure and possibly based on general classification schemes like Library of Congress or has used a journal classification)
General classification scheme (LoC, Dewey etc)
Subject classification (ACM etc)
Subject thesaurus or Journal scheme
In House terminology Organizational structure (faculty etc)
None

HARVESTING

Unless a complete set of records (or date limited records) are being harvested it is necessary to have within the metadata, a level of uniformity either in subject catagorization or indexing terminology between harvester and harvested to enable selectivity (and mapability) of records. However, if a push model was used rather than harvesting (pull) the data provider would select which records would be deposited with the service provider / discipline based archive.

A rigorously applied universal categorization scheme would simplify and enable the harvesting envisaged in the OAI vision.

However is it likely that a Self archiving researcher will bother to sort through a long list of terms/classification to select the correct terms?

In Mediated archiving - through an information unit - intelligent/manual indexing is no problem but time consuming.

SUBJECT CATEGORIZATION Possibilities:

A universal categorization scheme that all archives would use. Examples:

1 . The Library of Congress System (LoC) organizes material according to twenty-one branches of knowledge. The 21 categories (labeled A-Z except I,O,W,X and Y) are further divided by adding one or two additional letters and a set of numbers. http://www.loc.gov/catdir/cpso/lcco/lcco.html

2. In Dewey Decimal Classification,(DDC) basic classes are organized by disciplines or fields of study. At the broadest level, the DDC is divided into ten main classes, which together cover the entire world of knowledge. Each main class is further divided into ten divisions, and each division into ten sections (not all the numbers for the divisions and sections have been used). The three summaries are at http://www.oclc.org/dewey/about/ddc_21_summaries.htm

3. The Universal Decimal Classification (UDC) is a multilingual classification scheme for all fields of knowledge consisting of arabic numerals and common punctuation marks. Adapted from DDC. http://www.udcc.org/outline/outline.htm

Discipline based categorization schemes agreed to be used by archives in a particular discipline (offered by organization or journal) . Examples:

1.ACM Classification System http://www.acm.org/class/

2. Medical Subject Headings (MeSH) http://wwwcf.nlm.nih.gov/class/OutlineofNLMClassificationSchedule.html

3. Global Change Master Directory (GCMD) http://gcmd.nasa.gov/Data/param_search_top.html

4. Journal of Economic Literature Classification System http://www.aeaweb.org/journal/elclasjn.html

A default universal categorization scheme issued with the software, with the option to import an in- house or specialized scheme

1. Enable harvesting using universal classification, and the specificity of indexing for local Search & Retrieve

A new categorization scheme agreed by the OAI community. eg. based on faculty/department titles. Examples:

Cambridge UK http://www.cam.ac.uk/cambuniv/finding/addresses/dept_a.html

Southampton UK http://www.soton.ac.uk/toc2.html

Montreal Canada http://www.umontreal.ca/ang/facdep.html

Yale USA http://www.yale.edu/yaleinfo/academicsdepartment.html

Melbourne Australia http://www.unimelb.edu.au/az/azfaculties/

AUTOMATED CLASSIFICATION / INDEXING

There are now a number of software packages that extract keywords from articles and automatically allocate keyword indexing eg. NSTEIN, Autonomy etc but these build 'knowledge bases' which then act as a controlled indexing vocabularly aka …….. . Some do this and map the keywords to an accepted universal classification scheme eg OCLC - Dewey. (will be explored as partners in the e-Prints UK Project http://www.rdn.ac.uk/projects/eprints-uk/ )

Most of this software is proprietary; how can it be used within an open source activity? Value added services offered, may be chargeable therefore would fund the software? Lesser known or project software may be open source?

Report on automatic classifications systems (Gietz, P.)

WEB SEARCH ENGINES

Can we rely on web search engines like Google to search deeply or accurately enough ?

Stevan Harnard (ePrints.org Self FAQ/-26.Classification) http://www.eprints.org/self-faq/#26.Classification

When we want to search the journal literature, we do not look to any university classification system: we go to indexing services such as INSPEC, MEDLINE, ISI, etc. (Those do have their own classification systems, but it is unlikely that any of those classifications could out-perform google-style boolean search on an inverted full-text index, especially if aided by citation-frequency-based, hit-based, recency-based, or relevance-based ranking of search output, as done, for example, by citebase).

Bill Arms (D-Lib, Jul 2000) http://www.dlib.org/dlib/july00/arms/07arms.html

For medical research, no web search engine can approach the National Library of Medicine's Medline service. Medline has over 11 million references and abstracts. It is built by a team of indexers who have knowledge of bio-medical research, using indexing rules and MeSH subject headings that have been developed laboriously over decades. In contrast, web search services such as GoogleSM are entirely automated. The indexes are built by a team of computers with no knowledge of what they are indexing. Google has the advantage over Medline of indexing hundreds of millions of web pages, and doing so repeatedly every month. It is quite useful for finding general information on medical topics, but it does not index the major scientific journals, its indexing records are crude, it has no understanding of medical terminology, and makes no attempt to separate sound medicine from quackery. It is a long way from being a substitute for Medline. On the other hand, consider the trade-off between Google and Inspec, which is the leading abstracting and indexing service for computing. I used to be a regular user of Inspec, but have largely abandoned it in favor of Google. In many areas of computing, Google's restriction to open access web materials is relatively unimportant, since almost every significant result first appears on a web site and only later reaches the printed journals, if ever. Google is more up to date than Inspec, its coverage is broader and its indexing records are good enough for me to find what I am looking for. But its greatest strength is that everything in its indexes is available online with open access. In computing, substantially the same information is often available from several sources. Google provides a direct link to an open access version. Inspec references a formally published version, which is usually printed or online with restricted access. For my purposes, Google's broad coverage and convenient links more than compensate for its weaknesses.