HRAF, again

HRAF (HRAFSIR@YALEVM.CIS.YALE.EDU)
Fri, 29 Apr 1994 13:33:23 EDT

I heard that some of you missed this the first time, so here goes again.
Apologies to those who have seen it before.

Thanks to all of you who expressed interest in the HRAF Interest Group. When
we have a draft of the proposal to the AAA we will put it on the net for your
comments. The rest of this message is a reply to Barry Lewis's comments on
HRAF.

Predicting events ten years from now is dangerous in an environment as
volatile and rapidly changing as that of e-texts. Everyone knows, including
HRAF, that the Internet is going to play a large role in the future of
information delivery. That role has not been defined specifically as yet; the
Internet is still an unorganized and user-unfriendly thing. HRAF's
distribution plans for the near future are just the beginning of our
electronic distribution effort. We fully expect them to change in considerably
less than ten years. We decided to begin with CD-ROM publication, since our
surveys showed that that is the medium librarians want. Libraries are the
institutions through which people access the HRAF archive, although they may
not be forever. I doubt that Barry thinks HRAF should wait ten years before
issuing anything electronically. There are many things that can, should be,
and will be done in the meantime.

What is happening network-wise, however, and what HRAF will probably become
involved in long before Barry's ten years are up, is relatively local
networking of electronic text databases using client-server architecture. The
distribution media are of less significance than the data and file structures,
data and file transfer protocols, and search and retrieval software. All of
the things I mentioned in the last sentence are complex technical issues. They
involve issues such as the SGML coding of text, the use of the Z39.50 data
transfer protocol, and other things. Both SGML and Z39.50 involve ISO and NISO
standards. I honestly doubt that these are issues that AAA cares to be
involved in, but they are the nuts and bolts of "...[defining] standards, data
priorities, and address methodological problems associated with the
development, maintenance, and analysis of these data archives." The proposed
HRAF Interest Group, while having some concern with these information science
issues, will probably be more concerned with anthropological research and
teaching using the HRAF database. It should be interesting, since once a
database like this is available people will start to think of things they
never imagined before.

I call the e-text produced by those thousands of men and women armed with
hand-held OCRs "Ad-Hoc E-Texts." They are ad-hoc because they adhere to none
of the standards mentioned above and each one needs to be approached
differently from the other. They are in no way comparable to the database that
HRAF is producing. The Ad-Hoc E-Texts are usually unorganized "flat ASCII" or,
worse yet, WordPerfect files with simple or no search and retrieval software
supplied.

HRAF differs significantly from the Ad-Hoc E-Texts in that what we basically
do at HRAF and have always done is to add value to texts. This value that we
add consists most significantly in the selection of texts and the application
at a very specific level, i.e., each paragraph of the text, of analytic,
controlled-vocabulary indexing. I am referring to the use of the OUTLINE OF
CULTURAL MATERIALS (OCM) and the OUTLINE OF WORLD CULTURES (OWC). The
development and application of these indexing systems is a significant task
and much of the cost of the HRAF archive subscription pays for these value-
added features. We do not simply grab books off the library shelves, scan them
with hand-held OCRs, and toss them out to the HRAF membership. On a slightly
different vein, HRAF, and others, has found that OCR scanning is not a
reliable method of data conversion for large and diverse collections of text.
HRAF discovered several years ago that it was cheaper to key the data in than
to correct the errors generated by OCR scanning. HRAF has contracted with a
data conversion business. Data coversion businesses offer all of the available
methods, from OCR and image scanning to keying in. They scan what they can and
key in the rest. Our experience over about a five-year period, with our own
in-house effort to scan and two different data conversion houses, is that most
of the data needs to be keyed in. This is a costly process.

Barry is primarily concerned with the high cost of the electronic HRAF. The
subscription cost is very close to the cost of the microfiche and the increase
over the cost of the microfiche covers the cost of conversion of the over two
times more data that will be included each year from now on. There is
considerable cost for HRAF to select documents for the database, index them
with the OCM, convert them to electronic format, publish, distribute, and
support it each year. HRAF is a not-for-profit membership organization and we
try our best to supply the database at as low a cost as possible to member
institutions. Regarding the Electronic HRAF, last summer we announced active
member annual dues of $3,400 and non-member dues of $5,400. Since then, as we
have controlled the cost of the Electronic HRAF, we have lowered the dues to
$2,900 for members and $3,900 for non-members. The dues are set by HRAF's
Board of Directors, which is currently composed of thirteen social scientists
and eight librarians, all of whom are well aware of their own institution's
efforts at cost containment. HRAF's bottom line is that if HRAF did not cover
its production costs, HRAF would cease to exist.

The social sciences are not well represented in the e-text world. Most e-text
centers are full of humanities texts and some are technologically very
sophisticated. The social sciences are not represented for a very simple
reason; social sciences are still not really very interested in texts, they
are interested in data, and the "rawer" the better. The humanities are far
ahead technologically because texts, qua texts, are precisely what they are
interested in. I suggest that Barry look at the latest (March 1994) issue of
INFORMATION TECHNOLOGY AND LIBRARIES that is devoted to the subject of
electronic texts.

The Smithsonian and another organization have already published a series of
CD-ROMs consisting of BAE publications from 1879-1960. It is called "North
American Indian" and it consists of four CD-ROMs containing 150 titles with
28,000 pages of text, images, etc. The BAE database sells for $695 per CD-ROM
volume and is apparently issued irregularly. By the way, since Barry was
making comparisons between HRAF and the BAE database, each annual installment
of HRAF will contain 50,000 pages of text and approximately 1,500 photographs;
we expect that the first five years worth (ca. 160,000 pages of text) will fit
on one CD-ROM. The entire HRAF database, if produced now, using current
compression techniques, would fill five CD-ROMs, containing ca. 6,500 titles
and 800,000 text pages. There is a big difference between the BAE database and
the HRAF database.

This is already long-winded enough so I will refrain from explaining the
tedious and dense details of the structure of the data and file. If anyone is
interested in these details, let me know.