Why we should archive USENET

Danny Yee (danny@STAFF.CS.SU.OZ.AU)
Sun, 28 Aug 1994 13:28:50 +1000

[ note, this is from a thread in news.misc and alt.culture.usenet ]

In article <FISCHER.94Aug19170005@ssx1.dina.kvl.dk>,
>True. The amount of data (nope, I'm not going to call it information)
>flowing thru' USENET these days is so vast that archiving all of it
>really makes no sense. You'll end up with more data than you can
>index, search, or do anything at all useful with.

I don't think this is true. If you are looking at something simple
enough, automated search and analysis programs can easily plow through
gigabytes of data. Just ask Kibo about it :-).

>The art of archiving is not to collect everything in sight. Rather,
>it's to carefully select the useful bits (useful according to some
>specification defined by what you're up to) and archive only the data
>that meets it.

But for social science research it'd be nice to have *everything*, not
just the bits people have decided there's a market for on CD-ROM.

Here's an example of something I'd like to do that would be best done
with complete USENET archives.

Construct a graph (?table) showing relationships between newsgroups
based on crossposting and shared inhabitants (=posters). It's pretty
clear that there are "relationships" between groups - news.groups and
news.misc, for example, or all the startrek groups. But how exactly
do the demographics of this work? What would it mean if there were
a high inverse correlation between participation in alt.aquaria
and comp.*? Could one factor talk.origins posters into two groups
based on postings to bionet.* or *religion*?

I believe this might produce some interesting results, though one
would have to be *very* careful with one's statistics. It might
even be possible to infer something about the relationship between
different ideologies and ideas in the "real" world outside the Net.
[ The advantage of USENET over other media here is that (a) the data is
available in standard electronic format already, (b) one can get stats
on participants (not on readers) and (c) the newsgroup namespace is
(at least formally) uniform - it makes more sense to try and compare
alt.sex.stories and comp.sys.mac.misc than it does to compare Playboy
and MacWorld! ]

Now all I need is a research grant that would provide me with a
couple of optical jukeboxes, a few gigabytes of disk space and lots
of processing power. (A$40K would probably do it :-) But I reckon
I could do a scaled down version of this with my single CD-ROM,
500 Meg disk 486. (And I intend to, when I find the time.)

Danny Yee (danny@cs.su.oz.au)