Re: [Gossip] More Missing messages OR ....Does everything have to be done at once?

Earl Hood Sat, 22 Jun 2002 14:08:03 -0700

On June 22, 2002 at 15:48, "William J. Kammerer" wrote:

> As a side issue, are you beginning to wonder whether MHonArc is really
> suited for a massively scalable system of mail archiving such as yours?
> It seems to me that over 99% of all e-mail received by the Mail Archive
> will never be looked at within the first few weeks of their posting.  Is
> too much disk capacity and processing power taken up in immediately
> doing the Email-to-HTML conversion, along with the concomitant
> generation of index pages?


I believe the MHonArc limitations are no longer a factor since Jeff
configured things awhile back to have MHonArc only deals with last X
number of messages for an archive (I do not know off hand what Jeff
set the X to).  Older messages are dropped from the navigational
index but are still left on the file system for search-based retrieval.
Theoretically, you could still browse old threads at the message page
level, but the indexes will only show the last X messages of an archive.

Hence, the resource of concern appear to be mainly disk space for
the old messages and for search index updates. Jeff, you have any
stats on htdig indexing?  Have you thought about doing some data
compression?

>From Jeff's previous status messages to this list, it appears that
disk space has been the main problem as mail-archive grows, and
I believe, the reason why messages are not showing up currently.

This problem does revisit the following issue: Should there be
message expiration at mail-archive.com?  

> Perhaps mail should just be queued away and indexed as it's received - a
> much less consumptive process.  And only if one actually were to call up
> a mailing list archive - or particular message - would the appropriate
> HTML be generated, "just-in-time," using ASP or something like that.

You are trading one set of resource consumption for another.
For high-traffic sites, it is always better to have as much work done
"offline", or publishing time, than during delivery time.  During
delivery is much more sensitive to time-based performance concerns and
increases the software complexity and possibly the administration (I
know this is a major concern for Jeff).  On-the-fly conversion can be
very, very costly for complex and/or large messages.  You would then
need a caching system and an auto expiration of data from the cache
if you want to minimize disk space (otherwise, if the HTML were to be
kept around indefinitly, why not just pre-convert as it is done now).

BTW, you would also have to deal with critical area problems.  For example,
if multiple requests are to the same message, do all of the block waiting
for the message to be converted by the first request?  I.e.  Some for
of synchronization is required (which is bad for high traffic sites).

Either way, you have not really solved the disk space problem, and
the amount of time you delay hitting disk capacity with a just-in-time
model versus the static model may not be that much (big dependency on
usage model).

What mail-archive.com is experiencing now is the resource limitations
that a single individual can provide.  It appears mail-archive has
grown bigger than what Jeff ever thought it would.  Since there are
several open source projects that utilize the service, it would be
nice that some contribution, like in resources, were provided to
mail-archive to avoid problems like the current situation.

--ewh

_______________________________________________
Gossip mailing list
[EMAIL PROTECTED]
http://jab.org/cgi-bin/mailman/listinfo/gossip

Re: [Gossip] More Missing messages OR ....Does everything have to be done at once?

Reply via email to