Dear Torsten,

I accept all the replies (I should be crazy not to do it.. :)). But I would
like to remember youagain the possibilities of the SQL based database. It
permits, infact, basing on the engine of many databases as Oracle or
Postgres (as said previously), creating RDBS in the way to store all the
datas in a single clustered database, thing that, as you easily could
imagine, would give a University, even more a great company site the
capability to create a complete index of all their documents.
I launched last month my crawler in my university to discover the number of
documents presents and give a statistic. There were about 20,000,000 of html
and hypertextual documents in all the servers (this would make me assume
they' d be much more, because a great deal of the non reached where in .ps,
.pdf, or other format non followble).
I did not test ht://Dig working to index this great ammount of datas, but
everything let me thing that BerkeleyDB is not the appropriate way to do it.

Another way could be to parallelize the storing and search routine, to put,
for example, 50 BerkeleyDBs in 50 differents machine, clustered with Beowulf
system, that would work by rsh in serach method...
The problem here that comes is: htdig create different databases that
htmerge merges, eliminating the identical documents... this is a good
process, but as said make the parallelizing unrealizeble...

Let's talk about it...

Ciao
tomi


----- Original Message -----
From: Torsten Neuer <[EMAIL PROTECTED]>
To: tomi <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, June 28, 2000 10:23 AM
Subject: Re: [htdig3-dev] Creating a SQL backend...


> > tomi wrote:
> >
> > Hi guys,
> >
> > This topic (creating a SQL backend) has been added to the TODO list. I
> > studied a little the problem and I had a little discussion on it with
> > Geoff Hutchinson. It is a great step forward creating databases
> > directly in SQL based formats:
> > 1) it is much more reliable;
>
> Why?  You make a statement without proofing it.  Any stable database
> system
> can be said to be equally reliable to another one.  However, since SQL
> systems
> allow the database to reside on *remote* machines and thus require the
> backend
> to have a net connection to the SQL host, it might even be *less*
> reliable than
> approach used by Ht://Dig.
>
>
> > 2) considering many cases a webmaster or a system enginer could face
> > (very huge databases, Clustering, RDBS, etc.), SQL is the best way to
> > overflow any problem, thanks to the engines of databases as Postgres
> > or mySQL (just to cite only GNU projects).
>
> SQL is surely a more general approach, favours bigger databases and
> distributed
> processing of queries.  It also allows for concurrent updates of the
> search
> engine databases more easily than BerkelyDB.  But AFAIK are neither
> PostgreSQL
> nor mySQL GNU Projects (GNU is running its own database project which
> has its
> goal set on implementing a SQL-92 database system.  I have not seen a
> useable
> distribution of it, however).
>
> > Creating an SQL database is not difficult:
> > infact, if we want to create, for example, the table employee, we
> > should simply follow the following statement:
> >
> [...]
> >
> > Will follow other specifics to insert and modificate datas if You'' ll
> > require. Yhis e-mail was only to demonstrate the ease of creating SQL
> > tables and databases.
>
> You used a rather trivial example here.  One of the main problems of the
> Ht://Dig search engine with regards to an SQL backend implementation is
> the use of document excerpts which in nearly all cases are too large for
> many popular SQL database engines.  Not only are those excerpts pretty
> large, but they are of a dynamic length.  In order to make such a
> document
> database portable to all SQL engines, you will probably need to store
> each document in a single file in a document-db directory, causing
> lookups
> to be extremely slow for larger databases - let alone that you (again)
> have to stick with single host databases and cannot truely utilize the
> power of the SQL engine unless you also use networked file systems for
> this document-db directory.
>
>
> just my 2 cc,
>
>   Torsten
>
> --
> InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
> Waldhofstra�e 14                            Tel: +49-4101-403605
> D-25474 Ellerbek                            Fax: +49-4101-403606
> E-Mail: [EMAIL PROTECTED]            Internet: http://www.inwise.de
>
> ------------------------------------
> To unsubscribe from the htdig3-dev mailing list, send a message to
> [EMAIL PROTECTED]
> You will receive a message to confirm this.
>
>


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to