Thanks for the fast answer,

If I understand correctly, B3S is the Billions triple challenge
dataset, correct? Located here http://challenge.semanticweb.org/ . It
is stated that it is 1.14 Billions statements.

The currents amounts of statements for the Genbank N3 Dump is
6,561,103,030 triples
For Refseq its 3,299,862,816 triples

So its between 3 to 6 times bigger. But the most problematic part is
the size of some literals, because they can be complete genome
sequence in a single literal (hundreds of kilobytes to megabyte).

The compression ratio of the N3 files with gzip is 1:10. The current
size of the virtuoso.db look like the UNcompressed size of what I've
currently succeed to load in it. So I'm worrying I will need to have 1
terrabytes of free space to build these triplestore. The use for
compression is to be able to store gygabytes of data in a smaller
space that won't be use in indexing, searching or anything except to
fetch them when needed. I'm using Virtuoso 6 TP2.

I've set my loging to /dev/null because the server didn't want to
start without something at the line TransactionFile

----From virtuoso web site------------
 TransactionFile=virtuoso.trx

This is the transaction log file. If this parameter is omitted, which
should never be the case in practice, the database will run without
log, meaning that it cannot recover transactions committed since last
checkpoint if it should abnormally terminate. There is always a single
log file for one server. The file grows as transactions get committed
until a checkpoint is reached at which time the transactions are
applied to the database file and the trx file is reclaimed, unless
CheckpointAuditTrail is enabled.
------------------------------------------------------

So my line is currently
TransactionFile = /dev/null

It work well and actually increase the speed, but I'm not sure its
completely OK to do it this way. Does the server is suppose to start
without any problem without this line?

Has for stripping, I'm currently loading on multiple on a disk array
of 5 physical hard drive, but a single logical disk. So I don't think
I will have gain from stripping. From Virtuoso web site ("Striping
only has a potential performance benefit when striped across multiple
devices"). Do you think I should try to strip anyway?

Thanks,

Marc-Alexandre Nolin

2009/10/1 Ivan Mikhailov <imikhai...@openlinksw.com>:
>> Does somebody ever load something has large has a complete N3 version
>> of GenBank or Refseq into a single Virtuoso Triplestore? I'm using a
>> the ttlp_mt program has mentioned on how to load Bio2RDF data, but I
>> call it from a Perl script.
>
> We've loaded B3S dataset plus some extra data without performance
> troubles, so I see no reason to worry in your case.
>
>>  Is there a way to enable compression of the object part of the triple if 
>> its a literal because
>> I've plenty of sequence in these dump that take a lot of space when
>> not compressed?
>
> There's no special compression of objects, but version 6 compacts
> database pages so there's little reason for extra compression anyway.
>
>>  Also, is there a faster way to load than ttlp_mt for
>> N3 because the load is slowing down?
>
> While the database is in personal use and consists only data from outer
> sources, you may wish to disable logging while loading. More important,
> set appropriate number of buffers and, if possible, make the database
> striped. Refer to http://docs.openlinksw.com/virtuoso/dbadm.html, esp.
> to section 6.1.9.1.15. [Striping] for more details.
>
> Best Regards,
>
> Ivan Mikhailov
> OpenLink Software
> http://virtuoso.openlinksw.com
>
>
>

Reply via email to