Thanks for the fast answer, If I understand correctly, B3S is the Billions triple challenge dataset, correct? Located here http://challenge.semanticweb.org/ . It is stated that it is 1.14 Billions statements.
The currents amounts of statements for the Genbank N3 Dump is 6,561,103,030 triples For Refseq its 3,299,862,816 triples So its between 3 to 6 times bigger. But the most problematic part is the size of some literals, because they can be complete genome sequence in a single literal (hundreds of kilobytes to megabyte). The compression ratio of the N3 files with gzip is 1:10. The current size of the virtuoso.db look like the UNcompressed size of what I've currently succeed to load in it. So I'm worrying I will need to have 1 terrabytes of free space to build these triplestore. The use for compression is to be able to store gygabytes of data in a smaller space that won't be use in indexing, searching or anything except to fetch them when needed. I'm using Virtuoso 6 TP2. I've set my loging to /dev/null because the server didn't want to start without something at the line TransactionFile ----From virtuoso web site------------ TransactionFile=virtuoso.trx This is the transaction log file. If this parameter is omitted, which should never be the case in practice, the database will run without log, meaning that it cannot recover transactions committed since last checkpoint if it should abnormally terminate. There is always a single log file for one server. The file grows as transactions get committed until a checkpoint is reached at which time the transactions are applied to the database file and the trx file is reclaimed, unless CheckpointAuditTrail is enabled. ------------------------------------------------------ So my line is currently TransactionFile = /dev/null It work well and actually increase the speed, but I'm not sure its completely OK to do it this way. Does the server is suppose to start without any problem without this line? Has for stripping, I'm currently loading on multiple on a disk array of 5 physical hard drive, but a single logical disk. So I don't think I will have gain from stripping. From Virtuoso web site ("Striping only has a potential performance benefit when striped across multiple devices"). Do you think I should try to strip anyway? Thanks, Marc-Alexandre Nolin 2009/10/1 Ivan Mikhailov <imikhai...@openlinksw.com>: >> Does somebody ever load something has large has a complete N3 version >> of GenBank or Refseq into a single Virtuoso Triplestore? I'm using a >> the ttlp_mt program has mentioned on how to load Bio2RDF data, but I >> call it from a Perl script. > > We've loaded B3S dataset plus some extra data without performance > troubles, so I see no reason to worry in your case. > >> Is there a way to enable compression of the object part of the triple if >> its a literal because >> I've plenty of sequence in these dump that take a lot of space when >> not compressed? > > There's no special compression of objects, but version 6 compacts > database pages so there's little reason for extra compression anyway. > >> Also, is there a faster way to load than ttlp_mt for >> N3 because the load is slowing down? > > While the database is in personal use and consists only data from outer > sources, you may wish to disable logging while loading. More important, > set appropriate number of buffers and, if possible, make the database > striped. Refer to http://docs.openlinksw.com/virtuoso/dbadm.html, esp. > to section 6.1.9.1.15. [Striping] for more details. > > Best Regards, > > Ivan Mikhailov > OpenLink Software > http://virtuoso.openlinksw.com > > >