> If I understand correctly, B3S is the Billions triple challenge
> dataset, correct? Located here http://challenge.semanticweb.org/ . It
> is stated that it is 1.14 Billions statements.

Yes B3S was 1.14G, but we have some additional data (but still much less
than your 10G).

> The currents amounts of statements for the Genbank N3 Dump is
> 6,561,103,030 triples
> For Refseq its 3,299,862,816 triples
> 
> So its between 3 to 6 times bigger. But the most problematic part is
> the size of some literals, because they can be complete genome
> sequence in a single literal (hundreds of kilobytes to megabyte).

I agree, B3S and other LOD resources have only relatively short
literals.

> The compression ratio of the N3 files with gzip is 1:10. The current
> size of the virtuoso.db look like the UNcompressed size of what I've
> currently succeed to load in it. So I'm worrying I will need to have 1
> terrabytes of free space to build these triplestore. The use for
> compression is to be able to store gygabytes of data in a smaller
> space that won't be use in indexing, searching or anything except to
> fetch them when needed. I'm using Virtuoso 6 TP2.

Virtuoso 6 compacts pages up to 50% of original size. But you usually
have _two_ copies of each page in the database: in normal place and
remapped. That is used to recover the database after crash: if a page
has changed since last checkpoint then the database contains the state
of the page at checkpoint time and the log replay restores all changes
on it that were made after the checkpoint.

For hundreds of kilobytes per object, a gz compression would be nice, of
course, but there's no such built-in feature in RDF storage (even if
there are ready-to-use gz_compress and gz_uncompress functions in
Virtuoso/PL). I'll think about it.

> I've set my loging to /dev/null because the server didn't want to
> start without something at the line TransactionFile

Strange, I'll check that but not right now, I'm afraid.

> Has for stripping, I'm currently loading on multiple on a disk array
> of 5 physical hard drive, but a single logical disk. So I don't think
> I will have gain from stripping. From Virtuoso web site ("Striping
> only has a potential performance benefit when striped across multiple
> devices"). Do you think I should try to strip anyway?

No, you should have one IO queue per independently controlled harddrive.
Having 5 physical hard drives mounted individually, you would be able to
stripe on 5 and thus get a bit better performance than OS can provide,
but the difference is small enough if you have plenty of RAM and OS disk
buffers are hundreds of megabytes to gigabytes. So just keep running. FS
type plays some role (xfs is preferable) but in most of cases you may
ignore the difference, esp. if disks are far from being full.



Reply via email to