Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

Hugh Williams Mon, 01 Sep 2014 07:52:29 -0700

Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.      //              http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

On 25 Aug 2014, at 14:33, Jörn Hees <j_h...@cs.uni-kl.de> wrote:

> Hi again,
> 
> On 22 Aug 2014, at 17:44, Jörn Hees <j_h...@cs.uni-kl.de> wrote:
> 
>> On 22 Aug 2014, at 17:51, Hugh Williams <hwilli...@openlinksw.com> wrote:
>> 
>>> What I would not expect though is for the memory consumption to continue to 
>>> increase until the server is killed due to oom error which would imply a 
>>> possible memory leak, which is why I recommend building with the develop/7  
>>> build where there have been improvement in memory management.
>> 
>> I currently just used the stable 7.1.0 release. I'll try with the dev build 
>> again and report back...
> 
> So i'm running the import on a fresh dev build since my last email and i'm 
> now at a total memory consumption of 31218/32177 MB (buffers: 15 MB, cache: 
> remaining ~700 MB).
> 
> The Virtuoso process has allocated 31.5 GB (VIRT), 30.1 GB (RES) and 3.812 MB 
> (SHR) Memory.
> 
> I'm not sure if i really have to run the importer till it's killed for out of 
> memory (as i said it becomes pretty slow after a while and is currently only 
> seeking around with 200 KB/s) or if this is enough already. As 
> NumberOfBuffers is set to 2720000 as recommended i guess that anything above 
> 21 GB is suspicious... we're at > 31 GB now.
> 
> 
> I've also split up the input file into 100M line chunks so that i can track 
> the progress a bit better...
> 14 of these are completely loaded now, so 1.4 G triples, the 15th is 
> currently running.
> These are the start times as reported in DB.DBA.LOAD_LIST. I added a column 
> for loaded triples (not necessarily unique):
> 2014.8.22 19:59 0
> 2014.8.22 20:09 100M
> 2014.8.22 20:22 200M
> 2014.8.22 20:39 300M
> 2014.8.22 20:53 400M
> 2014.8.22 21:11 500M
> 2014.8.22 21:31 600M
> 2014.8.22 22:03 700M
> 2014.8.22 22:39 800M
> 2014.8.22 23:32 900M
> 2014.8.23 00:17 1G
> 2014.8.23 02:47 1.1G
> 2014.8.23 08:51 1.2G
> 2014.8.23 18:02 1.3G
> 2014.8.24 16:16 1.4G
> 
> The import times for 100M triples seem to be roughly about:
> - 10 minutes initially
> - 30 minutes after 600M loaded triples
> - 45 minutes after 900M triples
> - 2h:30 after 1G triples (I'm guessing that this is when the set Memory-Limit 
> is hit)
> - 6h after 1.1G triples
> - 10h after 1.2G triples
> - 22h after 1.3G triples
> - >22h after 1.4G triples
> 
> The last 4 lines sadly don't give me the impression that this scales nearly 
> linearly after virtuoso runs out of fast random access memory and has to rely 
> on block storage :-/ Is there maybe a setting which allows virtuoso to fall 
> back to a merge-sort like approach like creating sorted temp dbs and then 
> merging them bottom up? Wouldn't this scale way beyond the available RAM 
> sizes and not cause the seek&wait pattern i observe?!?
> 
> 
> Anything else i can do to help to debug this? Can i stop the import?

[Hugh] Did you let the load continue or was it stopped ? Development indicate 
your suggestion is not without merit but implementation is not as simple as it 
may seems as the indexes are not all sequential, but something like that could 
possibly be implemented. It is suggested you could try dropping the indexes on 
RDF_QUAD table,  load the Freebase datasets and then recreate indexes after 
loading,  which would require a smaller working set that would better fix into 
the 32GB RAM available. The command for dropping the necessary indexes are:

        drop index rdf_quad_pogs;
        drop index rdf_quad_sp;
        drop index rdf_quad_op;
        drop index rdf_quad_gs;

and the respective indexes can then be recreated as detailed at:

        
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtRDFPerformanceTuning?#RDF%20Index%20Scheme

Note you need to recreate the column-wise indexes being v7. Let us know how 
this works for you. Note you can also use the ld_meter scripts we provided for 
monitoring the Virtuoso Bulk loader activity as detailed at:

        
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideLDMeterUtility

Also, how many "rdf_loader_run()" processes do you have  running when 
performing the load, as for v7 we recommend running  Number of Core * 0.4 for 
best performance typically ?

Regards
Hugh

> 
> Cheers,
> Jörn
> 


------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users
Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

Reply via email to