Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Glen Newton Thu, 23 Jul 2009 08:39:04 -0700

Hi Otis,

Yes, you are right: LuSql is heavily optimized for multi-thread/multi-core.
It also performs better on a single core with multiple threads, due to
the heavy i/o bounded nature of Lucene indexing.


So if the DB is the bottleneck, well, yes, then LuSql and any other
tool are not going help. Resolve the DB bottleneck, and then decide
what tool best serves your indexing requirements.

Only slightly off topic: I have noticed one problem with DBs (with
LuSql and custom JDBC clients processing records) when the fetch size
is too large and the amount of processsing of each record gets too
large: sometimes the connection times out because the time between
getting the next batch takes too long (due to the accumulated delay
from processing all the records). Solved with reducing the fetch size.
I am not sure if Solr/DIH users have experienced this. LuSql allows
setting the fetch size (like DIH I believe) and (unreleased version)
re-issues the SQL and offsets to the last+1 record when this happens.

-glen

2009/7/23 Otis Gospodnetic <otis_gospodne...@yahoo.com>:
> Note that the statement about LuSql (or really any other tool, LuSql is just 
> an example because it was mentioned) is true only if Solr is underutilized 
> because DIH uses a single thread to talk to Solr (is this correct?) vs. LuSql 
> using multiple (I'm guessing that's the case becase of the multicore comment).
>
> But, if the DB itself if your bottleneck, and I've seen plenty of such cases, 
> then speed of DIH vs. LuSql vs. something else matters less.  Glen, please 
> correct me if I'm wrong about this - I know you have done plenty of 
> benchmarking. :)
>
>  Otis
> --
> Sematext is hiring: http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Glen Newton <glen.new...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, July 23, 2009 5:52:43 AM
>> Subject: Re: DataImportHandler / Import from DB : one data set comes in  
>> multiple rows
>>
>> Chantal,
>>
>> You might consider LuSql[1].
>> It has much better performance than Solr DIH. It runs 4-10 times faster on a
>> multicore machine, and can run in 1/20th the heap size Solr needs. It
>> produces a Lucene index.
>>
>> See slides 22-25 in this presentation comparing Solr DIH with LuSql:
>> http://code4lib.org/files/glen_newton_LuSql.pdf
>>
>> [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>>
>> Disclosure: I am the author of LuSql.
>>
>> Glen Newton
>> http://zzzoot.blogspot.com/
>>
>> 2009/7/22 Chantal Ackermann :
>> > Hi all,
>> >
>> > this is my first post, as I am new to SOLR (some Lucene exp).
>> >
>> > I am trying to load data from an existing datamart into SOLR using the
>> > DataImportHandler but in my opinion it is too slow due to the special
>> > structure of the datamart I have to use.
>> >
>> > Root Cause:
>> > This datamart uses a row based approach (pivot) to present its data. It was
>> > so done to allow adding more attributes to the data set without having to
>> > change the table structure.
>> >
>> > Impact:
>> > To use the DataImportHandler, i have to pivot the data to create again one
>> > row per data set. Unfortunately, this results in more and less performant
>> > queries. Moreover, there are sometimes multiple rows for a single 
>> > attribute,
>> > that require separate queries - or more tricky subselects that probably
>> > don't speed things up.
>> >
>> > Here is an example of the relation between DB requests, row fetches and
>> > actual number of documents created:
>> >
>> >
>> > 3737
>> > 5380
>> > 0
>> > 2009-07-22 18:19:06
>> > −
>> >
>> > Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
>> >
>> > 2009-07-22 18:22:29
>> > 2009-07-22 18:22:29
>> > 0:3:22.484
>> >
>> >
>> > (Full index creation.)
>> > There are about half a million data sets, in total. That would require 
>> > about
>> > 30h for indexing? My feeling is that there are far too many row fetches per
>> > data set.
>> >
>> > I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
>> > around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
>> > factor 10, ram buffer size 32).
>> >
>> > Possible solutions?
>> > A) Write my own DataImportHandler?
>> > B) Write my own "MultiRowTransformer" that accepts several rows as input
>> > argument (not sure this is a valid option)?
>> > C) Approach the DB developers to add a flat table with one data set per 
>> > row?
>> > D) ...?
>> >
>> > If someone would like to share their experiences, that would be great!
>> >
>> > Thanks a lot!
>> > Chantal
>> >
>> >
>> >
>> > --
>> > Chantal Ackermann
>> >
>>
>>
>>
>> --
>>
>> -
>
>



-- 

-

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Reply via email to