Re: import efficiencies

Erick Erickson Thu, 26 May 2016 11:18:37 -0700

Forgot to add... sometimes really hammering at the SQL query in DIH
can be fruitful, can you make a huge, monster query that's faster than
the sub-queries?


I've also seen people run processes on the DB that move all the
data into a temporary place making use of all of the nifty stuff you
can do there and then use DIH on _that_. Or the view.

All that said, I generally prefer using SolrJ if DIH doesn't do the job
after a day or two of fiddling, it gives more control.

Good Luck!
Erick

On Thu, May 26, 2016 at 11:02 AM, John Blythe <j...@curvolabs.com> wrote:
> oo gotcha. cool, will make sure to check it out and bounce any related
> questions through here.
>
> thanks!
>
> best,
>
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Thu, May 26, 2016 at 1:45 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Solr commits aren't the issue I'd guess. All the time is
>> probably being spent getting the data from MySQL.
>>
>> I've had some luck writing to Solr from a DB through a
>> SolrJ program, here's a place to get started:
>> searchhub.org/2012/02/14/indexing-with-solrj/
>> you can peel out the Tika bits pretty easily I should
>> think.
>>
>> One technique I've used is to cache
>> some of the DB tables in Java's memory to keep
>> from having to do the secondary lookup(s). This only
>> really works if the "secondary table" is small enough to fit in
>> Java's memory of course. You can do some creative
>> things with caching partial tables if you can sort appropriately.
>>
>> Best,
>> Erick
>>
>> On Thu, May 26, 2016 at 9:01 AM, John Blythe <j...@curvolabs.com> wrote:
>> > hi all,
>> >
>> > i've got layered entities in my solr import. it's calling on some
>> > transactional data from a MySQL instance. there are two fields that are
>> > used to then lookup other information from other tables via their related
>> > UIDs, one of which has its own child entity w yet another select
>> statement
>> > to grab up more data.
>> >
>> > it fetches at about 120/s but processes at ~50-60/s. we currently only
>> have
>> > close to 500k records, but it's growing quickly and thus is becoming
>> > increasingly painful to make modifications due to the reimport that needs
>> > to then occur.
>> >
>> > i feel like i'd seen some threads regarding commits of new data,
>> > master/slave, or solrcloud/sharding that could help in some ways related
>> to
>> > this but as of yet can't scrounge them up w my searches (ironic :p).
>> >
>> > can someone help by pointing me to some good material related to this
>> sort
>> > of thing?
>> >
>> > thanks-
>>

Re: import efficiencies

Reply via email to