Re: import efficiencies

John Bickerstaff Thu, 26 May 2016 11:43:11 -0700

Having more carefully read Erick's post - I see that is essentially what he
said in a much more straightforward way.


I will also second Erick's suggestion of hammering on the SQL.  We found
that fruitful many times at the same gig.  I develop and am not a SQL
master.  In a similar situation I'll usually seek out a specialist to help
me make sure the query isn't wasteful.  It frequently was and I learned a
lot.

On Thu, May 26, 2016 at 12:31 PM, John Bickerstaff <j...@johnbickerstaff.com
> wrote:

> It may or may not be helpful, but there's a similar class of problem that
> is frequently solved either by stored procedures or by running the query on
> a time-frame and storing the results...  Doesn't matter if the end-point
> for the data is Solr or somewhere else.
>
> The problem is long running queries that are extremely complex and stress
> the database performance too heavily.
>
> The solution is to de-normalize the data you need... store it in that form
> and then the query gets really fast... sort of like a data warehouse type
> of thing.  (Don't shoot, I know this isn't data warehousing...)
>
> Postgres even has something called an "automatically updateable view" that
> might serve - if that's your back end.
>
> Anyway - the underlying strategy is to find a way to flatten your data
> preparatory to turning it into solr documents by some means - either by
> getting it out on shorter-running queries all the time into some kind of
> store (Kafka, text file, whatever) or by using some feature of the database
> (stored procs writing to a summary table, automatically updatable view or
> similar).
>
> In this way, when you make your query, you make it against the "flattened"
> data - which is, ideally, all in one table - and then all the complexity of
> joins etc... is washed away and things ought to run pretty fast.
>
> The cost, of course, is a huge table with tons of duplicated data...  Only
> you can say if that's worth it.  I did this at my last gig and we truncated
> the table every 2 weeks to prevent it growing forever.
>
> In case it's helpful...
>
> PS - if you have the resources, a duplicate database can really help here
> too - again my experience is mostly with Postgres which allows a "warm"
> backup to be live.  We frequently used this for executive queries that were
> using the database like a data warehouse because they were so
> time-consuming.  It kept the load off production.
>
> On Thu, May 26, 2016 at 12:18 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Forgot to add... sometimes really hammering at the SQL query in DIH
>> can be fruitful, can you make a huge, monster query that's faster than
>> the sub-queries?
>>
>> I've also seen people run processes on the DB that move all the
>> data into a temporary place making use of all of the nifty stuff you
>> can do there and then use DIH on _that_. Or the view.
>>
>> All that said, I generally prefer using SolrJ if DIH doesn't do the job
>> after a day or two of fiddling, it gives more control.
>>
>> Good Luck!
>> Erick
>>
>> On Thu, May 26, 2016 at 11:02 AM, John Blythe <j...@curvolabs.com> wrote:
>> > oo gotcha. cool, will make sure to check it out and bounce any related
>> > questions through here.
>> >
>> > thanks!
>> >
>> > best,
>> >
>> >
>> > --
>> > *John Blythe*
>> > Product Manager & Lead Developer
>> >
>> > 251.605.3071 | j...@curvolabs.com
>> > www.curvolabs.com
>> >
>> > 58 Adams Ave
>> > Evansville, IN 47713
>> >
>> > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Solr commits aren't the issue I'd guess. All the time is
>> >> probably being spent getting the data from MySQL.
>> >>
>> >> I've had some luck writing to Solr from a DB through a
>> >> SolrJ program, here's a place to get started:
>> >> searchhub.org/2012/02/14/indexing-with-solrj/
>> >> you can peel out the Tika bits pretty easily I should
>> >> think.
>> >>
>> >> One technique I've used is to cache
>> >> some of the DB tables in Java's memory to keep
>> >> from having to do the secondary lookup(s). This only
>> >> really works if the "secondary table" is small enough to fit in
>> >> Java's memory of course. You can do some creative
>> >> things with caching partial tables if you can sort appropriately.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, May 26, 2016 at 9:01 AM, John Blythe <j...@curvolabs.com>
>> wrote:
>> >> > hi all,
>> >> >
>> >> > i've got layered entities in my solr import. it's calling on some
>> >> > transactional data from a MySQL instance. there are two fields that
>> are
>> >> > used to then lookup other information from other tables via their
>> related
>> >> > UIDs, one of which has its own child entity w yet another select
>> >> statement
>> >> > to grab up more data.
>> >> >
>> >> > it fetches at about 120/s but processes at ~50-60/s. we currently
>> only
>> >> have
>> >> > close to 500k records, but it's growing quickly and thus is becoming
>> >> > increasingly painful to make modifications due to the reimport that
>> needs
>> >> > to then occur.
>> >> >
>> >> > i feel like i'd seen some threads regarding commits of new data,
>> >> > master/slave, or solrcloud/sharding that could help in some ways
>> related
>> >> to
>> >> > this but as of yet can't scrounge them up w my searches (ironic :p).
>> >> >
>> >> > can someone help by pointing me to some good material related to this
>> >> sort
>> >> > of thing?
>> >> >
>> >> > thanks-
>> >>
>>
>
>

Re: import efficiencies

Reply via email to