Having more carefully read Erick's post - I see that is essentially what he said in a much more straightforward way.
I will also second Erick's suggestion of hammering on the SQL. We found that fruitful many times at the same gig. I develop and am not a SQL master. In a similar situation I'll usually seek out a specialist to help me make sure the query isn't wasteful. It frequently was and I learned a lot. On Thu, May 26, 2016 at 12:31 PM, John Bickerstaff <j...@johnbickerstaff.com > wrote: > It may or may not be helpful, but there's a similar class of problem that > is frequently solved either by stored procedures or by running the query on > a time-frame and storing the results... Doesn't matter if the end-point > for the data is Solr or somewhere else. > > The problem is long running queries that are extremely complex and stress > the database performance too heavily. > > The solution is to de-normalize the data you need... store it in that form > and then the query gets really fast... sort of like a data warehouse type > of thing. (Don't shoot, I know this isn't data warehousing...) > > Postgres even has something called an "automatically updateable view" that > might serve - if that's your back end. > > Anyway - the underlying strategy is to find a way to flatten your data > preparatory to turning it into solr documents by some means - either by > getting it out on shorter-running queries all the time into some kind of > store (Kafka, text file, whatever) or by using some feature of the database > (stored procs writing to a summary table, automatically updatable view or > similar). > > In this way, when you make your query, you make it against the "flattened" > data - which is, ideally, all in one table - and then all the complexity of > joins etc... is washed away and things ought to run pretty fast. > > The cost, of course, is a huge table with tons of duplicated data... Only > you can say if that's worth it. I did this at my last gig and we truncated > the table every 2 weeks to prevent it growing forever. > > In case it's helpful... > > PS - if you have the resources, a duplicate database can really help here > too - again my experience is mostly with Postgres which allows a "warm" > backup to be live. We frequently used this for executive queries that were > using the database like a data warehouse because they were so > time-consuming. It kept the load off production. > > On Thu, May 26, 2016 at 12:18 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> Forgot to add... sometimes really hammering at the SQL query in DIH >> can be fruitful, can you make a huge, monster query that's faster than >> the sub-queries? >> >> I've also seen people run processes on the DB that move all the >> data into a temporary place making use of all of the nifty stuff you >> can do there and then use DIH on _that_. Or the view. >> >> All that said, I generally prefer using SolrJ if DIH doesn't do the job >> after a day or two of fiddling, it gives more control. >> >> Good Luck! >> Erick >> >> On Thu, May 26, 2016 at 11:02 AM, John Blythe <j...@curvolabs.com> wrote: >> > oo gotcha. cool, will make sure to check it out and bounce any related >> > questions through here. >> > >> > thanks! >> > >> > best, >> > >> > >> > -- >> > *John Blythe* >> > Product Manager & Lead Developer >> > >> > 251.605.3071 | j...@curvolabs.com >> > www.curvolabs.com >> > >> > 58 Adams Ave >> > Evansville, IN 47713 >> > >> > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson < >> erickerick...@gmail.com> >> > wrote: >> > >> >> Solr commits aren't the issue I'd guess. All the time is >> >> probably being spent getting the data from MySQL. >> >> >> >> I've had some luck writing to Solr from a DB through a >> >> SolrJ program, here's a place to get started: >> >> searchhub.org/2012/02/14/indexing-with-solrj/ >> >> you can peel out the Tika bits pretty easily I should >> >> think. >> >> >> >> One technique I've used is to cache >> >> some of the DB tables in Java's memory to keep >> >> from having to do the secondary lookup(s). This only >> >> really works if the "secondary table" is small enough to fit in >> >> Java's memory of course. You can do some creative >> >> things with caching partial tables if you can sort appropriately. >> >> >> >> Best, >> >> Erick >> >> >> >> On Thu, May 26, 2016 at 9:01 AM, John Blythe <j...@curvolabs.com> >> wrote: >> >> > hi all, >> >> > >> >> > i've got layered entities in my solr import. it's calling on some >> >> > transactional data from a MySQL instance. there are two fields that >> are >> >> > used to then lookup other information from other tables via their >> related >> >> > UIDs, one of which has its own child entity w yet another select >> >> statement >> >> > to grab up more data. >> >> > >> >> > it fetches at about 120/s but processes at ~50-60/s. we currently >> only >> >> have >> >> > close to 500k records, but it's growing quickly and thus is becoming >> >> > increasingly painful to make modifications due to the reimport that >> needs >> >> > to then occur. >> >> > >> >> > i feel like i'd seen some threads regarding commits of new data, >> >> > master/slave, or solrcloud/sharding that could help in some ways >> related >> >> to >> >> > this but as of yet can't scrounge them up w my searches (ironic :p). >> >> > >> >> > can someone help by pointing me to some good material related to this >> >> sort >> >> > of thing? >> >> > >> >> > thanks- >> >> >> > >