And it may also be that there are whole classes of user for whom denormalization is just too heavy a cross to bear and for who a little extra money spent on more hardware is a great tradeoff.
And... Lucene's indexing may be superior to your average SQL database, so that a Solr JOIN could be so much better than your average RDBMS SQL JOIN. That would be an interesting benchmark. -- Jack Krupansky On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <joels...@gmail.com> wrote: > I think people are going to be surprised though by the speed of the joins. > The joins also get faster as the number of shards, replicas and worker > nodes grow in the cluster. So we may see people building out large clusters > and and using the joins in OLTP scenarios. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupan...@gmail.com > > > wrote: > > > And of course it depends on the specific queries, both in terms of what > > fields will be searched and which fields need to be returned. > > > > Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20 > > seconds for a complex query may be just fine vs. OLTP/search where under > > 150 ms is the target. But, again, it will depend on the nature of the > > query, the cardinality of each search field, the cross product of > > cardinality of search fields, etc. > > > > -- Jack Krupansky > > > > On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <joels...@gmail.com> > > wrote: > > > > > In general the Streaming Expression joins are designed for interactive > > OLAP > > > type work loads. So BI and data warehousing scenarios are the sweet > spot. > > > There may be scenarios where high QPS search applications will work > with > > > the distributed joins, particularly if the joins themselves are not > huge. > > > But the specific use cases need to be tested. > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky < > > jack.krupan...@gmail.com > > > > > > > wrote: > > > > > > > It will be interesting to see which use cases work best with the new > > > > streaming JOIN vs. which will remain best with full denormalization, > or > > > > whether you simply have to try both and benchmark them. > > > > > > > > My impression had been that streaming JOIN would be ideal for bulk > > > > operations rather than traditional-style search queries. Maybe there > > are > > > > three use cases: bulk read based on broad criteria, top-n relevance > > > search > > > > query, and specific document (or small number of documents) based on > > > > multiple fields. > > > > > > > > My suspicion is that doing JOIN on five tables will likely be slower > > than > > > > accessing a single document of a denormalized table/index. > > > > > > > > -- Jack Krupansky > > > > > > > > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com> > > > > wrote: > > > > > > > > > Solr now has full distributed join capabilities as part of the > > > Streaming > > > > > Expression library. Keep in mind that these are distributed joins > so > > > they > > > > > shuffle records to worker nodes to perform the joins. These are > > > > comparable > > > > > to joins done by SQL over MapReduce systems, but they are very > > > responsive > > > > > and can respond with sub-second response time for fairly large > joins > > in > > > > > parallel mode. But these joins do lend themselves to large > > distributed > > > > > architectures (lot's of shards an replicas). Target QPS also needs > to > > > be > > > > > taken into account and tested in deciding whether these joins will > > meet > > > > the > > > > > specific use case. > > > > > > > > > > Joel Bernstein > > > > > http://joelsolr.blogspot.com/ > > > > > > > > > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com> > > wrote: > > > > > > > > > > > The Streaming API with Streaming Expressions (or Parallel SQL if > > you > > > > want > > > > > > to use SQL) can give you the functionality you're looking for. > See > > > > > > > > > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions > > > > > > and > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface. > > > > > > SQL queries coming in through the Parallel SQL Interface are > > > translated > > > > > > down into Streaming Expressions - if you need to do something > that > > > SQL > > > > > > doesn't yet support you should check out the Streaming > Expressions > > to > > > > see > > > > > > if it can support it. > > > > > > > > > > > > With these you could store your data in separate collections (or > > the > > > > same > > > > > > collection with different docType field values) and then during > > > search > > > > > > perform a join (inner, outer, hash) across the collections. You > > > could, > > > > if > > > > > > you wanted, even join with data NOT in solr using the jdbc > > streaming > > > > > > function. > > > > > > > > > > > > - Dennis Gove > > > > > > > > > > > > > > > > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG < > > > > > > lat...@mdpi.com.invalid> wrote: > > > > > > > > > > > >> '*would I then be able to query a specific field of articles or > > > other > > > > > >> "table" (with the same OR BETTER performances)?*' > > > > > >> -> And especially, would I be able to get only 1 article in the > > > > > result... > > > > > >> > > > > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote: > > > > > >> > > > > > >> Thanks Jack. > > > > > >> > > > > > >> I know that Solr is a search engine, but this replace a search > in > > my > > > > > >> mysql DB with this model: > > > > > >> > > > > > >> > > > > > >> *My goal is to improve my environment (and my performances at > the > > > same > > > > > >> time).* > > > > > >> > > > > > >> *Yes, I have a Solr data model... but atm I created 4 different > > > > indexes > > > > > >> for "similar service usage".* > > > > > >> *So atm, for 70 millions of documents, I am duplicating journal > > data > > > > and > > > > > >> publisher data all the time in 1 index (for all articles from > the > > > same > > > > > >> journal/pub) in order to be able to retrieve all data in 1 > > query...* > > > > > >> > > > > > >> *I found yesterday that there is the possibility to create like > an > > > > array > > > > > >> of <entity> in the data-conf.xml.* > > > > > >> e.g. (pseudo code - incomplete): > > > > > >> <entity name="solr_publisher" query="select name from > > publishers"> > > > > > >> <entity name="solr_journal" query="select name as j_name from > > > journals > > > > > >> WHERE publisher_id='${solr_publisher.id}'"> > > > > > >> <entity name="solr_articles" query="select title, abstract from > > > > articles > > > > > >> WHERE journal_id='${solr_journal.id}'"> > > > > > >> <entity name="solr_authors" query="select given_name, last_name > > from > > > > > >> authors WHERE article_id='${solr_article.id}'"> > > > > > >> > > > > > >> > > > > > >> * Would this be a good option? Is this the denormalization you > > were > > > > > >> proposing? * > > > > > >> > > > > > >> *If yes, would I then be able to query a specific field of > > articles > > > or > > > > > >> other "table" (with the same OR BETTER performances)? If yes, I > > > might > > > > > >> probably merge all the different indexes together. * > > > > > >> *I'm currently joining everything in mysql, so duplicating the > > > fields > > > > in > > > > > >> the solr (pseudo code):* > > > > > >> <entity name="all" query="select * from articles INNER JOIN > > journal > > > > on > > > > > >> [...]"> > > > > > >> *So I have an index for authors query, a general one for > articles > > > > (only > > > > > >> needed info of other tables) ...* > > > > > >> > > > > > >> Thanks in advance for the tips. :) > > > > > >> > > > > > >> Kind regards, > > > > > >> Bastien > > > > > >> > > > > > >> On 14/04/2016 16:23, Jack Krupansky wrote: > > > > > >> > > > > > >> Solr is a search engine, not a database. > > > > > >> > > > > > >> JOINs? Although Solr does have some limited JOIN capabilities, > > they > > > > are > > > > > >> more for special situations, not the front-line go-to technique > > for > > > > data > > > > > >> modeling for search. > > > > > >> > > > > > >> Rather, denormalization is the front-line go-to technique for > data > > > > > >> modeling in Solr. > > > > > >> > > > > > >> In any case, the first step in data modeling is always to focus > on > > > > your > > > > > >> queries - what information will be coming into your apps and > what > > > > > >> information will the apps want to access based on those inputs. > > > > > >> > > > > > >> But wait... you say you are upgrading, which suggests that you > > have > > > an > > > > > >> existing Solr data model, and probably queries as well. So... > > > > > >> > > > > > >> 1. Share at least a summary of your existing Solr data model as > > well > > > > as > > > > > >> at least a summary of the kinds of queries you perform today. > > > > > >> 2. Tell us what exacting is driving your inquiry - are queries > too > > > > slow, > > > > > >> too cumbersome, not sufficiently powerful, or... what exactly is > > the > > > > > >> problem you need to solve. > > > > > >> > > > > > >> > > > > > >> -- Jack Krupansky > > > > > >> > > > > > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG < > > > > > >> <lat...@mdpi.com.invalid>lat...@mdpi.com.invalid> wrote: > > > > > >> > > > > > >>> Hi Guys, > > > > > >>> > > > > > >>> *I am upgrading from solr 4.2 to 6.0.* > > > > > >>> *I successfully (after some time) migrated the config files and > > > other > > > > > >>> parameters...* > > > > > >>> > > > > > >>> Now I'm just wondering if my indexes are following the best > > > > > >>> practices...(and they are probably not :-) ) > > > > > >>> > > > > > >>> What would be the best if we have this kind of sql data to > write > > in > > > > > Solr: > > > > > >>> > > > > > >>> > > > > > >>> I have several different services which need (more or less), > > > > different > > > > > >>> data based on these JOINs... > > > > > >>> > > > > > >>> e.g.: > > > > > >>> Service A needs lots of data (but bot all), > > > > > >>> Service B needs a few data (some fields already included in A), > > > > > >>> Service C needs a bit more data than B(some fields already > > included > > > > in > > > > > >>> A/B)... > > > > > >>> > > > > > >>> *1. Would it be better to create one single index?* > > > > > >>> *-> i.e.: this will duplicate journal info for every single > > > article* > > > > > >>> > > > > > >>> *2. Would it be better to create several specific indexes for > > each > > > > > >>> similar services?* > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> *-> i.e.: this will use more space on the disks (and there are > > > > > >>> ~70millions of documents to join) 3. Would it be better to > create > > > an > > > > > index > > > > > >>> per table and make a join? -> if yes, how?? * > > > > > >>> > > > > > >>> Kind regards, > > > > > >>> Bastien > > > > > >>> > > > > > >>> > > > > > >> > > > > > >> Kind regards, > > > > > >> Bastien Latard > > > > > >> Web engineer > > > > > >> -- > > > > > >> MDPI AG > > > > > >> Postfach, CH-4005 Basel, Switzerland > > > > > >> Office: Klybeckstrasse 64, CH-4057 > > > > > >> Tel. +41 61 683 77 35 > > > > > >> Fax: +41 61 302 89 18 > > > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/ > > > > > >> > > > > > >> > > > > > >> Kind regards, > > > > > >> Bastien Latard > > > > > >> Web engineer > > > > > >> -- > > > > > >> MDPI AG > > > > > >> Postfach, CH-4005 Basel, Switzerland > > > > > >> Office: Klybeckstrasse 64, CH-4057 > > > > > >> Tel. +41 61 683 77 35 > > > > > >> Fax: +41 61 302 89 18 > > > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/ > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > >