Re: Solr best practices for many to many relations...

Joel Bernstein Fri, 15 Apr 2016 08:07:17 -0700

I think people are going to be surprised though by the speed of the joins.
The joins also get faster as the number of shards, replicas and worker
nodes grow in the cluster. So we may see people building out large clusters
and and using the joins in OLTP scenarios.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <[email protected]>
wrote:

> And of course it depends on the specific queries, both in terms of what
> fields will be searched and which fields need to be returned.
>
> Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
> seconds for a complex query may be just fine vs. OLTP/search where under
> 150 ms is the target. But, again, it will depend on the nature of the
> query, the cardinality of each search field, the cross product of
> cardinality of search fields, etc.
>
> -- Jack Krupansky
>
> On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <[email protected]>
> wrote:
>
> > In general the Streaming Expression joins are designed for interactive
> OLAP
> > type work loads. So BI and data warehousing scenarios are the sweet spot.
> > There may be scenarios where high QPS search applications will work with
> > the distributed joins, particularly if the joins themselves are not huge.
> > But the specific use cases need to be tested.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
> [email protected]
> > >
> > wrote:
> >
> > > It will be interesting to see which use cases work best with the new
> > > streaming JOIN vs. which will remain best with full denormalization, or
> > > whether you simply have to try both and benchmark them.
> > >
> > > My impression had been that streaming JOIN would be ideal for bulk
> > > operations rather than traditional-style search queries. Maybe there
> are
> > > three use cases: bulk read based on broad criteria, top-n relevance
> > search
> > > query, and specific document (or small number of documents) based on
> > > multiple fields.
> > >
> > > My suspicion is that doing JOIN on five tables will likely be slower
> than
> > > accessing a single document of a denormalized table/index.
> > >
> > > -- Jack Krupansky
> > >
> > > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <[email protected]>
> > > wrote:
> > >
> > > > Solr now has full distributed join capabilities as part of the
> > Streaming
> > > > Expression library. Keep in mind that these are distributed joins so
> > they
> > > > shuffle records to worker nodes to perform the joins. These are
> > > comparable
> > > > to joins done by SQL over MapReduce systems, but they are very
> > responsive
> > > > and can respond with sub-second response time for fairly large joins
> in
> > > > parallel mode. But these joins do lend themselves to large
> distributed
> > > > architectures (lot's of shards an replicas). Target QPS also needs to
> > be
> > > > taken into account and tested in deciding whether these joins will
> meet
> > > the
> > > > specific use case.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <[email protected]>
> wrote:
> > > >
> > > > > The Streaming API with Streaming Expressions (or Parallel SQL if
> you
> > > want
> > > > > to use SQL) can give you the functionality you're looking for. See
> > > > >
> > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > > > and
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > > > SQL queries coming in through the Parallel SQL Interface are
> > translated
> > > > > down into Streaming Expressions - if you need to do something that
> > SQL
> > > > > doesn't yet support you should check out the Streaming Expressions
> to
> > > see
> > > > > if it can support it.
> > > > >
> > > > > With these you could store your data in separate collections (or
> the
> > > same
> > > > > collection with different docType field values) and then during
> > search
> > > > > perform a join (inner, outer, hash) across the collections. You
> > could,
> > > if
> > > > > you wanted, even join with data NOT in solr using the jdbc
> streaming
> > > > > function.
> > > > >
> > > > > - Dennis Gove
> > > > >
> > > > >
> > > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> '*would I then be able to query a specific field of articles or
> > other
> > > > >> "table" (with the same OR BETTER performances)?*'
> > > > >> -> And especially, would I be able to get only 1 article in the
> > > > result...
> > > > >>
> > > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > > > >>
> > > > >> Thanks Jack.
> > > > >>
> > > > >> I know that Solr is a search engine, but this replace a search in
> my
> > > > >> mysql DB with this model:
> > > > >>
> > > > >>
> > > > >> *My goal is to improve my environment (and my performances at the
> > same
> > > > >> time).*
> > > > >>
> > > > >> *Yes, I have a Solr data model... but atm I created 4 different
> > > indexes
> > > > >> for "similar service usage".*
> > > > >> *So atm, for 70 millions of documents, I am duplicating journal
> data
> > > and
> > > > >> publisher data all the time in 1 index (for all articles from the
> > same
> > > > >> journal/pub) in order to be able to retrieve all data in 1
> query...*
> > > > >>
> > > > >> *I found yesterday that there is the possibility to create like an
> > > array
> > > > >> of <entity> in the data-conf.xml.*
> > > > >> e.g. (pseudo code - incomplete):
> > > > >> <entity  name="solr_publisher" query="select name from
> publishers">
> > > > >> <entity name="solr_journal" query="select name as j_name from
> > journals
> > > > >> WHERE publisher_id='${solr_publisher.id}'">
> > > > >> <entity name="solr_articles" query="select title, abstract from
> > > articles
> > > > >> WHERE journal_id='${solr_journal.id}'">
> > > > >> <entity name="solr_authors" query="select given_name, last_name
> from
> > > > >> authors WHERE article_id='${solr_article.id}'">
> > > > >>
> > > > >>
> > > > >> * Would this be a good option? Is this the denormalization you
> were
> > > > >> proposing? *
> > > > >>
> > > > >> *If yes, would I then be able to query a specific field of
> articles
> > or
> > > > >> other "table" (with the same OR BETTER performances)? If yes, I
> > might
> > > > >> probably merge all the different indexes together. *
> > > > >> *I'm currently joining everything in mysql, so duplicating the
> > fields
> > > in
> > > > >> the solr (pseudo code):*
> > > > >> <entity  name="all" query="select * from articles INNER JOIN
> journal
> > > on
> > > > >> [...]">
> > > > >> *So I have an index for authors query, a general one for articles
> > > (only
> > > > >> needed info of other tables) ...*
> > > > >>
> > > > >> Thanks in advance for the tips. :)
> > > > >>
> > > > >> Kind regards,
> > > > >> Bastien
> > > > >>
> > > > >> On 14/04/2016 16:23, Jack Krupansky wrote:
> > > > >>
> > > > >> Solr is a search engine, not a database.
> > > > >>
> > > > >> JOINs? Although Solr does have some limited JOIN capabilities,
> they
> > > are
> > > > >> more for special situations, not the front-line go-to technique
> for
> > > data
> > > > >> modeling for search.
> > > > >>
> > > > >> Rather, denormalization is the front-line go-to technique for data
> > > > >> modeling in Solr.
> > > > >>
> > > > >> In any case, the first step in data modeling is always to focus on
> > > your
> > > > >> queries - what information will be coming into your apps and what
> > > > >> information will the apps want to access based on those inputs.
> > > > >>
> > > > >> But wait... you say you are upgrading, which suggests that you
> have
> > an
> > > > >> existing Solr data model, and probably queries as well. So...
> > > > >>
> > > > >> 1. Share at least a summary of your existing Solr data model as
> well
> > > as
> > > > >> at least a summary of the kinds of queries you perform today.
> > > > >> 2. Tell us what exacting is driving your inquiry - are queries too
> > > slow,
> > > > >> too cumbersome, not sufficiently powerful, or... what exactly is
> the
> > > > >> problem you need to solve.
> > > > >>
> > > > >>
> > > > >> -- Jack Krupansky
> > > > >>
> > > > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> > > > >> <[email protected]>[email protected]> wrote:
> > > > >>
> > > > >>> Hi Guys,
> > > > >>>
> > > > >>> *I am upgrading from solr 4.2 to 6.0.*
> > > > >>> *I successfully (after some time) migrated the config files and
> > other
> > > > >>> parameters...*
> > > > >>>
> > > > >>> Now I'm just wondering if my indexes are following the best
> > > > >>> practices...(and they are probably not :-) )
> > > > >>>
> > > > >>> What would be the best if we have this kind of sql data to write
> in
> > > > Solr:
> > > > >>>
> > > > >>>
> > > > >>> I have several different services which need (more or less),
> > > different
> > > > >>> data based on these JOINs...
> > > > >>>
> > > > >>> e.g.:
> > > > >>> Service A needs lots of data (but bot all),
> > > > >>> Service B needs a few data (some fields already included in A),
> > > > >>> Service C needs a bit more data than B(some fields already
> included
> > > in
> > > > >>> A/B)...
> > > > >>>
> > > > >>> *1. Would it be better to create one single index?*
> > > > >>> *-> i.e.: this will duplicate journal info for every single
> > article*
> > > > >>>
> > > > >>> *2. Would it be better to create several specific indexes for
> each
> > > > >>> similar services?*
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *-> i.e.: this will use more space on the disks (and there are
> > > > >>> ~70millions of documents to join) 3. Would it be better to create
> > an
> > > > index
> > > > >>> per table and make a join? -> if yes, how?? *
> > > > >>>
> > > > >>> Kind regards,
> > > > >>> Bastien
> > > > >>>
> > > > >>>
> > > > >>
> > > > >> Kind regards,
> > > > >> Bastien Latard
> > > > >> Web engineer
> > > > >> --
> > > > >> MDPI AG
> > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > >> Tel. +41 61 683 77 35
> > > > >> Fax: +41 61 302 89 18
> > > > >> E-mail: [email protected]http://www.mdpi.com/
> > > > >>
> > > > >>
> > > > >> Kind regards,
> > > > >> Bastien Latard
> > > > >> Web engineer
> > > > >> --
> > > > >> MDPI AG
> > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > >> Tel. +41 61 683 77 35
> > > > >> Fax: +41 61 302 89 18
> > > > >> E-mail: [email protected]http://www.mdpi.com/
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Solr best practices for many to many relations...

Reply via email to