Re: Solr best practices for many to many relations...

Jack Krupansky Fri, 15 Apr 2016 08:21:12 -0700

And it may also be that there are whole classes of user for whom
denormalization is just too heavy a cross to bear and for who a little
extra money spent on more hardware is a great tradeoff.


And... Lucene's indexing may be superior to your average SQL database, so
that a Solr JOIN could be so much better than your average RDBMS SQL JOIN.
That would be an interesting benchmark.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <joels...@gmail.com> wrote:

> I think people are going to be surprised though by the speed of the joins.
> The joins also get faster as the number of shards, replicas and worker
> nodes grow in the cluster. So we may see people building out large clusters
> and and using the joins in OLTP scenarios.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > And of course it depends on the specific queries, both in terms of what
> > fields will be searched and which fields need to be returned.
> >
> > Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
> > seconds for a complex query may be just fine vs. OLTP/search where under
> > 150 ms is the target. But, again, it will depend on the nature of the
> > query, the cardinality of each search field, the cross product of
> > cardinality of search fields, etc.
> >
> > -- Jack Krupansky
> >
> > On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > In general the Streaming Expression joins are designed for interactive
> > OLAP
> > > type work loads. So BI and data warehousing scenarios are the sweet
> spot.
> > > There may be scenarios where high QPS search applications will work
> with
> > > the distributed joins, particularly if the joins themselves are not
> huge.
> > > But the specific use cases need to be tested.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
> > jack.krupan...@gmail.com
> > > >
> > > wrote:
> > >
> > > > It will be interesting to see which use cases work best with the new
> > > > streaming JOIN vs. which will remain best with full denormalization,
> or
> > > > whether you simply have to try both and benchmark them.
> > > >
> > > > My impression had been that streaming JOIN would be ideal for bulk
> > > > operations rather than traditional-style search queries. Maybe there
> > are
> > > > three use cases: bulk read based on broad criteria, top-n relevance
> > > search
> > > > query, and specific document (or small number of documents) based on
> > > > multiple fields.
> > > >
> > > > My suspicion is that doing JOIN on five tables will likely be slower
> > than
> > > > accessing a single document of a denormalized table/index.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com>
> > > > wrote:
> > > >
> > > > > Solr now has full distributed join capabilities as part of the
> > > Streaming
> > > > > Expression library. Keep in mind that these are distributed joins
> so
> > > they
> > > > > shuffle records to worker nodes to perform the joins. These are
> > > > comparable
> > > > > to joins done by SQL over MapReduce systems, but they are very
> > > responsive
> > > > > and can respond with sub-second response time for fairly large
> joins
> > in
> > > > > parallel mode. But these joins do lend themselves to large
> > distributed
> > > > > architectures (lot's of shards an replicas). Target QPS also needs
> to
> > > be
> > > > > taken into account and tested in deciding whether these joins will
> > meet
> > > > the
> > > > > specific use case.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com>
> > wrote:
> > > > >
> > > > > > The Streaming API with Streaming Expressions (or Parallel SQL if
> > you
> > > > want
> > > > > > to use SQL) can give you the functionality you're looking for.
> See
> > > > > >
> > > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > > > > and
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > > > > SQL queries coming in through the Parallel SQL Interface are
> > > translated
> > > > > > down into Streaming Expressions - if you need to do something
> that
> > > SQL
> > > > > > doesn't yet support you should check out the Streaming
> Expressions
> > to
> > > > see
> > > > > > if it can support it.
> > > > > >
> > > > > > With these you could store your data in separate collections (or
> > the
> > > > same
> > > > > > collection with different docType field values) and then during
> > > search
> > > > > > perform a join (inner, outer, hash) across the collections. You
> > > could,
> > > > if
> > > > > > you wanted, even join with data NOT in solr using the jdbc
> > streaming
> > > > > > function.
> > > > > >
> > > > > > - Dennis Gove
> > > > > >
> > > > > >
> > > > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > > > > lat...@mdpi.com.invalid> wrote:
> > > > > >
> > > > > >> '*would I then be able to query a specific field of articles or
> > > other
> > > > > >> "table" (with the same OR BETTER performances)?*'
> > > > > >> -> And especially, would I be able to get only 1 article in the
> > > > > result...
> > > > > >>
> > > > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > > > > >>
> > > > > >> Thanks Jack.
> > > > > >>
> > > > > >> I know that Solr is a search engine, but this replace a search
> in
> > my
> > > > > >> mysql DB with this model:
> > > > > >>
> > > > > >>
> > > > > >> *My goal is to improve my environment (and my performances at
> the
> > > same
> > > > > >> time).*
> > > > > >>
> > > > > >> *Yes, I have a Solr data model... but atm I created 4 different
> > > > indexes
> > > > > >> for "similar service usage".*
> > > > > >> *So atm, for 70 millions of documents, I am duplicating journal
> > data
> > > > and
> > > > > >> publisher data all the time in 1 index (for all articles from
> the
> > > same
> > > > > >> journal/pub) in order to be able to retrieve all data in 1
> > query...*
> > > > > >>
> > > > > >> *I found yesterday that there is the possibility to create like
> an
> > > > array
> > > > > >> of <entity> in the data-conf.xml.*
> > > > > >> e.g. (pseudo code - incomplete):
> > > > > >> <entity  name="solr_publisher" query="select name from
> > publishers">
> > > > > >> <entity name="solr_journal" query="select name as j_name from
> > > journals
> > > > > >> WHERE publisher_id='${solr_publisher.id}'">
> > > > > >> <entity name="solr_articles" query="select title, abstract from
> > > > articles
> > > > > >> WHERE journal_id='${solr_journal.id}'">
> > > > > >> <entity name="solr_authors" query="select given_name, last_name
> > from
> > > > > >> authors WHERE article_id='${solr_article.id}'">
> > > > > >>
> > > > > >>
> > > > > >> * Would this be a good option? Is this the denormalization you
> > were
> > > > > >> proposing? *
> > > > > >>
> > > > > >> *If yes, would I then be able to query a specific field of
> > articles
> > > or
> > > > > >> other "table" (with the same OR BETTER performances)? If yes, I
> > > might
> > > > > >> probably merge all the different indexes together. *
> > > > > >> *I'm currently joining everything in mysql, so duplicating the
> > > fields
> > > > in
> > > > > >> the solr (pseudo code):*
> > > > > >> <entity  name="all" query="select * from articles INNER JOIN
> > journal
> > > > on
> > > > > >> [...]">
> > > > > >> *So I have an index for authors query, a general one for
> articles
> > > > (only
> > > > > >> needed info of other tables) ...*
> > > > > >>
> > > > > >> Thanks in advance for the tips. :)
> > > > > >>
> > > > > >> Kind regards,
> > > > > >> Bastien
> > > > > >>
> > > > > >> On 14/04/2016 16:23, Jack Krupansky wrote:
> > > > > >>
> > > > > >> Solr is a search engine, not a database.
> > > > > >>
> > > > > >> JOINs? Although Solr does have some limited JOIN capabilities,
> > they
> > > > are
> > > > > >> more for special situations, not the front-line go-to technique
> > for
> > > > data
> > > > > >> modeling for search.
> > > > > >>
> > > > > >> Rather, denormalization is the front-line go-to technique for
> data
> > > > > >> modeling in Solr.
> > > > > >>
> > > > > >> In any case, the first step in data modeling is always to focus
> on
> > > > your
> > > > > >> queries - what information will be coming into your apps and
> what
> > > > > >> information will the apps want to access based on those inputs.
> > > > > >>
> > > > > >> But wait... you say you are upgrading, which suggests that you
> > have
> > > an
> > > > > >> existing Solr data model, and probably queries as well. So...
> > > > > >>
> > > > > >> 1. Share at least a summary of your existing Solr data model as
> > well
> > > > as
> > > > > >> at least a summary of the kinds of queries you perform today.
> > > > > >> 2. Tell us what exacting is driving your inquiry - are queries
> too
> > > > slow,
> > > > > >> too cumbersome, not sufficiently powerful, or... what exactly is
> > the
> > > > > >> problem you need to solve.
> > > > > >>
> > > > > >>
> > > > > >> -- Jack Krupansky
> > > > > >>
> > > > > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> > > > > >> <lat...@mdpi.com.invalid>lat...@mdpi.com.invalid> wrote:
> > > > > >>
> > > > > >>> Hi Guys,
> > > > > >>>
> > > > > >>> *I am upgrading from solr 4.2 to 6.0.*
> > > > > >>> *I successfully (after some time) migrated the config files and
> > > other
> > > > > >>> parameters...*
> > > > > >>>
> > > > > >>> Now I'm just wondering if my indexes are following the best
> > > > > >>> practices...(and they are probably not :-) )
> > > > > >>>
> > > > > >>> What would be the best if we have this kind of sql data to
> write
> > in
> > > > > Solr:
> > > > > >>>
> > > > > >>>
> > > > > >>> I have several different services which need (more or less),
> > > > different
> > > > > >>> data based on these JOINs...
> > > > > >>>
> > > > > >>> e.g.:
> > > > > >>> Service A needs lots of data (but bot all),
> > > > > >>> Service B needs a few data (some fields already included in A),
> > > > > >>> Service C needs a bit more data than B(some fields already
> > included
> > > > in
> > > > > >>> A/B)...
> > > > > >>>
> > > > > >>> *1. Would it be better to create one single index?*
> > > > > >>> *-> i.e.: this will duplicate journal info for every single
> > > article*
> > > > > >>>
> > > > > >>> *2. Would it be better to create several specific indexes for
> > each
> > > > > >>> similar services?*
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> *-> i.e.: this will use more space on the disks (and there are
> > > > > >>> ~70millions of documents to join) 3. Would it be better to
> create
> > > an
> > > > > index
> > > > > >>> per table and make a join? -> if yes, how?? *
> > > > > >>>
> > > > > >>> Kind regards,
> > > > > >>> Bastien
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >> Kind regards,
> > > > > >> Bastien Latard
> > > > > >> Web engineer
> > > > > >> --
> > > > > >> MDPI AG
> > > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > > >> Tel. +41 61 683 77 35
> > > > > >> Fax: +41 61 302 89 18
> > > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > > > >>
> > > > > >>
> > > > > >> Kind regards,
> > > > > >> Bastien Latard
> > > > > >> Web engineer
> > > > > >> --
> > > > > >> MDPI AG
> > > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > > >> Tel. +41 61 683 77 35
> > > > > >> Fax: +41 61 302 89 18
> > > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Solr best practices for many to many relations...

Reply via email to