Re: Solr best practices for many to many relations...

Joel Bernstein Fri, 15 Apr 2016 06:57:10 -0700

Solr now has full distributed join capabilities as part of the Streaming
Expression library. Keep in mind that these are distributed joins so they
shuffle records to worker nodes to perform the joins. These are comparable
to joins done by SQL over MapReduce systems, but they are very responsive
and can respond with sub-second response time for fairly large joins in
parallel mode. But these joins do lend themselves to large distributed
architectures (lot's of shards an replicas). Target QPS also needs to be
taken into account and tested in deciding whether these joins will meet the
specific use case.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com> wrote:

> The Streaming API with Streaming Expressions (or Parallel SQL if you want
> to use SQL) can give you the functionality you're looking for. See
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> and
> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> SQL queries coming in through the Parallel SQL Interface are translated
> down into Streaming Expressions - if you need to do something that SQL
> doesn't yet support you should check out the Streaming Expressions to see
> if it can support it.
>
> With these you could store your data in separate collections (or the same
> collection with different docType field values) and then during search
> perform a join (inner, outer, hash) across the collections. You could, if
> you wanted, even join with data NOT in solr using the jdbc streaming
> function.
>
> - Dennis Gove
>
>
> On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> lat...@mdpi.com.invalid> wrote:
>
>> '*would I then be able to query a specific field of articles or other
>> "table" (with the same OR BETTER performances)?*'
>> -> And especially, would I be able to get only 1 article in the result...
>>
>> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
>>
>> Thanks Jack.
>>
>> I know that Solr is a search engine, but this replace a search in my
>> mysql DB with this model:
>>
>>
>> *My goal is to improve my environment (and my performances at the same
>> time).*
>>
>> *Yes, I have a Solr data model... but atm I created 4 different indexes
>> for "similar service usage".*
>> *So atm, for 70 millions of documents, I am duplicating journal data and
>> publisher data all the time in 1 index (for all articles from the same
>> journal/pub) in order to be able to retrieve all data in 1 query...*
>>
>> *I found yesterday that there is the possibility to create like an array
>> of <entity> in the data-conf.xml.*
>> e.g. (pseudo code - incomplete):
>> <entity  name="solr_publisher" query="select name from publishers">
>> <entity name="solr_journal" query="select name as j_name from journals
>> WHERE publisher_id='${solr_publisher.id}'">
>> <entity name="solr_articles" query="select title, abstract from articles
>> WHERE journal_id='${solr_journal.id}'">
>> <entity name="solr_authors" query="select given_name, last_name from
>> authors WHERE article_id='${solr_article.id}'">
>>
>>
>> * Would this be a good option? Is this the denormalization you were
>> proposing? *
>>
>> *If yes, would I then be able to query a specific field of articles or
>> other "table" (with the same OR BETTER performances)? If yes, I might
>> probably merge all the different indexes together. *
>> *I'm currently joining everything in mysql, so duplicating the fields in
>> the solr (pseudo code):*
>> <entity  name="all" query="select * from articles INNER JOIN journal on
>> [...]">
>> *So I have an index for authors query, a general one for articles (only
>> needed info of other tables) ...*
>>
>> Thanks in advance for the tips. :)
>>
>> Kind regards,
>> Bastien
>>
>> On 14/04/2016 16:23, Jack Krupansky wrote:
>>
>> Solr is a search engine, not a database.
>>
>> JOINs? Although Solr does have some limited JOIN capabilities, they are
>> more for special situations, not the front-line go-to technique for data
>> modeling for search.
>>
>> Rather, denormalization is the front-line go-to technique for data
>> modeling in Solr.
>>
>> In any case, the first step in data modeling is always to focus on your
>> queries - what information will be coming into your apps and what
>> information will the apps want to access based on those inputs.
>>
>> But wait... you say you are upgrading, which suggests that you have an
>> existing Solr data model, and probably queries as well. So...
>>
>> 1. Share at least a summary of your existing Solr data model as well as
>> at least a summary of the kinds of queries you perform today.
>> 2. Tell us what exacting is driving your inquiry - are queries too slow,
>> too cumbersome, not sufficiently powerful, or... what exactly is the
>> problem you need to solve.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
>> <lat...@mdpi.com.invalid>lat...@mdpi.com.invalid> wrote:
>>
>>> Hi Guys,
>>>
>>> *I am upgrading from solr 4.2 to 6.0.*
>>> *I successfully (after some time) migrated the config files and other
>>> parameters...*
>>>
>>> Now I'm just wondering if my indexes are following the best
>>> practices...(and they are probably not :-) )
>>>
>>> What would be the best if we have this kind of sql data to write in Solr:
>>>
>>>
>>> I have several different services which need (more or less), different
>>> data based on these JOINs...
>>>
>>> e.g.:
>>> Service A needs lots of data (but bot all),
>>> Service B needs a few data (some fields already included in A),
>>> Service C needs a bit more data than B(some fields already included in
>>> A/B)...
>>>
>>> *1. Would it be better to create one single index?*
>>> *-> i.e.: this will duplicate journal info for every single article*
>>>
>>> *2. Would it be better to create several specific indexes for each
>>> similar services?*
>>>
>>>
>>>
>>>
>>>
>>> *-> i.e.: this will use more space on the disks (and there are
>>> ~70millions of documents to join) 3. Would it be better to create an index
>>> per table and make a join? -> if yes, how?? *
>>>
>>> Kind regards,
>>> Bastien
>>>
>>>
>>
>> Kind regards,
>> Bastien Latard
>> Web engineer
>> --
>> MDPI AG
>> Postfach, CH-4005 Basel, Switzerland
>> Office: Klybeckstrasse 64, CH-4057
>> Tel. +41 61 683 77 35
>> Fax: +41 61 302 89 18
>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>
>>
>> Kind regards,
>> Bastien Latard
>> Web engineer
>> --
>> MDPI AG
>> Postfach, CH-4005 Basel, Switzerland
>> Office: Klybeckstrasse 64, CH-4057
>> Tel. +41 61 683 77 35
>> Fax: +41 61 302 89 18
>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>
>>
>

Re: Solr best practices for many to many relations...

Reply via email to