Re: Solr best practices for many to many relations...

Bastien Latard - MDPI AG Sun, 17 Apr 2016 23:11:31 -0700

Thanks everybody.

Your answers are very interesting, however I'm not sure I'm getting themproperly (sorry I'm not an expert... it might be evident for you)...


*When you're speaking about denormalization, does it mean:

1. something like that?*

   <entity  name="solr_publisher" query="select name from publishers">
   <entity name="solr_journal" query="select name as j_name from
   journals WHERE publisher_id='${solr_publisher.id}'">
   <entity name="solr_articles" query="select title, abstract from
   articles WHERE journal_id='${solr_journal.id}'">
   <entity name="solr_authors" query="select given_name, last_name from
   authors WHERE article_id='${solr_article.id}'">
   */-> I think that the answer is "no".../*


*2. 1 different index for each SQL table? *

/ -> if yes, how can I then retrieve all the needed data (i.e.:intersection)?...JOIN/Streaming exp.?/

Otherwise, when you're speaking about JOIN, is it a join between 2different indexes, or between several fields of the same index?

/Reminder: there are around 68 millions of articles, which are alllinked to 1 journal and 1 publisher...And I have 8 different servicesrequesting the data (so I cannot really provide a specific use case)./

*Would it be better/faster to query a single normalized index (all thedata at the same place **/- but larger index because of duplicateddata/**), or to query several indexes (smaller indexes, but need to makea solr "join")?*


Thanks.

Kind regards,
Bastien


On 15/04/2016 17:20, Jack Krupansky wrote:

And it may also be that there are whole classes of user for whom
denormalization is just too heavy a cross to bear and for who a little
extra money spent on more hardware is a great tradeoff.

And... Lucene's indexing may be superior to your average SQL database, so
that a Solr JOIN could be so much better than your average RDBMS SQL JOIN.
That would be an interesting benchmark.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <joels...@gmail.com> wrote:

I think people are going to be surprised though by the speed of the joins.
The joins also get faster as the number of shards, replicas and worker
nodes grow in the cluster. So we may see people building out large clusters
and and using the joins in OLTP scenarios.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupan...@gmail.com
wrote:

And of course it depends on the specific queries, both in terms of what
fields will be searched and which fields need to be returned.

Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
seconds for a complex query may be just fine vs. OLTP/search where under
150 ms is the target. But, again, it will depend on the nature of the
query, the cardinality of each search field, the cross product of
cardinality of search fields, etc.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <joels...@gmail.com>
wrote:

In general the Streaming Expression joins are designed for interactive

OLAP

type work loads. So BI and data warehousing scenarios are the sweet

spot.

There may be scenarios where high QPS search applications will work

with

the distributed joins, particularly if the joins themselves are not

huge.

But the specific use cases need to be tested.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <

jack.krupan...@gmail.com

wrote:

It will be interesting to see which use cases work best with the new
streaming JOIN vs. which will remain best with full denormalization,

or

whether you simply have to try both and benchmark them.

My impression had been that streaming JOIN would be ideal for bulk
operations rather than traditional-style search queries. Maybe there

are

three use cases: bulk read based on broad criteria, top-n relevance

search

query, and specific document (or small number of documents) based on
multiple fields.

My suspicion is that doing JOIN on five tables will likely be slower

than

accessing a single document of a denormalized table/index.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com>
wrote:

Solr now has full distributed join capabilities as part of the

Streaming

Expression library. Keep in mind that these are distributed joins

so

they

shuffle records to worker nodes to perform the joins. These are

comparable

to joins done by SQL over MapReduce systems, but they are very

responsive

and can respond with sub-second response time for fairly large

joins

in

parallel mode. But these joins do lend themselves to large

distributed

architectures (lot's of shards an replicas). Target QPS also needs

to

be

taken into account and tested in deciding whether these joins will

meet

the

specific use case.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com>

wrote:

The Streaming API with Streaming Expressions (or Parallel SQL if

you

want

to use SQL) can give you the functionality you're looking for.

See

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

and

https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.

SQL queries coming in through the Parallel SQL Interface are

translated

down into Streaming Expressions - if you need to do something

that

SQL

doesn't yet support you should check out the Streaming

Expressions

to

see

if it can support it.

With these you could store your data in separate collections (or

the

same

collection with different docType field values) and then during

search

perform a join (inner, outer, hash) across the collections. You

could,

if

you wanted, even join with data NOT in solr using the jdbc

streaming

function.

- Dennis Gove


On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

'*would I then be able to query a specific field of articles or

other

"table" (with the same OR BETTER performances)?*'
-> And especially, would I be able to get only 1 article in the

result...

On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:

Thanks Jack.

I know that Solr is a search engine, but this replace a search

in

my

mysql DB with this model:


*My goal is to improve my environment (and my performances at

the

same

time).*

*Yes, I have a Solr data model... but atm I created 4 different

indexes

for "similar service usage".*
*So atm, for 70 millions of documents, I am duplicating journal

data

and

publisher data all the time in 1 index (for all articles from

the

same

journal/pub) in order to be able to retrieve all data in 1

query...*

*I found yesterday that there is the possibility to create like

an

array

of <entity> in the data-conf.xml.*
e.g. (pseudo code - incomplete):
<entity  name="solr_publisher" query="select name from

publishers">

<entity name="solr_journal" query="select name as j_name from

journals

WHERE publisher_id='${solr_publisher.id}'">
<entity name="solr_articles" query="select title, abstract from

articles

WHERE journal_id='${solr_journal.id}'">
<entity name="solr_authors" query="select given_name, last_name

from

authors WHERE article_id='${solr_article.id}'">


* Would this be a good option? Is this the denormalization you

were

proposing? *

*If yes, would I then be able to query a specific field of

articles

or

other "table" (with the same OR BETTER performances)? If yes, I

might

probably merge all the different indexes together. *
*I'm currently joining everything in mysql, so duplicating the

fields

in

the solr (pseudo code):*
<entity  name="all" query="select * from articles INNER JOIN

journal

on

[...]">
*So I have an index for authors query, a general one for

articles

(only

needed info of other tables) ...*

Thanks in advance for the tips. :)

Kind regards,
Bastien

On 14/04/2016 16:23, Jack Krupansky wrote:

Solr is a search engine, not a database.

JOINs? Although Solr does have some limited JOIN capabilities,

they

are

more for special situations, not the front-line go-to technique

for

data

modeling for search.

Rather, denormalization is the front-line go-to technique for

data

modeling in Solr.

In any case, the first step in data modeling is always to focus

on

your

queries - what information will be coming into your apps and

what

information will the apps want to access based on those inputs.

But wait... you say you are upgrading, which suggests that you

have

an

existing Solr data model, and probably queries as well. So...

1. Share at least a summary of your existing Solr data model as

well

as

at least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries

too

slow,

too cumbersome, not sufficiently powerful, or... what exactly is

the

problem you need to solve.


-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
<lat...@mdpi.com.invalid>lat...@mdpi.com.invalid> wrote:

Hi Guys,

*I am upgrading from solr 4.2 to 6.0.*
*I successfully (after some time) migrated the config files and

other

parameters...*

Now I'm just wondering if my indexes are following the best
practices...(and they are probably not :-) )

What would be the best if we have this kind of sql data to

write

in

Solr:


I have several different services which need (more or less),

different

data based on these JOINs...

e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already

included

in

A/B)...

*1. Would it be better to create one single index?*
*-> i.e.: this will duplicate journal info for every single

article*

*2. Would it be better to create several specific indexes for

each

similar services?*





*-> i.e.: this will use more space on the disks (and there are
~70millions of documents to join) 3. Would it be better to

create

an

index

per table and make a join? -> if yes, how?? *

Kind regards,
Bastien

Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail: latard@mdpi.comhttp://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail: latard@mdpi.comhttp://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

Re: Solr best practices for many to many relations...

Reply via email to