Thanks everybody.
Your answers are very interesting, however I'm not sure I'm getting them
properly (sorry I'm not an expert... it might be evident for you)...
*When you're speaking about denormalization, does it mean:
1. something like that?*
<entity name="solr_publisher" query="select name from publishers">
<entity name="solr_journal" query="select name as j_name from
journals WHERE publisher_id='${solr_publisher.id}'">
<entity name="solr_articles" query="select title, abstract from
articles WHERE journal_id='${solr_journal.id}'">
<entity name="solr_authors" query="select given_name, last_name from
authors WHERE article_id='${solr_article.id}'">
*/-> I think that the answer is "no".../*
*2. 1 different index for each SQL table? *
/ -> if yes, how can I then retrieve all the needed data (i.e.:
intersection)?...JOIN/Streaming exp.?/
Otherwise, when you're speaking about JOIN, is it a join between 2
different indexes, or between several fields of the same index?
/Reminder: there are around 68 millions of articles, which are all
linked to 1 journal and 1 publisher...And I have 8 different services
requesting the data (so I cannot really provide a specific use case)./
*Would it be better/faster to query a single normalized index (all the
data at the same place **/- but larger index because of duplicated
data/**), or to query several indexes (smaller indexes, but need to make
a solr "join")?*
Thanks.
Kind regards,
Bastien
On 15/04/2016 17:20, Jack Krupansky wrote:
And it may also be that there are whole classes of user for whom
denormalization is just too heavy a cross to bear and for who a little
extra money spent on more hardware is a great tradeoff.
And... Lucene's indexing may be superior to your average SQL database, so
that a Solr JOIN could be so much better than your average RDBMS SQL JOIN.
That would be an interesting benchmark.
-- Jack Krupansky
On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <joels...@gmail.com> wrote:
I think people are going to be surprised though by the speed of the joins.
The joins also get faster as the number of shards, replicas and worker
nodes grow in the cluster. So we may see people building out large clusters
and and using the joins in OLTP scenarios.
Joel Bernstein
http://joelsolr.blogspot.com/
On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupan...@gmail.com
wrote:
And of course it depends on the specific queries, both in terms of what
fields will be searched and which fields need to be returned.
Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
seconds for a complex query may be just fine vs. OLTP/search where under
150 ms is the target. But, again, it will depend on the nature of the
query, the cardinality of each search field, the cross product of
cardinality of search fields, etc.
-- Jack Krupansky
On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <joels...@gmail.com>
wrote:
In general the Streaming Expression joins are designed for interactive
OLAP
type work loads. So BI and data warehousing scenarios are the sweet
spot.
There may be scenarios where high QPS search applications will work
with
the distributed joins, particularly if the joins themselves are not
huge.
But the specific use cases need to be tested.
Joel Bernstein
http://joelsolr.blogspot.com/
On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
jack.krupan...@gmail.com
wrote:
It will be interesting to see which use cases work best with the new
streaming JOIN vs. which will remain best with full denormalization,
or
whether you simply have to try both and benchmark them.
My impression had been that streaming JOIN would be ideal for bulk
operations rather than traditional-style search queries. Maybe there
are
three use cases: bulk read based on broad criteria, top-n relevance
search
query, and specific document (or small number of documents) based on
multiple fields.
My suspicion is that doing JOIN on five tables will likely be slower
than
accessing a single document of a denormalized table/index.
-- Jack Krupansky
On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com>
wrote:
Solr now has full distributed join capabilities as part of the
Streaming
Expression library. Keep in mind that these are distributed joins
so
they
shuffle records to worker nodes to perform the joins. These are
comparable
to joins done by SQL over MapReduce systems, but they are very
responsive
and can respond with sub-second response time for fairly large
joins
in
parallel mode. But these joins do lend themselves to large
distributed
architectures (lot's of shards an replicas). Target QPS also needs
to
be
taken into account and tested in deciding whether these joins will
meet
the
specific use case.
Joel Bernstein
http://joelsolr.blogspot.com/
On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com>
wrote:
The Streaming API with Streaming Expressions (or Parallel SQL if
you
want
to use SQL) can give you the functionality you're looking for.
See
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
and
https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
SQL queries coming in through the Parallel SQL Interface are
translated
down into Streaming Expressions - if you need to do something
that
SQL
doesn't yet support you should check out the Streaming
Expressions
to
see
if it can support it.
With these you could store your data in separate collections (or
the
same
collection with different docType field values) and then during
search
perform a join (inner, outer, hash) across the collections. You
could,
if
you wanted, even join with data NOT in solr using the jdbc
streaming
function.
- Dennis Gove
On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:
'*would I then be able to query a specific field of articles or
other
"table" (with the same OR BETTER performances)?*'
-> And especially, would I be able to get only 1 article in the
result...
On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
Thanks Jack.
I know that Solr is a search engine, but this replace a search
in
my
mysql DB with this model:
*My goal is to improve my environment (and my performances at
the
same
time).*
*Yes, I have a Solr data model... but atm I created 4 different
indexes
for "similar service usage".*
*So atm, for 70 millions of documents, I am duplicating journal
data
and
publisher data all the time in 1 index (for all articles from
the
same
journal/pub) in order to be able to retrieve all data in 1
query...*
*I found yesterday that there is the possibility to create like
an
array
of <entity> in the data-conf.xml.*
e.g. (pseudo code - incomplete):
<entity name="solr_publisher" query="select name from
publishers">
<entity name="solr_journal" query="select name as j_name from
journals
WHERE publisher_id='${solr_publisher.id}'">
<entity name="solr_articles" query="select title, abstract from
articles
WHERE journal_id='${solr_journal.id}'">
<entity name="solr_authors" query="select given_name, last_name
from
authors WHERE article_id='${solr_article.id}'">
* Would this be a good option? Is this the denormalization you
were
proposing? *
*If yes, would I then be able to query a specific field of
articles
or
other "table" (with the same OR BETTER performances)? If yes, I
might
probably merge all the different indexes together. *
*I'm currently joining everything in mysql, so duplicating the
fields
in
the solr (pseudo code):*
<entity name="all" query="select * from articles INNER JOIN
journal
on
[...]">
*So I have an index for authors query, a general one for
articles
(only
needed info of other tables) ...*
Thanks in advance for the tips. :)
Kind regards,
Bastien
On 14/04/2016 16:23, Jack Krupansky wrote:
Solr is a search engine, not a database.
JOINs? Although Solr does have some limited JOIN capabilities,
they
are
more for special situations, not the front-line go-to technique
for
data
modeling for search.
Rather, denormalization is the front-line go-to technique for
data
modeling in Solr.
In any case, the first step in data modeling is always to focus
on
your
queries - what information will be coming into your apps and
what
information will the apps want to access based on those inputs.
But wait... you say you are upgrading, which suggests that you
have
an
existing Solr data model, and probably queries as well. So...
1. Share at least a summary of your existing Solr data model as
well
as
at least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries
too
slow,
too cumbersome, not sufficiently powerful, or... what exactly is
the
problem you need to solve.
-- Jack Krupansky
On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
<lat...@mdpi.com.invalid>lat...@mdpi.com.invalid> wrote:
Hi Guys,
*I am upgrading from solr 4.2 to 6.0.*
*I successfully (after some time) migrated the config files and
other
parameters...*
Now I'm just wondering if my indexes are following the best
practices...(and they are probably not :-) )
What would be the best if we have this kind of sql data to
write
in
Solr:
I have several different services which need (more or less),
different
data based on these JOINs...
e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already
included
in
A/B)...
*1. Would it be better to create one single index?*
*-> i.e.: this will duplicate journal info for every single
article*
*2. Would it be better to create several specific indexes for
each
similar services?*
*-> i.e.: this will use more space on the disks (and there are
~70millions of documents to join) 3. Would it be better to
create
an
index
per table and make a join? -> if yes, how?? *
Kind regards,
Bastien
Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail: latard@mdpi.comhttp://www.mdpi.com/
Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail: latard@mdpi.comhttp://www.mdpi.com/
Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/