Re: Performance of cross join vs block join

Roman Chyla Mon, 22 Jul 2013 12:31:35 -0700

Hello Mikhail,

ps: sending to the solr-user as well, i've realized i was writing just to
you, sorry...


On Mon, Jul 22, 2013 at 3:07 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello Roman,
>
> Pleas get me right. I have no idea what happened with that dependency.
> There are recent patches from Yonik, they should be more actual, and I
> think he can help you with particular issues. From the common (captain's)
> sense I propose to specify any closer version of jetty, I don't think there
> are much reason to rely on that particular one.
>
> I'm thinking about your problem from time to time. You are right, it's
> definitely not a case for block join. I still trying to figure out how to
> make it computationally easier. As far as I get you have recursive
> many-to-many relationship and need to traverse it during the search.
>
> doc(id, author, text, references:[docid,....] )
>
> I'm not sure it's possible with lucene now, but if it can, what you think
> about writing DocValues stripe contains internal Lucene docnums instead of
> external docIds. It moves few steps from query time to index time, hence
> can get some performance.
>

Our use case of many-to-many relations is probably a weird one and we ought
to de-normalize the values. What I do (a building a citation network in
memory, using Lucene caches) is just a work-around that happens to
out-perform the index seeking, no surprise on that, but in the expense of
memory. I am aware the de-normalization may be necessary, the DocValues
would probably be a step forward to it - the joins give great flexibility,
it is really cool, but that comes with its own price...


>
> Also, I mentioned you hesitates regarding cross segments join. You
> actually shouldn't due to the following reasons:
>  - Join is a Solr code (which is a top reader beast);
>  - it obtains and works with SolrIndexSearcher which is a top reader...
>  - join happens at Weight without any awareness about leaf segments.
>
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L272
>

Thanks, I think I have not used (i believe) because there was very small
chance it could have been fast enough. It is reading terms/joins for docs
that match the query, so in that sense, it is not different from
pre-computing the citation cache - but it happens for every query/request,
and so for 0.5M of edges it must take some time. But I guess I should
measure it. I haven't made notes so now I am having hard time backtracking
:)

roman


> It seems to me cross segment join works well.
>
>
>
> On Mon, Jul 22, 2013 at 3:08 AM, Roman Chyla <roman.ch...@gmail.com>wrote:
>
>> ah, in case you know the solution, here ant output:
>>
>> resolve:
>> [ivy:retrieve]
>> [ivy:retrieve] :: problems summary ::
>> [ivy:retrieve] :::: WARNINGS
>> [ivy:retrieve] module not found:
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] ==== local: tried
>> [ivy:retrieve]  
>> /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml
>> [ivy:retrieve]   -- artifact
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
>> [ivy:retrieve]  
>> /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar
>> [ivy:retrieve] ==== shared: tried
>> [ivy:retrieve]  
>> /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml
>> [ivy:retrieve]   -- artifact
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
>> [ivy:retrieve]  
>> /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar
>> [ivy:retrieve] ==== public: tried
>> [ivy:retrieve]
>> http://repo1.maven.org/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
>> [ivy:retrieve] ==== sonatype-releases: tried
>> [ivy:retrieve]
>> http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
>> [ivy:retrieve]   -- artifact
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
>> [ivy:retrieve]
>> http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar
>> [ivy:retrieve] ==== maven.restlet.org: tried
>> [ivy:retrieve]
>> http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
>> [ivy:retrieve]   -- artifact
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
>> [ivy:retrieve]
>> http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar
>> [ivy:retrieve] ==== working-chinese-mirror: tried
>> [ivy:retrieve]
>> http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
>> [ivy:retrieve]   -- artifact
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
>> [ivy:retrieve]
>> http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar
>> [ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
>> [ivy:retrieve] ::          UNRESOLVED DEPENDENCIES         ::
>> [ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
>> [ivy:retrieve] :: org.eclipse.jetty#jetty-deploy;8.1.10.v20130312: not
>> found
>> [ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
>> [ivy:retrieve] :::: ERRORS
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve] impossible to acquire lock for
>> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
>> [ivy:retrieve]
>> [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>>
>> BUILD FAILED
>> /dvt/workspace/lucene-3076/build.xml:39: The following error occurred
>> while executing this line:
>> /dvt/workspace/lucene-3076/solr/build.xml:181: The following error
>> occurred while executing this line:
>> /dvt/workspace/lucene-3076/solr/common-build.xml:450: The following error
>> occurred while executing this line:
>> /dvt/workspace/lucene-3076/solr/common-build.xml:390: The following error
>> occurred while executing this line:
>> /dvt/workspace/lucene-3076/solr/common-build.xml:336: The following error
>> occurred while executing this line:
>> /dvt/workspace/lucene-3076/solr/common-build.xml:356: The following error
>> occurred while executing this line:
>> /dvt/workspace/lucene-3076/solr/common-build.xml:405: The following error
>> occurred while executing this line:
>> /dvt/workspace/lucene-3076/solr/example/build.xml:54: impossible to
>> resolve dependencies:
>>  resolve failed - see output for details
>>
>>
>>
>> On Sun, Jul 21, 2013 at 7:06 PM, Roman Chyla <roman.ch...@gmail.com>wrote:
>>
>>> Hello Mikhail,
>>>
>>> I have applied the patch (cleanly), but am getting a silly 'unresolved
>>> dependencies' from ivy when running 'ant test' - I know I could solve this,
>>> but I wonder whether you want to update your patch (well, I am just
>>> guessing that it specifies some non-available version of jetty [??] - or
>>> maybe it is the latest trunk)
>>>
>>>
>>> but, it occured to me, the block-joins rely on the segments, right? so
>>> every child must be in the same index segment? can you confirm pls - i can
>>> run the test, but if this is the case, we'll be getting different results
>>> (different no of joins)
>>>
>>> roman
>>>
>>>
>>> On Fri, Jul 12, 2013 at 6:22 PM, Roman Chyla <roman.ch...@gmail.com>wrote:
>>>
>>>> Hello Mikhail,
>>>>
>>>> If 3076 is easy to apply over the current trunk, I can try to run it
>>>> some time next week - not much of a testing framework, just a bunch of
>>>> python scripts,but I can run it against the the 200GB index to produce
>>>> comparison.
>>>>
>>>> So my tests were ran against Lucene's join,
>>>> http://lucene.apache.org/core/4_0_0/join/org/apache/lucene/search/join/JoinUtil.html
>>>>
>>>> The join query is created as:
>>>>
>>>> return JoinUtil.createJoinQuery(idField, false, refField, innerQuery,
>>>>                     req.getSearcher(), ScoreMode.Avg);
>>>>
>>>> The reasons I could not use the other joins are:
>>>>
>>>> - we are indexing citations/references and we cannot know parent-child
>>>> relationships (so we cannot do special indexing)
>>>> - we need to get at the score (of the first-order query children, the
>>>> final score is computed from that)
>>>>
>>>> I don't remember if LuceneJoin is cross-index (probably not), but it
>>>> doesn't matter to us (now) - the score thing may be relaxed, but it is
>>>> crucial NOT to be limited by the index segments, as said above, a paper can
>>>> cite other papers across the whole database (it is not manufacturer-product
>>>> type of relation that can be known and indexed into one segment).
>>>>
>>>> roman
>>>>
>>>>
>>>> On Fri, Jul 12, 2013 at 2:41 PM, Mikhail Khludnev <
>>>> mkhlud...@griddynamics.com> wrote:
>>>>
>>>>> Hello Roman,
>>>>>
>>>>> Thanks for your interest. I briefly looked on your approach, and I'm
>>>>> really interested in your numbers.
>>>>>
>>>>> Here is the trivial code, I'd rather prefer rely on your testing
>>>>> framework, and can provide you a version of Solr 4.2 with SOLR-3076
>>>>> applied. Do you need it?
>>>>> https://github.com/m-khl/join-tester
>>>>>
>>>>> What you are saying about benchmark representativeness definitely
>>>>> makes sense. I didn't try to establish a complete absolutely 
>>>>> representative
>>>>> benchmark. Just wanted to have rough numbers, related for my usecase,
>>>>> certainly. I'm from eCommerce, that volume was enough for me.
>>>>>
>>>>> What I didn't get is, 'not the block joins, because these cannot be
>>>>> used for
>>>>> citation data - we cannot reasonably index them into one segment'.
>>>>> Usually, there is no problem with blocks in multi segment index, block
>>>>> definitely can't span across segments. Anyway, please elaborate.
>>>>> One of block join benefits is an ability to hit only the first matched
>>>>> child in group, and jump over followings. It doesn't applicable in 
>>>>> general,
>>>>> but get huge gain some times.
>>>>>
>>>>>
>>>>> On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla <roman.ch...@gmail.com>wrote:
>>>>>
>>>>>> Hi Mikhail,
>>>>>> I have commented on your blog, but it seems I have done st wrong, as
>>>>>> the
>>>>>> comment is not there. Would it be possible to share the test setup
>>>>>> (script)?
>>>>>>
>>>>>> I have found out that the crucial thing with joins is the number of
>>>>>> 'joins'
>>>>>> [hits returned] and it seems that the experiments I have seen so far
>>>>>> were
>>>>>> geared towards small collection - even if Erick's index was 26M, the
>>>>>> number
>>>>>> of hits was probably small - you can see a very different story if
>>>>>> you face
>>>>>> some [other] real data. Here is a citation network and I was comparing
>>>>>> lucene join's [ie not the block joins, because these cannot be used
>>>>>> for
>>>>>> citation data - we cannot reasonably index them into one segment])
>>>>>>
>>>>>>
>>>>>> https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png
>>>>>>
>>>>>> Notice, the y axes is sqrt, so the running time for lucene join is
>>>>>> growing
>>>>>> and growing very fast! It takes lucene 30s to do the search that
>>>>>> selects 1M
>>>>>> hits.
>>>>>>
>>>>>> The comparison is against our own implementation of a similar search
>>>>>> - but
>>>>>> the main point I am making is that the join benchmarks should be
>>>>>> showing
>>>>>> the number of hits selected by the join operation. Otherwise, a very
>>>>>> important detail is hidden.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>   roman
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev <
>>>>>> mkhlud...@griddynamics.com> wrote:
>>>>>>
>>>>>> > On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu <
>>>>>> mihaela...@yahoo.com
>>>>>> > >wrote:
>>>>>> >
>>>>>> > > Hi Mikhail,
>>>>>> > >
>>>>>> > > I have used wrong the term block join. When I said block join I
>>>>>> was
>>>>>> > > referring to a join performed on a single core versus cross join
>>>>>> which
>>>>>> > was
>>>>>> > > performed on multiple cores.
>>>>>> > > But I saw your benchmark (from cache) and it seems that block
>>>>>> join has
>>>>>> > > better performance. Is this functionality available on Solr 4.3.1?
>>>>>> >
>>>>>> > nope SOLR-3076 awaits for ages.
>>>>>> >
>>>>>> >
>>>>>> > > I did not find such examples on Solr's wiki page.
>>>>>> > > Does this functionality require a special schema, or a special
>>>>>> indexing?
>>>>>> >
>>>>>> > Special indexing - yes.
>>>>>> >
>>>>>> >
>>>>>> > > How would I need to index the data from my tables? In my case
>>>>>> anyway all
>>>>>> > > the indices have a common schema since I am using dynamic fields,
>>>>>> thus I
>>>>>> > > can easily add all documents from all tables in one Solr core,
>>>>>> but for
>>>>>> > each
>>>>>> > > document to add a discriminator field.
>>>>>> > >
>>>>>> > correct. but notion of ' discriminator field' is a little bit
>>>>>> different for
>>>>>> > blockjoin.
>>>>>> >
>>>>>> >
>>>>>> > >
>>>>>> > > Could you point me to some more documentation?
>>>>>> > >
>>>>>> >
>>>>>> > I can recommend only those
>>>>>> >
>>>>>> >
>>>>>> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
>>>>>> > http://www.youtube.com/watch?v=-OiIlIijWH0
>>>>>> >
>>>>>> >
>>>>>> > > Thanks in advance,
>>>>>> > > Mihaela
>>>>>> > >
>>>>>> > >
>>>>>> > > ________________________________
>>>>>> > >  From: Mikhail Khludnev <mkhlud...@griddynamics.com>
>>>>>> > > To: solr-user <solr-user@lucene.apache.org>; mihaela olteanu <
>>>>>> > > mihaela...@yahoo.com>
>>>>>> > > Sent: Thursday, July 11, 2013 2:25 PM
>>>>>> > > Subject: Re: Performance of cross join vs block join
>>>>>> > >
>>>>>> > >
>>>>>> > > Mihaela,
>>>>>> > >
>>>>>> > > For me it's reasonable that single core join takes the same time
>>>>>> as cross
>>>>>> > > core one. I just can't see which gain can be obtained from in the
>>>>>> former
>>>>>> > > case.
>>>>>> > > I hardly able to comment join code, I looked into, it's not
>>>>>> trivial, at
>>>>>> > > least. With block join it doesn't need to obtain parentId term
>>>>>> > > values/numbers and lookup parents by them. Both of these actions
>>>>>> are
>>>>>> > > expensive. Also blockjoin works as an iterator, but join need to
>>>>>> allocate
>>>>>> > > memory for parents bitset and populate it out of order that
>>>>>> impacts
>>>>>> > > scalability.
>>>>>> > > Also in None scoring mode BJQ don't need to walk through all
>>>>>> children,
>>>>>> > but
>>>>>> > > only hits first. Also, nice feature is 'both side leapfrog' if
>>>>>> you have a
>>>>>> > > highly restrictive filter/query intersects with BJQ, it allows to
>>>>>> skip
>>>>>> > many
>>>>>> > > parents and children as well, that's not possible in Join, which
>>>>>> has
>>>>>> > fairly
>>>>>> > > 'full-scan' nature.
>>>>>> > > Main performance factor for Join is number of child docs.
>>>>>> > > I'm not sure I got all your questions, please specify them in more
>>>>>> > details,
>>>>>> > > if something is still unclear.
>>>>>> > > have you saw my benchmark
>>>>>> > >
>>>>>> http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > > On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu <
>>>>>> mihaela...@yahoo.com
>>>>>> > > >wrote:
>>>>>> > >
>>>>>> > > > Hello,
>>>>>> > > >
>>>>>> > > > Does anyone know about some measurements in terms of
>>>>>> performance for
>>>>>> > > cross
>>>>>> > > > joins compared to joins inside a single index?
>>>>>> > > >
>>>>>> > > > Is it faster the join inside a single index that stores all
>>>>>> documents
>>>>>> > of
>>>>>> > > > various types (from parent table or from children tables)with a
>>>>>> > > > discriminator field compared to the cross join (basically in
>>>>>> this case
>>>>>> > > each
>>>>>> > > > document type resides in its own index)?
>>>>>> > > >
>>>>>> > > > I have performed some tests but to me it seems that having a
>>>>>> join in a
>>>>>> > > > single index (bigger index) does not add too much speed
>>>>>> improvements
>>>>>> > > > compared to cross joins.
>>>>>> > > >
>>>>>> > > > Why a block join would be faster than a cross join if this is
>>>>>> the case?
>>>>>> > > > What are the variables that count when trying to improve the
>>>>>> query
>>>>>> > > > execution time?
>>>>>> > > >
>>>>>> > > > Thanks!
>>>>>> > > > Mihaela
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > > --
>>>>>> > > Sincerely yours
>>>>>> > > Mikhail Khludnev
>>>>>> > > Principal Engineer,
>>>>>> > > Grid Dynamics
>>>>>> > >
>>>>>> > > <http://www.griddynamics.com>
>>>>>> > > <mkhlud...@griddynamics.com>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Sincerely yours
>>>>>> > Mikhail Khludnev
>>>>>> > Principal Engineer,
>>>>>> > Grid Dynamics
>>>>>> >
>>>>>> >  <http://www.griddynamics.com>
>>>>>> > <mkhlud...@griddynamics.com>
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>> Principal Engineer,
>>>>> Grid Dynamics
>>>>>
>>>>> <http://www.griddynamics.com>
>>>>>  <mkhlud...@griddynamics.com>
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhlud...@griddynamics.com>
>

Re: Performance of cross join vs block join

Reply via email to