Hello Mikhail, ps: sending to the solr-user as well, i've realized i was writing just to you, sorry...
On Mon, Jul 22, 2013 at 3:07 AM, Mikhail Khludnev < mkhlud...@griddynamics.com> wrote: > Hello Roman, > > Pleas get me right. I have no idea what happened with that dependency. > There are recent patches from Yonik, they should be more actual, and I > think he can help you with particular issues. From the common (captain's) > sense I propose to specify any closer version of jetty, I don't think there > are much reason to rely on that particular one. > > I'm thinking about your problem from time to time. You are right, it's > definitely not a case for block join. I still trying to figure out how to > make it computationally easier. As far as I get you have recursive > many-to-many relationship and need to traverse it during the search. > > doc(id, author, text, references:[docid,....] ) > > I'm not sure it's possible with lucene now, but if it can, what you think > about writing DocValues stripe contains internal Lucene docnums instead of > external docIds. It moves few steps from query time to index time, hence > can get some performance. > Our use case of many-to-many relations is probably a weird one and we ought to de-normalize the values. What I do (a building a citation network in memory, using Lucene caches) is just a work-around that happens to out-perform the index seeking, no surprise on that, but in the expense of memory. I am aware the de-normalization may be necessary, the DocValues would probably be a step forward to it - the joins give great flexibility, it is really cool, but that comes with its own price... > > Also, I mentioned you hesitates regarding cross segments join. You > actually shouldn't due to the following reasons: > - Join is a Solr code (which is a top reader beast); > - it obtains and works with SolrIndexSearcher which is a top reader... > - join happens at Weight without any awareness about leaf segments. > > https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L272 > Thanks, I think I have not used (i believe) because there was very small chance it could have been fast enough. It is reading terms/joins for docs that match the query, so in that sense, it is not different from pre-computing the citation cache - but it happens for every query/request, and so for 0.5M of edges it must take some time. But I guess I should measure it. I haven't made notes so now I am having hard time backtracking :) roman > It seems to me cross segment join works well. > > > > On Mon, Jul 22, 2013 at 3:08 AM, Roman Chyla <roman.ch...@gmail.com>wrote: > >> ah, in case you know the solution, here ant output: >> >> resolve: >> [ivy:retrieve] >> [ivy:retrieve] :: problems summary :: >> [ivy:retrieve] :::: WARNINGS >> [ivy:retrieve] module not found: >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] ==== local: tried >> [ivy:retrieve] >> /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml >> [ivy:retrieve] -- artifact >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: >> [ivy:retrieve] >> /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar >> [ivy:retrieve] ==== shared: tried >> [ivy:retrieve] >> /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml >> [ivy:retrieve] -- artifact >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: >> [ivy:retrieve] >> /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar >> [ivy:retrieve] ==== public: tried >> [ivy:retrieve] >> http://repo1.maven.org/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom >> [ivy:retrieve] ==== sonatype-releases: tried >> [ivy:retrieve] >> http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom >> [ivy:retrieve] -- artifact >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: >> [ivy:retrieve] >> http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar >> [ivy:retrieve] ==== maven.restlet.org: tried >> [ivy:retrieve] >> http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom >> [ivy:retrieve] -- artifact >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: >> [ivy:retrieve] >> http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar >> [ivy:retrieve] ==== working-chinese-mirror: tried >> [ivy:retrieve] >> http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom >> [ivy:retrieve] -- artifact >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: >> [ivy:retrieve] >> http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar >> [ivy:retrieve] :::::::::::::::::::::::::::::::::::::::::::::: >> [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: >> [ivy:retrieve] :::::::::::::::::::::::::::::::::::::::::::::: >> [ivy:retrieve] :: org.eclipse.jetty#jetty-deploy;8.1.10.v20130312: not >> found >> [ivy:retrieve] :::::::::::::::::::::::::::::::::::::::::::::: >> [ivy:retrieve] :::: ERRORS >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] impossible to acquire lock for >> org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 >> [ivy:retrieve] >> [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS >> >> BUILD FAILED >> /dvt/workspace/lucene-3076/build.xml:39: The following error occurred >> while executing this line: >> /dvt/workspace/lucene-3076/solr/build.xml:181: The following error >> occurred while executing this line: >> /dvt/workspace/lucene-3076/solr/common-build.xml:450: The following error >> occurred while executing this line: >> /dvt/workspace/lucene-3076/solr/common-build.xml:390: The following error >> occurred while executing this line: >> /dvt/workspace/lucene-3076/solr/common-build.xml:336: The following error >> occurred while executing this line: >> /dvt/workspace/lucene-3076/solr/common-build.xml:356: The following error >> occurred while executing this line: >> /dvt/workspace/lucene-3076/solr/common-build.xml:405: The following error >> occurred while executing this line: >> /dvt/workspace/lucene-3076/solr/example/build.xml:54: impossible to >> resolve dependencies: >> resolve failed - see output for details >> >> >> >> On Sun, Jul 21, 2013 at 7:06 PM, Roman Chyla <roman.ch...@gmail.com>wrote: >> >>> Hello Mikhail, >>> >>> I have applied the patch (cleanly), but am getting a silly 'unresolved >>> dependencies' from ivy when running 'ant test' - I know I could solve this, >>> but I wonder whether you want to update your patch (well, I am just >>> guessing that it specifies some non-available version of jetty [??] - or >>> maybe it is the latest trunk) >>> >>> >>> but, it occured to me, the block-joins rely on the segments, right? so >>> every child must be in the same index segment? can you confirm pls - i can >>> run the test, but if this is the case, we'll be getting different results >>> (different no of joins) >>> >>> roman >>> >>> >>> On Fri, Jul 12, 2013 at 6:22 PM, Roman Chyla <roman.ch...@gmail.com>wrote: >>> >>>> Hello Mikhail, >>>> >>>> If 3076 is easy to apply over the current trunk, I can try to run it >>>> some time next week - not much of a testing framework, just a bunch of >>>> python scripts,but I can run it against the the 200GB index to produce >>>> comparison. >>>> >>>> So my tests were ran against Lucene's join, >>>> http://lucene.apache.org/core/4_0_0/join/org/apache/lucene/search/join/JoinUtil.html >>>> >>>> The join query is created as: >>>> >>>> return JoinUtil.createJoinQuery(idField, false, refField, innerQuery, >>>> req.getSearcher(), ScoreMode.Avg); >>>> >>>> The reasons I could not use the other joins are: >>>> >>>> - we are indexing citations/references and we cannot know parent-child >>>> relationships (so we cannot do special indexing) >>>> - we need to get at the score (of the first-order query children, the >>>> final score is computed from that) >>>> >>>> I don't remember if LuceneJoin is cross-index (probably not), but it >>>> doesn't matter to us (now) - the score thing may be relaxed, but it is >>>> crucial NOT to be limited by the index segments, as said above, a paper can >>>> cite other papers across the whole database (it is not manufacturer-product >>>> type of relation that can be known and indexed into one segment). >>>> >>>> roman >>>> >>>> >>>> On Fri, Jul 12, 2013 at 2:41 PM, Mikhail Khludnev < >>>> mkhlud...@griddynamics.com> wrote: >>>> >>>>> Hello Roman, >>>>> >>>>> Thanks for your interest. I briefly looked on your approach, and I'm >>>>> really interested in your numbers. >>>>> >>>>> Here is the trivial code, I'd rather prefer rely on your testing >>>>> framework, and can provide you a version of Solr 4.2 with SOLR-3076 >>>>> applied. Do you need it? >>>>> https://github.com/m-khl/join-tester >>>>> >>>>> What you are saying about benchmark representativeness definitely >>>>> makes sense. I didn't try to establish a complete absolutely >>>>> representative >>>>> benchmark. Just wanted to have rough numbers, related for my usecase, >>>>> certainly. I'm from eCommerce, that volume was enough for me. >>>>> >>>>> What I didn't get is, 'not the block joins, because these cannot be >>>>> used for >>>>> citation data - we cannot reasonably index them into one segment'. >>>>> Usually, there is no problem with blocks in multi segment index, block >>>>> definitely can't span across segments. Anyway, please elaborate. >>>>> One of block join benefits is an ability to hit only the first matched >>>>> child in group, and jump over followings. It doesn't applicable in >>>>> general, >>>>> but get huge gain some times. >>>>> >>>>> >>>>> On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla <roman.ch...@gmail.com>wrote: >>>>> >>>>>> Hi Mikhail, >>>>>> I have commented on your blog, but it seems I have done st wrong, as >>>>>> the >>>>>> comment is not there. Would it be possible to share the test setup >>>>>> (script)? >>>>>> >>>>>> I have found out that the crucial thing with joins is the number of >>>>>> 'joins' >>>>>> [hits returned] and it seems that the experiments I have seen so far >>>>>> were >>>>>> geared towards small collection - even if Erick's index was 26M, the >>>>>> number >>>>>> of hits was probably small - you can see a very different story if >>>>>> you face >>>>>> some [other] real data. Here is a citation network and I was comparing >>>>>> lucene join's [ie not the block joins, because these cannot be used >>>>>> for >>>>>> citation data - we cannot reasonably index them into one segment]) >>>>>> >>>>>> >>>>>> https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png >>>>>> >>>>>> Notice, the y axes is sqrt, so the running time for lucene join is >>>>>> growing >>>>>> and growing very fast! It takes lucene 30s to do the search that >>>>>> selects 1M >>>>>> hits. >>>>>> >>>>>> The comparison is against our own implementation of a similar search >>>>>> - but >>>>>> the main point I am making is that the join benchmarks should be >>>>>> showing >>>>>> the number of hits selected by the join operation. Otherwise, a very >>>>>> important detail is hidden. >>>>>> >>>>>> Best, >>>>>> >>>>>> roman >>>>>> >>>>>> >>>>>> On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev < >>>>>> mkhlud...@griddynamics.com> wrote: >>>>>> >>>>>> > On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu < >>>>>> mihaela...@yahoo.com >>>>>> > >wrote: >>>>>> > >>>>>> > > Hi Mikhail, >>>>>> > > >>>>>> > > I have used wrong the term block join. When I said block join I >>>>>> was >>>>>> > > referring to a join performed on a single core versus cross join >>>>>> which >>>>>> > was >>>>>> > > performed on multiple cores. >>>>>> > > But I saw your benchmark (from cache) and it seems that block >>>>>> join has >>>>>> > > better performance. Is this functionality available on Solr 4.3.1? >>>>>> > >>>>>> > nope SOLR-3076 awaits for ages. >>>>>> > >>>>>> > >>>>>> > > I did not find such examples on Solr's wiki page. >>>>>> > > Does this functionality require a special schema, or a special >>>>>> indexing? >>>>>> > >>>>>> > Special indexing - yes. >>>>>> > >>>>>> > >>>>>> > > How would I need to index the data from my tables? In my case >>>>>> anyway all >>>>>> > > the indices have a common schema since I am using dynamic fields, >>>>>> thus I >>>>>> > > can easily add all documents from all tables in one Solr core, >>>>>> but for >>>>>> > each >>>>>> > > document to add a discriminator field. >>>>>> > > >>>>>> > correct. but notion of ' discriminator field' is a little bit >>>>>> different for >>>>>> > blockjoin. >>>>>> > >>>>>> > >>>>>> > > >>>>>> > > Could you point me to some more documentation? >>>>>> > > >>>>>> > >>>>>> > I can recommend only those >>>>>> > >>>>>> > >>>>>> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html >>>>>> > http://www.youtube.com/watch?v=-OiIlIijWH0 >>>>>> > >>>>>> > >>>>>> > > Thanks in advance, >>>>>> > > Mihaela >>>>>> > > >>>>>> > > >>>>>> > > ________________________________ >>>>>> > > From: Mikhail Khludnev <mkhlud...@griddynamics.com> >>>>>> > > To: solr-user <solr-user@lucene.apache.org>; mihaela olteanu < >>>>>> > > mihaela...@yahoo.com> >>>>>> > > Sent: Thursday, July 11, 2013 2:25 PM >>>>>> > > Subject: Re: Performance of cross join vs block join >>>>>> > > >>>>>> > > >>>>>> > > Mihaela, >>>>>> > > >>>>>> > > For me it's reasonable that single core join takes the same time >>>>>> as cross >>>>>> > > core one. I just can't see which gain can be obtained from in the >>>>>> former >>>>>> > > case. >>>>>> > > I hardly able to comment join code, I looked into, it's not >>>>>> trivial, at >>>>>> > > least. With block join it doesn't need to obtain parentId term >>>>>> > > values/numbers and lookup parents by them. Both of these actions >>>>>> are >>>>>> > > expensive. Also blockjoin works as an iterator, but join need to >>>>>> allocate >>>>>> > > memory for parents bitset and populate it out of order that >>>>>> impacts >>>>>> > > scalability. >>>>>> > > Also in None scoring mode BJQ don't need to walk through all >>>>>> children, >>>>>> > but >>>>>> > > only hits first. Also, nice feature is 'both side leapfrog' if >>>>>> you have a >>>>>> > > highly restrictive filter/query intersects with BJQ, it allows to >>>>>> skip >>>>>> > many >>>>>> > > parents and children as well, that's not possible in Join, which >>>>>> has >>>>>> > fairly >>>>>> > > 'full-scan' nature. >>>>>> > > Main performance factor for Join is number of child docs. >>>>>> > > I'm not sure I got all your questions, please specify them in more >>>>>> > details, >>>>>> > > if something is still unclear. >>>>>> > > have you saw my benchmark >>>>>> > > >>>>>> http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu < >>>>>> mihaela...@yahoo.com >>>>>> > > >wrote: >>>>>> > > >>>>>> > > > Hello, >>>>>> > > > >>>>>> > > > Does anyone know about some measurements in terms of >>>>>> performance for >>>>>> > > cross >>>>>> > > > joins compared to joins inside a single index? >>>>>> > > > >>>>>> > > > Is it faster the join inside a single index that stores all >>>>>> documents >>>>>> > of >>>>>> > > > various types (from parent table or from children tables)with a >>>>>> > > > discriminator field compared to the cross join (basically in >>>>>> this case >>>>>> > > each >>>>>> > > > document type resides in its own index)? >>>>>> > > > >>>>>> > > > I have performed some tests but to me it seems that having a >>>>>> join in a >>>>>> > > > single index (bigger index) does not add too much speed >>>>>> improvements >>>>>> > > > compared to cross joins. >>>>>> > > > >>>>>> > > > Why a block join would be faster than a cross join if this is >>>>>> the case? >>>>>> > > > What are the variables that count when trying to improve the >>>>>> query >>>>>> > > > execution time? >>>>>> > > > >>>>>> > > > Thanks! >>>>>> > > > Mihaela >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > -- >>>>>> > > Sincerely yours >>>>>> > > Mikhail Khludnev >>>>>> > > Principal Engineer, >>>>>> > > Grid Dynamics >>>>>> > > >>>>>> > > <http://www.griddynamics.com> >>>>>> > > <mkhlud...@griddynamics.com> >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Sincerely yours >>>>>> > Mikhail Khludnev >>>>>> > Principal Engineer, >>>>>> > Grid Dynamics >>>>>> > >>>>>> > <http://www.griddynamics.com> >>>>>> > <mkhlud...@griddynamics.com> >>>>>> > >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sincerely yours >>>>> Mikhail Khludnev >>>>> Principal Engineer, >>>>> Grid Dynamics >>>>> >>>>> <http://www.griddynamics.com> >>>>> <mkhlud...@griddynamics.com> >>>>> >>>> >>>> >>> >> > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > <mkhlud...@griddynamics.com> >