Roman, The video was very clarifying and I realized block joins would be a great fit for my problem. However, I got worried about the size of the block... I could have 10 million childs for 1 parent, for instance. Althout this could stay in the same shard, do you guys think it would be a huge problem at _query time_? Of course the indexing would take longer, but if it can query faster, it would be a great fit for my case...
Best regards, Marcelo. 2013/7/8 Roman Chyla <roman.ch...@gmail.com> > Hello, > > The joins are not the only idea, you may want to write your own function > (ValueSource) that can implement your logic. However, I think you should > not throw away the regex idea (as being slow), before trying it out - > because it can be faster than the joins. Your problem is that the number of > entities need to be limited, see recent replies of Jack Krupansky on the > number of fields. > > The joins are of different kinds, I recommend this link to see their > differences: http://vimeo.com/44299232 > > If your data relations can fit in memory, a smart cache (ie [un]inverted > index) will always outperform lucene joins - look at the chart inside this: > http://code4lib.org/files/2ndOrderOperatorsv2.pdf > > roman > > > On Mon, Jul 8, 2013 at 4:03 PM, Marcelo Elias Del Valle > <mvall...@gmail.com>wrote: > > > Hello all, > > > > I am using Solr Cloud today and I have the following need: > > > > - My queries focus on counting how many users attend to some criteria. > > So my main document is "user" (parent table) > > - Each user can access several web pages (a child table) and each web > > page might have several attributes. > > - I need to lookup for users where there is some page accessed by them > > which matches a set of attributes. For example, I have two scenarios: > > 1. if a user accessed a web page WP1 with a URL that starts with > > "www." and with a title that includes "solr", then the user is a > > match. > > 2. However, if there is a webpage WP1 with such url and ANOTHER WP2 > > that includes "solr" in the title, this is not a match. > > > > > > If I were modeling this on a relational DB, user would be a table and > > url would be other. However, as I using solr, my first option would be > > denormalizing first. Simply storing all the fields in the user document > > wouldn't work, as I would work as described in scenario 2. > > I thought in two solutions for these: > > > > - Using the idea of an inverted index - Having several kinds of > > documents (user, web page, entity 3, entity 4, etc.) where each entity > > (web > > page, for instance) would have a field to relate to the user id. Then, > > using a cross join in solr to get the results where there was a match > on > > user (parent table) and also on each child entity (in other words, to > > merge > > the results of several queries that might return user ids). This has a > > drawback of using a join. > > - Having just a user document and storing each web page as only one > > field (like a json). To search, the same field would need to match a > > regular expression that includes both conditions. This would make my > > search > > slower and I would not be able to apply the same technique if the > child > > tables also had children. > > > > Am I missing any obvious solution here? I would love to receive > critics > > on this, as I am probably not the only one who have this problem... I > > would like more ideas on how to denormalize data in this case. Is the > join > > my best option here? > > > > Best regards, > > -- > > Marcelo Elias Del Valle > > http://mvalle.com - @mvallebr > > > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr