Hello Flavio, Out of curiosity, are you already using this in prod? Would you share your results / benchmarks with us? (not sure if you have some). I wonder how it is performing for you. I was thinking in using a very similar schema, comparing to yours. The thing is: each option has drawbacks, there is no "good" or "bad" schema, if I understood things correctly. Even joins, which is something we should avoid using in a nosql technology like solr, may be a good option in some cases, I guess sometimes the only thing that can answer some questions are POCs and benchmarks. I am not a solr expert, there are several commiters on this list that might help you much better than I, but the way I think you should try your solution, see how it performs, and keep looking for alternatives that perform better forever, if possible. As I said, I am not an expert, but I wouldn't call your model a "bad model" that needs fix. It's a possible model and who knows, maybe other model could perform better. It's like in the case of an algorithm, we should assume we can always do better...
Best regards, Marcelo. 2013/7/11 Flavio Pompermaier <pomperma...@okkam.it> > I also have a similar scenario, where fundamentally I have to retrieve all > urls where a userid has been found. > So, in my schema, I designed the url as (string) key and a (possible huge) > list of attributes automatically mapped to strings. > For example: > > Url1 (key): > - language: en > - content:userid1 > - content:userid1 > - content:userid1 (i.e. 3 times actually for user 1) > - content:userid2 > - content:userid3 > - author:userid4 > > and so on and so forth. > So, if I did understand, you're saying that this is a bad design? How > should I fix my schema in your opinion in that case? > > Best, > Flavio > > > On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky <j...@basetechnology.com > >wrote: > > > Simple answer: avoid "large number of values in a single document". There > > should only be a modest to moderate number of fields in a single > document. > > > > Is the data relatively static, or subject to frequent updates? To update > > any field of a single document, even with atomic update, requires Solr to > > read and rewrite every field of the document. So, lots of smaller > documents > > are best for a frequent update scenario. > > > > Multivalues fields are great for storing a relatively small list of > > values. You can add to the list easily, but under the hood, Solr must > read > > and rewrite the full list as well as the full document. And, there is no > > way to address or synchronize individual elements of multivalued fields. > > > > Joins are great... if used in moderation. Heavy use of joins is not a > > great idea. > > > > -- Jack Krupansky > > > > -----Original Message----- From: Marcelo Elias Del Valle > > Sent: Wednesday, July 10, 2013 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: amount of values in a multi value field - is denormalization > > always the best option? > > > > > > Hello, > > > > I have asked a question recently about solr limitations and some about > > joins. It comes that this question is about both at the same time. > > I am trying to figure how to denormalize my data so I will need just 1 > > document in my index instead of performing a join. I figure one way of > > doing this is storing an entity as a multivalued field, instead of > storing > > different fields. > > Let me give an example. Consider the entities: > > > > User: > > id: 1 > > type: Joan of Arc > > age: 27 > > > > Webpage: > > id: 1 > > url: http://wiki.apache.org/solr/**Join< > http://wiki.apache.org/solr/Join> > > category: Technical > > user_id: 1 > > > > id: 2 > > url: http://stackoverflow.com > > category: Technical > > user_id: 1 > > > > Instead of creating 1 document for user, 1 for webpage 1 and 1 for > > webpage 2 (1 parent and 2 childs) I could store webpages in a user > > multivalued field, as follows: > > > > User: > > id: 1 > > name: Joan of Arc > > age: 27 > > webpage1: ["id:1", "url: http://wiki.apache.org/solr/**Join< > http://wiki.apache.org/solr/Join>", > > "category: > > Technical"] > > webpage2: ["id:2", "url: http://stackoverflow.com", "category: > > Technical"] > > > > It would probably perform better than the join, right? However, it > made > > me think about solr limitations again. What if I have 200 million webpges > > (200 million fields) per user? Or imagine a case where I could have 200 > > million values on a field, like in the case I need to index every html > DOM > > element (div, a, etc.) for each web page user visited. > > I mean, if I need to do the query and this is a business requirement > no > > matter what, although denormalizing could be better than using query time > > joins, I wonder it distributing the data present in this single document > > along the cluster wouldn't give me better performance. And this is > > something I won't get with block joins or multivalued fields... > > I guess there is probably no right answer for this question (at least > > not a known one), and I know I should create a POC to check how each > > perform... But do you think a so large number of values in a single > > document could make denormalization not possible in an extreme case like > > this? Would you share my thoughts if I said denormalization is not always > > the right option? > > > > Best regards, > > -- > > Marcelo Elias Del Valle > > http://mvalle.com - @mvallebr > > > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr