I also have a similar scenario, where fundamentally I have to retrieve all urls where a userid has been found. So, in my schema, I designed the url as (string) key and a (possible huge) list of attributes automatically mapped to strings. For example:
Url1 (key): - language: en - content:userid1 - content:userid1 - content:userid1 (i.e. 3 times actually for user 1) - content:userid2 - content:userid3 - author:userid4 and so on and so forth. So, if I did understand, you're saying that this is a bad design? How should I fix my schema in your opinion in that case? Best, Flavio On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky <j...@basetechnology.com>wrote: > Simple answer: avoid "large number of values in a single document". There > should only be a modest to moderate number of fields in a single document. > > Is the data relatively static, or subject to frequent updates? To update > any field of a single document, even with atomic update, requires Solr to > read and rewrite every field of the document. So, lots of smaller documents > are best for a frequent update scenario. > > Multivalues fields are great for storing a relatively small list of > values. You can add to the list easily, but under the hood, Solr must read > and rewrite the full list as well as the full document. And, there is no > way to address or synchronize individual elements of multivalued fields. > > Joins are great... if used in moderation. Heavy use of joins is not a > great idea. > > -- Jack Krupansky > > -----Original Message----- From: Marcelo Elias Del Valle > Sent: Wednesday, July 10, 2013 5:37 PM > To: solr-user@lucene.apache.org > Subject: amount of values in a multi value field - is denormalization > always the best option? > > > Hello, > > I have asked a question recently about solr limitations and some about > joins. It comes that this question is about both at the same time. > I am trying to figure how to denormalize my data so I will need just 1 > document in my index instead of performing a join. I figure one way of > doing this is storing an entity as a multivalued field, instead of storing > different fields. > Let me give an example. Consider the entities: > > User: > id: 1 > type: Joan of Arc > age: 27 > > Webpage: > id: 1 > url: http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join> > category: Technical > user_id: 1 > > id: 2 > url: http://stackoverflow.com > category: Technical > user_id: 1 > > Instead of creating 1 document for user, 1 for webpage 1 and 1 for > webpage 2 (1 parent and 2 childs) I could store webpages in a user > multivalued field, as follows: > > User: > id: 1 > name: Joan of Arc > age: 27 > webpage1: ["id:1", "url: > http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>", > "category: > Technical"] > webpage2: ["id:2", "url: http://stackoverflow.com", "category: > Technical"] > > It would probably perform better than the join, right? However, it made > me think about solr limitations again. What if I have 200 million webpges > (200 million fields) per user? Or imagine a case where I could have 200 > million values on a field, like in the case I need to index every html DOM > element (div, a, etc.) for each web page user visited. > I mean, if I need to do the query and this is a business requirement no > matter what, although denormalizing could be better than using query time > joins, I wonder it distributing the data present in this single document > along the cluster wouldn't give me better performance. And this is > something I won't get with block joins or multivalued fields... > I guess there is probably no right answer for this question (at least > not a known one), and I know I should create a POC to check how each > perform... But do you think a so large number of values in a single > document could make denormalization not possible in an extreme case like > this? Would you share my thoughts if I said denormalization is not always > the right option? > > Best regards, > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr >