Simple answer: avoid "large number of values in a single document". There should only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a great idea.

-- Jack Krupansky

-----Original Message----- From: Marcelo Elias Del Valle
Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization always the best option?

Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url: http://wiki.apache.org/solr/Join
   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: ["id:1", "url: http://wiki.apache.org/solr/Join";, "category:
Technical"]
   webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
Technical"]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Reply via email to