Re: amount of values in a multi value field - is denormalization always the best option?

Jack Krupansky Wed, 10 Jul 2013 14:54:29 -0700

Simple answer: avoid "large number of values in a single document". Thereshould only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update anyfield of a single document, even with atomic update, requires Solr to readand rewrite every field of the document. So, lots of smaller documents arebest for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of values.You can add to the list easily, but under the hood, Solr must read andrewrite the full list as well as the full document. And, there is no way toaddress or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a greatidea.


-- Jack Krupansky

-----Original Message-----From: Marcelo Elias Del Valle

Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org

Subject: amount of values in a multi value field - is denormalization alwaysthe best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url: http://wiki.apache.org/solr/Join
   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: ["id:1", "url: http://wiki.apache.org/solr/Join";, "category:
Technical"]
   webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
Technical"]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle

http://mvalle.com - @mvallebr

Re: amount of values in a multi value field - is denormalization always the best option?

Reply via email to